Launched this week

RunInfra
Describe the AI model you need and get an optimized AI
178 followers
Describe the AI model you need and get an optimized AI
178 followers
Tell RunInfra what you need and it builds the production API. No dashboards. No config. Describe any open source model or full app in plain language. We optimize it for real: benchmark GPUs, quantize the model, generate custom CUDA kernels with our Forge agent. It runs faster and cheaper than standard hosting. Build voice (speech → AI → speech), doc search, vision, or model routing, all in one chat. Pay per million tokens. Scale to zero. Run managed or on your own GPUs.









How does the custom CUDA kernel generation actually work in practice, does Forge learn from existing kernels or write them from scratch, and what happens if the generated kernel underperforms the standard one at runtime?
how does the pricing per million tokens actually compare to something like runpod or modal when you're running a custom kernel workload, especially at lower utilization?
how does the per-token pricing actually compare to something like runpod or modal when running something like a 70b quantized model for a few hours a day?
how does the cuda kernel generation actually work in practice, does forge just spit out a kernel you can drop into vllm or does it need a custom serving stack on your end
how does the pricing actually work when you hit something like a custom CUDA kernel being generated, is that a flat fee or does it burn through tokens while forge is reasoning?
StartupBase
Abstracting model selection and kernel tuning behind a plain description is a good bet for teams without an ML infra person. How opinionated is it, does it pick the architecture and hardware or mostly optimize what you hand it? The gap between 'I need X' and a deployed model is where most people get stuck.
The scale-to-zero + pay-per-million-tokens combo is the part I'd test first. I’ve had small agent prototypes where idle GPU cost felt silly. Curious how you decide when to generate custom CUDA vs just quantize/route to an existing runtime?