Launching today

General Compute
AI models that run on an inference cloud optimized for speed
422 followers
AI models that run on an inference cloud optimized for speed
422 followers
GPUs are built for training, not inference. General Compute is an inference cloud running on ASICs — purpose-built alternatives to Nvidia silicon designed specifically for inference. We deliver 5x faster responses and higher per-user throughput for latency-sensitive workloads like coding and voice agents. Our OpenAI-compatible API means you swap your base URL, keep your existing workflows, and run real-time AI on infrastructure built for the job.






Neutron
Hey Product Hunt, I'm Jason, Co-founder & CTO of General Compute!
The Problem
Agents are the most exciting thing happening in AI right now but the infra they run on was designed for chatbots, not autonomous workflows. When an agent has to make 20, 50, sometimes hundreds of sequential LLM calls to complete a task, latency compounds into a ceiling on what's actually possible.
Most inference providers today hit you with one of two tradeoffs:
❌ GPU-based stacks – Great for training, but memory-bandwidth bottlenecks mean your agent runs slowly (~120 tokens/second)
❌ "Fast" inference with catches – Some providers deliver speed but lock you into small models, limited context windows, or pricing that breaks at agent-scale token volume. Speed without intelligence isn’t worth the trade off.
After years building voice agents and real-time AI products ourselves, we got tired of waiting. So we built General Compute.
How General Compute is Different 🚀
GC is an ASIC-first inference cloud built on multiple chips, including SambaNova. SN uses a 3 tier memory architecture and dataflow, which is a fancy way of saying “It’s really fast cause we don’t have the same bottlenecks”.
🔹 Agent first (OpenClaw) – Agents can sign up on their own and manage their own API keys. OpenClaw can move its inference just by pointing it at our website.
🔹 Built for agent workloads – Tuned for both coding agents and voice AI (TTFT), the things that matter when you're chaining dozens of calls. Your agent finishes in seconds, not minutes.
🔹 Speed without the tradeoffs – Frontier open models, full context windows, and pricing that actually works at production scale.
Who is this for?
If you're building AI agents, voice AI ,or even just using OpenClaw or OpenCode and want faster inference, then GC is built for you. Faster inference isn't just a nice-to-have; it unlocks use cases that weren't viable before.
🔗 Get started today
Sign up at https://generalcompute.com and start running your workloads on ASICs today. We are offering $200 in free credit to anyone that signs up through the Product Hunt launch (up from the normal $5 in credit)
Product Hunt
Neutron
@curiouskitty Bring your own model will be coming in a few weeks - unfortunately its harder on ASICs, but we're quickly closing in on it
In terms of spec decoding, we actually see a larger improvement on ASICs than GPUs, which is a bit of a surprising discovery. Most of the "hacks" to make GPUs faster still make us faster (since we utilize HBM much better)
The main limitation right now is that we are using SN40s for now and won't have our SN50s online for a few months. SN50s will crush across all model context lengths, model types, speeds, ... Keep an eye out for some announcements in the coming weeks showing how good they are! Like Cerebras but running large models with higher throughput
this is a very real agent infra problem. Chatbot latency is annoying, but agent latency compounds into a hard ceiling when workflows need dozens of sequential LLM calls. how General Compute balances raw speed with reasoning quality on longer agent workflows, especially when there is large context, tool use, retries, and coding tasks. Is the biggest gain in TTFT/throughput, or do you also see better end-to-end task completion?
Neutron
@harshalvc_ai Definitely e2e latency! We can get around 5x e2e latency speed up but more like ~2x TTFT speed up
The ASIC-for-inference approach is clever. GPU memory bandwidth just isn't optimized for inference memory access patterns. At RetainSure we've been routing latency-sensitive AI calls for customer success workflows, and 200ms vs 800ms response time matters a lot at scale. How do your ASICs handle KV cache eviction for long-context requests?
Neutron
@anand_thakkar1 Thanks! Lets discuss TTFT sometime - that craziest thing? We don't have smart prompt caching or kv cache aware routing yet. And we're already 5x faster. Prompt caching will be out in 1 month, and you'll see our gap widen even more!
The ASIC angle is interesting, how does the model selection compare to GPU clouds? Are you running your own fine-tuned models or is it more about offering the same models (Llama, etc.) just with faster inference?
Neutron
@campixl Models are limited right now since we are compute constrained. We're just getting started and onboarding new racks as fast as we can get our hands on them. Expect all the big hitter OSS models soon
Neutron
@davem_0 We use SambaNova racks, and they have a 3 tier memory system + a dataflow architecture. Currently their codebase is closed source so I can't share specifics :)
Do you guys plan on adding embedding models sometime in the future
Neutron
@sanjay_goel6 We will have all of them!