General Compute

AI models that run on an inference cloud optimized for speed

537 followers

AI models that run on an inference cloud optimized for speed

537 followers

Visit website

AI Infrastructure Tools

GPUs are built for training, not inference. General Compute is an inference cloud running on ASICs — purpose-built alternatives to Nvidia silicon designed specifically for inference. We deliver 5x faster responses and higher per-user throughput for latency-sensitive workloads like coding and voice agents. Our OpenAI-compatible API means you swap your base URL, keep your existing workflows, and run real-time AI on infrastructure built for the job.

Free Options

Launch tags:API•Software Engineering•Alpha

Launch Team

ElevenAgents by ElevenLabsScale conversations without scaling your team

Promoted

General Compute

Maker

📌

Hey Product Hunt, I'm Jason, Co-founder & CTO of General Compute!

The Problem

Agents are the most exciting thing happening in AI right now but the infra they run on was designed for chatbots, not autonomous workflows. When an agent has to make 20, 50, sometimes hundreds of sequential LLM calls to complete a task, latency compounds into a ceiling on what's actually possible.

Most inference providers today hit you with one of two tradeoffs:

❌ GPU-based stacks – Great for training, but memory-bandwidth bottlenecks mean your agent runs slowly (~120 tokens/second)
❌ "Fast" inference with catches – Some providers deliver speed but lock you into small models, limited context windows, or pricing that breaks at agent-scale token volume. Speed without intelligence isn’t worth the trade off.

After years building voice agents and real-time AI products ourselves, we got tired of waiting. So we built General Compute.

How General Compute is Different 🚀

GC is an ASIC-first inference cloud built on multiple chips, including SambaNova. SN uses a 3 tier memory architecture and dataflow, which is a fancy way of saying “It’s really fast cause we don’t have the same bottlenecks”.

🔹 Agent first (OpenClaw) – Agents can sign up on their own and manage their own API keys. OpenClaw can move its inference just by pointing it at our website.
🔹 Built for agent workloads – Tuned for both coding agents and voice AI (TTFT), the things that matter when you're chaining dozens of calls. Your agent finishes in seconds, not minutes.
🔹 Speed without the tradeoffs – Frontier open models, full context windows, and pricing that actually works at production scale.

Who is this for?

If you're building AI agents, voice AI ,or even just using OpenClaw or OpenCode and want faster inference, then GC is built for you. Faster inference isn't just a nice-to-have; it unlocks use cases that weren't viable before.

🔗 Get started today

Sign up at https://generalcompute.com and start running your workloads on ASICs today. We are offering $200 in free credit to anyone that signs up through the Product Hunt launch (up from the normal $5 in credit)

Report

2mo ago

Product Hunt

You’re pushing an ASIC-first stack (including SambaNova) while also offering “bring your own model”: what constraints does the hardware impose on model choice and deployment (architectures, context length, quantization, speculative decoding), and how do you decide what to optimize first for real-world agent traffic?

Report

2mo ago

General Compute

Maker

@curiouskitty Bring your own model will be coming in a few weeks - unfortunately its harder on ASICs, but we're quickly closing in on it

In terms of spec decoding, we actually see a larger improvement on ASICs than GPUs, which is a bit of a surprising discovery. Most of the "hacks" to make GPUs faster still make us faster (since we utilize HBM much better)

The main limitation right now is that we are using SN40s for now and won't have our SN50s online for a few months. SN50s will crush across all model context lengths, model types, speeds, ... Keep an eye out for some announcements in the coming weeks showing how good they are! Like Cerebras but running large models with higher throughput

Report

2mo ago

this is a very real agent infra problem. Chatbot latency is annoying, but agent latency compounds into a hard ceiling when workflows need dozens of sequential LLM calls. how General Compute balances raw speed with reasoning quality on longer agent workflows, especially when there is large context, tool use, retries, and coding tasks. Is the biggest gain in TTFT/throughput, or do you also see better end-to-end task completion?

Report

2mo ago

General Compute

Maker

@harshalvc_ai Definitely e2e latency! We can get around 5x e2e latency speed up but more like ~2x TTFT speed up

Report

2mo ago

The ASIC-for-inference approach is clever. GPU memory bandwidth just isn't optimized for inference memory access patterns. At RetainSure we've been routing latency-sensitive AI calls for customer success workflows, and 200ms vs 800ms response time matters a lot at scale. How do your ASICs handle KV cache eviction for long-context requests?

Report

2mo ago

General Compute

Maker

@anand_thakkar1 Thanks! Lets discuss TTFT sometime - that craziest thing? We don't have smart prompt caching or kv cache aware routing yet. And we're already 5x faster. Prompt caching will be out in 1 month, and you'll see our gap widen even more!

Report

2mo ago

The ASIC angle is interesting, how does the model selection compare to GPU clouds? Are you running your own fine-tuned models or is it more about offering the same models (Llama, etc.) just with faster inference?

Report

2mo ago

General Compute

Maker

@campixl Models are limited right now since we are compute constrained. We're just getting started and onboarding new racks as fast as we can get our hands on them. Expect all the big hitter OSS models soon

Report

2mo ago

How are you managing the KV Cache effectively within this architecture?

Report

2mo ago

General Compute

Maker

@davem_0 We use SambaNova racks, and they have a 3 tier memory system + a dataflow architecture. Currently their codebase is closed source so I can't share specifics :)

Report

2mo ago

Do you guys plan on adding embedding models sometime in the future

Report

2mo ago

General Compute

Maker

@sanjay_goel6 We will have all of them!

Report

2mo ago

1 2 3

Reviews

Hey Product Hunt, I'm Jason, Co-founder & CTO of General Compute!

The Problem

Most inference providers today hit you with one of two tradeoffs:

❌ GPU-based stacks – Great for training, but memory-bandwidth bottlenecks mean your agent runs slowly (~120 tokens/second)
❌ "Fast" inference with catches – Some providers deliver speed but lock you into small models, limited context windows, or pricing that breaks at agent-scale token volume. Speed without intelligence isn’t worth the trade off.

After years building voice agents and real-time AI products ourselves, we got tired of waiting. So we built General Compute.

How General Compute is Different 🚀

🔹 Agent first (OpenClaw) – Agents can sign up on their own and manage their own API keys. OpenClaw can move its inference just by pointing it at our website.
🔹 Built for agent workloads – Tuned for both coding agents and voice AI (TTFT), the things that matter when you're chaining dozens of calls. Your agent finishes in seconds, not minutes.
🔹 Speed without the tradeoffs – Frontier open models, full context windows, and pricing that actually works at production scale.

Who is this for?

🔗 Get started today