Pinstripes : Pinstripes Forums

Long-term goal: unlimited parallel llm instances, instant inference, without the mega bills. Right now we're optimising open weight models for high concurrency on legacy and cheap hardware. Custom engine watches for signals that tell us when to batch, when to defrag KV cache, when to deprioritize low-utilization runs. We're targeting MoE agents cos they fit way nicer on our hardware than dense models do, and are working on some data center hardware which we're likely to announce if we stay afloat long enough. We don't need fat margins to stay solvent, so pricing is fair: $0.14/M tokens depending on the model. Sadly, we don't stop ballooning token usage, but we can stop the price gouging, that's our main driver. It's a drop-in OpenAI-compatible API — one line to swap the endpoint in whatever agent framework you're using. We're still figuring out instance scaling. Cold starts happen (spinning up a model that's been idle). You might see a 200ms+ delay on first request, but sustained load is fast. Speed demo: https://x.com/i/status/206673848... Grab an API key and test it out. Let me know if you have questions or just want to chat. GLM 5.2 optimizations happening now, expect 100+ tok/s next week.

Pinstripes - waging war on inference providers with a fast, cheap api

Replies