What would you build or benchmark with 5M free tokens on a reasoning model?
by•
To encourage real experimentation, we’re offering 5 million free tokens on first API usage so devs and teams can test Alpie Core over Christmas and the New Year.
Alpie Core is a 32B reasoning model trained and served at 4-bit precision, offering 65K context, OpenAI-compatible APIs, and high-throughput, low-latency inference.
If you were evaluating or using a model like this:
– What would you benchmark first?
– What workloads matter most to you?
– What comparisons would you want to see?
We are actively collecting feedback to shape v2 and are happy to share more details.
66 views

Replies
Alpie Core
If I were evaluating it seriously, I’d benchmark three things first:
1) Long-context failure modes: instruction drift, prompt injection persistence, and context poisoning across the full 65K window (especially multi-turn reasoning chains).
2) OpenAI-compat edge cases: tool/function calling consistency, streaming behavior, and how error states are handled under load compared to GPT-style APIs.
3) Cost-amplification & abuse resistance: whether small prompt patterns can trigger disproportionately expensive reasoning paths or latency spikes at scale.
For workloads, I’d focus on multi-step planning, code review/refactoring, and agent-style workflows where quantization and long context tend to surface subtle regressions.
Comparisons I’d want to see: apples-to-apples against other open 30–34B class models on reasoning stability over long contexts, not just single-prompt benchmarks.
Happy to share concrete findings if helpful this is exactly the kind of model where early stress-testing pays off before v2.
Alpie Core
@sujal_meghwal This is an excellent breakdown, thank you for taking the time. Long-context failure modes, OpenAI-compatible edge cases, and cost amplification are all areas we are actively exploring, especially with 65K context, where issues surface quickly. Multi-step planning and agent-style workflows have already exposed some subtle regressions internally, so your points really resonate. Would love to compare notes if you end up running any concrete tests.
GraphBit
If I had 5M tokens to work with, I’d spend them on system-level behavior, not raw benchmarks.
Specifically:
Latency + determinism under load in multi-step workflows
Reasoning stability across retries (does the plan drift or converge?)
Tool-call consistency when the model is embedded in an agent loop
Failure characteristics how it degrades, not just how it performs at peak
Benchmarks matter, but what decides adoption is how the model behaves once it’s part of a real execution path. Reasoning models live or die by predictability in production, not leaderboard scores.
Alpie Core
@musa_molla I completely agree with this perspective. Once a model becomes part of an execution path, its predictability and failure characteristics are much more important than its peak benchmark scores.
These are the specific aspects we are focusing on now: latency variance under load, plan stability across retries, and how the model performs within agent loops as state accumulates. We've found that understanding where and how a model degrades is often more insightful than knowing where it performs at its best.
If you decide to use the 5 million tokens to investigate any of these system-level behaviours, we would greatly appreciate your feedback. Observations from real workflows help shape our priorities moving forward.
I'm happy to share notes as you conduct your exploration.