Today, we just shipped AI inference routing in oneinfer-edge.

Not a load balancer. Not a failover script. A routing layer that understands the difference between your local machine, your cloud GPU, and your third party API and moves each request to the right one automatically.

Three directions. One routing layer.

Local to cloud. When your local inference is at capacity, routing picks up the overflow and sends it to your cloud endpoint. Your app never sees the difference.

Cloud to local. When local capacity frees up, traffic comes back automatically. Most teams end up serving 80 to 90 percent of requests from local at near zero marginal cost. Cloud handles genuine overflow only, not a flat rate that never adjusts.

Cloud provider to cloud provider. When your primary provider starts throttling, spikes in latency, or returns errors, routing shifts traffic to your next best option based on cost, reliability, scalability, and performance signals in real time. No single provider is a single point of failure.

Every direction is driven by the same four signals. Cost. Reliability. Scalability. Performance. You set the weights. The routing layer handles everything from there.

We built this because managing inference across multiple environments should not be a full time job.

Its fully open sourced. Star the repo if this solves something you are dealing with right now.

Attached a walkthrough video on how to use the feature.

Repo details in the description.

We welcome contributions to help the community, please feel free to create issues and contribute.

Star the repo: github.com/oneinfer/oneinfer-edge

Issues and features: github.com/oneinfer/oneinfer-edge/issues

Learn more: https://oneinfer.ai/platform/oneinfer-edge