Introduced with the release of Ollama's support for @GPT OSS, is Turbo; Ollama's privacy-first datacenter-grade cloud inference service.
Whilst it's currently in preview, the service costs $20/m, and has both hourly and daily limits. Usage-based pricing will be available soon. So far, the service only has gpt-oss-12b and gpt-oss-120b models, and works with Ollama's App, CLI, and API.
highlight secure, offline use. Users echo the simplicity—easy setup, Docker-like workflows, quick prototyping, solid performance, and cost savings. Some note best results with mid-size models and smooth integrations via APIs.
I switched to Ollama from clunkier solutions and I have no regrets. Multimodal model support is well-implemented - feeding images through the API is just as easy as text. It’s great that the project is evolving so fast; support for new models usually pops up almost the day after they drop on Hugging Face. This is the simplest way to "play around" with modern AI locally without the headache of setting up the environment.
What needs improvement
I’d love to see a simpler way to import my own .gguf files downloaded outside the official Ollama library, without having to manually define all the parameters in a Modelfile.
How much disk space do common models require locally?
Typically, popular medium-sized models (7B-8B parameters) like Llama 3 or Mistral take up about 4.5-5 GB in standard 4-bit quantization
How simple is creating and customizing your own models?
It's simple using the Modelfile system, the process is very similar to writing a Dockerfile
We’re exploring Ollama to test and run LLMs locally—faster iteration, zero latency, total control. It’s like having our own AI lab, minus the GPU bills
What's great
fast performance (1)local AI model deployment (11)no third-party API reliance (3)AI server hosting (2)