New Updates 9/6
by•
You can create cache around your app that persists between scale down and scale up. This helps lower cold-starts and can be used for things such as tensor cache, vllm cache, etc.
Optimized cold-starts to be less than 200ms when multiple scale down and up events occurs; this is done by freezing vram when GPUs are idle.
Introduced Warmed status which helps you see replicas in that state; these will cold-start in less than 200ms. We always prioritize starting Warmed replicas first, before scaling up Idle since they scale up faster.

4 views


Replies