Anvitha V

New Updates 9/6

by
  • You can create cache around your app that persists between scale down and scale up. This helps lower cold-starts and can be used for things such as tensor cache, vllm cache, etc.

  • Optimized cold-starts to be less than 200ms when multiple scale down and up events occurs; this is done by freezing vram when GPUs are idle.

  • Introduced Warmed status which helps you see replicas in that state; these will cold-start in less than 200ms. We always prioritize starting Warmed replicas first, before scaling up Idle since they scale up faster.

4 views

Add a comment

Replies

Be the first to comment