Forge CLI - Swarm agents optimize CUDA/Triton for any HF/PyTorch model
by•
Forge generates optimized GPU kernels from any PyTorch or HuggingFace model. 32 parallel Coder+Judge agents compete to find the fastest CUDA/Triton implementation. Up to 5× faster than torch.compile(mode='max-autotune') with 97.6% correctness.
Enter HuggingFace model ID, get optimized kernels for every layer. Powered by optimized NVIDIA Nemotron 3 Nano 30B at 250k tokens/sec.
"Full refund if we don't beat torch.compile"



Replies
RightNow AI
RightNow AI