Jaber Jaber

Forge CLI - Swarm agents optimize CUDA/Triton for any HF/PyTorch model

Forge generates optimized GPU kernels from any PyTorch or HuggingFace model. 32 parallel Coder+Judge agents compete to find the fastest CUDA/Triton implementation. Up to 5× faster than torch.compile(mode='max-autotune') with 97.6% correctness. Enter HuggingFace model ID, get optimized kernels for every layer. Powered by optimized NVIDIA Nemotron 3 Nano 30B at 250k tokens/sec. "Full refund if we don't beat torch.compile"

Add a comment

Replies

Best
Jaber Jaber
2025 was the year of AI agents. 2026 will be the year of swarm agents. Forge is how we're starting it 32 agents competing in parallel to optimize your GPU kernels. Enter a HuggingFace model ID, get optimized CUDA/Triton for every layer. "Full refund if we don't beat torch.compile" Would love your feedback:D
Osama Jaber
What a beautiful swarm agents 🤗