RightNow AI

RightNow AI

#1 GPU AI Code Editor

4.7
10 reviews

1.7K followers

RightNow AI is the #1 GPU AI code editor. It combines GPU profiling, benchmarking, AI optimization, GPU virtualization, and a full GPU emulator in one environment to help developers analyze and optimize CUDA code faster
This is the 10th launch from RightNow AI. View more
Forge Agent

Forge Agent

Launched this week
Swarm Agents That Turn Slow PyTorch Into Fast GPU Kernels
Forge turns PyTorch models into optimized CUDA and Triton kernels automatically. 32 AI agents run in parallel, each trying different optimization strategies like tensor cores, memory coalescing, and kernel fusion. A judge validates every kernel for correctness before benchmarking. We got 5x faster inference than torch.compile on Llama 3.1 8B and 4x on Qwen 2.5 7B. Works on any PyTorch model. Free trial on one kernel. Full credit refund if we don't beat torch.compile.
Forge Agent gallery image
Forge Agent gallery image
Forge Agent gallery image
Forge Agent gallery image
Forge Agent gallery image
Free Options
Launch Team / Built With
Intercom
Intercom
Startups get 90% off Intercom + 1 year of Fin AI Agent free
Promoted

What do you think? …

Jaber Jaber
Hey PH! If we don't beat torch.compile you get your credits back!! Real results on B200: Llama 3.1 8B: 5x faster than torch.compile Qwen 2.5 7B: 4x faster SDXL UNet: 3x faster
Daniele Packard

Congrats! Can you dictate rules that the judge uses?

Jaber Jaber
@daniele_packard currently no but nice point! We will make you able to edit the rules for the judger
Curious Kitty
Correctness is the main risk with generated kernels. What is your validation strategy beyond “matches reference outputs”—e.g., tolerances, randomized testing across shapes/dtypes, determinism, and how you debug/report failures so users can trust and iterate quickly?
Piroune Balachandran

32 parallel coder+judge pairs is a smart setup. The judge comparison logic is the interesting part... wondering if it just checks against torch.compile baseline or if you can define custom metrics like memory footprint or specific tensor core utilization targets.

Devin Owen

Turning “PyTorch in, tuned CUDA/Triton out” into something productized like this is a very ambitious swing, especially with 32 agents coordinating on the same kernel. The hardest part of these systems in my experience is not just finding a faster variant once, but keeping the optimized kernels robust across driver changes, new GPUs and slightly different input shapes without a constant babysitting loop.

How are you handling that stability vs. raw speed tradeoff in the UX: do you bias toward conservative, portable kernels by default, or lean into aggressive, hardware-specific wins and let power users manage the risk?