Sandeep Singh

crunr - Launch and run any compute job on AWS with 1 command

crunr — run it, ghost it. GPU compute is $1.5/hr. But your real bill looks like this: - Idle time sitting there: $800/mo - Infra team to manage it: $3,000/mo - Failed setups and debugging: days lost - 3am emergency fixes: priceless crunr fixes all of it. $ crunr run train.py --gpu Spins up → runs → terminates. You pay for compute only. Nothing else. No idle bills. No DevOps. No lingering servers. Built for ML researchers, indie AI builders, and startup teams who just want their job to run.

Add a comment

Replies

Best
Sandeep Singh
Hey PH 👋 I'm Sandeep, the infra guy my data science team DM'd every time they needed GPUs. We had a ₹9,000/day GPU server. And a Slack thread. The message was always the same: "GPU's free, who wants it next?" If nobody replied fast enough, the meter kept going🔥. Full rate. Whether anyone was training or not. I ran the numbers. 65% idle. We were paying for a machine doing absolutely nothing most of the day. Renting compute per day when you need it per job is like hiring a full-time delivery driver because you order food three times a week. So I built crunr. $ crunr run train.py --gpu Spins up → runs → saves your outputs → terminates. Job done; instance gone. Every time. No exceptions. No controller VM. No SaaS layer. No data moving through infrastructure we control, because there is no infrastructure we control. Your AWS. Your CloudTrail. Your data. A 3-hour training run now costs ₹170. Between runs: ₹0. Not rounded. Exactly zero. No more Slack thread. No more idle bills. No more 3am fixes. Just crunr run. Free to start 👇
Artem Fedorovich

@sandeep_01 The "hiring a full-time delivery driver because you order food three times a week" line nails why per-day GPU rental is insane for bursty workloads. The 65% idle number is the whole pitch in one stat. Question on the ephemeral model, since "instance gone, every time" is the feature and the risk: what happens to a long training run that dies at hour 6 of 9? On spot especially, preemption isn't an edge case, it's Tuesday. Is checkpointing on the user to wire up, or does crunr snapshot to S3 on interruption so a killed run resumes instead of restarting from zero? For a 3-hour run that's a shrug, for a multi-day fine-tune it's the difference between the tool being usable and not. Upvoted.

Sandeep Singh

@artem_fedorovich 

glad that line landed : it's exactly how it felt watching the bill.

honest answer:

instance dies : it's gone. always. no idle after a crash.

--s3 flag on and everything in outputs/ is already in S3 before it terminates. crash at hour 6 — last checkpoint is safe.

checkpointing logic is on you for now. save to outputs/ periodically. crunr handles the rest.

for spot specifically : on-demand is the right call for multi-day runs. spot makes sense when your job is resumable or short enough that a restart is fine.

one week away : automatic mid-job checkpointing ships. crash anywhere, resume from exactly there. no wiring needed.

and thank you for the upvote. 🙏

Shuvam Mandal

it is a cool problem to solve we also have similar kind of problem at our backend we are just paying for idle server times on aws, but one query is like where are you keeping the instance output and is it also available for cpu

Sandeep Singh

@shuvam_mandal2 

thank you! and yes — exactly that problem.

two things on your questions:

outputs — by default they rsync straight back to your laptop when the job finishes. configure S3 once with crunr s3 setup and outputs go to your own S3 bucket automatically. can even skip local download entirely with --s3-no-local and pull from S3 whenever you need.

CPU — fully supported. crunr run script.py without --gpu picks a CPU instance. need specific RAM? --memory 32 gets you 32GB+. add --spot if you want spot pricing. same flow — spins up, runs, terminates.

idle billing on backend servers is a real one. would love to hear more about your setup. 🙏

Shuvam Mandal
Gaurav Aroraa

The ephemeral spin-up-run-terminate model is the right abstraction for batch ML jobs. We've burned significant budget on GPU instances idling after failed training runs, especially when a job crashes at epoch 40 and the instance just sits there. How do you handle mid-job failures and artifact persistence? Does the runner automatically sync outputs to S3 before terminating?

Sandeep Singh

@retain_dev yes , instance terminates the moment it crashes. always.

artifacts: run with --s3 and everything in outputs/ syncs to your S3 bucket before the instance is gone. crash at epoch 40 — your last checkpoint is already in S3.

one honest answer: mid-run snapshotting is on you to wire in your training script for now. save checkpoints to outputs/ periodically. crunr handles the rest.

that said , automatic mid-job checkpointing is shipping next week. crash anywhere, resume from exactly where you left off. no wiring needed.

the idle instance after a crash problem : already solved. the resume problem — one week away. 🙏