crunr - Launch and run any compute job on AWS with 1 command

crunr

•2mo ago

crunr — run it, ghost it. GPU compute is $1.5/hr. But your real bill looks like this: - Idle time sitting there: $800/mo - Infra team to manage it: $3,000/mo - Failed setups and debugging: days lost - 3am emergency fixes: priceless crunr fixes all of it. $ crunr run train.py --gpu Spins up → runs → terminates. You pay for compute only. Nothing else. No idle bills. No DevOps. No lingering servers. Built for ML researchers, indie AI builders, and startup teams who just want their job to run.

Replies

Best

crunr

Maker

📌

Hey PH 👋 I'm Sandeep, the infra guy my data science team DM'd every time they needed GPUs. We had a ₹9,000/day GPU server. And a Slack thread. The message was always the same: "GPU's free, who wants it next?" If nobody replied fast enough, the meter kept going🔥. Full rate. Whether anyone was training or not. I ran the numbers. 65% idle. We were paying for a machine doing absolutely nothing most of the day. Renting compute per day when you need it per job is like hiring a full-time delivery driver because you order food three times a week. So I built crunr. $ crunr run train.py --gpu Spins up → runs → saves your outputs → terminates. Job done; instance gone. Every time. No exceptions. No controller VM. No SaaS layer. No data moving through infrastructure we control, because there is no infrastructure we control. Your AWS. Your CloudTrail. Your data. A 3-hour training run now costs ₹170. Between runs: ₹0. Not rounded. Exactly zero. No more Slack thread. No more idle bills. No more 3am fixes. Just crunr run. Free to start 👇

Report

2mo ago

@sandeep_01 The "hiring a full-time delivery driver because you order food three times a week" line nails why per-day GPU rental is insane for bursty workloads. The 65% idle number is the whole pitch in one stat. Question on the ephemeral model, since "instance gone, every time" is the feature and the risk: what happens to a long training run that dies at hour 6 of 9? On spot especially, preemption isn't an edge case, it's Tuesday. Is checkpointing on the user to wire up, or does crunr snapshot to S3 on interruption so a killed run resumes instead of restarting from zero? For a 3-hour run that's a shrug, for a multi-day fine-tune it's the difference between the tool being usable and not. Upvoted.

Report

2mo ago

crunr

Maker

@artem_fedorovich

glad that line landed : it's exactly how it felt watching the bill.

honest answer:

instance dies : it's gone. always. no idle after a crash.

--s3 flag on and everything in outputs/ is already in S3 before it terminates. crash at hour 6 — last checkpoint is safe.

checkpointing logic is on you for now. save to outputs/ periodically. crunr handles the rest.

for spot specifically : on-demand is the right call for multi-day runs. spot makes sense when your job is resumable or short enough that a restart is fine.

one week away : automatic mid-job checkpointing ships. crash anywhere, resume from exactly there. no wiring needed.

and thank you for the upvote. 🙏

Report

2mo ago

The ephemeral spin-up-run-terminate model is the right abstraction for batch ML jobs. We've burned significant budget on GPU instances idling after failed training runs, especially when a job crashes at epoch 40 and the instance just sits there. How do you handle mid-job failures and artifact persistence? Does the runner automatically sync outputs to S3 before terminating?

Report

2mo ago

crunr

Maker

@retain_dev yes , instance terminates the moment it crashes. always.

artifacts: run with --s3 and everything in outputs/ syncs to your S3 bucket before the instance is gone. crash at epoch 40 — your last checkpoint is already in S3.

one honest answer: mid-run snapshotting is on you to wire in your training script for now. save checkpoints to outputs/ periodically. crunr handles the rest.

that said , automatic mid-job checkpointing is shipping next week. crash anywhere, resume from exactly where you left off. no wiring needed.

the idle instance after a crash problem : already solved. the resume problem — one week away. 🙏

Report

2mo ago