Aymeric Roucher

ml-intern - Hugging Face's AI agent that automates post-training

by
An open-source AI agent that fully automates post-training: reads arXiv papers, fixes & creates datasets, runs training jobs, debugs failures, and iterates all by itself. Results: +22 pts on GPQA in 10h and +60% on HealthBench. The future of ML research is here.

Add a comment

Replies

Best
Aymeric Roucher
Hunter
📌
Introducing ml-intern, the agent that just automated the post-training team at Hugging Face It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: - it trained the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. - In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. - For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on http://hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and http://hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on http://hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: https://github.com/huggingface/m... Web + mobile: https://huggingface.co/spaces/sm... And the best part? Hugging Face also provisioned 1k$ GPU resources and Anthropic credits for the quickest to use it
Abijah Kajabika

Awesome tool! Thanks for releasing it, I’ve been using it non stop in the past two days🫶

Tanmay Goel

I really like that this runs on the smolagents framework and is model-agnostic. One question tho, how does the agent prioritize between conflicting documentation , when two papers provide opposite methods , does it rely strictly on the llm’s reasoning from the standard looping sessions, or is there a specific steering mechanism to decide? Also using JSONL datasets for session observability will be incredibly handy for debugging these complex local runs , gonna experiment now...

Pranav

Hey everyone!
So I was basically building a fruad detection model using isolation forests. The part that took me hours was feature engineering iterations and retraining after adding new behavioral signals.

I would love to have an agent that reads the relevant papers, fixes the dataset issues, and returns training jobs without literally babysitting.
Though I had some questions like how does it handle domain specific tabular fraud data vs the benchmark tasks it's been tested on?

Vedang Mirgal

Wow, this looks really cool. Watching an agent go through papers, citations, datasets, experiments, failures, retraining loops, etc. so autonomously genuinely feels like a very different direction for ML workflows.

I was especially curious about the research side of it though — since a lot of fast-moving ML literature appears first through preprints and experimental results, how does the system avoid overcommitting compute/resources toward ideas that may later turn out to be weak, noisy, or hard to reproduce?

MEENAKSHI DHAVALESWARAPU

What I found interesting here is that it doesn’t stop at generating training code and calling it done. The whole loop of reading papers, fixing datasets, retrying failed runs, and iterating again feels much closer to an actual ML workflow. Also curious how it decides when a problem is coming from low-quality data versus the model or training setup itself.