Hey Product Hunt! Maurice here, CEO and co-founder of Atla.
At Atla, we’re a team of researchers and engineers dedicated to training models and building tools that monitor AI performance.
If you’re building with AI, you know that good evals are critical to ensuring your AI apps perform as intended.
Turns out, getting accurate evals that assess what matters for your use case is challenging. Human evaluations don’t scale and general-purpose LLMs are inconsistent evaluators. We’ve also heard that default eval metrics aren’t precise enough for most use cases, and prompt engineering custom evals from scratch is a lot of work.
🌖 Our solution
Selene 1: a LLM Judge trained specifically for evals. Selene outperforms all frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for scoring, classifying, and pairwise comparisons.
Alignment Platform: a tool that helps users automatically generate, test, and refine custom evaluation metrics with just a description of their task, little-to-no prompt engineering required.
🛠️ Who is it for? Builders of GenAI apps who need accurate and customizable evals—whether you’re fine-tuning LLMs, comparing outputs, or monitoring performance in production. Evaluate your GenAI products with Selene and ship with confidence.
You can start with our API for free. Our Alignment Platform is available for all users.
We’d love your feedback in the comments! What challenges have you faced with evals?
@masump Hey Masum! Selene won't adapt itself out of the box, but we've built the alignment platform to make it easy to continually align your LLM judge to changing requirements.
Keeping AI performance in check is no small task, and having an evaluator specifically trained for this sounds like a game-changer! How does Selene handle nuanced tasks where context is key—does it adapt based on different use cases?
@jonurbonas Hey Jonas! Yes indeed. We trained Selene to be easily customizable. It excels at following evaluation criteria and score rubrics closely, and responds well to fine-grained steering. For instance, developers using LLM Judges frequently encounter the problem of evals getting saturated, i.e. model responses receiving high scores too frequently, making the eval less useful. In such situations, one might want to “make it harsher” such that fewer responses receive high scores.
@jonurbonas Thanks for the support! We built the alignment platform to make it super straightforward to adapt Selene to different use cases. Just describe your use case in natural language and the platform will auto-generate eval prompts to assess your AI app.
To your point, we trained Selene to be steerable to custom evals. For example, you might want to “make it harsher” so fewer responses receive high scores. Alternatively, you might want to “flip the scores” so that the eval gives high scores to failures rather than successes. Graph 👈 Here's a graph from our testing on benchmarks that shows this
Report
This is actually super interesting, and I'll check it out!
@mohamed_zakarya Thanks for the comment Mohamed! Looking forward to helping more AI dev teams and builders of all shapes. We know good evals are key to measuring and improving AI performance.
We created Selene 1 + the Alignment Platform to unlock the ability for AI dev teams to quickly and accurately make informed decisions about system changes; such as updating your base model, your retriever, or your prompts. Your applications are designed to serve real users, and effective evals should represent their preferences.
For those that want to dive straight into the code, we've set up tutorial notebooks that represent the most popular use cases we've seen. These run directly on public datasets with human annotations for demonstration, but feel free to switch out for different data:
Too many AI tools, outputs, and constant tweaks can be a lot especially when you're racing to launch. Having precise evaluations without handcrafting endless prompts sounds like a dream. I like the idea of freeing myself up to actually focus on strategy instead of chasing down inconsistencies. Super intrigued!
Hey Product Hunt! Maurice here, CEO and co-founder of Atla.
At Atla, we’re a team of researchers and engineers dedicated to training models and building tools that monitor AI performance.
If you’re building with AI, you know that good evals are critical to ensuring your AI apps perform as intended.
Turns out, getting accurate evals that assess what matters for your use case is challenging. Human evaluations don’t scale and general-purpose LLMs are inconsistent evaluators. We’ve also heard that default eval metrics aren’t precise enough for most use cases, and prompt engineering custom evals from scratch is a lot of work.
🌖 Our solution
Selene 1: a LLM Judge trained specifically for evals. Selene outperforms all frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for scoring, classifying, and pairwise comparisons.
Alignment Platform: a tool that helps users automatically generate, test, and refine custom evaluation metrics with just a description of their task, little-to-no prompt engineering required.
🛠️ Who is it for? Builders of GenAI apps who need accurate and customizable evals—whether you’re fine-tuning LLMs, comparing outputs, or monitoring performance in production. Evaluate your GenAI products with Selene and ship with confidence.
You can start with our API for free. Our Alignment Platform is available for all users.
We’d love your feedback in the comments! What challenges have you faced with evals?
Selene by Atla
Hey Product Hunt! Maurice here, CEO and co-founder of Atla.
At Atla, we’re a team of researchers and engineers dedicated to training models and building tools that monitor AI performance.
If you’re building with AI, you know that good evals are critical to ensuring your AI apps perform as intended.
Turns out, getting accurate evals that assess what matters for your use case is challenging. Human evaluations don’t scale and general-purpose LLMs are inconsistent evaluators. We’ve also heard that default eval metrics aren’t precise enough for most use cases, and prompt engineering custom evals from scratch is a lot of work.
🌖 Our solution
Selene 1: a LLM Judge trained specifically for evals. Selene outperforms all frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for scoring, classifying, and pairwise comparisons.
Alignment Platform: a tool that helps users automatically generate, test, and refine custom evaluation metrics with just a description of their task, little-to-no prompt engineering required.
🛠️ Who is it for?
Builders of GenAI apps who need accurate and customizable evals—whether you’re fine-tuning LLMs, comparing outputs, or monitoring performance in production. Evaluate your GenAI products with Selene and ship with confidence.
You can start with our API for free. Our Alignment Platform is available for all users.
We’d love your feedback in the comments! What challenges have you faced with evals?
Selene by Atla
@masump Hey Masum! Selene won't adapt itself out of the box, but we've built the alignment platform to make it easy to continually align your LLM judge to changing requirements.
Fable Wizard
Keeping AI performance in check is no small task, and having an evaluator specifically trained for this sounds like a game-changer! How does Selene handle nuanced tasks where context is key—does it adapt based on different use cases?
Selene by Atla
@jonurbonas Hey Jonas! Yes indeed. We trained Selene to be easily customizable. It excels at following evaluation criteria and score rubrics closely, and responds well to fine-grained steering. For instance, developers using LLM Judges frequently encounter the problem of evals getting saturated, i.e. model responses receiving high scores too frequently, making the eval less useful. In such situations, one might want to “make it harsher” such that fewer responses receive high scores.
you can read more here: https://www.atla-ai.com/post/selene-1
Atla
@jonurbonas Thanks for the support! We built the alignment platform to make it super straightforward to adapt Selene to different use cases. Just describe your use case in natural language and the platform will auto-generate eval prompts to assess your AI app.
To your point, we trained Selene to be steerable to custom evals. For example, you might want to “make it harsher” so fewer responses receive high scores. Alternatively, you might want to “flip the scores” so that the eval gives high scores to failures rather than successes. Graph 👈 Here's a graph from our testing on benchmarks that shows this
This is actually super interesting, and I'll check it out!
Atla
@mia_k1 Thank you! Let us know if you have any questions
BITHUB
Love the thought and effort behind this. Hope it finds its audience and makes an impact!
Atla
@mohamed_zakarya Thanks for the comment Mohamed! Looking forward to helping more AI dev teams and builders of all shapes. We know good evals are key to measuring and improving AI performance.
Atla
Hey Product Hunt, Kyle here from the Atla team.
We created Selene 1 + the Alignment Platform to unlock the ability for AI dev teams to quickly and accurately make informed decisions about system changes; such as updating your base model, your retriever, or your prompts. Your applications are designed to serve real users, and effective evals should represent their preferences.
For those that want to dive straight into the code, we've set up tutorial notebooks that represent the most popular use cases we've seen. These run directly on public datasets with human annotations for demonstration, but feel free to switch out for different data:
Detecting Hallucinations in a RAG app
1 - 5 Scoring with Ground Truth Responses across Different Metrics (Logical Coherence, Completeness etc.)
You can also find our full docs here. Happy building!
Fedica
Too many AI tools, outputs, and constant tweaks can be a lot especially when you're racing to launch. Having precise evaluations without handcrafting endless prompts sounds like a dream. I like the idea of freeing myself up to actually focus on strategy instead of chasing down inconsistencies. Super intrigued!
Selene by Atla
@jonwesselink Thank you Jon! Would be excited to help you with evals at Fedica!
Stripo.email
Reliable AI evaluation is a huge challenge, and Selene 1 looks like a major step forward in making AI performance more measurable and scalable.
Atla
@marianna_tymchuk Thanks for the support, Marianna! Exactly. And if you can't measure it, you can't improve it (stealing that from our CTO).