Evidently AI

Evidently AI

Collaborative AI observability platform

5.0
2 reviews

391 followers

Evidently helps evaluate, test and monitor your AI-powered products. From ML-based classifiers to LLM chatbots and agents. Built on top of the leading open-source library with over 20 million downloads: https://github.com/evidentlyai/evidently
This is the 2nd launch from Evidently AI. View more
Evidently AI

Evidently AI

Open-source evaluations and observability for LLM apps
Evidently is an open-source framework to evaluate, test and monitor AI-powered apps.

📚 100+ built-in checks, from classification to RAG.
🚦 Both offline evals and live monitoring.
🛠 Easily add custom metrics and LLM judges.
Evidently AI gallery image
Evidently AI gallery image
Evidently AI gallery image
Evidently AI gallery image
Evidently AI gallery image
Evidently AI gallery image
Free
Launch Team / Built With
AppSignal
AppSignal
Get the APM insights you need without enterprise price tags.
Promoted

What do you think? …

Elena Samuylova
Hi Makers! I'm Elena, a co-founder of Evidently AI. I'm excited to share that our open-source Evidently library is stepping into the world of LLMs! 🚀 Three years ago, we started with testing and monitoring for what's now called "traditional" ML. Think classification, regression, ranking, and recommendation systems. With over 20 million downloads, we're now bringing our toolset to help evaluate and test LLM-powered products. As you build an LLM-powered app or feature, figuring out if it's "good enough" can be tricky. Evaluating generative AI is different from traditional software and predictive ML. It lacks clear criteria and labeled answers, making quality more subjective and harder to measure. But there is no way around it: to deploy an AI app to production, you need a way to evaluate it. For instance, you might ask: - How does the quality compare if I switch from GPT to Claude? - What will change if I tweak a prompt? Do my previous good answers hold? - Where is it failing? - What real-world quality are users experiencing? It's not just about metrics—it's about the whole quality workflow. You need to define what "good" means for your app, set up offline tests, and monitor live quality. With Evidently, we provide the complete open-source infrastructure to build and manage these evaluation workflows. Here's what you can do: 📚 Pick from a library of metrics or configure custom LLM judges 📊 Get interactive summary reports or export raw evaluation scores 🚦 Run test suites for regression testing 📈 Deploy a self-hosted monitoring dashboard ⚙️ Integrate it with any adjacent tools and frameworks It's open-source under an Apache 2.0 license. We build it together with the community: I would love to learn how you address this problem and any feedback and feature requests. Check it out on GitHub: https://github.com/evidentlyai/e..., get started in the docs: http://docs.evidentlyai.com or join our Discord to chat: https://discord.gg/xZjKRaNp8b.
@elenasamuylova Congrats on bringing your idea to life! Wishing you a smooth and prosperous journey. How can we best support you on this journey?
Elena Samuylova
@kjosephabraham Thanks for the support! We always appreciate any feedback and help in spreading the word. As an open-source tool, it is built together with the community! 🚀
Emeli Dral
Hi everyone! I am Emeli, one of the co-founders of Evidently AI. I'm thrilled to share what we've been working on lately with our open-source Python library. I want to highlight a specific new feature of this launch: LLM judge templates. LLM as a judge is a popular evaluation method where you use an external LLM to review and score the outputs of LLMs. However, one thing we learned is that no LLM app is alike. Your quality criteria are unique to your use case. Even something seemingly generic like "sentiment" will mean something different each time. While we do have templates (it's always great to have a place to start), our primary goal is to make it easy to create custom LLM-powered evaluations. Here is how it works: 🏆 Define your grading criteria in plain English. Specify what matters to you, whether it's conciseness, clarity, relevance, or creativity. 💬 Pick a template. Pass your criteria to an Evidently template, and we'll generate a complete evaluation prompt for you, including formatting it as JSON and asking the LLM to explain its scores. ▶️ Run evals. Apply these evaluations to your datasets or recent traces from your app. 📊 Get results. Once you set a metric, you can use it across the Evidently framework. You can generate visual reports, run conditional test suites, and track metrics in time on a dashboard. You can track any metric you like - from hallucinations to how well your chatbot follows the brand guidelines. We plan to expand on this feature, making it easier to add examples to your prompt and adding more templates, such as pairwise comparisons. Let us know what you think! To check it out, visit our GitHub: https://github.com/evidentlyai/e..., docs http://docs.evidentlyai.com or Discord to chat: https://discord.gg/xZjKRaNp8b.
Emeli Dral
@hamza_afzal_butt Thank you so much!
Prof Rod
Congratulations on the launch, Evidently team! I've always admired Evidently for its comprehensiveness and all-encompassing approach framework. I often work with teams who are unsure about what metrics to focus on or how to begin their evaluation process. For those new or unsure where to start: * What best practices would you recommend? * Is there a feature that helps beginners 'set things on autopilot' while they're learning the ropes? * Do you offer any guided workflows or templates for common use cases that could help newcomers get started quickly? Thanks for your continued innovations in this space!
Elena Samuylova
@rorcde @rorcde Thanks for the support! :🙏🏻  Quickstart: We have a simple example here: https://docs.evidentlyai.com/get.... It will literally take a couple of minutes!  We packaged some popular evaluations as presets and general metrics (like detecting Denials). However, we generally encourage using your custom criteria—no LLM app is exactly alike, and the beauty of using LLM as a judge is that you can use your own definitions. We made it super easy to define your custom prompt just by writing your criteria in plain English. Best practices:  That's a huuuge question. Let me try to summarize a few of them: - Don't skip the evals! Implementing evals can sound complex, so it's tempting to "ship on vibes". But it’s much easier to start with a simple evaluation pipeline that you iterate on than to try adding evals to your process later on. So, start simple.  - Make curating an evaluation dataset a part of your process. When it comes to offline evals, the metrics are as important as the data you run them on. Preparing a set of representative, realistic inputs (and, ideally, approved outputs) is a high-value activity that should be part of the process. - Log everything. On that note, don’t miss out on capturing real traces of user conversations. You can then use them for testing, to replay new prompts against them, etc. - Start with regression testing. This is low-hanging fruit in evals: every time you change a prompt, re-generate new outputs for a set of representative inputs and see what changed (or have peace of mind that nothing did). This is hugely important for the speed of iteration.  - If you use LLM as a judge, start with binary criteria and measure the quality of your judge. It’s also easier to test alignment this way.
Kyrylo Silin
Hey Elena and Emeli, How does Evidently AI handle potential biases in the LLM judges themselves? Do you have any plans to incorporate human feedback loops into the evaluation process? Congrats on the launch!
Emeli Dral
Hey @kyrylosilin , Thank you for bringing this up! We design judge templates based on a binary classification model, where we thoroughly define the classification criteria and strictly structure the response format. Users also have the option to choose how to handle cases with uncertainty, whether it’s by refusing to respond, detecting an issue, or deciding not to detect an issue. This approach is already implemented and helps achieve more consistent outcomes. In the next update, we plan to add the ability to further iterate on the judges using examples of classifications they have made in previous iterations. This will help address potential biases. Users will be able to select (and even fix wrong labels if needed) complex cases and explicitly pass these examples to the judge, which will, over several iterations, improve accuracy and consistency for specific cases.
Dima Demirkylych
I have to give major respect to the team at Evidently AI for their outstanding open-source product. The introduction of evaluation for LLM apps is a game-changer. It's incredibly easy to integrate into my product, and the monitoring capabilities are top-notch. What I love most is that they provide ready-to-use tests, which easily customizable. Kudos to Evidently AI for making such a valuable tool available to the community!
Elena Samuylova
@dima_dem Thanks for the support @dima_dem! 🙏🏻
Dini Aminarti
Incredible launch! Can't wait to try this out! And congrats on the launch @elenasamuylova 🚀
Elena Samuylova
@andinii Thank you! Let us know how it works out for you 🎉
Giuseppe Della Corte
@elenasamuylova and @emeli_dral great that you are building an open-source tool in the LLM evaluation space. Congrats!
Elena Samuylova
@gdc Thank you! Let us know if you have the chance to try it. We appreciate all feedback!
123
•••
Next
Last