Hello everyone, Over the last year, we have been working on a stealth startup to enable automated testing for LLM-based applications. I am excited to announce that the beta version is available for testing at EvalMy.AI. And I would love to hear your feedback. As LLM and RAG popularity has skyrocketed, I’ve frequently found myself helping customers use the technology to unlock value from internal documents, contracts, policies, etc. One recurring challenge was testing: our approach involved having domain experts validate whether the model's answers were correct. And we had to do it again and again for every change in the model, architecture, or data. Manual testing is expensive, and people get frustrated rather quickly. EvalMy.AI defines a balanced qualitative metric C3-score that expresses if the AI's answer is semantically equivalent to the expert answer. This automates the verification of the model. The metric consists of 3 key components: correctness, completeness and contradiction, helping you easily identify where the AI falls short. EvalMy.AI is a simple service, easy to integrate into anyone’s development lifecycle, and configurable for experts who do not like the default behavior. I’m especially proud of how accurate the tool is when semantically comparing answers. Our first users were excited about how the tool reduces friction and speeds up testing. So, we decided to open the service to the public for beta testing and get more feedback. If you want to try it, just go to evalmy.ai. If you have questions, ask here. Looking forward to your feedback.

EvalMy.AI - Automated AI-answer verification

Replies