LLM Eval Suite - Structured evaluation of Apple Foundation Models on macOS
by•
Structured scoring, leaderboard tracking, repeatable evals, and bring-your-own judge providers. Evaluate Apple Intelligence Foundation Models on macOS.
Replies
Best
Maker
📌
Hey everyone — I’m a solo iOS/macOS dev, and I recently built a macOS app called LLM Eval Suite.
It came out of a problem I kept running into while building AI features in my own apps.
I’d tweak a prompt or change generation settings, run the same examples again, and think:
“Okay… this seems better?”
But I didn’t have a good way to compare versions or see what actually improved.
Sometimes the output was cleaner, but less complete.
Sometimes it was more detailed, but added things it shouldn’t.
Sometimes I just liked the wording more, which isn’t really enough to ship a change confidently.
So I built LLM Eval Suite as a native macOS app to help compare prompt/config variants, review outputs side by side, and score them with custom judges and scoring guides.
I recently used it on another app I’m building, AI Doctor Notes, to improve a doctor visit summary feature. I wrote up the workflow here:
https://medium.com/@dreamlab.sol...
Replies