Francisco Mendoza

LLM Eval Suite - Structured evaluation of Apple Foundation Models on macOS

Structured scoring, leaderboard tracking, repeatable evals, and bring-your-own judge providers. Evaluate Apple Intelligence Foundation Models on macOS.

Add a comment

Replies

Best
Francisco Mendoza
Hey everyone — I’m a solo iOS/macOS dev, and I recently built a macOS app called LLM Eval Suite. It came out of a problem I kept running into while building AI features in my own apps. I’d tweak a prompt or change generation settings, run the same examples again, and think: “Okay… this seems better?” But I didn’t have a good way to compare versions or see what actually improved. Sometimes the output was cleaner, but less complete. Sometimes it was more detailed, but added things it shouldn’t. Sometimes I just liked the wording more, which isn’t really enough to ship a change confidently. So I built LLM Eval Suite as a native macOS app to help compare prompt/config variants, review outputs side by side, and score them with custom judges and scoring guides. I recently used it on another app I’m building, AI Doctor Notes, to improve a doctor visit summary feature. I wrote up the workflow here: https://medium.com/@dreamlab.sol...