AI Engineer building RagLeap — excited to join this community 👋
Hey everyone 👋
I'm TC Antony, founder of RagLeap and an AI Engineer based in Chennai, India.
I build production AI systems for a living — RagLeap is a multi-tenant RAG platform we run for businesses, handling document retrieval, agentic workflows, and voice/WhatsApp/email automation.
Outside of that, I build small focused AI tools that solve one specific problem well — things like resume screening, contract risk analysis, RAG chatbot code generation, that kind of thing. I'd rather ship 20 tools that each do one job perfectly than one bloated everything-app.
Just started exploring Product Hunt properly — excited to see what the community is building and get feedback on what I'm working on too.
If you're working on anything in the AI agent / RAG / dev tools space, would love to check it out and connect 🙏
Replies
The 20-focused-tools thing is how I've ended up thinking too — I build an AI research tool and a narrow scope lets you be opinionated about your data sources, where an everything-app has to stay generic and trust whatever the model hands back. The screening ones are interesting to me though: how do you handle eval there? That's where I find 'one job perfectly' gets expensive once you're past the demo. Welcome to PH 👋
@mesut_temizkan Good question — and honestly, eval is the part I underinvested in early and paid for later.
For the screening tools specifically (resume scorer, contract risk analyzer), what's worked:
1. Fixed test sets per tool — I keep 15-20 real examples with known "correct" outputs (e.g. resumes I'd manually score) and re-run them after any prompt change. Cheap, catches regressions fast.
2. Structured output over freeform — forcing JSON schemas with explicit fields (score, reasoning, flagged_items) makes drift way more visible than letting the model write prose. If a field goes missing or the reasoning gets vague, that's a signal before a user even complains.
3. Boundary tests, not just happy-path tests — for the riskier tools (anything insurance/legal-adjacent) I specifically test adversarial inputs, people trying to get the model to overstep scope. That's actually been harder than accuracy eval.
What I haven't solved well: eval for the more generative tools (RAG chatbot code gen, content writers) where there's no single "correct" answer. Still mostly manual spot-checks there. Curious how you're handling eval for your research tool — sounds like a similar problem from the other direction.
@tc_antony Honestly closer to your screening tools than the generative ones — a lot of what I pull has a real ground truth (funding, founders, dates), so I keep a set of hand-verified companies and re-run extraction against them after any change. The twist for me: regressions hide in the data, not the code — a tweak that fixes one company quietly breaks five others. So before shipping any extraction/QC change I replay it against ~100 past reports and diff the output. The narrative layer though, same as your content gen — still manual spot-checks. No clean answer there either.
@mesut_temizkan The diff-against-100-past-reports approach is smart — that's
essentially a regression suite built from your own production
data, which is way more meaningful than synthetic test cases.
The "regressions hide in the data not the code" point is
exactly right and underappreciated. I've had the same thing —
a prompt tweak that improves P/E ratio extraction quietly
makes the risk score reasoning worse for manufacturing stocks
specifically. You only catch it if you're diffing structured
output across a broad sample, not just eyeballing a few examples.
The narrative layer being manual for both of us probably means
there's no clean solution yet — or the solution is LLM-as-judge
which has its own reliability problems for subjective output.
What's your replay pipeline look like technically?
Are you storing the raw extracted output per run and diffing
that, or diffing at the final report level?