

Empromptu AI






















Instrumenting production app usage as a fine-tuning data source is genuinely clever. You avoid the cold start problem of manually curating datasets that don't reflect real user behavior. We hit that exact wall building our AI features and ended up with synthetic data that didn't generalize well. What does your quality filtering pipeline look like between raw app interactions and the training checkpoint?
@retain_dev Actually thats kinda the best part if I do say so myself Dr Sean Robinson my cofounder has a PhD is computational astrophysics and he invented a way to get up to 98% accurate outputs out of any model! Its built in to happen completely automatically based on the eval that you write and that last mile is what you label. Thanks for the complement. We know the problem is unless youre a founder, the smes and ai/ml eng are usually separate so you never really get that perfect dataset.
@retain_dev To add the engineering layer to what Shanea described, the quality filtering is what makes the self-improving loop actually work in practice rather than in theory.
The eval you write upfront becomes the ground truth signal. Every production interaction gets scored against it automatically. What surfaces for labeling isn't a random sample of outputs, it's specifically the cases that fell outside your accuracy threshold. You're not reviewing everything, you're reviewing the exact delta between what the model did and what your domain requires.
That's why the labeled dataset stays small and high signal over time. The model improves, the eval catches the new edge cases, and you're always training on the real distribution of your own users rather than synthetic approximations.
The part that resonates with what you described about SMEs and ML eng being separate is exactly the failure mode we designed around. The eval layer is built to be owned by the domain expert, not the ML team. The person who knows what correct looks like is the one defining the signal, not translating it through an eng team.
@retain_dev @sean_robinson1 Yes. Second what Sean said. We think that this is truly overlooked in the tools out there today.
@retain_dev Appreciate that you're sharing this insight!
We train on raw app interactions, actually. We find that the real-world edge case training often provides the most unique and distinguishing insights that help us to effectively compress training timelines. We often will say "your model is unique because of these edge cases," whereas most people generally find them intrusive or something they should try to mitigate.
The reality is quite the opposite.
@shanea_leven Congratulations on the launch.
One thing I’m trying to understand from your positioning. If the underlying model providers keep improving rapidly every few months, how do you measure whether the gains your customers see are actually coming from Empromptu’s learning layer versus improvements in the foundation model itself?
It seems like that’s a pretty important distinction because both could lead to better outputs over time, but only one creates a real competitive advantage for the customer. Are you able to quantify that difference in a meaningful way?
@shanea_leven @moh_codokiai yes absolutely you will actually see the performance and accuracy improvements directly in the product. And frontier models have deprecated the ability for you to fine tune their models any more. Also a model trained on your data for your product and your users is always going to be eventually more accurate than a general model.
@shanea_leven @shanealeven That makes sense. I agree a model trained on a company’s own users should eventually outperform a general-purpose model for that specific use case.
What I’m curious about is how you prove that improvement to customers. Do you have any benchmarking or evaluation framework that shows accuracy before and after the learning process, or is the validation mainly based on production outcomes and user feedback?
@shanea_leven @moh_codokiai yes! It's literally built into the product and you can see in our optimizer the performance. It's actually one of the things that my co-founder invented. Our optimizer is what our entire platform is based on. We have benchmarks that models trained on our platform are 30% more accurate than frontier models
@shanea_leven @moh_codokiai First, we're model agnostic -- the user can specify whatever they'd like to use, and we'll adapt the 'baseline' accordingly; the difference is often, for our users, the other efficiencies, such as the training, vertical integrations and other optimization components that, combined, make an enormous difference in both the quality of life they experience while building (setting up actual databases, auth sequencing, et cetera, with real-world best practices), and doing all of the AI-focused functionality from a template-based system that let's me say something like: "I want to build a growth function, and that's going to be me following up with leads we haven't talked to in more than 3 weeks, and I want that sort of outreach to look like this."
I haven't met a business leader yet who wants to replace someone on their team with a system that does that at 98% accuracy. And that's the difference they walk away remembering.
the feedback loop approach is smart. the part that usually trips teams up isnt the training pipeline though, its the quality of the corrections feeding it. if the humans correcting the AI output dont have a systematic way to evaluate whats actually wrong you end up fine-tuning on noise. curious how you handle that signal quality problem
@ozandag The correction itself is not the signal, the correction scored against a defined expected outcome is the signal. Without that layer you are just fine tuning on whoever had an opinion that day.
The eval is defined upfront by the SMEs who know what correct looks like. Every correction gets scored against it before it touches training, filtered if it doesn't meet the bar, flagged for review if it's ambiguous. When experts disagree the eval arbitrates. You never train on conflicting signal, only verified ground truth.
@ozandag +1 to what Sean said. We allow the user to first define what good looks like. and we actually remove the hard parts so they can define it in natural language. Often with a single statement. No configs or files so anyone can do it. Then we measure accuracy towards that goal.
@ozandag The pipeline's the easy half. We score corrections before they hit fine-tuning, so a sloppy "this is wrong" doesn't weigh the same as a structured one. Bad corrections are their own failure mode and most teams don't instrument for it. What surfaced this for you?
I’m Shanea, co-founder and CEO of @Empromptu AI
We built Empromptu AI's Alchemy because we believe the next phase of AI is not just building apps faster.
It is building AI that learns how your business works.
Right now, everyone is rushing to learn or add AI whether you are someone trying to figure out how to survive as an employee or take your expertise and monetize it. Everyone is plugging into the same frontier models, shipping the same generic workflows, and calling it a moat. But if everyone is using the same intelligence, no one is differentiated for long.
We're changing that.
With Empromptu AI's Alchemy you can fine tune a model with no ml expertise simply by building an AI applications. Our platform automatically captures customer usage, your corrections as a subject matter expert, edge cases, and application feedback. Alchemy turns those signals into a fine tune model that can keep itself up to date. Yes a self learning, self improving AI that doesn't cost trillions.
The simple version:
You've spent the last 5-10 years at your job learning really valuable skills whether its engineering, content, or more insane a highly regulated or specialized field.
Now your AI can learn from you so you can own your expertise.
This matters because the best knowledge usually lives inside people’s heads. The accountant knows the exception. The support lead knows when an escalation is real. The operator knows the edge case. The product team knows what “good” actually looks like.
Alchemy gives everyone a way to turn that expertise into AI that gets self improves with up to 98% accuracy.
Thank you for checking us out today. I’d love your feedback, questions, and brutal honesty.
@shanealeven Congrats on the launch team. How do you stop bad user feedback, noisy corrections, or outdated domain assumptions from being absorbed into the tuning loop
@zolani_matebese first the user can set the eval and we can automatically optimize towards that goal and as a back you can actually go in and correct the training set manually with natural language
@lakshminath_dondeti you can only fine tune oss models. The frontier models are deprecating the ability for you to fine tune them. 😬 Which is one of the reasons we're making this available. But Empromptu is totally model agnostic
@lakshminath_dondeti To add some architecture context, the fine tuned model you build on Alchemy is your asset regardless of what happens upstream with frontier models. When a new frontier model drops, you're not starting over. Your domain knowledge, your corrections, your edge cases are portable. We can use a newer base and retrain on your existing data. The expertise your team captured doesn't deprecate with the model version.
@lakshminath_dondeti Your built up dataset that you fine tune on is the real value that builds up over the use of your app. Alchemy let's you turn that into a fine tuned model, but when a new model releases you can quickly fine tune it on your existing data.
@lakshminath_dondeti We want to remove complexity and enhance outcomes from building with AI, sort of universally. What we would suggest is that because what you have is a dataset created by the 'shape' of the AI function and its trained outcomes, we can take that dataset and re-train it using the new model at an accelerated rate because the judgment exhibited in training is what creates the signal that matters, not the foundation model provider.
Optimizing our platform in this way helps us to refocus the conversation on "which model" to "what's the best/most optimal outcome" of the AI function, and to engineer the solution/function set accordingly!
@boyuan_deng1 The user absolutely gets visibility into what runs. You can also do this manually as well for all the tinkerers out there.
@boyuan_deng1 And the visibility is intentional from an architecture standpoint. A model you can't inspect is one you can't trust in production. You should always know what signal drove a change before it goes into training.
@boyuan_deng1 Our users always know what's happening and why -- we have evaluations and audits built in from the beginning, not added in as an afterthought.
I like the part abt capturing corrections and edge cases from real usage. That feels more useful than trying to guess everything upfront. One thing I wonder, how do you keep the model from leaaning the wrong patterns when user feedback is inconsistent or when diff experts correct the same situation in diff ways?
@busra_seker1 that's a great question. There is one ground truth so SMEs can override customers but SMEs have to agree what is ground truth
@busra_seker1 Exactly right on the ground truth architecture. On the inconsistency problem specifically, that's where the eval becomes the arbitration layer. Conflicting corrections don't both make it through, the eval scores against a defined expected outcome so noise and contradictions get filtered before they touch training. The model learns from signal that passed a quality bar, not raw feedback volume.
@busra_seker1 @sean_robinson1 Evals are the most important thing and yet some tools make evals really inefficient for people learning this technology to access. and take advantage of it's true power.
@busra_seker1 I think that what you're talking about is like the '7 out of 10 dentists' sort of conjecturing, and I think it's important to note that if policy is to enable flexibility from like a healthcare provider in making recommendations based on particular signals, it would seem only appropriate that the way the governance works would be similar, sort in a tightly-banded range of outcome likelihoods.
We're focused on optimizing toward outcomes, and find that helping people focus on results instead of the process itself helps to create alignment in the directionality of training outcomes. I hope that makes sense! If you'd like to discuss more, we're available for meetings to talk about your specific use case or scenario.


Empromptu AI
Yes! Thank you for acknowledging this!