Launching today

Tessl
Optimize agents skills, ship 3× better code.
209 followers
Optimize agents skills, ship 3× better code.
209 followers
Tessl helps developers evaluate and optimize agent skills, so you focus on building with smarter AI agents instead of fixing bugs and hallucinations - no signup required ➡️ tessl.io/registry/skills/submit






Snyk
Hey Product Hunt! 👋
Guypo here, founder of Tessl (previously founded Snyk).
Today, I’m excited to announce that you can evaluate your skills and optimize them on Tessl. This means you can stop debugging agent output and start shipping quality code, faster: https://tessl.io/registry/skills/submit
Agent skills help agents use your products, build in your codebase and enforce your policies.
They're the new unit of software for devs - but most are still treated like simple Markdown files copied between repos with no versioning, no quality signal, no updates.
Without AI evaluations, you can’t tell if a skill helps, provides minimal uplift or even degrades functionality. You spend your time course-correcting agents instead of shipping.
Tessl is a development platform and package manager for agent skills. With Tessl, we were able to evaluate and optimize ElevenLabs' skills, 2x'ing their agent success in using their APIs.
If you are building a personal project, maintaining an OSS library, or developing with AI at work, you can now evaluate your skill and optimize it to help agents use it properly.
What skills are you working on, and what's your use case for them?
Humans in the Loop
I'm an absolute fan of @guypod and the @Tessl team.
They're pioneers in the AI industry, and active contributors by maintaining AINativeDev and organizing the AI Native DevCon. So, when the team reached out for this launch, I was super pumped.
@Tessl is a package manager for agent skills. It helps you find, install, and evaluate capabilities for your coding agents. It's the right direction. In a recent thread, [1] we discussed best practices to get the most out of @Claude Code. Above all? Run more agents in parallel. @Tessl teaches them coding best practices, raising the quality of the outputs.
The timing is perfect.
Go to tessl.io/registry/skills/submit and start shipping better, secure code at scale.
S/O to @guypod and team, keep up the inspiring work 👏👏
[1]: How many Claude Codes do you run in parallel?
Tessl
@fmerian incredible writeup - thank you for hunting us and for framing it so well.
Parallel agents point is great angle - running multiple claude code instances is becoming the norm for serious teams, but the quality bottleneck shifts fast when you scale agents horizontally.
That's exactly where skills and evals become essential - 1 poorly written skill degrades output across every parallel session. With evals and optimizations, folks can focus on saving serious time debugging bugs/hallucinations/API misuses, and shipping quality code.
Appreciate the support from day one! 🧡
The eval-driven approach makes sense. Most teams copy skill files across projects and hope they still work after a model update - there's no feedback loop telling you the context degraded. Having structured evals that catch regression before it hits production is the missing piece.
Curious about the version compatibility matrix. When a new model version drops (say Claude Opus to Sonnet), how granular is the eval detection? Does it flag per-skill degradation or just overall task completion changes? The 1.8-2X performance numbers are compelling but I'd want to know which skills contributed most vs which ones were noise.
Tessl
@zzunkie Excellent question. Whenever a new model drops, we rerun our skill evaluations. That lets us flag per-skill regressions across every scenario. As you can see below, we can clearly measure the uplift - or lack of it - from adding extra context, based on task evaluations for the content-strategy skill (https://tessl.io/registry/skills/github/coreyhaines31/marketingskills/content-strategy/evals). It’s also useful when a skill doesn’t help much: users can see they’re better off running without it for this particular task.
Skillkit
Tessl
@rohit_ghumare Great to hear! What’s one thing you wish the eval workflow did better - debugging failures, comparing versions, etc? We’re iterating fast based on comments like yours. :)
Skillkit
Tessl
@rohit_ghumare Hey Rohit! I ran all your skills through the Tessl review machinery and sent you a pull request.
Tessl
Excellent point, @rohit_ghumare - you can find the recommendation directly into each skill.
And as for improvements, head over to "Optimize this skill"! @_popey_ leveraged this to improve your skill already, and you can already merge the changes in your repo!
I see you have the `insights` skill with ~80% performance, give it a go, and let me know if you encounter any issues
Agent evaluation is the part of the AI workflow that still feels unsolved — deterministic tests don't translate well when your output is non-deterministic by design. Curious how Tessl approaches defining "skill" for an agent: is it task completion rate, output quality scoring, or something closer to behavioral alignment? The 3x better code claim is a big statement, but if the eval layer is solid, the compounding effect on code quality could absolutely get there.
Tessl
@giammbo great question - we approach it from 2 angles today:
skill reviews - this is what you see when you first submit a skill. It scores against structure and best practice criteria established by Anthropic, combining validation checks with LLM-judged quality on implementation and activation. think of it as "is this skill well-constructed" before you even run it.
Task based evaluation - scenario-based task evals where you generate or hand-write scenarios, run end-to-end tasks, and track results. this gets closer to what you're describing around task completion and output quality.
Both use LLM-as-a-judge, which we think is the right fit for non-deterministic outputs - but we know that comes with its own tradeoffs around consistency and edge cases. We're working on new approaches, but will share more in the upcoming weeks.
Curious though - when you think about "behavioral alignment" as a measure, what does that look like for you? Wondering if there's a gap between what we're evaluating today and what you actually need to trust their skills.
@baptiste_fernandez1 The LLM-as-a-judge tradeoff acknowledgment is the honest part. For behavioral alignment I mean constraint adherence on the execution path — not just "did it complete" but "did it stay in scope, avoid side effects, use the right tools in order". That gap between outcome and execution path is where I'd say the high-trust bar lives.
How are you validating real user behavior at Tessl right now?
Tessl
@danilpond Two evaluation methods today.
First, skill reviews - when you submit a skill, it gets scored against structure and best practice criteria established by Anthropic, combining validation checks with LLM-judged quality. This tells you immediately whether your skill is well-constructed.
Second, task-based evaluations - scenario-based evals where you run end-to-end tasks and track results against real agent behavior. Teams submit a skill, see their scores, iterate, and resubmit - and we can measure the delta between versions. That second approach is where we validate evaluation scenarios.
We're also working on new approaches beyond these two, more to share in the coming weeks. Keen to hear if this is what you had in mind, and whether you've spotted an opportunity for improvement?
The "package manager for agent skills" framing clicks immediately, especially coming from the Snyk founder. The dependency management and security signal problem in traditional code is exactly what's now happening with agent skills, and most teams don't have the tooling to even see it yet.
The ElevenLabs 2x result is a concrete proof point that avoids the usual vague benchmark claims. That kind of before/after is what actually convinces teams to adopt a new tool in their workflow.
I use Claude Code daily for building my own AI platform and the skill quality problem is very real. You genuinely can't tell if a skill is helping or quietly degrading outputs without proper evals. This fills a gap that's been easy to ignore until it hurts. Congrats on the launch!
Tessl
@joao_seabra bad dependency in traditional code throws an error, a bad skill just makes your agent slightly worse, and you end up blaming the model instead of the context. 😄 skills are in that exact moment right now. as you're using Claude Code daily, try running an eval on one of your core skills. Would love to hear what you're building with them!