Launched this week

kodwai
The first platform that scores how you Vibe Code
56 followers
The first platform that scores how you Vibe Code
56 followers
Kodwai is the first platform that scores how well you collaborate with AI coding agents (Claude Code, Cursor, Codex). Solve real challenges in your own terminal; the CLI captures your code, tests, git history, agent transcript, and time, then scores you across three axes: Direction, Outcome, and Lift, each citing its own evidence. Climb a public leaderboard, earn badges, and build a profile that shows how you engineer, not what you memorized. Free, bring your own agent.





Hey Product Hunt, I'm Hakan, founder of Kodwai.
Three years ago we typed every line. Now most of us code by directing an agent, steering it, checking it, catching it when it's confidently wrong. That's the real skill, and nothing measures it. LeetCode tests memorized algorithms. Take-homes test free time. Neither reflects how we actually work.
So I built Kodwai. Pick a real challenge, solve it on your own machine with your own agent (Claude Code, Cursor, or Codex), and submit. We score the session across three axes: Direction (how you steer and verify), Outcome (what shipped and whether it passes), and Lift (the edge cases a one-shot prompt misses). Every score points to the evidence behind it.
Direction is half the weight on purpose. A careless prompt that passes the tests still scores low, because steering the agent well was always the hard part.
Free for developers, with a public leaderboard and a profile you can send instead of doing a take-home. Teams can run interviews on the same engine and watch the real session.
When you pair with an agent, what tells you it's going well or badly? That's what I'm trying to measure, and I'll be here all day. kodwai.com
@ege_hakan_karaagac when kodwai takes an automated path, what can I inspect or approve before I trust the result?
@romejerome
Great question, and it's the thing we obsessed over. The score isn't a black box.
A few things you can check:
- You run your own agent locally. Kodwai never auto-runs code or touches your machine. You decide what's in your working tree and when to submit, so you control exactly what gets evaluated.
- The objective part (the test suite) is deterministic. You can run the same tests yourself and get the same result.
- The Direction score (how you worked with the agent) is AI-judged, but every signal comes with a plain-English reason and the actual evidence pulled from your own transcript, plus a trace-quality and confidence label. You see why each number is what it is, not just the number.
On the submission page you can open the full breakdown, axis by axis and signal by signal. If something looks off, you've got the receipts to push back. That's the whole point.
@ege_hakan_karaagac Congrats on the launch! 🎉 Measuring how people collaborate with AI coding agents instead of just what they build is a really interesting angle. Curious—have you seen any common habits that consistently separate the highest-scoring users from everyone else?
@nicole_hynek
Thank you! 🎉 Yeah, a few habits show up again and again.
The best sessions start slow and specific. People spell out the constraints and edge cases before any code gets written (we score that as Spec Precision), and they break the work into ordered steps instead of one giant prompt (Decomposition).
The other big one is active distrust. The top scorers actually read what the agent produced, catch the wrong turn early, and redirect instead of letting it run for ten minutes (Verification Rigor and Recovery). The lower sessions tend to accept the first plausible output and pay for it an hour later.
Short version: the people who treat the agent like a fast, capable junior who still needs a clear brief and a real code review consistently come out on top.
Honestly the most fun part has been watching that play out. 😂
Love this, we've gone from "How good are you at coding?" to "How good are you at collaborating with AI?" and honestly... that's a much more relevant question today.
The idea of scoring Direction instead of just the final output is what caught my attention. Looking forward to throwing Claude Code at a few challenges this weekend and seeing whether I'm actually a good AI engineer or just really good at accepting suggestions.
Congrats on the launch! 🚀
@jayant_siktia
Thank you, this genuinely made my day 🙏 You nailed why we built it.
And ha, that last line is basically the actual test. "Good AI engineer vs just good at accepting suggestions" is something we literally score: Verification Rigor (did you catch its mistakes and push back) and Engagement (did you stay in the loop vs paste-the-spec-and-walk-away) are two of the Direction signals. So you'll find out 😄
Throw Claude Code at a few this weekend and tell me how it goes. I'd love to hear your score, and anything that felt off. Have fun with it.
@ege_hakan_karaagac
Haha, I love that those are actual scoring signals.
The more I think about it, the more I feel this is where software engineering interviews are headed. Knowing how to write code is still important, but knowing when to trust an AI, when to challenge it, and how to guide it toward the right solution is becoming an entirely new skill.
I'll definitely give it a proper run this weekend and won't sugarcoat the feedback. If it exposes bad prompting habits or moments where I blindly trusted the agent, that's probably the most valuable outcome. Looking forward to seeing where I land on the leaderboard!
For me it's whether the agent asks clarifying questions in the first 30 seconds. If it just starts coding, I know I'll be cleaning up assumptions for the next hour. If it asks "do you want me to handle X case?" before writing anything, the session almost always goes well. Bad sessions are confidently wrong. Good sessions are loudly uncertain.
@elias_motionfy
"Confidently wrong vs loudly uncertain" might be the best one-line description of agent sessions I've heard. Stealing that.
And it's funny, that reflex is exactly the human side of what we score. The strongest operators front-load it themselves: they state the constraints and the "what about X case" up front (we call it Spec Precision) so the agent doesn't have to guess. Then when it does start confidently wrong, they catch it fast and redirect (Verification Rigor and Recovery).
That's the whole reason we score Direction instead of just the final diff. Those moves, not the typing speed, are what separate a clean session from an hour of cleanup. Would genuinely love to see what your sessions score. The loudly-uncertain ones should do well 😄
@ege_hakan_karaagac Honestly? I think it works for ~20% of users, and that's enough. Most people use AI without thinking about cost or footprint at all, same way they used to leave the tap running while brushing teeth. The readout creates a moment of awareness. Some will ignore it, some will start picking the lighter model just because they can see the trade-off.
The behavior change isn't "users become eco saints." It's "users stop reaching for the heaviest model by default." That's already huge.