What should I test first for a checkable AI-work stack?

by•7d ago

Project Telos is scheduled to launch here on June 26. I am building it around one rule: if AI work matters, the person and the system should be looking at the same checkable state, not trusting the model's self-report.

The public line is five flagships:

- gather: witnessed intake and provenance receipts

- index: rerunnable workspace maps and MATCH / DRIFT / UNVERIFIABLE certificates

- forum: accountable multi-agent ledgers

- crucible: public GitHub pre-1.0 judgment and refinement

- the telos engine: shared human-AI perceive-and-make work

The honest stage: solo, independent, pre-revenue, pre-proof on the largest thesis. I am looking for verification and testing against real workflows, technical pushback, early traction from people willing to inspect the receipts, and possibly modest grassroots research funding to keep hardening the checkable-state pieces.

Main site: https://harperz9.github.io

GitHub: https://github.com/HarperZ9

Repos: https://github.com/HarperZ9/gather - https://github.com/HarperZ9/index - https://github.com/HarperZ9/forum - https://github.com/HarperZ9/crucible - https://github.com/HarperZ9/telos

8 views

Replies

Best

Small update: I opened an upstream PR to add Project Telos to ai-boost/awesome-harness-engineering's Demo Harnesses list: https://github.com/ai-boost/awesome-harness-engineering/pull/89

The specific test I want is not "does this feel useful?" It is: can you replay what the agent saw, what changed, and why a check passed, drifted, or stayed unverifiable?

If you build with AI agents, pick any one repo/workflow and tell me what receipt would make the result credible. Current stage is still solo, pre-revenue, author-tested, and not independently audited.

Report

5d ago

Current-state update after launch: the five flagship repos are public, and today's local dogfood loop is passing against gather-engine 1.5.0, index-graph 2.8.0, forum-engine 1.12.0, crucible-bench 1.1.0, and the telos source demo with a 25-tool five-flagship catalog.

The test I care about most now is simple: take one messy real workflow, run it through source intake -> workspace map -> route ledger -> claim verdict -> shared state, and tell me where the receipt stops being useful. The best feedback is a concrete breakage report, not a general reaction.

Report

5d ago