ShipGuard v4 roadmap: what should AI agents prove before you trust them?
I’m building ShipGuard as an open-source, local-first assurance layer for using Codex on real iOS work.
The current public release is now green on main with release-candidate readiness proof: install, upgrade, uninstall, release-proof consumption, schema docs, plugin refresh proof, package proof, and blocked stable-release claims. In plain English: ShipGuard is getting closer to being a product, not just a bundle of useful scripts.
The v4 direction is narrower than it might look:
inspect -> prepare -> verify
Before an agent edits...
Before an agent edits, ShipGuard should inspect the repo and prepare a scoped task contract: what files are allowed, what surfaces are risky, what proof is required, and what claims need verification.
After the agent works, ShipGuard should verify the diff, evidence, and claims: changed files, deleted tests, plist/entitlement movement, validation receipts, screenshots/logs where relevant, manual proof gaps, and overclaims.
v4 is the line where this needs to feel stable enough for other iOS developers to trust without me sitting next to the repo. That means stable task/evidence schemas, migration support, clean install/upgrade/uninstall, release-proof consumption, rollback proof, security review, and external adoption evidence.
The larger roadmap is:
v3: prove ShipGuard changes real developer decisions.
v4: become a stable local-first assurance product for solo iOS developers.
v5: become an AI change-control plane for teams across repos, agents, policies, approvals, and releases.
v6: become an open protocol for task contracts, evidence receipts, verdicts, domain packs, and attestations.
The immediate v4 work is product hardening: stable task/evidence schemas, migration support, clean install/upgrade/uninstall, release-proof consumption, rollback proof, security review, external adoption evidence, and a public-safe eval corpus.
The part I want feedback on:
What proof would make you trust an AI-assisted change in a production iOS app?
Some areas I’m especially interested in:
- notifications and permission states
- StoreKit, entitlements, and app groups
- widgets, App Intents, and shared state
- background execution and lifecycle edge cases
- migrations and persistence changes
- performance work that needs runtime proof
- App Review / release-readiness evidence
- simulator proof versus physical-device/manual proof
If you’ve used Codex, Claude Code, Cursor, or another coding agent on a real app, I’d also love to hear the failure mode that made you slow down and review manually.
Good feedback for ShipGuard is concrete:
- “I would not trust the agent unless it proves X”
- “This kind of iOS change always needs Y manual check”
- “This evidence is noisy and would not change my review decision”
- “This should be a first-class domain pack”
- “This belongs in docs, not the CLI”
Repo: https://github.com/jlekerli-source/ShipGuard
If you have a proof case, ugly edge case, or iOS workflow that agents keep getting wrong, open an issue or drop it here. I want ShipGuard v4 to be shaped by real review pain, not just my own workflow.

Replies