Building in public multi-platform desktop app using Claude Code
I am building Navam Sentinel in public as a reference AI project source available at official GitHub repo. The problem I am addressing is multi-agent regression testing for quality, capabilities, efficiency, and other criteria which matters. I want to do so with least cognitive load on the end user who is a busy developer, engineer, scientist in an AI lab. The project is a reference in two ways, 1) how to build a multi-agent AI system solving for AI-ops automation using visual primitives, 2) how to context engineer AI code generation for a complex multi-platform desktop AI app across tens of thousands of lines of code, hundreds of tests, multiple releases per day.

I am sharing this thread to both share my learning as well as receive feedback on the product, features, development approach from the awesome PH community. Here are few open questions at this stage in the project:
What are the key challenges that AI labs or teams face when building complex multi-agent AI systems?
How do experienced or new AI teams handle agent testing and regressions? Is there a difference in approach, workflow, or tooling?
What is the right UI for multi-agent testing - code first, visual first, hybrid, or something else?

Replies
Sharing latest learning on knowledge grounding and context engineering when using Claude Code for code generation. I am adding system menu integration for the Sentinel app being built using Tauri. Initial attempts to add a simple Settings/Preferences menu to app system menu failed after multiple iterations. The LLM kept making "some changes" ensuring the fix is done. No joy in the app UI. So I checked the version of Tauri being used with latest online. Discovered that the app was not using latest release. I did two actions:
Asked Claude Code to create an upgrade plan to latest Tauri release ensuring there are no regressions
Created a knowledge grounding reference folder where I asked Claude Code to download latest version docs
Next I updated my active backlog with the upgrade plan, referenced latest knowledge in Claude Code memory, performed the upgrade, manually tested key features to note any regressions.
Now I created a plan to integrate system menu while asking Claude Code to reference latest docs. It considered local references and researched online for gaps. Then created the plan for system menu integration. I then executed the plan and was greeted with awesome results. Not only did the system menu integration work this time for settings menu, Claude Code plan also added more menus to match my application creating placeholders where integration capabilities were yet to be built.
Lesson: If you are going in bug-fix-iterations with Claude Code, check latest versions of frameworks used, knowledge ground latest docs, and ensure context is updated.
UI for Sentinel is getting complex. Every new tab that I am adding, every time I am increasing choice for the end user (like adding more test template variations), every time I am adding new features, the complexity of seemingly simple actions like saving a test is compounding. Now user can "effect" saving and related actions in several places in the app including changes in test graph canvas visually, importing a test, editing the test script, recording multiple runs for a test which depends on saving a test and pointing to one version of the test, selecting templates which are saved as tests, and so on. As a result of this combinatorial explosion, simple UI decisions like Save button require user journey mapping from various starting points.
Seems like a good case for state machine diagrams or a sequence diagram to help the human-in-loop grasp this complexity. Wondering how the latest code generation models (Opus 4.5 just landed) handle this. What is the best way to build a shared understanding of this complexity with the model so we can both "solve" for this complexity? Here are few possibilities I am thinking, please add/comment:
(High LLM reasoning and context reliance) Ask the model to build a state machine diagram or sequence diagram after explaining the problem like I have done in prior paragraph.
(High tool automation reliance) Ask the model to run Playwright to snapshot every user interaction around a journey like saving tests. Then study these snapshots to arrive at an understanding of the complexity and recommend potential solutions.
(High human effort) Manually draw out the ideal user journey specification and inform the model
(Lazy way) Iteratively inform the model about what is failing in the user journey so that model figures out the solution over iterations