When we launched, Wallie was an open-source AI streamer that watches your screen, hears your audio, and reacts in character.
Now it does something I'm genuinely excited about: it PLAYS.
Wallie plays Minecraft survival live and unscripted it mines, hunts for iron, crafts, fights, and tries to survive the night, all on its own. No human at the controls, no script. And it talks the whole way through, in character, reacting to what's actually happening.
Two things make it special:
Wallie V2
the reacts to your screen feature is the interesting differentiator here. most AI streamers just respond to chat which is a solved problem. an avatar that can comment on what's actually happening in the game or on screen is a different kind of presence. curious how the screen reading works, is it vision model calls on a frame interval or something else, and what the latency looks like between something happening on screen and Wallie actually reacting to it
Wallie V2
@ansari_adin Vision runs on a frame interval, yeah. mss captures the screen, perceptual hash (pHash) detects meaningful changes, and if the delta clears the threshold, it fires a vision model call with the current frame. The interval and sensitivity are configurable from the dashboard.
Latency from screen event to spoken reaction: typically 2–4 seconds end to end, depending on the LLM provider. Groq + Llama-4 Scout gets you the fastest loop (~1.5–2s). Claude Sonnet is slower on raw latency but produces better reactions — especially for things like recognizing game UI, character names, or anything that requires IP/context knowledge.
The attention engine also means not every screen change triggers a full reaction. The model probabilistically assigns DEEP (22%), GLANCE (28%), TANGENT (5%), IGNORE (27%), or SILENCE (18%) — so Wallie doesn't spam reactions to every mouse move, which makes the ones that do happen feel more considered. Streak fatigue prevents the same reaction type from firing back-to-back.
Super work! is it possibly to integrate it with Gemini Live Model?
Wallie V2
@ashishkingdom Gemini is already supported as an LLM provider (Gemini 2.5 Flash and Pro, streaming + vision). You set it from the dashboard: Engine → provider: gemini, then pick your model.
Gemini Live specifically (the real-time audio/multimodal API) isn't integrated yet — that's a different API surface from the standard completions endpoint Wallie uses. It's on the roadmap conceptually (the "Hearing" item — real-time audio input), but the current TTS pipeline and single-history orchestrator design would need some rethinking to accommodate it cleanly. If you're interested in contributing, that'd be a solid PR to open.