Forums
Building an autonomous video agent with Gemini 2.0 & YOLOv8, Roast my logic?
I'm a CS student building VIRL-ai an autonomous agent that takes 8-hour Twitch streams and turns them into viral TikToks without human help.
Most AI clippers just look for "loud noises." I found that approach cuts off the setup to jokes.
My Solution: I built a local pipeline (Python) that does two things differently:
Context Engine: It grabs a 90-second audio buffer around spikes and uses Gemini 2.0 Flash to find the actual start/end of the interaction.
Universal Vision: I trained a YOLOv8 model to detect facecams vs. game UI to dynamically switch between "Split Screen" and "Full Screen" layouts.
Virl.ai - The Autonomous Growth Engine for Creators.
Streamers spend 8 hours broadcasting and 0 hours distributing. VIRLai fixes this.
It is an AI-native infrastructure that ingests long-form Twitch/YouTube streams, identifies viral moments using Context Intelligence (Audio + LLMs), and autonomously edits them into vertical shorts using Computer Vision (YOLOv8).
Features:
Understands jokes, setups, and reactions (90s buffer).
Detects facecams and game UI to smart-crop instantly.
Runs while you sleep.

