Launching today

Clipto
Fully local, natural language search over terabytes of media
342 followers
Fully local, natural language search over terabytes of media
342 followers
Like Google Photos, but fully local. Turn the terabytes of video, audio, meetings, and files you work with into searchable memories, without uploading anything to the cloud. Clipto automatically tags people, dialogue, and scenes, so you can instantly find any moment buried in your media just by describing what you're looking for. It's fast too: on a MacBook Pro M5, Clipto indexed 2TB of videos in just 24 hours.











Raycast
We've been honing Clipto's story for a few months. At the end of our last call @henry_kang proved the value of the product.
He and his team were out in the desert, testing Clipto remotely: minimal reception, terabytes of footage sitting on his laptop, and he needed to find a specific shot for the launch video.
He searched for: "the wide drone shot where the car enters the desert".
He didn't want "a cinematic moment." Not a "vibes" search.
He knew he had the clip but in the pre-Clipto world, it would take hours of video scrubbing to find it.
He found that clip in seconds using natural language to search over his own media, fully local.
Just like Google Photos — but nothing lives in the cloud.
This isn't an easy problem to solve. Henry's been pursuing this direction for over twenty years, when at CMU's Robotics Institute (my alma mater, FYI), he began pushing the limits of computer vision. He starting with indexing hundreds of images and then advanced to millions of objects — and watched recognition basically explode once memory scaled.
Clipto is in many respects the culmination of that work, pointed at your personal hard drive.
And it's quick: a modern M5 MacBook chews through ~2TB of video in about a day. Why not push yours through its paces?
ViralSort
This looks really interesting.
I'm curious about how deeply it understands media content.
Does it recognise things like camera angles, shot types (wide, medium, close-up), camera movements, transitions, B-roll, and multi-camera sequences?
It would be incredibly useful if I could search for something like "close-up shot of a person smiling" or "drone footage with a slow pan" and instantly find matching clips across my archive.
Would love to know how detailed the visual understanding gets beyond basic object and dialogue detection.
Clipto
@pradeepmalakar That's a really professional cinematography question. We're working hard to enrich our understanding of cinematic language to better serve professional video creators — here's what we can reliably recognize today:
Shot Size Wide Shot, Medium Shot — e.g. "wide shot of a city street" or "medium shot interview"
Camera Angle High Angle, Overhead/Top-down — e.g. "overhead shot of a table" or "high angle crowd scene"
Framing & Composition Landscape, Long — e.g. "landscape framing outdoor scene"
Scene & Setting Urban/City, Green Screen/Studio, Day — e.g. "studio interview daytime" or "urban street scene"
Technical Specs AV1, Rec.709, 4:2:0, 8-bit, 25FPS — e.g. filter footage by codec or color space when you need format consistency in an edit
Focus & Quality Out of Focus — e.g. quickly filter out unusable takes
...and more, these are just a few examples across the many dimensions Clipto tags. Sorry I can't list them all here! Every case shown in our demo video is a real.
Camera movements, transitions, B-roll classification, and multi-camera sequences — those are on the roadmap and we're heads-down on it.
Would love to hear what specific search queries matter most to your workflow — it really helps us understand what to build next:)
YouMind
This is genuinely impressive — local-first AI search for video is something I didn't know I needed until now. The desert story really sold it for me.
Quick question: does Clipto index audio content like podcast recordings or interview transcripts the same way it handles video footage? I have hundreds of hours of recorded interviews and this could be a total game-changer for my workflow.
Clipto
@jaredl Absolutely. Video gets most of the attention, but Clipto works with audio just as well.
Podcasts, interviews, meetings, voice recordings, and other audio files are all indexed and made searchable. You can search across transcripts using natural language and jump directly to the relevant moments.
In fact, if you’re sitting on hundreds of hours of recorded interviews, that’s one of the strongest use cases for Clipto. Those recordings often contain valuable insights that are almost impossible to rediscover later without a system like this.
We’d love to hear how you’re currently managing and searching those interviews today.
LobeHub
Just downloaded the Mac app—the UI is surprisingly clean for a local AI tool. How many languages does the transcription support currently?
Clipto
@amazing_1 Thanks, really glad you like the UI. We currently support transcription in 99+ languages, so it should work well for multilingual audio and video content across different workflows
I’m a YouTuber and managing b-roll is my biggest nightmare. Does Clipto allow for tagging, or is it all AI-based search?
Clipto
@song_kirby Totally feel you. Managing B-roll was my personal nightmare back when I was creating videos. It's actually one of the core reasons we built Clipto. It automatically analyzes and tags your footage across multiple dimensions — shot type, people, actions, dialogue, expressions, subjects and more. All AI, zero manual work. Your B-roll will become a fully searchable library.
And what makes it really special — at least for me personally — is this: when you're deep in an edit, you often need that one specific detail to nail the emotional continuity, the storytelling flow, or the movement between cuts. Something you half-remember from the shoot, or honestly didn't even notice you'd captured. Just describe it in plain language, and you'll find exactly what you need in seconds.
Hope Clipto will help you a lot:)
Clipto
@kjlis Great questions!
For dialogue search, we support 100+ languages through our speech recognition pipeline, including English, French, Italian, Spanish, Japanese, Chinese, and many others. As long as the language is supported by the underlying ASR models, the dialogue becomes searchable. Accuracy can vary by language, audio quality, accents, and recording conditions, but we’ve found it works very well across most major languages.
For compound queries, yes. We don’t treat search as simple keyword matching. We use semantic retrieval and reranking to understand the intent behind a query. For something like:
“Find clips that contain both X and Y”
clips matching both concepts would typically rank highest, while clips matching only X or only Y may still appear further down the results if they are semantically relevant. In practice, the system tries to optimize for the user’s intent rather than applying strict boolean logic.
We’d love to hear more about the workflows you’re thinking about. This is an area we’re actively improving.
KnowU
Finally, a tool that respects our privacy. Since it's 100% local, does that mean absolutely zero data or telemetry is sent back to your servers?
Clipto
@carlvert That's exactly right for 100% on-device processing, no data leaves your machine, no telemetry, nothing. That's the whole point of that mode.
If you choose Hybrid mode, some minimal data is used to enable cloud features like sync and collaboration — but that's opt-in, and clearly labeled when you set it up. Your choice, fully in your control:)
@carlvert @matthewwei How you track the performance of your product then? What if something went wrong or your users does not like it ?