Launched this week

Clipto
Fully local, natural language search over terabytes of media
805 followers
Fully local, natural language search over terabytes of media
805 followers
Like Google Photos, but fully local. Turn the terabytes of video, audio, meetings, and files you work with into searchable memories, without uploading anything to the cloud. Clipto automatically tags people, dialogue, and scenes, so you can instantly find any moment buried in your media just by describing what you're looking for. It's fast too: on a MacBook Pro M5, Clipto indexed 2TB of videos in just 24 hours.











Impressed by the 99+ languages for ASR. Does the natural language search also work across different languages? (e.g., searching in Spanish for a video with English dialogue)?
Clipto
@bing7 A very good question!
Our search function does indeed support cross-language retrieval - for instance, if you search in Spanish, you can find video content that contains English dialogues.
The current implementation logic is feasible and can help you locate a large number of relevant results. However, the technical complexity of cross-language semantic matching makes its accuracy unable to fully match the search experience within a single language. Some expressions that are more colloquial and have strong cultural attributes may have a certain impact on the matching effect.
We are also continuously iterating to make cross-language search increasingly intelligent. If you have specific search scenarios, we can help you assess the expected results.
remio - Your Personal ChatGPT
How does it handle audio search for multiple speakers? If I search for a phrase, can I filter it by who said it, or does it just show the timestamp?
Clipto
@lvyanghuang Yeah, that's a great point ~
The search results will show both timestamps and speaker labels.
The system first automatically detects different voice characteristics to distinguish between speakers (for example, labeling them as Speaker A, Speaker B).
Next, you can rename these speakers, changing them to names or roles.
Most importantly, this editing operation is globally effective — you only need to modify it once, and it will automatically apply to the recognition results of that same speaker across all media files, with no need for repeated setup.
Local natural language search over your own footage is a genuinely hard problem — most approaches either require cloud uploads or fall apart at scale. Curious how Clipto handles indexing updates when new footage is added, does it re-index incrementally or does a full re-scan run again?
Clipto
@zrimko I appreciate you digging into the technical side of things.
Clipto re-indexes incrementally. When you add new footage, only that new file gets processed — not your entire library. And that's specifically designed to save you time. You don't have to sit through a full re-scan every time when drop a new clip into your folder. It just works quietly in the background, so you can keep editing or searching without interruption.😊
Clipboard Canvas v2.0
Tried a similar tool last year and it choked on my 500GB of GoPro footage. Curious if Clipto handles high-bitrate HEVC files smoothly.
Clipto
@trydoff Great question.
Yes, high-bitrate HEVC footage is something we encounter quite often, especially from GoPros, drones, and modern mirrorless cameras.
During indexing, Clipto normalizes media through our processing pipeline, so it can handle a wide range of formats and codecs, including H.264, H.265/HEVC, AV1, VP8/VP9, MPEG-2, ProRes, AAC, MP3, FLAC, WAV, and more.
In practice, we’ve successfully indexed and searched across libraries containing hundreds of gigabytes and even multiple terabytes of HEVC footage.
Processing speed will depend on your hardware, but HEVC itself is absolutely a first-class citizen in the workflow.
This is cool. Keeping everything local is a massive win, especially with how unpredictable cloud costs can get and how worried people are about privacy right now. I really respect that you stuck to local-only storage to protect user data.
As a developer working a lot with local-first Python frameworks, I'm super curious about the performance side. How do you manage the local system resources so that indexing a massive 2TB drive doesn't slow their device to a crawl?
Clipto
@rumiza_shaikh Thanks! Performance has actually been one of the biggest engineering challenges for us.
We’ve spent a lot of time optimizing the entire stack, from model acceleration and inference efficiency to orchestration between different models and processing pipelines.
The goal is to make indexing large media libraries feel like a background task rather than something that takes over your machine.
That said, we’re definitely not done. There is still plenty of room for improvement, especially around memory footprint and resource utilization during large indexing jobs.
We’ve been shipping performance improvements continuously, and there are a few more significant optimizations currently in the pipeline.
Since you’re working with local-first systems yourself, I’d love to stay in touch and compare notes. This is one of those areas where there’s still a lot of unexplored territory.
On-device NL search over 2TB is the hard part — curious if you're embedding frames with a local CLIP-style model + ANN index, or sampling keyframes? And how does it stay incremental as the library grows?
Clipto
@qifengzheng Great question.
The high-level approach is directionally similar, but the actual system is quite a bit more involved than a straightforward CLIP + ANN pipeline.
We don’t rely on a single vision model. Instead, we combine multiple model types and signals, including visual understanding, semantic retrieval, transcripts, metadata, and other media-specific features.
For video, we also don’t simply sample frames at fixed intervals. A big part of the challenge is identifying the most informative moments in a video while keeping the index efficient at scale. We’ve built quite a bit of logic around selecting and processing high-information-density frames rather than treating every frame equally.
As for growth over time, the library is designed to be incremental. Under the hood, we maintain a persistent local knowledge structure and database layer that allows new content to be added continuously without rebuilding everything from scratch.
Out of curiosity, have you built search or retrieval systems before? This is one of the more technical questions we’ve gotten today. 🙂
@henry_kang yeah — clip-style embeddings on fixed-interval keyframes. the missed-content problem is real, especially on long videos with slow burns. genuinely curious what you landed on.
Clipto
@qifengzheng Yeah, exactly. The missed-content problem is real, especially with long slow-burn footage.
We don’t sample purely at fixed intervals. We run content analysis to look at changes across frames and automatically select more representative / high-information frames for indexing.
That said, there is always a tradeoff between coverage, cost, speed, and index size. If you try to analyze every frame, it becomes expensive very quickly and often adds a lot of redundant signal. If you sample too sparsely, you miss subtle moments.
So our current approach is to balance efficiency and information density: detect meaningful visual changes, prioritize representative moments, and keep the index practical for multi-hour or multi-terabyte libraries.
Slow-burn videos are definitely one of the harder cases, I don't think we are near perfect on that front and we’re continuing to improve that part.
I've got a lot of multi-cam footage and heavy ProRes files. Does the app struggle with professional codecs, or is it optimized for proxy-like speeds internally?
Clipto
@oscarliu Great question.
Professional codecs are absolutely a use case we designed for. We see a lot of ProRes, multi-cam footage, drone footage, and other creator-grade media libraries.
Under the hood, Clipto normalizes media through its processing pipeline, allowing us to support a wide range of formats while keeping indexing efficient and consistent.
More importantly, our goal isn’t to play back or edit every frame in real time. The goal is to understand the content and make it searchable, so we can optimize the indexing workflow differently from a traditional NLE.
We’ve successfully indexed libraries containing hundreds of gigabytes and even multiple terabytes of media, including ProRes-heavy workflows.
Out of curiosity, what’s your typical setup? Premiere, Resolve, Final Cut, or something else?