Chris Messina

Clipto - Fully local, natural language search over terabytes of media

Like Google Photos, but fully local. Turn the terabytes of video, audio, meetings, and files you work with into searchable memories, without uploading anything to the cloud. Clipto automatically tags people, dialogue, and scenes, so you can instantly find any moment buried in your media just by describing what you're looking for. It's fast too: on a MacBook Pro M5, Clipto indexed 2TB of videos in just 24 hours.

Add a comment

Replies

Best
Qifeng Zheng

On-device NL search over 2TB is the hard part — curious if you're embedding frames with a local CLIP-style model + ANN index, or sampling keyframes? And how does it stay incremental as the library grows?

Henry Kang

@qifengzheng Great question.

The high-level approach is directionally similar, but the actual system is quite a bit more involved than a straightforward CLIP + ANN pipeline.

We don’t rely on a single vision model. Instead, we combine multiple model types and signals, including visual understanding, semantic retrieval, transcripts, metadata, and other media-specific features.

For video, we also don’t simply sample frames at fixed intervals. A big part of the challenge is identifying the most informative moments in a video while keeping the index efficient at scale. We’ve built quite a bit of logic around selecting and processing high-information-density frames rather than treating every frame equally.

As for growth over time, the library is designed to be incremental. Under the hood, we maintain a persistent local knowledge structure and database layer that allows new content to be added continuously without rebuilding everything from scratch.

Out of curiosity, have you built search or retrieval systems before? This is one of the more technical questions we’ve gotten today. 🙂

Qifeng Zheng

@henry_kang yeah — clip-style embeddings on fixed-interval keyframes. the missed-content problem is real, especially on long videos with slow burns. genuinely curious what you landed on.

Henry Kang

@qifengzheng Yeah, exactly. The missed-content problem is real, especially with long slow-burn footage.

We don’t sample purely at fixed intervals. We run content analysis to look at changes across frames and automatically select more representative / high-information frames for indexing.

That said, there is always a tradeoff between coverage, cost, speed, and index size. If you try to analyze every frame, it becomes expensive very quickly and often adds a lot of redundant signal. If you sample too sparsely, you miss subtle moments.

So our current approach is to balance efficiency and information density: detect meaningful visual changes, prioritize representative moments, and keep the index practical for multi-hour or multi-terabyte libraries.

Slow-burn videos are definitely one of the harder cases, I don't think we are near perfect on that front and we’re continuing to improve that part.

OSCAR Liu

I've got a lot of multi-cam footage and heavy ProRes files. Does the app struggle with professional codecs, or is it optimized for proxy-like speeds internally?

Henry Kang

@oscarliu Great question.

Professional codecs are absolutely a use case we designed for. We see a lot of ProRes, multi-cam footage, drone footage, and other creator-grade media libraries.

Under the hood, Clipto normalizes media through its processing pipeline, allowing us to support a wide range of formats while keeping indexing efficient and consistent.

More importantly, our goal isn’t to play back or edit every frame in real time. The goal is to understand the content and make it searchable, so we can optimize the indexing workflow differently from a traditional NLE.

We’ve successfully indexed libraries containing hundreds of gigabytes and even multiple terabytes of media, including ProRes-heavy workflows.

Out of curiosity, what’s your typical setup? Premiere, Resolve, Final Cut, or something else?

Ea Z

Does the natural language search get better over time through local fine-tuning, or is the model static upon installation?

Henry Kang

@ea_z Great question.

To clarify, Clipto does not perform local model fine-tuning on your personal media library today. The underlying models themselves aren’t continuously retrained on-device.

What we do have is a local data flywheel. As you use Clipto, your interactions, edits, labels, and feedback help the system build a better understanding of your media and preferences over time.

For example, if you consistently organize content a certain way, rename detected people, or make specific editing decisions, those signals can be incorporated into future retrieval and understanding workflows.

So while the model weights remain unchanged, the system itself becomes increasingly personalized as it accumulates more context about your library and how you work with it.

We think that’s often more valuable than fine-tuning alone because it allows the experience to improve while keeping your personal media private and local.

Steven Cen

For long-form team collaboration, is there a way to share the index file with another editor, or does each person need to re-index the same footage locally?

Henry Kang

@s_cen Great question!

Today, each Clipto library and index lives locally on the user’s machine. If multiple editors are working independently, each machine maintains its own local index.

That said, collaborative workflows are something we’re actively thinking about. As media libraries grow and teams become more distributed, sharing knowledge about a media collection becomes just as important as sharing the files themselves.

Team collaboration, shared knowledge, and more flexible ways to work across multiple users and machines are all areas we’re exploring for the roadmap.

Out of curiosity, what’s your setup today? A shared NAS, cloud storage, or a traditional post-production workflow with multiple editors?

Eric

Congrats! I believe this product is very helpful for me! Clipto arrives at the intersection of three powerful trends: on-device AI, privacy-centric computing, and knowledge management. it has genuine disruptive potential.

Henry Kang

@fei_li5 Thanks, really appreciate that.

We believe those three trends are converging faster than most people realize. What started as a search product is gradually evolving into something closer to a local memory layer for personal media.

Curious which part caught your attention first: local AI, privacy, or knowledge management?

Carol Feng

How does the search handle lighting conditions? If I search for 'forest at night' vs 'forest during the day,' is the vision model sensitive enough to distinguish the cinematic mood?

Henry Kang

@carooolxxyy Great question.

Yes. Our visual models don’t just recognize objects and scenes, they also capture contextual signals such as lighting conditions, time of day, atmosphere, and other visual characteristics.

So in practice, searches like:

• “forest at night”
• “forest during the day”
• “sunset over the ocean”
• “dark and moody street scene”

can produce very different results, even when the underlying scene category is similar.

Of course, cinematic mood is inherently subjective, so there are limits to what any model can perfectly understand. But distinguishing things like day vs. night, bright vs. dark environments, or dramatically different visual atmospheres is something the system is designed to handle.

We’d actually love to hear the kinds of searches you’d want to run. “Cinematic mood” is an area where we’re continuing to push the models forward.

Fabrizio Pfannl

Local-only across audio + video + files is the version of this I keep waiting for, congrats on shipping. The piece that usually breaks under real load is the indexing job, not the search itself. How are you handling the initial pass on someone with 5 years of meeting recordings? And does the index update incrementally or do new files queue behind the original backfill?

Henry Kang

@fabriziowexare Great question. We learned pretty early that indexing is actually the harder problem than search itself.

For large backfills (for example, years of meeting recordings), we’ve spent a lot of effort on scheduling, prioritization, and resource management.

The index is incremental. New files don’t have to wait for the entire historical backlog to finish processing.

If you’re indexing five years of recordings and a new meeting arrives today, Clipto will use a priority-based scheduling system to process the new content much sooner, rather than forcing it to sit behind a massive batch job.

Under the hood, we continuously balance long-running indexing tasks with newly arrived media so the system remains responsive while the library keeps growing.

Out of curiosity, how large is the media library you’re managing today? We’re seeing some users push well beyond the “normal” use case, which has been fascinating to learn from.

Manoj Yadav

I am not a creator , but I do have lots of personal photos stored in different locations on my device , will clipto be able to organise those for me ? And can it build a memory chart out if it. For me rather then searching I like what google shows to me on time to time , like memories.

But sometimes searching is also required.

Henry Kang

@bravo_5951 That’s a great question, and honestly it gets close to where we think this category is heading.

Today, Clipto can absolutely index and organize personal photos alongside videos and audio. It can identify people, scenes, objects, and other visual concepts, and you can even assign custom names to detected faces, making it much easier to search for family memories later.

Everything stays local and private on your own machine.

Right now, our primary focus is helping users find what they’re looking for instantly. But we also think there’s a future beyond search, where your media library becomes a personal memory system rather than just a collection of files.

Features like surfacing meaningful moments, relationships between people, places, and events, and helping users rediscover forgotten memories are all directions we’re actively thinking about.

I’m curious: what would make you use Clipto instead of Google Photos? Is it privacy, local ownership, better search, or something else entirely?

Manoj Yadav

@henry_kang Privacy is the first thing and then organisation of Photos.

Henry Kang

@bravo_5951 That makes a lot of sense.

Privacy was one of the core reasons we built Clipto as a local-first product. And we completely agree that organization matters just as much as search when your media library starts to grow.

Thanks for sharing!

Anubhav Gupta

Concept is really intresting and the smooth onboarding
How did you get the idea to make that kind of stuff? what was your excatly the moments you think to create this.

Henry Kang

@anubhav16 Thanks for asking.

The idea actually goes back much further than Clipto.

Before Clipto, I founded an AI video company called ZenVideo. We were working on turning text into videos with AI back in 2017, years before today’s wave of generative AI. The company was later acquired by Tencent. While building products for creators, we kept seeing the same problem over and over again: people were spending more time searching through footage than actually creating.

As cameras got better and storage got cheaper, creators accumulated terabytes of videos, interviews, podcasts, screenshots, and project files. The information was there, but it became harder and harder to find.

At the same time, local AI models and modern hardware finally became powerful enough to understand media directly on a personal computer.

That combination made us ask a simple question:

What if your computer could actually understand everything you’ve ever recorded and help you find any moment instantly?

That’s really where Clipto started.

Alexia Li

Since it’s 100% local, does the indexing process completely lock up the Mac, or can I still smoothly edit 4K video in the foreground while it indexes in the background?

Henry Kang

@alexia_li Great question.

One of the biggest challenges with local AI is making sure the indexing work doesn’t get in the way of the work you’re actually trying to do.

We’ve spent a lot of time building orchestration, scheduling, and resource management systems so indexing can run efficiently in the background while minimizing its impact on foreground tasks.

So yes, the goal is that you can continue editing, reviewing footage, or working normally while Clipto processes your library in the background.

Of course, we’re still actively optimizing performance. Large media libraries can be demanding, but a huge part of our engineering effort has gone into balancing indexing speed with a smooth user experience.