Launched this week

LightCrawl

Launched this week

Lightweight, self-hostable Web scraping API & MCP server

10 followers

Lightweight, self-hostable Web scraping API & MCP server

10 followers

Visit website

LightCrawl is an ultra-fast, lightweight, and self-hostable web scraping API and Model Context Protocol (MCP) server optimized for LLMs. Features: • Built with TypeScript and Playwright for reliable scraping. • Seamless Brave Search integration for autonomous browsing. • Formats messy web pages into clean, LLM-ready Markdown. • 100% open-source and easy to self-host with Docker & Railway. Turn any website into clean data for your AI agents in seconds!

Free

Launch tags:Open Source•Developer Tools•Artificial Intelligence

Launch Team

Framer 3.0With Agents, Branching Community and an all-new design

Promoted

Maker

📌

Hi Product Hunt community! 👋 I'm the creator of LightCrawl. As an engineering manager working deeply with infrastructure, I built LightCrawl because I needed a simpler, faster way to feed clean web data into AI agents without relying on heavy, expensive SaaS scraping platforms. LightCrawl is a lightweight, 100% open-source alternative built with TypeScript and Playwright. It converts messy web pages into perfectly formatted, LLM-ready Markdown in seconds and includes full Model Context Protocol (MCP) server support right out of the box, making it seamless to connect with tools like Cursor or Claude Code. It's designed to be completely stateless, self-hostable, and secure. You can spin it up instantly via Docker or deploy it to Railway with a single click. I'd love to hear your thoughts, feedback, or feature requests!

Report

5d ago

Self-hostable scraping API that also works as an MCP server is a really smart combo. AI agents need fresh web data constantly and the last thing you want is hitting rate limits from a third-party service mid-workflow. The Playwright-based rendering is a good call too - most of the interesting data is behind dynamic pages these days. Curious about the memory footprint when running headful Playwright at scale - how does it hold up on a modest VPS?

Report

4d ago

Maker

LightCrawl takes a few careful steps to keep the memory footprint manageable, even on modest VPS instances:

Singleton browser instance — Rather than spawning a new Chromium process per request, LightCrawl reuses a single shared headless browser and creates lightweight BrowserContext instances per scrape, which are promptly closed after use.
Hybrid Mode (static-first) — By default, LightCrawl tries a fast static HTTP fetch first and only falls back to Playwright if the page requires JavaScript rendering or is bot-protected. This means Playwright isn't even invoked for simple, static pages.
Concurrency limiter — A built-in semaphore caps the number of simultaneous Playwright pages (default: 5, configurable via MAX_CONCURRENCY), preventing memory spikes from too many parallel tabs.

In practice, a 2GB RAM VPS can comfortably handle moderate workloads with these defaults.

Report

3d ago

Mailwarm

Do you support caching so agents don’t keep rescraping the same pages and burning time?

Report

5d ago

Maker

Hi @karimbenkeroum

Thanks for the great question.

Currently, LightCrawl does not support caching out of the box, as it is designed to be completely stateless and minimal.

However, your point is spot on—preventing redundant rescraping is critical for AI agent workflows to save both time and compute resources. Since LightCrawl already supports Redis for its distributed crawling queue, implementing an optional Redis-backed cache layer for scraped Markdown would be a very natural next step.

I've just created a GitHub issue to track this feature request:

https://github.com/yosuke1024/LightCrawl/issues/34

Please feel free to share any specific requirements or thoughts there (e.g., TTL, cache-busting behavior)!

Report

5d ago

LightCrawl takes a few careful steps to keep the memory footprint manageable, even on modest VPS instances:

Singleton browser instance — Rather than spawning a new Chromium process per request, LightCrawl reuses a single shared headless browser and creates lightweight BrowserContext instances per scrape, which are promptly closed after use.
Hybrid Mode (static-first) — By default, LightCrawl tries a fast static HTTP fetch first and only falls back to Playwright if the page requires JavaScript rendering or is bot-protected. This means Playwright isn't even invoked for simple, static pages.
Concurrency limiter — A built-in semaphore caps the number of simultaneous Playwright pages (default: 5, configurable via MAX_CONCURRENCY), preventing memory spikes from too many parallel tabs.

In practice, a 2GB RAM VPS can comfortably handle moderate workloads with these defaults.