Launched this week

LightCrawl
Lightweight, self-hostable Web scraping API & MCP server
10 followers
Lightweight, self-hostable Web scraping API & MCP server
10 followers
LightCrawl is an ultra-fast, lightweight, and self-hostable web scraping API and Model Context Protocol (MCP) server optimized for LLMs. Features: • Built with TypeScript and Playwright for reliable scraping. • Seamless Brave Search integration for autonomous browsing. • Formats messy web pages into clean, LLM-ready Markdown. • 100% open-source and easy to self-host with Docker & Railway. Turn any website into clean data for your AI agents in seconds!



Self-hostable scraping API that also works as an MCP server is a really smart combo. AI agents need fresh web data constantly and the last thing you want is hitting rate limits from a third-party service mid-workflow. The Playwright-based rendering is a good call too - most of the interesting data is behind dynamic pages these days. Curious about the memory footprint when running headful Playwright at scale - how does it hold up on a modest VPS?
LightCrawl takes a few careful steps to keep the memory footprint manageable, even on modest VPS instances:
Singleton browser instance — Rather than spawning a new Chromium process per request, LightCrawl reuses a single shared headless browser and creates lightweight BrowserContext instances per scrape, which are promptly closed after use.
Hybrid Mode (static-first) — By default, LightCrawl tries a fast static HTTP fetch first and only falls back to Playwright if the page requires JavaScript rendering or is bot-protected. This means Playwright isn't even invoked for simple, static pages.
Concurrency limiter — A built-in semaphore caps the number of simultaneous Playwright pages (default: 5, configurable via MAX_CONCURRENCY), preventing memory spikes from too many parallel tabs.
In practice, a 2GB RAM VPS can comfortably handle moderate workloads with these defaults.
Mailwarm
Do you support caching so agents don’t keep rescraping the same pages and burning time?
Hi @karimbenkeroum
Thanks for the great question.
Currently, LightCrawl does not support caching out of the box, as it is designed to be completely stateless and minimal.
However, your point is spot on—preventing redundant rescraping is critical for AI agent workflows to save both time and compute resources. Since LightCrawl already supports Redis for its distributed crawling queue, implementing an optional Redis-backed cache layer for scraped Markdown would be a very natural next step.
I've just created a GitHub issue to track this feature request:
https://github.com/yosuke1024/LightCrawl/issues/34
Please feel free to share any specific requirements or thoughts there (e.g., TTL, cache-busting behavior)!