techengine

GitHub - Scrape the web at Go speed.

by
GoScrapy: Harnessing Go's perfomance for blazingly fast web scraping, inspired by Python's Scrapy framework. - tech-engine/goscrapy

Add a comment

Replies

Best
techengine
Maker
📌

Most web scraping today is dominated by Python—especially tools like Scrapy. It’s familiar, powerful, and widely used. But when you start scaling, something becomes clear: performance and infrastructure costs begin to hurt.


Switching to a faster language like Go sounds like the obvious next step—but the real challenge is the transition:
👉 new syntax, new patterns, and a completely different ecosystem.

🚀 Enter Goscrapy

Goscrapy bridges that gap. It’s designed for:

  • ⚡ Developers who want Go-level performance

  • 🧠 Without losing the Scrapy-like developer experience


💡 Why it stands out

  • Familiar UX → Feels like Scrapy, so Python developers feel at home instantly

  • Lower learning curve → No painful transition phase

  • High performance → Built on Go for speed and efficiency

  • Cost savings → Handle more with fewer resources

🎯 The idea is simple

Don’t force developers to choose between comfort and performance.

With Goscrapy, you get both.

If you’ve ever thought:

I wish Scrapy was faster and cheaper to run…

This is for you.


Features
----------
🚀 Blazing Fast — Built on Go's concurrency model for high-throughput parallel scraping
🐍 Scrapy-inspired — Familiar architecture for anyone coming from Python's Scrapy
🛠️ CLI Scaffolding — Generate project structure instantly with goscrapy startproject
📡 Signal-Driven — Decoupled, event-driven architecture using a central signal bus
🧠 Auto-Discovery — Automatic detection of spider lifecycle methods (Open, Close, Idle)
🔁 Smart Retry — Automatic retries with exponential back-off on failures
🍪 Cookie Management — Maintains separate cookie sessions per scraping target
🔍 CSS & XPath Selectors — Flexible HTML parsing with chainable selectors
📦 Built-in Pipelines — Export to CSV, JSON, MongoDB, Google Sheets, and Firebase out of the box
🧩 Built-in Middleware — Plug in robust middlewares like Azure TLS and advanced Dupefilters
🎛️ Telemetry & TUI — Real-time terminal dashboard and global metrics monitoring
🔌 Extensible — Every layer (Scheduler, WorkerPool, Engine) is swappable and extensible

⭐ Check it out, try it, and if it clicks—give the repo a star and help it grow.