LLM Training Data Crawler & Curator - Curate clean, deduplicated training data for AI models.
by•
Curate clean, deduplicated training data for AI models. Crawl any website, score quality, export JSONL/Parquet for GPT, Claude, Llama fine-tuning.
Replies
Best
Maker
📌
Curate high-quality, deduplicated training data for LLM fine-tuning. Extract clean text from any website OR process your own documents with automatic quality scoring, deduplication, and format conversion.
Features
Smart Content Extraction: Automatically detects and extracts main content, filtering out navigation, ads, and boilerplate
Bring Your Own Data (BYOD): Process your own text documents without crawling - perfect for existing datasets
Quality Scoring: Scores each document based on vocabulary diversity, sentence structure, and content density
Deduplication: Uses MinHash/Jaccard similarity to remove near-duplicate content
Flexible Crawling: Single page, same domain, same subdomain, or follow all links
Document Chunking: Split long documents into training-ready chunks with configurable overlap
Multiple Output Formats: JSONL (OpenAI compatible), JSON, Parquet, CSV, or HuggingFace Datasets format
Language Filtering: Filter content by language (ISO 639-1 codes)
Privacy Features: Optionally remove emails and URLs from extracted text
Use Cases
LLM Fine-tuning: Collect domain-specific training data for fine-tuning language models
RAG Systems: Build high-quality document collections for retrieval-augmented generation
Knowledge Bases: Create clean text corpora from documentation sites
Research: Gather datasets from academic or technical resources
Data Cleaning: Clean and deduplicate existing text datasets for ML training
Replies