Eliud Munyala

LLM Training Data Crawler & Curator - Curate clean, deduplicated training data for AI models.

by
Curate clean, deduplicated training data for AI models. Crawl any website, score quality, export JSONL/Parquet for GPT, Claude, Llama fine-tuning.

Add a comment

Replies

Best
Eliud Munyala
Maker
📌
Curate high-quality, deduplicated training data for LLM fine-tuning. Extract clean text from any website OR process your own documents with automatic quality scoring, deduplication, and format conversion. Features Smart Content Extraction: Automatically detects and extracts main content, filtering out navigation, ads, and boilerplate Bring Your Own Data (BYOD): Process your own text documents without crawling - perfect for existing datasets Quality Scoring: Scores each document based on vocabulary diversity, sentence structure, and content density Deduplication: Uses MinHash/Jaccard similarity to remove near-duplicate content Flexible Crawling: Single page, same domain, same subdomain, or follow all links Document Chunking: Split long documents into training-ready chunks with configurable overlap Multiple Output Formats: JSONL (OpenAI compatible), JSON, Parquet, CSV, or HuggingFace Datasets format Language Filtering: Filter content by language (ISO 639-1 codes) Privacy Features: Optionally remove emails and URLs from extracted text Use Cases LLM Fine-tuning: Collect domain-specific training data for fine-tuning language models RAG Systems: Build high-quality document collections for retrieval-augmented generation Knowledge Bases: Create clean text corpora from documentation sites Research: Gather datasets from academic or technical resources Data Cleaning: Clean and deduplicate existing text datasets for ML training