Launching today
SNEWPapers
The World's First AI Newspaper Archive
109 followers
The World's First AI Newspaper Archive
109 followers
I taught machines to read newspapers, gave them 250 years of data, extracted everything (6 million+ stories so far), separated the ads from the content, and categorized it all. You can search semantically or with you own AI research assistant and get the actual articles with full text extraction, as well as build and share collections. As far as I know, this has never been done before, the data isn't on Google or in any LLM, only on SNEWPAPERS









SNEWPapers
Hey Product Hunt! 👋
I'm excited to share SNEWPapers — the world’s first AI-powered historical newspaper archive. We’ve read and organized 6 million+ stories from 250 years of American newspapers (1730s–1960s) so you can finally explore history by meaning, not just broken keywords.
Maybe the biggest news since sliced bread for digital humanities, historians, researchers, genealogists?
I built this after trying to research references in The Fourth Turning. Traditional archives dumped me into faded page scans with terrible search. So I created my own.
The result: clean, summarized articles and nearly perfect full-text OCR extractions + The Sleuth (your personal AI research assistant), smart categorization (24 categories / 1,000+ sub-categories), Collections for sharing, and a fun Today in History daily feed.
Quick start (10 minutes): → Tutorials
A few things I’d love your thoughts on:
Today in History — Would you actually open this daily?
Search + Sleuth — How useful is semantic search and the AI assistant for your research?
Collections — Would you use/share public collections?
Pricing: 7-day free trial. I priced it ~50% below traditional archives because we actually deliver usable, intelligent access. Product Hunt special: Use PRODUCTHUNT20 for 20% off any plan (valid until May 8).
Huge technical journey. I had to figure out how to acquire, store and process nearly a million high-resolution newspaper images, build custom multi-modal systems to detect and segment articles, massively improve OCR on centuries old ink, train models to understand newspaper layout and context, run prompt engineering at scale, balance cost vs quality with LLMs and vLLMs, build semantic and agentic search infrastructure that actually works on millions of documents, and scale a cost-effective GPU fleet.
Some “AWS-ish” stats so far:
115,000+ GPU GB-hours (OCR / Layouts)
26,000+ Lambda GB-hours moving data around
44.7 billion LLM/vLLM tokens processed
7 months of 80+ hour work weeks (organic neural network compute)
Would love your honest feedback and discoveries you make in the archive! 🫡 (here or hello@snewpapers.com)
Incredible scale!
You mentioned training the model to handle degraded paper and faded ink. Google famously used recaptcha v1 for the same problem, having millions of users unknowingly label words from old NYT archives. How have you coped this issue?
SNEWPapers
@oleksandr_utkin Haha I didn't know about the recaptcha story, that's very clever. There are quite a few decent OCR tools out there that are open source, a lot of getting it to work right is first to understand the settings and limitations, ideal DPI for character recognition, GPU settings for aspect ratios, then of course LLM and VLLM tech can help clean things up, then human verification, which you could then turn into a reinforcement loop for transfer learning on an open weights model
@brett_shinnebarger Truly impressive work — the engineering depth here is rare to see on a launch. The practical questions: how much disk space does the full archive end up taking after extraction?
SNEWPapers
@oleksandr_utkin I appreciate you for saying that! For a million high res images and all the processing artifacts etc.. about 6TB in S3, big but manageable. Easy enough to scale to petabytes without much effort
Honestly, this is quite cool!
Do you plan to expand the newspaper libraries to other countries?
SNEWPapers
@markocki Thank you! I just opened up the "Today in History" summary page a moment ago for any authed-but-not-subscribed users (only the full extraction details are behind the subscription wall now), feel free to check that out! There's plenty more US papers to get to first, but UK would probably be easy as well, and other languages that read left-to-right and have latin character sets. Partially it's also harder to do data validations when you don't know the language, but all possible
minimalist phone: creating folders
Can someone else submit newspapers?
SNEWPapers
@busmark_w_nika I hadn't thought of this, but it's a good idea. If there are high res images in the public domain we could create a request process to add them to our pipeline, ideally it wouldn't be just one issue, but the whole history that we could grab, or more ideally other archives that want to use our tech could request for us to process their entire dataset and we host the data and provide SSO for their users. There's a lot of them out there... https://en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives#United_States
minimalist phone: creating folders
@brett_shinnebarger I think it is a difficult process tho, because some newspapers need to be of a better quality, but imagine having them from all the countries (the biggest database) :)
SNEWPapers
@busmark_w_nika You'd be surprised how well we can process all but the really really degraded papers i.e. bad scans, ripped papers, pieces missing or very faded text that even a human would struggle with. The free US papers alone in English would get Snewpapers probably to 500 million stories or more. Other languages would be possible too, but that brings additional OCR related and technical challenges