Cleanly pull content from any website

get it
#3 Product of the WeekApril 06, 2016



You need to become a Contributor to join the discussion - Find out how.
Brian DonohueMakerPro@bthdonohue · CEO, Instapaper
Brian from Instapaper here! Over the past few years we've gotten a significant number of requests from developers to have access to Instapaper's parser. Yesterday we launched Instaparser, an API to access Instapaper's parser. Instaparser is a paid service, but there's a free tier under that can be used for testing or just quick weekend hacks. Personally, this is the first developer-focused product I've launched, and I'm very excited to get it out into the community and see what people will do with it.
PJ Camillieri@cam_pj · Founder
@bthdonohue This looks very interesting. I am not trying to be negative here, but I am just curious (as a potential customer): how do you guys compare to open source (and frankly: popular) solutions such as Newspaper?
Brian DonohueMakerPro@bthdonohue · CEO, Instapaper
@cam_pj Hi PJ! I'm unfamiliar with Newspaper, so I just took a look through the source code to get a feel for how they're doing the article parsing. It looks like a great tool for an open source parsing framework, and also appears to be at least somewhat influenced by the Readability parser (similar paragraph scoring, checking sibling nodes, etc). I think the major difference here is that, in order to have a large coverage for as many domains as possible, you need to implement and maintain a flexible system for domain-by-domain parser configurations. We have a dedicated support/community person that's trained to resolve parsing issues on a domain-by-domain basis when they do come up, and we use a variety of signals in order to make sure the parser is up-to-date. We have signals coming from the "Report a Problem" button in the Instapaper app, scheduled integration tests against our most popular domains, recorded failures from the Instaparser API, and we use a combination of those signals and domain popularity to prioritize fixes in parsing issues both on a proactive and reactive basis. Creating an accurate parser requires constant maintenance from a dedicated team and while I'm sure there are open source projects out there that will come up with 65%-75% accuracy, getting to 90%+ accuracy is the really tricky bit. Hope that's helpful!
PJ Camillieri@cam_pj · Founder
@bthdonohue Understood. It makes sense. Like you said - the last 20% are always tricky with data extraction. Thanks for clarifying this.
Jeffrey Wyman@jeffrey_wyman · At the intersection of Tech and Business
Ben TossellHunter@bentossell · newCo
Pretty cool way to pull out the parts from an article go to the preview and test it out.
Dmitry Suholet@suholet · Product Manager, Yandex
Great news! I'm a huge fan of Instapaper. So, I'm very excited to see more products based on your Instaparser. Special thanks for a free tier :)
Brian DonohueMakerPro@bthdonohue · CEO, Instapaper
@suholet Thanks Dmitry! I was really impressed with the Yandex browser when it came out in 2014. I haven't used it much since, but I loved the innovations in the browser interface. Nice to have some mutual admiration! :-)
Dmitry Suholet@suholet · Product Manager, Yandex
@bthdonohue Brian, Im really impressed that you've heard of our browser ) How did you find it?
Brian DonohueMakerPro@bthdonohue · CEO, Instapaper
@suholet I think it was this article from TNW in late 2014:
Dmitry Suholet@suholet · Product Manager, Yandex
@bthdonohue haha thankd for the link :) Meh... Russians suck at promoting their products :)
Andrew McLaughlin@mcandrew · Medium
Well done, Brian! It's a really useful service. Parsing is super-valuable for good mobile UX, and Instaparser does it speedily, cheaply, and with good documentation.