Extract content from articles, products, videos and more

#4 Product of the DayMay 20, 2015
Would you recommend this product?
No reviews yet
Hi Product Hunters, Diffbot is an Artificial Intelligence company with the goal of converting the existing web into the world's largest database of structured data. Diffbot data is the backend that powers Instapaper, Bing Search, DuckDuckGo, eBay and others. You can read more about our company's vision in this article: http://www.xconomy.com/san-franc... It's humbling to see a product that has no "UI" featured here. What do you think about the current API and website?
Great product. I used it for a year+- and it was really good.
"Using AI, computer vision, machine learning and natural language processing, Diffbot provides comprehensive tools to understand and extract from any web page."
That's correct, unlike the other markup based scrapers, diffbot won't break when the underlying markup changes. Amazing team, great product.
Thanks a lot guys! Our robot looks at the actual page a human sees and makes its own decision to identify what is what (instead of following pre-determined rules on html code). @arush @emmanuelamber
Hello Diffbot team, This kind of product are really interesting I believe they will be more useful within the coming years, how would you differentiate yourself from https://www.kimonolabs.com/ ? Good luck and keep on going
Thanks @anselmetrochu. It's akin to the difference between a revolver and a machine gun. One is a manual rule creation tool--you click on parts of the page for each site you want to extract from--the other is an AI that uses a hybrid computer vision and natural language processing system on the rendered page to automatically extract structured data just given a URL. It makes getting data from 1000s of different sites, or the whole web, possible.
Hey Anselme, kimonolabs allows you to create an API to extract content from web pages. You need to configure it for each website you want data from. We also provide this functionality with our CustomAPIs (cf http://www.diffbot.com/products/). We provide 2 other main functionalities: 1) automatic extractions: that is were artificial intelligence comes in. Our robot is able to understand (and extract from) article, product, and several other page types without configuration nor knowing the site in advance. 2) crawling entire sites (and apply APIs to each page discovered this way) with the lightning speed of Gigablast. This also comes with a searchAPI on collections created this way. Hope it helps! Don't hesitate to check out our website and give our testdrive a spin (e.g: http://www.diffbot.com/testdrive...) @anselmetrochu
Very promising direction! Would be amazing to see this approach extended to more broad range of websites, such as concert/sports/film schedules
@benjiwheeler Automatic extraction of event data is on our roadmap! You can see a full list of types in this infographic: http://mashable.com/2012/08/16/t...