Extract content from articles, products, videos and more

get it
#4 Product of the DayMay 20, 2015
You need to become a Contributor to join the discussion.
Mike K Tung
Mike K TungMakerHiring@mikektung
Hi Product Hunters, Diffbot is an Artificial Intelligence company with the goal of converting the existing web into the world's largest database of structured data. Diffbot data is the backend that powers Instapaper, Bing Search, DuckDuckGo, eBay and others. You can read more about our company's vision in this article: http://www.xconomy.com/san-franc... It's humbling to see a product that has no "UI" featured here. What do you think about the current API and website?
Yuval Shoshan
Yuval Shoshan@yuvals · Building @parrotread
Great product. I used it for a year+- and it was really good.
Emmanuel Amberber
Emmanuel AmberberHunter@emmanuelamber · Data & Product Head | @YSProfiles
"Using AI, computer vision, machine learning and natural language processing, Diffbot provides comprehensive tools to understand and extract from any web page."
ArushHiring@arush · Co-founder, Try.com
That's correct, unlike the other markup based scrapers, diffbot won't break when the underlying markup changes. Amazing team, great product.
Emmanuel Charon
Emmanuel CharonMakerHiring@emmanuelcharon · ML Engineer & Games craftsman
Thanks a lot guys! Our robot looks at the actual page a human sees and makes its own decision to identify what is what (instead of following pre-determined rules on html code). @arush @emmanuelamber
Anselme@anselmetrochu · Co-founder of Ruzzit
Hello Diffbot team, This kind of product are really interesting I believe they will be more useful within the coming years, how would you differentiate yourself from https://www.kimonolabs.com/ ? Good luck and keep on going
Mike K Tung
Mike K TungMakerHiring@mikektung
Thanks @anselmetrochu. It's akin to the difference between a revolver and a machine gun. One is a manual rule creation tool--you click on parts of the page for each site you want to extract from--the other is an AI that uses a hybrid computer vision and natural language processing system on the rendered page to automatically extract structured data just given a URL. It makes getting data from 1000s of different sites, or the whole web, possible.
Emmanuel Charon
Emmanuel CharonMakerHiring@emmanuelcharon · ML Engineer & Games craftsman
Hey Anselme, kimonolabs allows you to create an API to extract content from web pages. You need to configure it for each website you want data from. We also provide this functionality with our CustomAPIs (cf http://www.diffbot.com/products/). We provide 2 other main functionalities: 1) automatic extractions: that is were artificial intelligence comes in. Our robot is able to understand (and extract from) article, product, and several other page types without configuration nor knowing the site in advance. 2) crawling entire sites (and apply APIs to each page discovered this way) with the lightning speed of Gigablast. This also comes with a searchAPI on collections created this way. Hope it helps! Don't hesitate to check out our website and give our testdrive a spin (e.g: http://www.diffbot.com/testdrive...) @anselmetrochu
Benjamin Wheeler
Benjamin Wheeler@benjamin_wheeler · Teaching and making tech. Indie hacker.
Very promising direction! Would be amazing to see this approach extended to more broad range of websites, such as concert/sports/film schedules
Mike K Tung
Mike K TungMakerHiring@mikektung
@benjiwheeler Automatic extraction of event data is on our roadmap! You can see a full list of types in this infographic: http://mashable.com/2012/08/16/t...