Scraping public data from the web, transforming it, and using it for a new product can become a very successful business.
What kind of web scraping projects have you worked on and which tools did you use?
I never finished it - but I started a Strava scraping project. I think there's a ton of suuuuper interesting data in there, although I did it for interests sake, rather than to monetise it.
And yep, like @berthakgokong says - Python, Beautiful Soup, etc.
@berthakgokong@nik_hazell Also pretty cool. I think collecting data for a while and then figuring out what do to with it later is also not a bad idea. The value of data in general will be rising in the future. Have you tried puppeteer?
@berthakgokong@nik_hazell You should check it out. The usability is pretty good, especially if you use it with Typescript. It is based on Chromium.
All in all it has some quirks when controlling a headless browser engine, but I think that's not the fault of Puppeteer itself.
I had a website that scraped automotive listings and looked at the year, model, mileage, options and price to determine if it was a good deal (this was before everyone was doing it)
I found the whole process of scraping messy and a bit shady (listing sites really wanted to protect their data) so I eventually abandoned it. Data ownership is a very messy subject which I decided to avoid completely.
Decided to build a CMS instead - no reliance on external data :)
It is currently in private release and I think it offers quite a few competitive features that separate it from the competition.
My CMS is a SaaS platform built with Vue/Nuxt and MongoDB. I'm still ramping up but there's a bit of information on my website (check out the docs) at https://shustudios.com
I'm currently looking for a few beta testers.
@david_gregorian Yes, it is! It uses a REST API, but you can define the endpoints yourself in the CMS, as well as what data it should return. This gives you the best of both worlds between a REST API and a GraphQL API in my opinion.
Funny thing, I scraped the "Top Most Upvoted Products" using Bardeen.ai (our tool). It worked really nicely.
BUT I wanted to figure out which month is the best to launch, and turns out they haven't updated that page, so now I gotta scrape the all products.
Let's see where this takes me.
Some Projects – LinkedIn, Szalesforce (AppExchange), GitHub, Amazon, Food Inspection Scores (Texas), Google, Government Data Sets, CraigsList, Library, lots of sites...
Tools (that I like) – Scrapestorm, Import.io, ParseHub, OctoParse, Scrapy, RPA Tools (UIPath, Automation Anywhere, etc), Selenium, CLI (wget, curl, shell scripts)...
Tools vary depending upon task - haven't found one tool that I can consistently use for everything ..
I scrape local web sites from various countries, I use Python as the programming language and Bsoup library which is really easy and https://scrape.do as the proxy gateway (Couldn’t get the job done without them since local web sites I scrape usually requires local residential IPs.
I've been working in web scraping for almost 10 years. The most demand we've seen is from the e-commerce industry in terms of the volume of the data scraped. The common use cases are price monitoring, competitive intelligence, reputation monitoring, etc. Another hot use case is extracting data from Linkedin.
If I have to list the number of use-cases our data scraping supported - it will be more than 100 very different use cases across 20+ industries. Initially, we started with Python frameworks like scrapy and then built our own tools internally.
I'm the founder of Datahut(https://datahut.co/), a data ( web scraped ) as a service provider.