Katherine Strickland · Follow
Published in · 6 min read · Jun 4, 2020
--
ParseHub is a free, easy-to-use web scraping tool. Once it is downloaded as an application onto a desktop, data can be extracted from websites for analysis.
Why use a web scraping tool, rather than manual methods of web scraping?
- Speed. These tools are much faster than traditional web scraping methods, where inspecting the page for every command can grow tedious and difficult.
- Visual, user-friendly interface. For instance, ParseHub brings up the website itself, highlighting the relevant place on the website for your reference when ParseHub is “seeing” the data. When selecting the elements on the web pages to be scraped, a box will appear over the text. This helps keep track of what’s going on in the program, rather than trying to keep track of long lines of code.
- Easy to follow command list. Each command can be added to, modified as needed, and deleted. They can also be renamed, simply by clicking on the word and typing the new name.
- Little to no coding knowledge is necessary. This is great for small businesses who may want web-scraped information without having to hire a professional, researchers who are trying to identify trends without much coding knowledge, or anyone who wants to obtain data to explore.
One of the many great things about ParseHub is the tutorial that automatically begins when you open the application for the first time after downloading it. This saves a first-time user the trouble of attempting to figure out the interface on their own, and provides a sample website to scrape data from as the walkthrough continues. Though the interface is fairly intuitive, some of the terms must be learned in order to successfully navigate different pages to scrape data. At the end of the tutorial, you can actually run the project, which will end up with data in an easy-to-download format.
Though there’s a lot to be learned from the tutorial, it’s often helpful to see an additional project walkthrough on a website that may be more realistic. The website for this walkthrough will be Bloomist, an e-commerce cite. Our goal is to extract products, links, prices, and reviews from each of the products, and organize them into a dataframe for later analysis.
The first step is to open a new project. After clicking on the new project button, ParseHub asks for the website that it will be scraping, and the URL can be copied in. In ParseHub’s main window, the website will appear.
Commands in ParseHub are very visual, and indicating what data to scrape from a website simply involves hovering over the element to be scraped and clicking on it. In order to make sure it is “seeing” the correct data, a pattern will have to be confirmed by the user, so ParseHub will then suggest a second element on the page that may fit the pattern. For instance, if I wanted to scrape all of the product names from the site, I would have to click on two product names before ParseHub recognizes the rest of the pattern. Once it does, green boxes will appear over everything that matches that pattern on the website.
From there, we would want to select a new element for the web scraper to identify. In our case, this would be price. It’s important that we link the product name and the price together so the data will be clear. ParseHub makes this easy with the Relative Select tool. First, one of the green product name boxes is clicked on, then the price below can be selected, which links those two elements. After this is done by hand twice, the pattern will again be identified, and it will appear throughout the entire page.
While this selecting process is going on, another visual tool is changing in the ParseHub interface, adding rows to a dataframe as different items are selected. ParseHub is able to automatically add links to the products we added to the dataframe, as well as including the price as we indicated earlier. This data preview is helpful, because it offers a glimpse of what the finished product, the data itself, might look like. However, only the first few rows are shown, unless a box is checked to include more data. Showing a preview of too much data can slow down the speed of the application, so sticking with the first few rows is usually a good idea.
Another thing we wanted to find out from this website was the number of reviews each product has. This isn’t something we can see on the main page of the website — it requires actually navigating to each product’s page and scrolling down to see the review number. A tedious task without a web scraping tool, but this is not hard for ParseHub to do!
The “Click” command will tell ParseHub to navigate to a new page and start a new template to gather data from that plate. On this page, creating commands works the same way as it did for the main page commands, and once again it automatically groups web pages with certain products.
Let’s say that there’s too many products to fit on one page, so the site has broken them up into multiple pages. We can still easily get the data from the next page by selecting a next page button, and indicating to continue the same commands on that new page of results. This means that a lot of products can be found with only one run.
Previous commands can be viewed and altered easily if any changes are necessary. They can also be renamed and re-organized. A full command list for a project may look something like this:
By indentation and color, it is clear exactly what ParseHub is collecting and in what order. It is also clear what page is being scraped. When a complete command list and data preview makes it seem like all the data will be found, a test run is a good way to see the scraping in action. To do this, click on the green “Get Data” button, and select “Test Run” from the options. This will highlight the commands as data is being collected, so if something goes wrong, it is easy to tell which command is the problem. It also ensures that you won’t have to take the time to run the scraper multiple times, especially because this process can take a while if there’s a lot of data.
After a test run has been done (or if you’re feeling bold), clicking on the “Run” button will run the program on ParseHub’s servers. Though it can take a few minutes to run completely, you then have the ability to get the website’s data in a CSV or JSON format and download it to your computer for further analysis.
Overall, this web scraping tool was easy to use and effective! For a visual person like myself, seeing what ParseHub was “seeing” in real time was very helpful in understanding the data structure and creating a working program.