Recently I was asked to help with the job of scraping company information from the Yellow Pages website using the ScreenScraper Chrome Extension. After working with this simple scraper, I decided to create a tutorial on how to use this Google Chrome Extension for scraping pages similar to this one. Hopefully, it will be useful to many of you.
1. Install the Chrome Extension
You can get the extension here. After installation you should see a small monitor icon in the top right corner of your Chrome browser.
2. Open the source page
Let’s open the page from which you want to scrape the company information:
3. Determine the parent element (row)
The first thing you need to do for the scraping is to determine which HTML element will be the parent element. A parent element is the smallest HTML element that contains all the information items you need to scrape (in our case they are Company Name, Company Address and Contact Phone). To some extent a parent element defines a data row in the resulting table.
To determine it, open Google Chrome Developer Tools (by pressing Ctrl+Shift+I), click the magnifying class (at the bottom of the window) and select the parent element on the page. I selected this one:
As soon as you have selected it, look into the developer tools window and you will see the HTML code related to this element:
As is seen from the highlighted HTML line, you can easily define a parent element by its class: listingInfoAndLogo.
5. Determine the information elements (columns)
After you have learned how to determine the parent element, it should be easy to specify the information elements that contain the information you want to scrape (they represent columns in the resultant table).
Just do this in the same way that you did it for the parent element – by selecting it on the page:
and looking at the highlighted HTML code below:
As you can see, the company name is defined by businessName class.
6. Tune the ScreenScraper itself
After all the data elements you want to scrape are found, open the ScreenScraper by clicking the small monitor icon in the top-right corner of your browser. Then do the following:
- Enter the parent element class name (listingInfoAndLogo in our case) into the Selector field, preceding it with a dot (*see below for why)
- Click the Add Column button
- Enter a field’s name (any) into the Field text box
- Enter the information item class into the Selector text box, preceding it with a dot
- Repeat steps 2-4 for each information item element you want to be scraped
After you enter all these definitions you should see the preview of the scraped data at the bottom of the extension’s window:
If the result is satisfactory you can download it in JSON or CSV format by pressing the corresponding button.
That’s it! I hope the tutorial is clear enough. But if not, feel free to write your comments below and I’ll give additional explanations.
Have a nice day!