Screen Scraper

Screen Scraper is a good classical scraping tool for all kinds of data scraping, extracting and packing. However, it takes time to properly master it.

 Overview

Screen Scraper is a full-kit development lab for scraping, analyzing and saving data for midsize or large scale projects. Screen Scraper works using a Regex data extraction approach. Screen Scraper is issued in 3 editions: Basic, Professional and Enterprise. The first being free spread, while the last ones provide a powerful pro API tools for data extraction and system integration. As we were using Screen Scraper, it worked as a stable, inclusive software, able to do all kinds of tasks.  The 7 rich tutorials clearly show how to construct a data extraction project.

Characteristics

Usability
Functionality
Easy to learn
Customer supportemail, phone (Enterprise edition)
Price$412 - Professional, $2099 - Enterprise
Trial version/Free versionBasic edition - free
OS (Specifications)Win, Mac, Unix/Linux
Data Export formatstext.
HTML, SQL Script File, MySQL Script File, XML file, HTTP submit form (if additionally programmed)
Multi-thread yes (up to 20)
API yes, for Professional and Enterprise versions
Schedulingthrough Windows Task Manager

Workflow

Screen Scraper defines its extraction patterns using a built-in proxy. First, you need to configure the proxy of your browser so that the data flow will go through the Screen Scraper (see tutorial 1). Turn the proxy on and create and start a new proxy session on SS:

Proxy setting

Actually, after I turned the proxy on, there was a big confusion of warnings issued by Google Chrome on certificates that came in from the web. I turned the proxy off and everything went smoothly. The problem is that since HTTPS requires encryption certificates, web browsers alert on certificates not properly signed. Screen Scraper, as a proxy, evidently cannot provide the proper certificate. So browsers often reject or ask permissions for the HTTPS, when Screen Scarper is on as a proxy. To use it as a proxy, Opera browser is recommended for sites using HTTPS.  Screen Scraper downloads the main page content and all sub-links during the proxy session. After that, you need to set the proxy off. When you turn the proxy off there should be no more trouble with HTTPS.

The following picture shows the HTTP transactions going through the proxy and being stored:

Proxy session

Then, you need to create a scraping session. The data ripped from the web during a proxy session will generate a scrape-able file, as part of a scraping session. The scraping session strips the web and applies patterns to the data:

Scrapeable file

Scrapeable files are just containers to accommodate links to the web data, extraction patterns and following scripts.  Scrapeable files are launched during the scraping session (mainly in a sequence). The browser requests the target page by the link and the program applies pre-set extraction patterns and following/preceding scripts. To make an extraction pattern, you need to search through HTML shown at the “Last Response” tab of the scrape-able file, using the “Find” tool. After defining the area for the pattern, just right click the mouse button and follow the prompted “generate the extraction pattern”:

Generate pattern

As you can see from the picture, Screen Scraper provides convenient tools to set up extraction patterns and to define extraction tokens. Thus, you may define extraction tokens, and conveniently, one may test each pattern right there. If you are skilled in Regex, you might write your own Regexes for tokens:

Data token setting & pattern testing results

After setting up the patterns and tokens, you need to save data to a file or somewhere else. The scripts serve this purpose. They may follow the patterns’ matches during the session, or they precede the session in order to set up session variables, or be invoked at other times. Here is a simple output script in Interpreted Java (the JavaScript and Python languages are also supported by Screen Scraper for writing scripts):

Output Script

Summary

Screen Scraper is a good tool to work with all kinds of data mining, including dynamic pages. Yet, it takes much time for an inexperienced user to master the techniques. Therefore, the best way is to call upon a scraping guru. The important thing is that Screen Scraper can integrate with other systems, with Java support allowing you to write serious scripts for a large scale program.