In the post we share the practical implementation (code) of the Xing companies scrape project using Node.js, Puppeteer and the Apify library. The first post, describing the project objectives, algorithm and results, is available here.

The scrape algorithm you can look at here. The project files you might get from here.

Start Apify

For starting an Apify actor locally, it is better to quick start it by using the  apify create  command, if you have Apify’s CLI. Then you can git init. For a Quick Start refer to here.

The main.js file structure might be summed up as follows:

  1. Init global vars
  2. Apify.main(async () => {
    • fetch from input
    • init settings
    • check deactivated accounts
    • compiling search urls
    • adding previously failed urls to queue
    • const crawler = new Apify.PuppeteerCrawler({
      • crawler settings
      • save each found url into dataset

      });

    • await crawler.run(); // launch crawler
    • get results from datasets

    });

PuppeteerCrawler

The main engine of the scraping code has chosen Apify SDK PuppeteerCrawler. The code runs having a given INPUT and finishes as a request queue is exhausted.

Input is taken from an INPUT.json file. The approximate input file content should be like the following:

Here in the INPUT,  most parameters are self-explanatory.
The page_handle_max_wait_time is used to make a random time delay in the crawl process, just to be sure the script does not throttle the site.

The account_index parameter sets the init account index to choose from the INPUT accounts.
In the INPUT file the parameter crawl has sub-parameters that are needed for forming and compiling complex search urls [at the run time]. Those urls include the following:

  • country index (country, landern_with_letters, landern_only)
  • employee range (empl_range)
  • keywords (letters) are also used for forming complex search urls.
Note, for each xing category I run separate script instances with different employee ranges.

Urls inside of crawler

The crawler handles 2 kinds of urls:

  1. Search url. One may call it a crawl url as opposed to scraping url. This is a service url, enabling the crawler to gather as much as page urls. Such a url is compiled [artificially] based on the xing request filter parameters; we get them from xing. Parameters’ values are set in INPUT.json. Parameters are e.g.,

    filter.size[]=4
    filter.location[]=2951839
    keywords=aH

    The mix of url various GET parameters composes a unique search url. A search url might look like the following:
    https://www.xing.com/search/companies?filter.location%5B%5D=2951839&filter.size%5B%5D=4&keywords=aH
    Therefore, such a url returns a different search result having different parameters. This way we may query the db of xing companies backwards and forwards (inside out).
    Those search urls are generated upon each PuppeteerCrawler startup, and they get added to the queue:

    If a url is already in the queue or has been processed (handled) then the smart Apify SDK requestQueue object will filter it out, by not adding it. That is what makes working with Apify SDK so convenient. The requestQueue object smartly manages handled and pending urls.

  2. Page url. Each page url is a url of a particular xing company. They are gathered from the search results of search urls inside of the PuppeteerCrawler’s handlePageFunction. Page url e.g.,
    https://www.xing.com/companies/daimlerag

Handling xing pagination

A single search request might find many (over 10) companies’ pages, so this requires pagination to gather them all. So we spawn search urls by adding page parameter to the initial search url, new urls being added to requestQueue:

Exclude non-company page links

In the crawling process we decided not use Apify’s own Apify.utils.enqueueLinks() utility to better filter in page urls. So we’ve composed our own check_link() procedure to be able to extract only the companies’ pages links (page urls).

Logging in

We created a separate file login_xing.js where several login procedures are stored. The  login_page_simple() procedure and check_if_logged_in() are used to login and to check if an account is logged in, using an existing PuppeteerCrawler page instance.

Work with accounts

Since upon crawl of certain number of requests, xing bans an account, we have developed the strategy to check and discard any account that fails to log in to xing. Initially, we give each account a credit (validity) of 3 points. That means that the script will make max 3 attempts to try to login with it. If all 3 attempts are failed, the account will be set as deactivated.

Since xing may ban/deactivate any account that we use, for the purpose of choosing a new one, we have a procedure for checking non-active (deactivated) xing accounts: accounts_check.js

Datasets

The crawler pushes scraped data into several data sets.

The wrong_website_dataset stores urls of the main dataset items that have a website item parameter equal to https://www.xing.com/.

The  no_links_search_url_dataset and oversized_search_dataset accumulate search requests that have either zero number of results or a total number of results over 300 (oversized). The main.js code fetches the urls from corresponding datasets [before the Apify.PuppeeterCrawler launch] and re-runs those [search] requests at the following launch. 

But the requestQueue [Apify object] is smart enough to exclude repeating urls, the filter parameter being an url itself. Therefore before we rerun main.js that adds urls from those datasets, we have to remove those urls from handled links located at the apify_storage/request_queues/<queue_name>/handled folder. For that I used a separate Python script:

First we save empty_searches_dataset items into a corresponding *.txt file

Now we need to copy that text file from apify_storage/datasets/<dataset_name>  into apify_storage/request_queues/<queue name> 

And run the following script:

The project files you might get from here.

Conclusion

The scrape of the xing business directory has been a decent challenge to me. I managed to develop a mid-level- of- difficulty crawler with custom links from gathering, handling login inside of crawler’s gotoFunction, to storing and re-running  failed searches that found no links, etc.

Note: If you need a step by step guide on how to install puppeteer, Apify and start the project – let me know in comments.