nodejs-web-scraping-logoThe web scraping topic has been actively growing in popularity for dozens of years now. Freelance sites are overcrowded with orders connected with this contradictory data extracting process. Today we will combine two new and revolutionary directions in web development. So, let’s consider an elegant and modern way to scrape data from websites with Node.js!

Firstly, a few words about the technology in use. Node.js is a cross-platform server environment, based on V8. Two main Node.js benefits are:

  • Using JavaScript on the back-end
  • Asynchronous programming – when thousands of users are simultaneously connected to the server– Node.js works asynchronously, that is it makes priorities and distributes resources more rationally.

Node.js is usually used for creating API; it is very convenient to create Desktop + Mobile app with Node.js, and, take notice, IoT. The deeper you study it, the more distinctly you will see – it is the future of back-end technologies.

If you don’t know anything about Node.js, a basic understanding of javascript and callback functions will be enough; the other complex code will be explained here.

Modules

Let’s start with overviewing our project. What do we need first? Node.js consists of a lot of useful modules that help you work faster. We will use these:

  • Express: The Node.js framework which allows designing API for mobile and web apps.
  • Fs: File system module. We will use it to write the results into a file.
  • Request: This module provides the simplest way to make http calls.
  • Cheerio: This allows one to use JQuery syntax to parse web data.

Now we will create our project and take some installation steps.

Building a project

To use Node.js you should download it. The installation process is very simple, so right after it’s successfully completed, you can start using it. We will talk about launching a bit later. Now we should create a project and start to install the needed modules.

The project building is as easy as the installation:

  1. Create a folder
  2. Inside the folder create file package.json
  3. Open this file and paste into it the following:
    In the file package.json this basic information is placed: name of the project, project version and description, main file and the author. The dependencies defines all modules and their versions (latest) that will be used in the project.
  4. Now we are going to use a command line, but first we should write some code. Create a server.js file and enter into it the following:
    Open it: find your project and enter the command node server – it will print our message in the console.

The basic configuration is done. Now we should install our modules that were mentioned in package.json file. The command

will download them in our project.

Scraping data with API

So, we have checked the project’s capability and have downloaded the modules. Let’s try to scrape some info. In the first example, we will get data about users on github.com. Fortunately, GitHub has its own open API. We will create a script which loads data about every GitHub user. For the test, we will get info about GitHub co-founder – Linus Torvalds.

Now we are going to visit to http://localhost:8081/scrape/users. We should see our message generated by res.send(‘Check your console!’):

node-js-browserEverything is alright.

What have we got in the command line?

Awesome! Finally, we should check our file output.json (it is created automatically in your project folder):

Quite simple, isn’t it? Some specific components might seem to be complex, but after an hour of practice you are certain to create this type of script like an expert.

Scraping the website content

Now we will scrape text information from the website. To practice our new skills we will work with a typical blog to get each article title. We will practice on the typical blog greencly.com.

The first thing we need is to “make friends” with the site’s html-code.

Go to the https://greencly.com/, open the developer tools (F12) and let’s investigate the source code. For example, we want to get all article titles from the main page

We see that every title is wrapped up by <h1 class = “entry-title”></h1>. To get it we will use the above-mentioned package cheerio.

The code is similar to our script above:

And here is our output:

Scraping with Node.js is a like an art, isn’t it?

Making conclusions

Web scraping is an engaging experience. We strongly recommend for you to go deeper in this theme to explore some other amazing features about scraping with Node.js, but remember – use gained knowledge only in legal directions.

To become a guru in Node.js scraping we recommend that you read the following 4 articles (the first post is this very post):

Learn Javascript

Useful Node.js tutorials

Cheerio