In this post we share the code of a simple Java email crawler. It crawls emails of a given website, with an infinite crawling depth. A previous post showed us Python simple email crawler.

Init settings

In our crawler we used the following libraries. For extracting web pages, I use the third-party JSoup library. It has many methods for extracting and modifying web data. We also use the, java.util, and libraries.

Init page and its host name

We will use the host name of the init page in order for our code to work with pages of only a given site.

Before the main cycle starts, we create a regex email template and arrays for data (email and URLs) storage.

Main loop

In the main loop we iterate over an array with links to pages of a given site. The init link was saved in the previous piece of code.

Note that this listOfURL array will be appended with new found links within each loop/cycle.
The cycle consists of 3 parts:

  1. The URL is extracted from the array, and we fetch the html page at its address. The page html code is stored as the Document object of the org.jsoup.nodes.Document class.
  2. The following piece of code is used to find and save email addresses. We transfer the text of the page to the object of the matcher, which finds matches for the pattern. Found email addresses are saved into the array without duplication.
  3. The last part of the code looks for and stores the links in the listOfURL array. Links of new pages having the same host are to be added without duplication. First, we look in the Document object for the tags ‘a’ with the attribute ‘href’. The ‘select’ method of the Document object accepts CSS or jQuery selectors to search. The found ‘a’ tags are saved in the Elements object. Next, we iterate over the found links of an Element object. We store the URLs as strings into the listOfURL array (not the entire ‘a’ tag notation).
    To get the full address (resolve relative path) in the attr method we use not just ‘href’ notation, but one with the prefix ‘abs:’. The resulting address is saved as an instance of the URL class for comparison with the host name of the init page. If the names match, the URL is checked for duplication before being saved to the listOfURL array.

Whole code

For those who want to check the result (URLs), a small piece of code can be added to the end of the main method. The list with links is sorted and only first 50 links are displayed. For this we import the Arrays class.
Now you can use the code, and we welcome your feedback. You may fork the project from github.