Email Crawling I often receive requests asking about email crawling. It is evident that this topic is quite interesting for those who want to scrape contact information from the web (like direct marketers), and previously we have already mentioned GSA Email Spider as an off-the-shelf solution for email crawling. In this article I want to demonstrate how easy it is to build a simple email crawler in Python. This crawler is simple, but you can learn many things from this example (especially if you’re new to scraping in Python).

I purposely simplified  the code as much as possible to distill the main idea and allow you to add any additional features by yourself  later if necessary. However, despite its simplicity, the code is fully functional and is able to extract for you many emails from the web. Note also that this code is written on Python 3.

Ok, let’s move from words to deeds. I’ll consider it portion by portion, commenting on what’s going on. If you need the whole code you can get it at the bottom of the post.

Let’s import all necessary libraries first. In this example I use BeautifulSoup and Requests as third party libraries and urllib, collections and re as built-in libraries. BeautifulSoup provides a simple way for searching an HTML document, and the Request library allows you to easily perform web requests.

The following piece of code defines a list of urls to start the crawling from. For an example I chose “The Moscow Times” website, since it exposes a nice list of emails. You can add any number of urls that you want to start the scraping from. Though this collection could be a list (in Python terms), I chose a deque type, since it better fits the way we will use it:

Next, we need to store the processed urls somewhere so as not to process them twice. I chose a set type, since we need to keep unique values and be able to search among them:

In the emails collection we will keep the collected email addresses:

Let’s start scraping. We’ll do it until we don’t have any urls left in the queue. As soon as we take a url out of the queue, we will add it to the list of processed urls, so that we do not forget about it in the future:

Then we need to extract some base parts of the current url;  this is necessary for converting relative links found in the document into absolute ones:

The following code gets the page content from the web. If it encounters an error it simply goes to the next page:

When we have gotten the page, we can search for all new emails on it and add them to our set. For email extraction I use a simple regular expression for matching email addresses:

After we have processed the current page, let’s find links to other pages and add them to our url queue (this is what the crawling is about). Here I use the BeautifulSoup library for parsing the page’s html:

The find_all method of this library extracts page elements according to the tag name (<a> in our case):

Some of <a> tags may not contain a link at all, so we need to take this into consideration:

If the link address starts with a hash, then we count it as a relative link, and it is necessary to add the base url to the beginning of it:

Now, if we have gotten a valid link (starting with “http”) and we don’t have it in our url queue, and we haven’t processed it before, then we can add it to the queue for further processing:

That’s it. Here is the complete code of this simple email crawler:

This crawler is simple and is deficient in several features (like saving found emails into a file), but it gives you some basic principles of email crawling. I give it to you for further improvement. :)

And of course, if you have any questions, suggestions or corrections feel free to comment on this post below.

Have a nice day!