In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).

The Challenge

The challenge was to scrape arbitrary sites and save all the site’s pure text (stripping all the html markup) into a single file. Originally I was going to use python and PHP solutions, but then I thought I’d try using Google App Script instead. And it turned out pretty well.

To handle page retrieval I use

Gathering Internal Links

Before we strip the HTML from a given page, we have to gather all the links from it so we can add them to our crawling links array (using regex). Since we only want to crawl inside the site’s boundaries, I limited the matches to internal links only.

Strip the HTML

For this purpose we use XMLService library. Xml.parse where the second param is “true” parses the document as an HTML page. Then we walk through the document (that will be patched up with missing HTML and BODY elements, etc. and turned into a valid XHTML document), turning text nodes into text and expanding all other nodes.

The Whole Code

Some utility functions I did not mention before, but I think you’ll easy catch up with them. :-)

How to automaticlly convert files in Google App Script you might read here.

Disclaimer

The script works well, but the google script execution limitation is 6 min, so for huge sites it won’t work (stopped after 6 min and no output in google docs file). I plan on improving it to encompass huge sites’ scrape and you are welcome to contribute your suggestions for that.