html5 local storage scraped Some of you may be wondering if it’s possible to extract a web browser’s local storage by web scraping?

Local storage in a nutshell is when a website stores data on your machine instead of having to call the server for it every time. Local storage is more secure than cookies, and large amounts of data can be stored locally, without affecting website performance. It’s accessible through browser scripting, for  example JavaScript.

Why JavaScript to fetch local storage, is there another way to get it?

Here’s some from the Wikipedia’s definition of local storage:

Unlike cookies, which can be accessed on both the server and client side, web storage falls exclusively under the purview of client-side scripting.
So in my view the local storage is data stored by web browser (ex. Opera) somewhere on your hard drive (or cloud machine) where browser is run. So to fetch them you need to locally hack Opera’s data files, which is much harder. I think the simplest way is to apply the client-scripting, namely JavaScript.

Python and Selenium

None of the high level programming languages invoke a browser instance, they request and extract pure HTML only. So if we want to access the browser’s local storage when scraping a page, we need to invoke both a browser instance and leverage a JavaScript interpreter to read the local storage. For my money, Selenium is the best solution.

Here’s how to leverage custom scripting through Selenium’s framework upon a web browser instance.

A possible replacement for Selenium is PhantomJS, running a headless browser.

JaveScript to iterate over localStorage browser object

Advanced script

As mentioned here a HTML5 featured browser should also implement So script would be:

Python with Selenium script for setting up and scraping local storage

Python bindings alternative to Python+Selenium

Some might argue Selenium is inefficient for only local storage extracting. If you think Selenium is too bulky, you might want to try a Python binding with a development framework for desktop, ex. PyQt. Something I might touch on in a later post.