Several days ago I wrote the Kimono scraper review where I mentioned that the service is short of pagination support and some other important functions. But it is fair to say that this service is developing quite rapidly and now they have not only added the ability to go over several pages and URLs, but even to keep the history of scraped data. Lets look at these features closely.
With this option, you can store the historical data accessible through Kimono’s application programming interface (API). Kimono is now able to create a new version of the data received every time it connects with the source, and to store it. You can go to your API endpoint in order to check out the most recent version’s number that the JSON metadata contains.
This option includes the choice between versions – from the very first one to the most recent – that you can retrace. It does not amend the original API endpoint path though, as it goes on accessing the most recent version of the data. In order to work with a version from the past, you may pick the older version’s number and add it to the API call:
If you need to retrieve more data from a page using a link provided – normally this is the case when the links lead you to “more” or “next” page – you may use the pagination feature. Once you have selected all the data you needed from the page, go to the top right corner of the toolbar, find the pagination icon (this looks like an opened book), click it and then click the link you need to allow Kimono following. This will enable you to find more entries.
Here is a 1 minute video from Kimono that shows how it works:
If the pages you need to obtain data from have the same format and structure, you may use the crawling option. This allows you to simply copy/paste the URLs (one by one, each URL in a separate line) into the “link list” box, and Kimono will handle all of them as a list. The “link list” can be found in your API detail page, under the “pagination/crawling” tab. Otherwise, you can select “import links from API” below the link list box and this will let you obtain the links from the page and import them into the list of the crawler’s target links by creating a kimono API detail page.
The difference between pagination and crawling kimono options is that the first one allows you to retrieve all of the data from a clicked-through page under ‘next’ or ‘more’ link, and the second one works with extracted data from the pages that are specifically provided or returned by kimono’s API URLs. Crawler would be more useful if you need, for example, to gather individual bios’ data from the personal pages of a team member when working with a group’s page providing this list of links. In case you are dealing with, let’s say, product catalog data and want to get the first 10 products’ descriptions chained by “next” links, you probably would need to set up pagination in order to obtain these descriptions.
You can see how the crawling works on this 2 minute Kimono’s video: