Octoparse has recently launched a brand new version 7.0, which has turned out to be the most revolutionary upgrade in the past two years, with not only a more user-friendly UI, but also some of the advanced features make web scraping even easier. In this post, I will walk through some of the new features/changes made available in this new version, with respect to how a beginner, even one without any coding background, can approach this web scraping tool.
Here are some new features/changes included in Octoparse v7.0:
- Modern UI with a lightweight design
- New dashboard for efficient scraping task management
- New action-guided scraping supported by patent pending algorithm
- Wizard mode for easy scraping/Advanced mode for complex web pages
- Improved compatibility with upgraded built-in browser
- New UA select feature for enhanced anti-blocking
- More data export formats
Modern UI with a lightweight design
Compared with the past versions, Octoparse 7.0 comes in a clean and sleek look of a much more lightweight application. Less is more.
New dashboard for efficient scraping task management
Now with the new dashboard, you have all task-related information in one place, such as task status, lines of data extracted, time elapsed since the start of the job, etc. You can also access any data extracted from the dashboard or filter tasks by task status for efficient task management.
New action-guided scraping supported by patent pending algorithm
One of the biggest changes with this version is actually the change of algorithm. Now, with the new version, as soon you click on any content on the web page within the built-in browser, Octoparse will try to infer what you may want to do next and what specific data from the page you may be interested in capturing, then all you have to do is to pick the desired action from an interactive option menu.
In contrast to the previous versions, where you would have to know exactly what you want to do for the next step, the new action guided scraping mode really provides the extra guidance to quickly start on any web scraping process.
Wizard Mode or Advanced Mode?
Octoparse offers two modes to build your scraper:
- Wizard Mode for easy scrapes with step-by-step guidance
- Advanced Mode which is a lot more powerful and flexible and works for 98% of all websites.
There are three templates provided in Octoparse Wizard Mode,
1) “List or Table” – extract a list or table from a single page or multiple pages, such as directory listings.
2) “List and Detail” – extract information from item page by clicking on the links on a list, such as news articles.
3) “Single Page” – extract data from a single web page
With the Octoparse Advanced Mode, you can:
- Achieve data scraping on almost all kinds of web pages
- Extract data like text, URL, image, and HTML
- Set up a scraping task to interact with the webpage, such as login authentication, keywords searching and opening a drop-down menu
- Customize your workflow for more advanced sites, such as adding a wait time, adjusting for AJAX, modifying XPath and reformatting the data extracted
Upgraded built-in browser
Octoparse has upgraded the built-in browser in the 7.0 version, which greatly enhances the compatibility of the software. Websites that cannot be opened in the previous versions won’t be a problem now.
User-Agent selection for anti-blocking
The newly added User-Agent selection feature will help to keep you from getting blocked during the scraping.
Your browser sends what’s known as a user agent for any web page you visit. This is a string to tell the target website what kind of device you are accessing the page with. When scraping a website very consistently with the same user agent, it is easy to be detected as a scraping bot activity. Now, you can easily tweak this setting to reduce the chance of being blocked. Note, the software does only selection, not rotation of User-Agent.
More data export formats
With the new export format, JSON, Octoparse 7.0 now supports six different export formats (Excel 2007, Excel 2003, CSV, HTML, JSON and export to database) to meet your different requirements on data application.
Built-in RegEx tool and XPath tool to help you out
Octoparse 7.0 provides two important tools for building a scraper: RegEx Tool and XPath Tool, which helps you to extract the web data more precisely.
1) Regular expression tool (RegEx tool)
Having data extracted does not necessary means having the clean data you are looking for. Octoparse provides a built-in RegEx tool to release you from the painful data cleansing process. You can use the RegEx tool to hone in on any specific elements, for example, to retrieve the rating number from the HTML of star-rating icons.
2) XPath tool
Octoparse tracks elements on the web page using XPath. Knowing how to write an XPath that tracks the target elements is very helpful when the auto-generated XPath fails. The XPath tool provided in Octoparse helps you customize or write any XPath with a few easy setups.
Get data via local extraction/cloud extraction/API
There are three ways you can get data extracted in Octoparse, which is nice, as you can always pick one that meets your price target or fulfills your user case.
Running task on a local machine is available to all users, which is good not just for debugging workflow issues but also can be run in full to get the complete dataset extracted without utilizing cloud resources.
Octoparse offers a powerful Cloud Platform for 24/7 data extraction. When a task is set to run in the cloud, it will be run on the numerous Octoparse cloud servers anonymously using Octoparse IP’s. Data extracted will be saved in the cloud and can be accessed from any computer at any time. In addition, Octoparse cloud extraction also comes with advantages such as:
- Automatic IP rotation
- Data extraction speed increase up to 6 – 20 times
- Scheduled crawling at any time and any frequency
- Connected with the Octoparse API
The Octoparse API allows an even more seamless integration to your own program, without manually accessing the Octoparse App. The APIs can be set to access task information, get data extracted or even to control tasks (Advanced API). For example, you can have the extracted data delivered automatically to your own systems, in any frequency, including in real time.
|Plan||Functions (includes but not limited to)||Price|
|Free Plan||Unlimited pages, Unlimited computers, 10 Crawlers||free|
|Standard Plan||Free plan + cloud extraction, 100 Crawlers, API access||$75/month|
|Professional Plan||Standard Plan+ 250 Crawlers Advanced API, 1 on 1 training||$209 / Month|
As a whole, the Octoparse Version 7 is a sleek, powerful and easy to learn software that makes web scraping from any websites easy and achievable for most people, including non-coders.
I think it’s definitely worth a try if you are looking for a web scraping tool, as the Octoparse team is committed to making all these improvements in this new version 7. Moreover, its affordability and the generous functionality provided with its free version is compelling in the field. Regarding the guidance and support, there are tons of tutorials, step-by-step guides and video on their websites and YouTube, and the Octoparse Support Team is very responsive and helpful when you need some extra assistance with your task/crawler.