Dexi.io is a powerful scraping suite. This cloud scraping service provides development, hosting and scheduling tools. The suite might be compared with Mozenda for making web scraping projects and runnig them in clouds for user convenience. Yet it includes the API, each scraper being a json definition similar to other services like import.io, kimono lab and parseHub.
This is the modern js representation of the scraping robot as an object which can be easily edited, adjusted and transferred for other projects.
Robot building workflow
The robot building workflow is quite strightforward. You log in, choose Robot left pane tab and Create new Robot. Enter starting URL, name it and choose its type: Scraper or Crawler.
Following that you just utilize point-&-click UI to select page elements, choose actions, set before/after steps and more. Read more about the browser based robot building. See a robot example test results:
One may add on Crawler robot. It is formed based on conditions and processes. The crawling depth is adjustable. It took me quite a while to compose my first robot (mainly through watching tutorial videos).
Runs and execution
After the robot is ready you need to configure its run. Run is the configuration of how to excecute it comprising of concurrency, scheduling, integrations and inputs.
Robot execution happens in the cloud and results are stored in the available storage untill you wish to download, request through API or/and delete them.
Also I’d underline some more features the Dexi provides:
- The system operates with CSS and JQuery selectors. Better you get familiar with them.
- For each robot’s run a User-Agent might be set up.
- Robots.txt respect on/off for a single run.
- The system produces screenshots at each extraction step to help debug what went wrong. (+1)
- Can extract images, file downloads and take screenshots of any element
You should plug in 3rd party proxies to be used within Dexi tool.
“We do not allow running without proxies so if your account has no proxies – we will use our free proxies for your executions.”
They now have over 160 proxies (61 DE proxies and 100 US proxies), so you might need need to plug in the 3rd party proxy service for professional web scraping.
As the modern cloud scraper tool it works to be monitored, executed and fetched results thru ReST API. More details here. So your results might get fetched by a simple php code:
$request_url = "https://app.dexi.io/api/runs/$runId/latest/result";
$accessKey = md5($accountId . $apiKey);
$request = array(
'header'=> "X-DexiIO-Access: $accessKey\r\n" .
"X-DexiIO-Account: $accountId\r\n" .
"Accept: application/json\r\n" .
"Content-Type: application/json\r\n" .
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b \r\n"
$context = stream_context_create($request);
$response = file_get_contents($request_url);
$results = json_decode($response, false, $context);
The Dexi now provides a built-in CAPTCHA solving service that at the moment (July 2015) is free of charge for the SaaS users. Which is something you rarely see in scrapers. Yet, the service is able to solve only the input field CAPTCHA s (not JS-driven CAPTCHAs: draggable, drag and drop & etc.). The following are the steps (as suggested by their support) detailing how to set CAPTCHA solving steps in robot:
Just point to the image from the captcha, and select ‘Add step for element’. Click on the step in the timeline, and click ‘edit step’. In the ‘Type’ selector choose ‘Resolve Captcha’. Then select the ‘Captcha Input’ by selecting the icon and afterwards point it to the input field on your page. Finally add a step, which click the submit button.
The system provides the variety of exporting options. For that you follow the Integration sign. Select integrations and formats for each run. These will be invoked automatically whenever an execution of this run succeeds.
Pricing and counting
The Dexi SaaS offers a free account with feature full functionality, execution time being limited to 1 hour and concurrent robots (workers) being limited to 10 only. But for pro usage you do need to upgrade to the paid plan that start at $119/month with unlimited workers, unlimited execution time and full feature access. Dexi also provides on-demand pricing where you can opt to pay only for the execution hours you use.
My impression of the Dexi web scraping suite is that it is a modern environment for building and hosting scrapers. It offers users the “gentleman set” for the web scraping, which is not provided by any similar tool. Their CAPTCHA solving sets Dexi apart from services like Import.io or Kimono. Compared to Mozenda, this suite is more convenient purely because it is fully browser based (Mozenda requires you to install a desktop agent builder).
The docs are still under development, but a learning curve doesn’t appear to be too steep. Their support is very responsive and always ready to assist you. It looks modern, developing and it will sure have its place among the other scrape tools and frameworks.