Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.
The test’s tasks you might also see at the following picture:
|General scrape: tasks 1-3, 5-6||Tolerate unmached quotes||META Attentive||HTTP Header attentive|
The result for this scraper is as follows:
2>1 & 1<2 nonHTML unclosed scrapetools.com” millepah.com bad nesting проверка (windows-1251) wrong meta РїСЂРѕРІРµСЂРєР° (utf-8) wrong header
The scraper has successfully done almost all the tasks. The result shows that the scraper can scrape the unmatched quotes and it pays more attention to the http-header rather than to the meta tag.
With this scraper, which does extraction by Regex, one needs to set the extraction pattern with Regex expressions or something similar. That might be fitting for an invalid HTML scrape, but one cannot predict what the mistake will be with your target for scrape. So I did a scrape with one general pattern. Result:
The scrape result is as follows:
2>1 & 1<2 nonHTML unclosedmilleph.com bad nesting проверка (windows-1251) wrong meta РїСЂРѕРІРµСЂРєР° (utf-8) wrong header
Again the scraper has not found the unmatched quotes link, and it paid more attention to the http-header, rather than to the meta tag.
After some consultation and check with Sequentum tech team, I realyzed the VWR is able to extract the unmatched quotes link using the following regex:
having the 2-nd capture group as the result.
We’ve done the test on the Content Grabber and it issued almost the same result as its predecessor, VWR.
2>1 & 1<2
проверка (windows-1251) wrong meta
РїСЂРѕРІРµСЂРєР° (utf-8) wrong header
The scraper paid more attention to the http-header, rather than to the meta tag.
As far as the extracing of the unmatched quotes links, Content Grabber might be programmed to do it. Just grab the whole area, choose Inner HTML and to use the following regex in the transformation script to refine a link:
having the 2-nd capture group as the result: return $2
OutWit Hub Result is as follows:
This scraper fails to scrape unmatched quotes, and again it is more header attentive than meta tag.
Result is as follows:
(for utf-8 encoding saved) 2>1 & 1<2 nonHTML unclosed “” millepah.com bad nesting проверка (windows-1251) wrong meta РїСЂРѕРІРµСЂРєР° (utf-8) wrong header
Again the unmatched quotes failure and meta tag non-attentive scraper.
The result for Helium Scraper is this:
|2>1 & 1<2|
|проверка (windows-1251) wrong meta|
|РїСЂРѕРІРµСЂРєР° (utf-8) wrong header|
Unmatched quotes problem and http-header attentive, rather than meta tag.
Result is here:
Again the scraper missed the unmatched quotes element, and it paid more attention to the header than to the meta tag.
The result is as follows:
The scraper did not recognize the unmatched quotes link and paid more attention to the header rather than to the meta tag.
The scrapers have generally done satisfactorily, passing 5 out of 7 tasks. The Web Content Extractor, Visual Web Ripper and Content Grabber (6 out of 7 tasks rate) did the best. WCE could scrape the unmatched quotes links and VWR and CG are good in regex application to the deliberate page area (text transformation). The rest failed with unmatched quotes (single quote ‘ instead of double one “) recognition. The attention to the meta tag or the http-header differentiated the scrapers. See the table above.