As we have been considering web scraping for positive use, there is also the aspect of the negative use of scraping for the purpose of stealing other bloggers’ proprietary content. Let’s consider some anti web scraping WP plugins.
As for a web content ownership the main indicator here is the indexing done mainly by Google. This means that if the content is scraped and immediately reposted, Google might be fooled to index it as the original, while the genuine source will be counted as content farming. Higher ranking sites might have better chances of being indexed earlier than sites with the original content, and the latter might even get a mark for being spam. This is not necessarily a tendency, but in the past some precedents have happened. This seems ridiculous, but through a published feed the offenders might detect and quickly scrape the original content for repost.
We consider several approaches and corresponding WordPress plugins for fighting it:
- Google’s Authorship signup.
- Append a “branded mark” message to be seen only in the feed to protect feed scrape.
- Make the RSS feed to delay a certain length of time after posting, thus leaving no ground for theft sites to be indexed first.
1. Google Authorship plugins
Use Google Authorship guide to connect your content with your G+ page (read here or here). If you fail to do that, having some trouble inserting bylines, why not use the Google Plus Authorship plugin? After installation you just enter your Google profile URL in the profile page and place a link back from Google Plus profile to the weblog.
Another plugin in Google Authorship features your WP posts with the Google Authorship Badge, Google Authorship Icon and link ( in WordPress go to Admin Panel -> Users -> Your Profile, and fill in your Google profile URL). So in case the post is scraped, the authorship link will remain there and keep pointing to your G+ profile. Also the plugin allows you to add your Bio and change password if needed.
2. My signature at the thief’s site
How about inserting a signature in the RSS feed, so when it is scraped and reposted, the content keeps its “branded mark”? Easily done! Just use the Anti Feed-Scraper Message plugin. After installing and initializing it, on the WordPress Dashboard go to Settings -> Anti Feed-Scraper. Leave or edit the default message:
[postname] originally appeared on [sitename] on [postdate],
and now your signature gets appended as some bots catch and repost the feed. Smart. Unless they know how to cut it off …
3. The Feed delay plugin
The Google distributed indexing system initiates indexing the web pages quite fast. Therefore if you just delay RSS post for a while, the original content doesn’t get indexed later and thus the authorship rights get protected from duplicate content threat. The plugin prevents the feed from immediate publication. Just set up a delay.
The anti-scraping tools for protection against stealing content for farming are nowadays very necessary and handy for bloggers. In later posts we will develop the anti-scraping theme reviewing more tools and methods.
If you have some questions on anti-scraping tools or just want to know more about how to protect your web data, feel free to comment or leave your question through ‘Contact us’ on the side panel.