To be perfectly honest I wasn’t sure so I decided to try it out.
Full disclaimer here, I didn’t actually succeed. However, it was a great learning experience for me and I think you guys could benefit from seeing what I did and where I went wrong. Who knows, maybe you can take what I’ve done and figure it out for yourself!
You can jump to any of these methods if you like…
|CORS||No Referer header request||WordPress pages’ load|
xmlhttp.open("GET", "http://example.com/data/json", false);
var data = JSON.parse(xmlhttp.responseText);
Same Origin Policy
Can you bypass it? No.
You might think you can just bypass this. In order to request foreign domain resources we imitate the same domain origin. How? By obfuscating Referer header in XML Http Request, see the following:
Resources allowed to be requested from a foreign domain
Now, there are services that do allow cross-origin resource sharing. This is applicable for distributed resource sharing to diminish a server resource load. Eg. CSS stylesheets, images, and scripts might be served from foreign domain servers. Here are some examples of resources which may be embedded cross-origin:
<script src="..."></script>. Error messages for syntax errors are only available for same-origin scripts.
- CSS with
<link rel="stylesheet" href="...">.
- Images with
<img>. Supported image formats include PNG, JPEG, GIF, BMP, SVG.
- Media files with
- Plug-ins with
- Fonts with
@font-face. Some browsers allow cross-origin fonts, others require same-origin fonts.
- Anything with
<iframe>. A site can use the
X-Frame-Optionsheader to prevent this form of cross-origin interaction.
One example is the jQuery code which is often served from ajax.googleapis.com domain:
Cross-Origin Resource Sharing (CORS)
The main concept is that a target server may allow some other origins (or all of them) to request its resources. Server configured for allowing cross-origin requests is useful for the cross-domain API access of its resources.
If a server allows CORS it’ll respond with Access-Control-Allow-Origin:* header.
If a resource owner’s restricts the sharing with only a certain domain, the server will respond with:
You might do a preflight request to make clear if a server allows foreign domain access.
No Referer form submission
We’ve mentioned before that <iframe> loading foreing data in it works by neglecting same-domain policy.
Let’s try to use the form submission with no Referer header. Most of the sites approve the request if Referer header is empty (omitted). Websites do this because they don’t want to lose sort of 1% of their traffic. So we make a simple procedure that is called for a chosen domain with requesting thru virtual form submission:
var virtualForm = 'data:text/html,<form id="genform" action="http://www.' +site_url + '" method="GET"> <input type="submit" ></form> <script>genform.submit()<\/script>';
var iframe = document.createElement('iframe');
This code, when called client-side, adds new <iframe> into a web page and loads needed resource into a browser page. The whole code is here. Kind of loading.
See the following web sniffer’s shot showing the Origin header being null and no Referer header present.
The loaded site will seamlessly work in an iframe, yet, you can’t have an access to its HTML. You can get the page’s screenshot as an image, but it’s not sufficient for full-scale web scraping.
How does WordPress load foreign page shots into its admin panel
Now the CMS makes HTTP request to its own server, embedding the link to the foreign resource. Obviously the WordPress server makes request to the resource by provided link of interest and returns the content:
The only the thing is that the content returned by WordPress being an image: content-type: image/jpeg. You can program server to return HTML code, but that’s server-side data extraction.
Feel free to add more to this topic (using comments).