We now launch the new Web Scraper Test Drive stage with the Login Form Test. The test is to check if the web scrapers are able to pass a login before they touch actual data for scrape. Both form submission via POST, HTTP 302 Redirect outwork and cookie storing performance will be checked for each scraper.
|Form submission||302 redirect outwork||Cookie storing|
The Web Content Extractor demonstrated a good result. The scraper has a built-in login form submitter.
At the very start of a new project (at the Starting URLs window), there will be an option at the bottom for Auto Login Parameters:
Check the box, insert the URL with login form and press the “Auto Login Parameters” button; the Login Parameters window appears with the opportunity to enter the form fields parameters.
Now you enter the parameters and press the login button on the web page form, thus the Auto Submit Script will be automatically generated. You can preview and test the Auto Submit Script by pressing the corresponding buttons:
After all of that, you can proceed to compose an extraction project and the initial Auto Submit will be successfully performed.
For Mozenda Extractor in the Agent Builder, the method is simply to choose ‘Set user input’ among the actions. Then you proceed to fill in the fields and click the login button (choose ‘Click Item’ option).
The Agent Builder will save the action for an Agent and will execute it properly.
To compose the form submission with the proper field, at first we define the form fields. Click on the form entry (1) and then go to Capture area, choose the Content tab (2) and enter the field value (3) to be inserted into the form. Then press New button (4) and the new FormField type element will be generated. Press save button (5) to save it.
After we have defined the FormField type elements, we create a new template of a FormSubmit type. For this, click on the Template tab and then click on the Login button. You just name the new template, ex. New Template and save it.
Now we need to assign the form fields to this template. For this, go back to the Content tab. Then for each form field do the following:
- Press Edit button
- Go to Options area, Form tab
- Choose the previously created template in the Form template dropdown list
Now as you press the open button on the “New Template” template, the scraper will do form submission, request redirection and session authorization.
Working out the login form submission at the Content Grabber was pure pleasure. As you select an input field, the scraper defines Set Form Field command to let user provide input values to this web form field. Besides, user might set multiple values for each input field and the scraper will iterate over all of them.
The last command (Navigate Link) is for ‘submit’ button to submit form, having also been automatically created as you’ve chosen/picked a submit button.
Big thanks to the Content Grabber creators for such a fantastic visual scraping tool!
As usual we should record the proxy session. The very first HTTP request, the initial URL request, should then be generated into a scrapeable file. Call it Home.
Then we proceed and login on the web form. We now have the other HTTP requests, caught by proxy. The one with parameters is the actual POST request.
Login Scrapeable File
Now out of this POST request we generate a scrapeable file, call it Login. Set the Login scrapeable file sequence to 2. It should be requested right after the Home scrapeable file. On the picture below one can see the Parameters tab of the Login scrapeable file; those parameters will be automatically sent to as the scrapeable file gets executed.
Now as we run the scrapping session, both Home and Login scrapeable files will be executed, the parameters of the latter will be used for logging in and cookie passing.
Screen-scraper automatically tracks the cookies, just like a web browser, so by requesting it near the beginning any subsequent pages that are protected by the login will be accessible. Screen-scraper will automatically follow redirects, too.
Scraping session results:
Running scraping session: Login Form
Processing scripts before scraping session begins.
Scraping file: “Home”
Home: Resolved URL: http://testing-ground.extract-
Home: Sending request.
Scraping file: “Login_Form”
Login_Form: Resolved URL: http://testing-ground.extract-
Login_Form: POST data: usr=admin&pwd=12345
Setting referer to: http://testing-ground.extract-
Login_Form: Sending request.
Login_Form: Redirecting to: http://testing-ground.extract-
Login_Form: Extracting data for pattern “Untitled Extractor Pattern”
Login_Form: The following data elements were found:
Untitled Extractor Pattern–DataRecord 0:
Storing this value in a session variable.
Processing scripts after scraping session has ended.
Processing scripts always to be run at the end.
Scraping session “Login Form” finished.
Knowing the nature of the OutWit scraper I could not find any way for OutWit to proceed with the form submission of the Test Drive. But through a support inquiry, it was soon explained how to work it out.
The way to do this is to generate POST queries. OutWit 3.0 does not yet sniff the HTTP queries (4.0 will) but the way to generate the queries when you know the fields you want is very simple: http://testing-
So HTTP POST parameters are built into the URL in the OutWit Hub. Of course, to compose this kind of HTTP POST request, one needs to explore the form fields of the source HTML. The format of POST queries can be mixed with OutWit’s query generation patterns, so that whole lists of GET and POST queries can be sent as a part of a workflow.
I inserted the given query into a new Macros Automator and attached to it a custom scraper. This gave me the right result: WELCOME
The WebSundew scraper successfully did the task.
The way to login to the website is to create an agent by entering the credentials into the admin and password fields and then pressing the Login button. The scraper program will do the login logic. As the scraper does a login onto the next page, one can proceed to define the extraction pattern.
The program saves and passes cookie.
With Helium Scraper, at first I composed the ‘kinds’ as the form fields and submit (login) button. Then there is a need to import a premade action tree “Auto Login”: File->Online Premades -> Auto Login. This action tree contains the actions for auto login outwork. Then as I build the project action tree, I need to invoke this premade one through “Execute actions tree”:
As I invoked it, it prompted me to supply 5 parameters: the kind that selects the username box, the kind that selects the password box, the kind that selects the submit button and 2 more parameters: username value and password value.
Now it’s time to test the auto login action. It works perfect.
The way to login with this scraper is to add the login option to the existing project. In the toolbar menu go Project->Login Website. Then as I enter the form data, the scraper will save them for further use.
Basically all the scrapers have the login functionality, so I found no frustrations for website logging in. With some this functionality is evident, while with others it required some support inquiries.