This tutorial will describe how CasperJS can be used to scrape/test multiple pages at a time. CasperJS is a navigation scripting and testing utility. It’s execution takes place in sequential manner, in which one navigation step executes after other. For small number of steps, this behavior of CasperJS is perfectly fine. But as number of steps increase, the amount of time consumed can become very huge. This problem can be solved by introducing parallelism in the execution of navigation steps.
var url = 'https://www.google.co.in/search?q='+query;
casper.start();
casper.thenOpen(url);
casper.then(addScrapedLinksToResults(query));
query = 'facebook';
url = 'https://www.google.co.in/search?q='+query;
casper.thenOpen(url);
casper.then(addScrapedLinksToResults(query));
casper.run(function() {
// echo results in some pretty fashion
this.echo('Done');
for(var key in results){
this.echo(results[key].length + ' links found for '+ key +':');
this.echo(' - ' + results[key].join('\n - '));
}
this.exit();
});
Google Scraping : For Multiple keywords, using array #
Each thenOpen and then call adds a navigation step to the execution stack of CasperJS. Code for scraping results of multiple keywords can easily be tweaked with array of keywords. For more information on Navigation steps:What Does ‘Then’ Really Mean in CasperJS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
var results = {};
var casper = require('casper').create();
casper.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64)"+
" AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36");
var query = ['google','facebook','twitter','pinterest','whatsapp','skype'];
casper.start();
for(var i=0;i<query.length;i++){
var url = 'https://www.google.co.in/search?q='+query[i];
casper.thenOpen(url);
casper.then(addScrapedLinksToResults(query[i]));
}
casper.run(function() {
// echo results in some pretty fashion
this.echo('Done');
for(var key in results){
this.echo(results[key].length + ' links found for '+ key +':');
this.echo(' - ' + results[key].join('\n - '));
}
this.exit();
});
This code results in proper and complete result. Based on observations made in previous run, following conclusions could be made:
Keyword to URL mapping is not proper.
For some keywords, no results were scraped: then() step[the scraping step] executed before page was loaded. Above step also could be the result of this issue.
And apparently only one instance worked: Executing exit() on one casper instance caused all other instances to exit.
To deal with the first problem jQuery is injected in every result page and with the help of $.ready page was scraped at the very right time. This solution eliminated both mapping and zero results problem. For the second problem, status is added to each casper instance with the help of casper.completed = false. Status was then modified to true once instance associated with it has executed all the steps. All the statuses were checked before calling exit() and only iff statuses for all thecasper instances are set to true, exit() was issued.