ToolGoogleScraperFAQ < Dmi

You are here: Foswiki>Dmi Web>ToolGoogleScraperFAQ (04 Jul 2013, ErikBorra)Edit Attach

Google Scraper FAQ

Google Scraper FAQ

What does the Google Scraper actually do?

The Google Scraper is a piece of software which allows one to batch query Google. It allows a user to enter a set of URLs and a set of keywords. For each URL-keyword combination Google will be queried for [keyword site:!URL]

Ive been setting the max number of hits to 1000 but it never seems to hit the maximum

Question (continued): ... and Ive noticed a lot of results in the order of 400-600. My maximum so far is 679 with emergency at facebook.com. I just tested scraping the word and at facebook.com and the result was 686 hits. There seems to be a limit in the tool how many hits you can get that is not the 1000 I had ticked, or?

Answer:

The full URL to the Google query would be http://www.google.com/search?as_q=emergency&num=100&hl=&btnG=Google+Search&as_qdr=all&lr=&as_ft=i&as_filetype=&as_occt=any&as_dt=i&as_sitesearch=facebook.com&as_rights=&safe=images&cr=&ie=UTF-8&filter=1&pws=0

At the top of Google's result page it is indicated that there are 13.400.000 results. However, Google will not provide all those results but will present the 'most relevant' results.

Typically, on Google's result page one can find approximately 10 results. Via the advanced settings (and in the URL above) it is possible to have Google return 100 results per result page. If there are more than 100 results available (which in this case there are, considering Google indicates there are 13.400.000 results) it will be possible to get another batch of 100 results through the pagination at the bottom of the result page. However, Google never returns more than a total of 1000 results. In case of the query [emergency site:facebook.com] 7 pages with a maximum of 100 results are returned. This explains the number 679.

Why is the 'number of results per query' set to 100?

The Google Scraper has the setting "Number of results per query (max 1000)". By default this parameter is set to 100 as this will result in one request to Google (for a result page of 100 results). If this parameter is higher than 100, the Google Scraper will loop over Google's result page pagination (at the bottom of the result page) until either Google returns no more results (for the above example of 679 results that would be at page 7), or until the desired "number of results per query" has been reached (e.g. 200 in the example of https://wiki.digitalmethods.net/Dmi/ToolLippmannianDevice#Example:_Multiple_sources_and_a_single_issue_or_key_word._40Seminal_Lippmannian_Device_45_source_partisanship_with_respect_to_single_issue_or_key_word_45_Craig_Venter_in_the_Synthetic_Biology_Issue_Space_41 )

The Google Scraper's notation page[0], page[1], ... refers to where the scraper is in paginating over the result pages. Page[0] means the scraper is retrieving the first result page with 100 results, page[1] means the scraper is retrieving the second result page with 100 results, and so on.

Note: video's are not parsed as results

What is the difference between 'Retrieved by Google Scraper' and 'Estimated by Google' in the tag cloud output?

In the resulting tag clouds one can choose to size the tags by 'Retrieved by Google Scraper' and by 'Estimated by Google'. The first is the number of results actually retrieved (679 in the above example) and the latter is the number as indicated by Google (13.4000.000). Also note, that the latter number is an estimate and that it might vary considerably depending on location or time of the day.

Is it possible to run multiple scrapes simultaneously?

No. In traditional mode you will be queued for sequential execution. In client mode the tool bar will get confused about which browser window to communicate with.

Querying hosts and URLs

If the box "only query discrete sites" is checked it will search the entire site. If it is unchecked it will search the specific page.

Checking the box chops every URL after the host. I.e. http://un.org/issues/etcetera becomes http://un.org. Thus, the full site is searched instead of the individual pages. Also note that Google itself strips everything from the URL after the question mark (e.g. un.org/issues/index.php?parameter1=foo&parameter2=bar).

I noticed that the tool does not always find the words Im looking for.

Question (continued): For example, it only finds the word geoengineering once on www.keith.seas.harvard.edu/geo.html and tipping point only once on www.en.wikipedia.org/wiki/Geoengineering but these terms occur several times.

A: If you query a specific word in a specific page the tool can only detect whether that word appears on the page or not. I.e., if you query [geoengineering site:www.keith.seas.harvard.edu/geo.html] in Google (e.g. https://www.google.com/search?as_q=geoengineering&num=100&hl=&btnG=Google+Search&as_qdr=all&lr=&as_ft=i&as_filetype=&as_occt=any&as_dt=i&as_sitesearch=www.keith.seas.harvard.edu/geo.html&as_rights=&safe=images&cr=&ie=UTF-8&filter=1&pws=0 ) it returns only one result. The tool in effect thus only measures presence or absence. If you query a site [geoengineering site:www.keith.seas.harvard.edu] the tool will return how many pages within that site contain the word geoengineering at least once.

So the tool cant query homepages (if thats the right terminology) as individual pages?

Not unless you are able to locate a URL of the homepage which is more specific than just the hostname (e.g. http://example.com/index.php instead of http://example.com). This is not possible for all sites.

How can I verify the query made to Google?

Each request to Google will be linked by its full URL in the Process log.

How many requests can the tool handle?

A request to Google is made for each URL-keyword combination. If one enters 10 URLs and 7 keywords, 70 requests to Google will be made - given that the maximum number of results has a maximum of 100.

Batches of 100 requests should be no problem; we have had few successes with more than 1000 requests too. To be sure try to stay on the low side.

The scraper is not reacting (or Firebug shows "TypeError: can't access dead object")

Make sure all your plugins (e.g. Adobe Flash Player) are updated. Go to the tools menu > Add-ons > Plugins > check for updates.

Help! Nothing works anymore

Try hard-reloading the Google scraper. You might also try to close your browser and reload the Google scraper. Lastly, you can try clearing your browser's cache and cookies.

How long do I have to fill in the captcha?

12 hours

Proxies

You can install a Firefox extension like Foxy Proxy which allows you to surf the Web through any kind of proxy. As the Google scraper offloads requests through your browser, if your browser is set to use a proxy, the request will thus also be through a proxy.

Can I close my laptop and continue from home?

No. The tool expects you to be at the same IP-address for the full duration of the scrape.

More info

https://wiki.digitalmethods.net/Dmi/ToolGoogleScraper

https://wiki.digitalmethods.net/Dmi/ToolLippmannianDevice

https://wiki.digitalmethods.net/Dmi/FirefoxToolBar

Topic revision: r2 - 04 Jul 2013, ErikBorra

Digital Methods

Course

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback