Things Internet Researchers Should know About Search Engines

This page lists useful tips for doing research with search engines, particularly in combination with the Search Engine Scraper and the Lippmannian Device.

Things Internet Researchers Should know About Search Engines

Good query design

See our lecture on query design.

Consider what it takes to turn search into research.
Look into the search operators the search engine supports (here are Google’s, for example)
Use quotes around every word which need to be literally included, as some engines (such as Google) may:
- makes automatic spelling corrections
- persoanlizes search by using information such as sites visited before
- includes synonyms of search terms (matching “car” when you search [automotive])
- finds results that match similar terms to those in the query (finding results related to “floral delivery” when searching [flower shops])
- searches for words with the same stem, e.g. “running” when [run] was submitted
- makes some of the terms optional, like “circa” in [the scarecrow circa 1963]
Some queries might result in overly fresh results, or as Google puts it: “Search results, like warm cookies right out of the oven or cool refreshing fruit on a hot summer’s day, are best when they’re fresh”
Take into account transliterations: “osama bin laden” vs “osama ben laden” vs the Arabic spelling. Use e.g. Wikipedia’s articles in other languages.
Discrete and underspecified search terms often work well
When noting down queries in a research report one can use brackets. E.g. we queried [HIV] in google.co.uk and later refined the query to [“AIDS”] so that no synonyms are included.

Disentangling the researcher from the results

See our video tutorials on analyzing engine results and localizing web sources.

When using the Search Engine Scraper or Lippmannian Device with our Firefox toolbar, the researcher needs to take a few steps to ensure that day to day activities do not interfere with research.

Consider installing a separate version of Firefox, a so called research browser, used solely for research purposes.
- Alternatively use a specially created Firefox profile or install a separate version of Firefox on a USB stick.
- See our video on setting up a research browser.
In the research browser, make sure to log out of any services that may be linked to the search eninge. See our video on setting up Google for research; but this applies to other engines as well, e.g. Bing may be linked to your Microsoft account if you have one.
- Even when logged out, a search engine may personalize results based on previously stored cookies. For the most neutral search results, clear your cookies before searching, or configure Firefox to not allow cookies at all.

Search Engine Peculiarities

The search engine scraper tools allow searching with a search engine of your choice. Most search engines have some peculiarities, which are useful to keep in mind while scraping and analysing results:

Baidu

Baidu is a Chinese search engine, and thus focuses on Chinese sites and results.
Baidu does not show the URLs of found sites in its results, but rather a redirect URL. If it is the URLs you are interested in, it is advised that you de-shorten those URLs yourself.

Bing

Bing supports many search engine operators.
Many of Bing's search query customizations (e.g. searching by region or tweaking safe search settings) are set via a cookie. This method of customizing a query is currently not supported by the scraper tools, but you can go to https://www.bing.com/account/ to change search settings. Make sure to do this after you've cleared your cookies.
The way Bing indicates result count estimates is a little complicated (it has various ways of phrasing the estimate, e.g. "approximately 5000 results" and "result 50-100 of 5000"). The scraper will attempt to parse this value, but if results look off it is a good idea to double-check.

DuckDuckGo

See the DuckDuckGo help for a list of supported operators.
DuckDuckGo does not offer (an estimate of) the total amount of results.

Google

A list of Google-supported query operators.
A CAPTCHA is triggered after a number of results when scraping results automatically (i.e. via a tool).
Amount of results varies from request to request. It usually stays approximately the same, but may fluctuate by small amount.
It is virtually never possible to get as many results as Google initially estimates, even when using a normal browser.
Google personalizes results to a great extent. For the best results, log out of all Google services, clear cookies, and opt out of further personalization.
- Results are further personalized based on locale (i.e. location and language). This is difficult to avoid entirely, but in extreme cases you may want to consider using a browser in a different language.
Google publishes a transparency report on search result removals. This may give some insights in case of conspicuously 'missing' search results.
If a non-English language is selected, Google may include translated English language results.
How Google decides what ‘nationality’ a site has: Geotargeting factors uses cctld, geotargeting for gtlds (webmaster tools), server locaction, other signals (addresses and phone numbers). At the bottom of this list is a list of local domain Googles.

Naver

This is a Korean search engine, and will therefore be particularly biased towards (South) Korean sites and results.

Yahoo Japan

It is worth emphasizing that this is not the same search engine as the international Yahoo you are probably more familiar with. Yahoo Japan is a separate company and search engine, focused on the Japanese market.
Like Baidu, Yahoo Japan results link to an internal redirect URL rather than the actual page URL. Unlike Baidu, these redirects contain the actual URL, so the scraper extracts these, and there is no need to do this manually.

Yandex

Yandex also allows you to use a number of search operators.
CAPTCHAs are triggered rather easily, which makes scraping large numbers of results more difficult (or at least time-consuming)
Yandex does not offer an easily parseable estimate of the total number of results. The scraper will attempt to parse the number, but it may not be able to do so, e.g. if due to your location or browser Yandex returns results in a language that isn't English.

Analysing results

See our video tutorials on analyzing engine results and localizing web sources.

Harvesting and triangulating results

Harvest a page (of search results, or of another kind) by selecting the results in your browser > view-selection-source > copy > paste into the harvester. Also see our video on extracting URLs from a web page.
You might want to triangulate the results of different search engines or searches with different settings. Also see our video on comparing lists.

The symbiosis of search results and Wikipedia

Wilkinson and Huberman (2007) find evidence of a direct correlation between the visibility level of a certain article (measured in terms of its Google pagerank popularity level) and the number of edits received by that article. See Wilkinson, D.M., and B.A. Huberman, 2007. Assessing the Value of Coooperation in Wikipedia.
It has been shown that the Google PageRank has a strong correlation with the number of times a Wikipedia page is viewed. See Spoerri, A., 2007. " What is popular on wikipedia and why?," First Monday.

Algorithm changes

Google

In what follows, Google algorithm changes that have resulted in new, or changing, modes of research that were not possible before the change type are listed; from the first named and confirmed Boston update in 2002 until June 2015. The timeline is by no means exhaustive. Google changes its algorithm 500-600 times per year. While most of these changes are minor, others are ‘major’ in that they have the biggest impact on (re-)search. A selection is made from the work by SEO consultancy MOZ, which keeps track of these major algorithm changes by tracking changes in results for a set of queries with their ‘Rank Tracker’ tool, community submissions and updates reported by Google. Table adapted from Weltevrede, Esther (2016). Repurposing digital methods. The research affordances of platforms and engines. Ph.D. Dissertation, Amsterdam, NL: University of Amsterdam (pp 120).

year	update name	update type	key Google algorithm change
2003	Boston	Anti-manipulation / Quality	More emphasis on quality back-links
2003	Cassandra	Anti-manipulation / Quality	Cracking down on link-quality issues, such as co-linking from domains, hidden text & hidden links
2003	Dominic	Anti-manipulation / Quality	Improving on counting and reporting backlinks
2003	Emeralda	Infrastructure	Improvements on the index infrastructure
2003	Fritz	Infrastructure	Improvements on the index infrastructure
2003	Supplemental Index	Anti-manipulation / Quality	Update splitting off results of lesser quality into the "supplemental index"
2003	Florida	Anti-manipulation / Quality	Crack-down on low-value late 90s SEO tactics, like keyword stuffing
2004	Austin	Anti-manipulation / Quality	Crack-down on SEO-tactics, inc. deceptive on-page tactics, including invisible text and META-tag stuffing
2004	Brandy	Semantic / Query	Latent Semantic Indexing (LSI), anchor text relevance, synonyms and keywords, intro idea of link "neighbourhoods"
2005	Allegra	Anti-manipulation / Quality	Crack-down on suspicious-looking links
2005	Bourbon	Anti-manipulation / Quality	Improvements in how duplicate content and non-canonical (www vs. non-www) URLs were treated
2005	Personalized Search	Personalization / Social	Results take user's search histories into account
2005	Jagger	Anti-manipulation / Quality	Crack-down on low-quality links, including reciprocal links, link farms, and paid links
2005	Google Local/Maps	Local	Maps data is integrated into the Local Business Center
2005	Big Daddy	Infrastructure	Infrastructure update changing the way URL canonicalization, redirects a.o. technical issues are handled
2006	Supplemental Update	Anti-manipulation / Quality	Change to the supplemental index and how filtered pages were treated
2007	Universal Search	Universal	Integrating traditional search results with News, Video, Images, Local, and other verticals
2007	Buffy	Semantic / Query	Update to single-word search results and other small changes
2008	Dewey	Universal	Unspecified update to the index, reportedly pushing Google's own internal properties, including Google Books
2008	Google Suggest	Semantic / Query	Update displaying suggested searches in a dropdown below the search box and later powering Instant
2009	Vince	Trust	Big brands get a boost in the results
2009	Real-time Search	Real-time / freshness	Twitter feeds, Google News, newly indexed content, a.o. were integrated into a real-time feed on some SERPs
2010	Google Places	Local	"Places" originally only a part of Google Maps was now integrated more closely with local search results
2010	May Day	Anti-manipulation / Quality	Crack-down on low-quality pages ranking highly for long-tail searches
2010	Caffeine	Real-time / Freshness	Launch of new web indexing infrastructure resulting in a 50% fresher index
2010	Brand Update	Trust	Same domains are allowed to appear multiple times on a SERP
2010	Google Instant	Semantic / Query	Displaying search results as a query was being typed
2010	Social Signals	Personalization / Social	Social signals are included in determining ranking, including data from Twitter and Facebook
2010	Negative Reviews	Trust	Update to ranking based on negative reviews
2011	Panda	Anti-manipulation / Quality	Crack-down on thin content, content farms, sites with high ad-to-content ratios, and a number of other quality issues
2011	Freshness Update	Real-time / Freshness	Update primarily affecting time-sensitive results signaling a much stronger focus on recent content
2012	Search + Your World	Personalization / Social	Update pushing Google+ social data and user profiles into SERPs
2012	Venice	Local	More localized organic results and more tightly integrate local search data
2012	Penguin	Anti-manipulation / Quality	Crack-down on spam factors, including keyword stuffing and link schemes
2012	Knowledge Graph	Semantic / Query	Rolling out a SERP-integrated display providing supplemental object about certain people, places, and things
2012	Exact-Match Domain (EMD) Update	Anti-manipulation / Quality	Crack-down on low quality websites that have search terms in their domain names
2012	DMCA Penalty ("Pirate")	Anti-piracy	Crack-down on software and digital media piracy
2013	In-depth Articles	Universal	New type of result, dedicated to more ever-green, long-form content
2013	Hummingbird	Semantic / Query	Core algorithm update that powers changes to semantic search and the Knowledge Graph
2014	Payday Loan	Anti-manipulation / Quality	Crack-down on spammy queries
2014	Pigeon	Local	Altering local results and modifying how location cues are handled, creating closer ties between the local and core algorithm(s)
2014	HTTPS/SSL Update	Trust	Giving preference to secure sites
2014	Authorship Removed	Trust	Authorship bylines disappearing from all SERPs
2014	"In The News" Box	Universal	Change to News-box results expanding news links to a much larger set of potential sites
2014	Pirate 2.0	Anti-piracy	Crack-down on software and digital media privacy
2015	Mobile Update / "Mobilegeddon"	Mobile	Mobile friendliness becomes a stronger ranking factor for mobile searches
2015	The Quality Update	Anti-manipulation / Quality	Core algorithm change impacting "quality signals"

Other useful observations

Period of observance	Observation	Reference / example	Google service
2012	cheat sheet / URL parameters	http://code.google.com/intl/en/apis/customsearch/docs/xml_results.html
2002 - 2012	The maximum amount of results served by Google is 1000. In this example query Google indicates it has indexed about 226,000,000 results, while one can not click beyond result 874	http://www.google.com/search?q=Things+one+should+know+about+google&hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=6bK&num=100&start=900&sa=N	all
2004 - 2012	Screen scraping Google might get you blocked.	DMI Google scraping experience	all
- 2012	different results are returned when one is logged in		all
2002-2012	the maximum nr of results returned by Google per query = 100	add &num=100 to the URI	all
2004 - 2012	the US version of Google (google.com) returns the most "international" results	you can also go to http://google.com/ncr (No country redirect)	google web search
2004 - 2012	cheat sheet / search operators	http://www.google.com/help/cheatsheet.html	google web search
2004 - 2012	cheat sheet / search operators	http://jwebnet.net/advancedgooglesearch.html	google web search
2004 - 2012	cheat sheet / search operators	http://www.internettutorials.net/boolean.asp	google web search
2008 - 2012	Google Trends is based on 'sucessful queries.'	How does Google Trends for Websites work? When you enter the address of a website into the search box, Trends for Websites shows you a graph reflecting the number of daily unique visitors (the number of people who visit a website) to that website. http://www.google.com/intl/en/trends/websites/help/index.html#1	trends

I	Attachment	Action	Size	Date	Who	Comment
pdf	Updates_timeline_.pdf	manage	98 K	06 Jan 2016 - 10:39	ErikBorra	Google Algorithm 'Change Types'

This topic: Dmi > SummerSchool2009 > ThingsInternetResearchersShouldKnowAboutGoogle
Topic revision: 02 Oct 2018, StijnPeeters

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback