The Social Life of a Crawler

Team Members

Anat Ben-David, Anne Helmond, Jeroen Jonkbot, Marc Tuters, Oscar Coromina, Samuel Zwaan, Simeona Petkova

Research Questions

Is the web an increasingly closed space for crawlers?

Is there something like inclusion or exclusion policies for crawlers through different webspaces?

Is it possible to map the spaces that crawlers can/can't reach?

Which are the most marginalized bots?


To study if there are some kind of exclusion/inclusion policies towards crawlers we focused on 5 different spaces:
  • News (50 websites listed in Google Directory)
  • Dutch Blogosphere (Dejaap List)
  • UN websites (list provided by UN)
  • Gov Websites (wikipedia list of .gov sites from U.S)
  • Social Networking Websites (Most Popular according to wikipedia list)
  • Edu websites (queried google: and selected top 100)
We checked if there was a robots.txt file for each of the websites and readed it with crawler eyes.

Preliminary Findings

  • Some websites doesn't have a robots.txt file, so all the content can be crawled without limitations.
  • Some websites use robots.txt use it to explicitly allow all crawling for all robots.
  • Robots.txt is also used to avoid crawling through specific content.
  • There are also whitelists (they grant access to some bots disallowed places for the rest of the crawlers) and blacklists (bots not allowed to crawl, presumably for being behaving badly).
  • There is some kind of poetry on robots.txt comments.


A Large-Scale Study of Robots.txt. Yang Sun, Ziming Zhuang, and C. Lee Giles. 16th International World Wide Web Conference (2007). Publisher: ACM Press, Pages: 1123-1124
Analysis of the usage statistics of robots exclusion standard. Alay, S., and J. Ekanayake. IADIS International Conference WWW/Internet 2006
