Nofollow / Indexing Issues in the Blogosphere
Introduction: Indexing and Ranking
Search engine critiques generally focus on either the allocation of pages to be searched (indexing) or the algorithms used to determine the order of search returns (ranking). Central to both areas of study, however, is the hyperlink. The link is necessary for page discovery, and is used in PageRank and derivative methods as an indicator of reputation among a set of linked documents. Roughly speaking, the debate around search engines in the late 1990s and early 2000s centered on indexing, with search engines displaying with pride the number of pages indexed and the introduction, by search engine critics, of the ominous notion of a 'dark Web,' an unindexed Web not discoverable through search engines. More recently, with the rise of Google, focus seems to have shifted to the question of relevance among results. The Nofollow case study raises both issues and, by taking the link as a starting point, questions the conventional practice in search engine studies to separate the two. Indexing, as will be seen below, is not always a straightforward act, and in some cases requires link-interpretation on the part of the crawler. Alongside the many editorial decisions embedded in ranking algorithms, this inevitably affects search engine return.
The Nofollow tag specifically affects the indexing of links embedded in blog comments. After briefly introducing the tag and assessing its prevalence, here we present a case study investigating the returns from Google, Google Blog Search, and Technorati for links to the Masters of Media weblog. Rather than question the extent to which the devices provide blogosphere 'coverage' (an indexing question), we speculate on how editorial decisions in both indexing and ranking combine to construct different blogospheres among the various devices.
Nofollow
Working alonside Yahoo!, MSN and blog platforms such as Wordpress, Google
introduced the no_follow attribute in 2005 to prevent comment spam and trackback spam. According to Wikipedia,
nofollow is an HTML attribute value (no_follow) used to instruct search engines that a hyperlink should not influence the link target's ranking in the search engine's index. It is intended to reduce the effectiveness of certain types of spamdexing, thereby improving the quality of search engine results and preventing spamdexing from occurring in the first place. (Link)
While the aim is to prevent the manipulation of site rankings, no_follow also affects indexing, as indicated by the different variations on the attribute and interpretations of it:
1. Robots (don't follow link)
The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable.
For example: "Do not follow any of the hyperlinks in the body of this document." (Wikipedia)
2. Search Engines (don't count link)
How the attribute is being interpreted differs between the search engines. While some take it literally and do not follow the link to the page being linked to, others still "follow" the link to find new web pages for indexing. In the latter case rel="nofollow" actually tells a search engine "Don't score this link" rather than "Don't follow this link." (Wikipedia)
These inital considerations led us to a few research questions, focusing on measuring the prevalence and effects of no_follow, as well as the broader effects on blogosphere coverage when devices treat the tag differently:
- How prevalent is the nofollow tag?
- What percentage of a given network may be excluded from an internet search due to nofollow?
- What are the social/political implications of this sort of segregation?
- What do we lose by dividing our primary access to the web into two primary entry frames, blogs and not-blogs?
Prevalence of the no_follow html attribute
To guage the prevalence of no_follow, we consulted relevant policies from blogging platforms and search engines.
Among the major blog services, no_follow is a standard addition (with some caveats):
- WordPress: default setting, can be disabled with an additional do_follow plugin.
- Blogger: default setting, can be disabled with a series of advanced steps.
- Typepad: "For TypePad subscribers, implementation will be automatic. Links from commenters will be flagged automatically in the next update, which will be deployed within the next 24 hours." (Six Apart - Support for Nofollow)
- Movable Type: "For Movable Type users, we’re shipping a plugin today to enable support on Movable Type-powered sites. The Movable Type website has full details, including a download link." (Six Apart - Support for Nofollow)
- LiveJournal: "LiveJournal also plans to implement the specification for comments from other members who are not friends." (Six Apart - Support for Nofollow)
The following table is an overview of how the various search engines interpret the no_follow attribute.
rel="nofollow" Action |
|
Google |
|
Yahoo |
|
MSN Search |
|
Ask.com |
Follows the link |
|
Yes |
|
Yes |
|
Not proven |
|
Yes |
Indexes the "linked to" page |
|
No |
|
Yes |
|
No |
|
Yes |
Shows the existence of the link |
|
Only for a previously indexed page |
|
Yes |
|
No |
|
Yes |
In SERPs for anchor text |
|
Only for a previously indexed page |
|
Yes |
|
No |
|
Yes |
(
Wikipedia)
Meanwhile, Nofollow is presumably important for blog rankings, but there may be many other factors. Technorati bases blog 'authority' on inlinks, while the major factors for Google
BlogSearch include:
- Google’s regular ranking factors
- Scrape Gmail for links
- Frequency of Clicks
- Blogrolls
- Social Bookmarking
- Feed Readership
- Other factors
source
Indexing Issues Case Study: Google vs. Google Blogsearch vs. Technorati
How does Google segregate the static web and blogs? Do noindex and nofollow play a role?
See
here for speculation on how their blog search works.
From
About Google Blogsearch:
- Which blogs are included in Blogsearch? The goal of Blogsearch is to include every blog that publishes a site feed (either RSS or Atom). It is not restricted to Blogger blogs, or blogs from any other service.
- How do I get my blog listed? If your blog publishes a site feed in any format and automatically pings an updating service (such as Google Blogsearch Pinging Service), we should be able to find and list it. Also, we will soon be providing a form that you can use to manually add your blog to our index, in case we haven't picked it up automatically. Stay tuned for more information on this.
Starting point:
We will compare the results of the query "link:mastersofmedia.hum.uva.nl" in
google,
google blogsearch and
technorati. We will use 3 tagclouds for speculation.
- Google
- Google Blogsearch
- Technorati
Creating a tag cloud for inlinks to Masters of Media Blog in Google.
Question
Who links to
http://mastersofmedia.hum.uva.nl, according to Google?
Tools:
Method:
- Use Google scraper with query "link:mastersofmedia.hum.uva.nl"
- Google search result for "link:http://mastersofmedia.hum.uva.nl", on 27.07.07: 126 results
- Now manually count and list the results per domain like this
Final list for for Google query "link:http://mastersofmedia.hum.uva.nl":
DmiMoM
Result
-
- Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
- 1 = 30%
- 2 = 40%
- 3 = 50%
- 4 = 60%
- 5 = 70%
- 6 = 80%
- 7 = 90%
- 8+ = 100%
Result: Backlinks Masters of Media. Tagcloud Google Search:
Creating a tag cloud for inlinks to Masters of Media Blog in Google Blogsearch.
Question Who links to
http://mastersofmedia.hum.uva.nl, according to Google Blogsearch?
Tools:
Method:
- Query Google Blogsearch with "link:mastersofmedia.hum.uva.nl". Since there is no tool to scrape all the results, we manually copied all the URLs of the titles of the results.
- googleblogsearch.txt: Google Blogsearch Masters of Media blog inlink results (27-07-07).
- Tally results per domain.
Final Google Blogsearch results for "link:http://mastersofmedia.hum.uva.nl" (27-07-07):
DmiMoM Google Blog Search MOM inlink results
googleblogsearch.txt
-
- Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
- 1 = 30%
- 2 = 40%
- 3 = 50%
- 4 = 60%
- 5 = 70%
- 6 = 80%
- 7 = 90%
- 8+ = 100%
Result: links Masters of Media. Tagcloud Google Blogsearch:
Creating a tag cloud for links to Masters of Media Blog in Technorati.
Question Who links to
http://mastersofmedia.hum.uva.nl, according to Technorati?
Tools:
- [[http://service.openkapow.com/artonice/technoratipostsearch1.rest][Technorati Scraper]
- tag cloud to svg tool
- Illustrator
Method:
- Query Technorati (advanced search) with "link:mastersofmedia.hum.uva.nl". result Since there is no tool to scrape these results, we manually copied all the URLs of the titles of the results.
- Tally results per domain.
-
- Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
- 1 = 30%
- 2 = 40%
- 3 = 50%
- 4 = 60%
- 5 = 70%
- 6 = 80%
- 7 = 90%
- 8+ = 100%
Result: links Masters of Media. Tagcloud Technorati:
Results: Comparing the three tagclouds
To see which blogs are included/excluded within the spheres of the three devices we compared the results as found in
1 2 3 we used the
Compare Lists tool. Update: The tool
Triangulation is the advanced version of Compare Lists
The results can be found
here.
The results were visualized in a cross-device tag cloud:
cross-device DmiMoM tag cloud overlap google googleblogsearch technorati:
Findings
- It is remarkable that Google returns mostly blog results (hardly any static web results). Are there very few static websites linking to MOM?
- Google returns blog results that cannot be found in Google blog Search.
- Nofollow has little to do with the difference in returns in the 3 devices. The permalink has no nofollow tag, only comments have a nofollow tag by default and are excluded from results. This has consequences for the results returned in that there are no links to DmiMoM returned that are in placed in comments. This is however not a defining factor in the difference in returns. The difference is in the algorithm of the engines.
Tags:
,
view all tags