Tracing and mapping early blogospheres and the emergence of the blogroll.
Developing and testing a new tool - an archive crawler to research the Internet archive for early networks on the web.
When did the first blogospheres arise?
When did early blogs start to link to other websites?
How did blogospheres emerge within the first years of blogging?
How did the feature of the blogroll arise within early blogging?
In order to research early networks an archive crawler tool has to be developed. To explore its possibilities, problems and possible results we use a semi-manual version of the archive crawler by extracting links from a defined set of archived blogs and mapping them with the resaulo tool.
The starting set of data is Eaton web directory of archived and available blogs in 15-08-2000, accessible at http://web.archive.org/web/20000815223308/http://www.eatonweb.com/portal/portal.php3
1. In a first step we manually check when the websites listed have become blogs as some blogs listed in the Eaton directory have not been blogs when they were first archived.**
2. The archive data set of the Eaton blogs is periodized by group 1 according to the following method: Periodization of the Internet Archive will happen through a k-means clustering algorithm. We will define how many clusters we want (e.g. we want 4 periods thus 4 clusters) and let the algorithm group all time stamped urls into 4 clusters which are maximally coherent internally and maximally exclusive externally. http://tools.issuecrawler.net/beta/wayback/contents.php
2. Beginning with the cluster in mid 1999 and then proceeding to mid 2000 and mid 2001, we use a customized linkripper tailored to the Archive of the Archive
, to extract all outbound links from the archived archive by an archive-link-ripper (results: mid2000
). The tool only extracts links from the first page as pages on the second or third level in the archive might not be from the same archival date.
3. The outbound links are plotted into a network by using the resaulu tool.
4. In a final step these networks are visualised in maps. The initial set of blogs is highlighted.
(1) In the early lists (1999) we do not expect to find a strong network between the blogs. Rather, we hope to find the emergence of outlinks. At a later stage we expect increasingly complex networks of outlinks.
(2) As only links from the first page are ripped, we expect to see the emergence of the blogroll and webrings if the blogs increasingly link to other blogs.
(3) Further, we expect clusterisation of networks. Within the analysis we want to explore if these early networks cluster around the practice of blogging itself or around actors or issues.
(4) We expect to find new blogs. Blogs that were missing from the Eaton web directory but play a role in the network.
(5) We also expect to find blogs which are missing in the Internet Archive.
(6) Looking at the central actors within the emerging networks, we expect to identify A-list blogs.
* There are blogs from the Eaton web directory that are first archived after 15-08-2000.
**We only check blogs before 15-08-2000.