You are here: Foswiki>Dmi Web>Winter13SearchingTheArchive (30 Jan 2013, AnatBenDavid)Edit Attach

Searching the Archives

Digital Methods Initiative Winterschool, 22-25 January 2013

Team Members

Anat Ben-David, Anne Helmond, Hugo Huurdeman, David Moats, Natalia Sánchez, Thaer Sammar, Catherine Somzé

Data Source: KB, CommonCrawl



This project explores sprint-methods for searching an archived collection of the Dutch news aggregator Website The data for the analysis has been provided by the Web Archive of the National Library of the Netherlands, which collects daily snapshots of The Library's Web Archive is currently not available online, and its current interface is based on the Wayback Machine, which, similar to the Internet Archive, enables users to type a URL and view its archived snapshots at different points in time.

The National Library of the Netherlands is the cultural heritage partner of WebART, an NWO-CATCH research projects aimed at exploring new means and tools for making Web archives accessible and useful for researchers in the social sciences and the humanities.

For the Digital Methods Initiative's Winterschool, the WebART project's team prepared a searchable interface of the National Library's collection of the news Website

First, the WebART team indexed an archived collection of (crawled daily between October 2011 and December 2012), received from the National Library for research purposes*. The ! WebART team has also extracted and indexed an additional archived collection of, from the 2011 CommonCrawl collection. Both collections have been used as a data source for the DMI winterschool data sprint, and a search interface allowing for advanced search options has been built on top of the collections. The search interface, called "!WebARTist", has been experimented for the first time during the DMI Winterschool, in order to examine the ways with which New Media researches would be interested in exploring a (news) Web archive for research, and which tools can meet their research needs.


The following analysis presents initial findings from the DMI Winterschool. The findings should be understood as an exploration of potential methods of exploring searchable Web archives for research, rather than a deep-going analysis of the subject matter.

Research Questions

What types of analyses (and historiographical accounts) open up When Web archives are searchable?

What types of news-related analyses can be performed with a Web archive of a News Website?

Which features can the !!WebARTist include in order to meet research questions related to Web archive, or historiographical news analysis research?


1. Data extraction and indexing

The following documents specify the data extraction and search engine interface developed by the WebART team:

1. Data extraction and indexing (to be updated)
2. Search interface (to be updated)

2. Methods for exploring the archive

The methodology for analyzing the data during the DMI Winterschool is specified under each analysis' header below.


1. Temporal analysis of item frequencies

Research question:
What is the frequency of news items covering (controversial) country leaders in during 2012?

1. Query WebARTist for the following leader names: "Assad", "Mubarak", "Putin", "Kim Jung Il", "Fidel Castro" and "Raul Castro".
2. export search results with related date stamps of returned news articles, plot number of mentions per query on a timeline.


"Putin" has significantly more new items compared to other queried country leader names. The rise in the frequency of items mentioning Putin in their title, seen between December 2011 and March 2012 is related to the "Putin Must Go" campaign and protests prior to the Russian Presidential elections.

Similarly, there is a rise in news items mentioning "Mubarak" in August 2011, around his court hearing, and subsequently in June 2012 following the Egyptian elections.

These findings are suggested as a data exploration method for further analyzing news coverage.

2. Temporal Co-Word analysis

Research question:

Which words are related to the query "Assad" in over time?

1. Query WebARTist for the keyword "Assad"
2. Export results (date stamps, news item title, snippet text).
3. Perform Co-Word analysis of the returned snippet text using ANTA**.
4. Using GEPHI, fix the timestamps on the graph, use a spatialization algorithm that pulls and attracts the related keywords, depending on their frequency and date.


1. Returned results show "news items" language, such as "Press Conference", "Historic Event" and "Libyan Capitol Tripoli".
2. Co-word analysis shows greater affiliation with countries and leaders related to the "Arab Spring" (such as Egypt, Tunisia, "Tahrir Square" and "Hosni Mubarak"), than with words related to covering the anti-regime protests in Syria. (One would expect, for example, to find mentions of city names such as "Homs" or "Damascus".
3. The large concentration of words around June 2011 may be related to Assad's Damascus University speech.

These findings are suggested, among others, as a data exploration method for performing historical critique of international news coverage policies and practices. This method can also be applied on non-news related Web archives.
Research question:

1. To which extent does use external links as references to news articles? Are there different types of outlinks related to different issues?

1. Query WebARTist for the keywords "Sandy", "Syria" and "US elections" (in Dutch). Each query represents a key event which took place during 2011-2012.
2. Using WebARTist's interface, extract the outlinks found in the archived page of each result, and export the results (query, timestamp, related outlinks) to an csv file.
3. Count the number of outlinks per query.
4. Using DMI Dorling Map tool, visualize the outlinks per query.


1. In the case of Syria, most outlinks are to google maps, perhaps indicating a practice of using digital maps as a source of reference of conflict areas where there are fewer journalist reporting from the ground. Wikipedia is also used as a reference to items related to Syria. Similarly, outlinks to YouTube may indicate reference to user generated content uploaded from Syria.

2. Compared to the other issues, the Syrian outlink space has more references to other news sources, especially the English editions of TV news broadcasting channels such as Aljazeera and Alarabyia, BCC and CNN. There are also outlinks to newspapers such as the Washington Pots and the Guardian. References are also found to most prominent Dutch newspapers, such as the Volkskrant, NRC, Telegraaf and Trow. By contrast, there is only once reference to the Syrian Arab news agency SANA.

3. The outlinks related to hurricane Sandy refer mostly to the sandy crisis map, to Wikipedia, CNN and to Bruce Springstin's Website (resonating 12-12-12 Sandy relief concert). Other than the New York post, there are no references to other newspapers (compared to the Syrian outlink space).

4. By contrast to the types of outlinks from news items related to Hurricane Sandy and Syria, most of the outlinks related to the US 2012 presidential elections are to Web advertising networks. Although references to other newspapers (both American and Dutch) are found, there is also significant linking to entertainment magazines and Websites such as TMZ, Hollywood Reporter and This may say more about the types of wire subscriptions of, which aggregates entertainment news side by side international and national news.

We found the outlink extraction method essential for future analyses of Web archives, as it points at the larger Web-space in which they appeared in realtime, and might not have been archived. Future analyses can use the outlinks analysis to examine the changes and edits to the same news item within an archived news source, over time.

4. Geomapping of News wire reporting cities

Research question: makes use of Wire services subscriptions, which are then translated to Dutch and posted on the Website under the different sections, based on editorial policy. From which places do wire services report news about Syria?

1. Filter the database for all items mentioning Syria in the title field.
2. Extract from the source code of the HTML pages the names of the Wire services, appearing in <div class="actions"> at the bottom of news items on
3. Extract the reporting city appearing as the first word in the H2 section of the HTML code (under the title header).
4. Create a csv file containing the filtered news items, the city from which they were reported, and the Wire service.
5. Using Google Fusing Tables, count the wire services reporting from each city and plot on a geomap.

The Dutch News Agency, ANP, provides news items reported from Syria. Yet most reports are from European cities, indicating perhaps the absence of journalists reporting from Syria during the civil conflict in 2011.
Further analysis can compare to larger extents whether there are geographical patterns of Wire services reporting places and reported places, and whether they change over time.

5. Temporal image analysis

Research question:
What type of news image analysis and historiographical accounting can one perform with a Web archive of a news Website?

1. Adjust WebARTist to include an image search feature. For that, the images were extracted from their containing folders, renamed, and returned as separate images to the folders containing each URL. It is important to note that this adjustment preferred over the simple extraction of the URLs to image files, as their rendering would be from the living Web, and not from the archived collection.

2. Query WebARTist for the keyword "Mubarak". Export both text search results and image search results.


3. Export the results to a Google Spreadsheet. Using a JavaScript Tool for visualizing timeline data, automatically visualize the search results on an interactive timeline.

Clik here for the interactive timeline.


1. There are many repeating images, perhaps indicating the use of an image-bank rather than providing news images reported from the ground.
2. This method is offers as a new means of exploring and narrating historiographical accounts of issues and events using Web archives. (That is, the exploration of an issue through a timeline based on its images).

* Winterschool project participants have been granted access to the collection, for which they signed a non-disclosure contract ensuring they will not further distribute the data.

** snippet text was automatically translated using Google Translate, as ANTA currently does not support co-word analysis in Dutch.
Topic revision: r5 - 30 Jan 2013, AnatBenDavid
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback