Data-Driven User Journalism: The Case of the Afghan War Diary

Team Members

Camilo Cristancho, Catalina Iorga, Matteo Cernison


Research Question

Is there an alternative account of the Afghan War Diary 2004 - 2010 documents released by WikiLeaks, a "multi-jurisdictional public service designed to protect whistleblowers, journalists and activists who have sensitive materials to communicate to the public?" (1) In other words, is the data hosted by WikiLeaks used in different ways other than the mainstream media represented by WikiLeaks' official partners?


Collect all inlinks to Afghan War Diary 2004 - 2010 document pages

  1. Observe which is the common root of all document URLs, namely ''
  2. Query Google by using the Google Scraper to obtain the first 1000 results which contain this common root as a textual component.
  3. Submit the 95 obtained webpages (alternatively considered as the top 100) to the Link Ripper in order to later get all outlinks to specific Afghan War Diary 2004 - 2010 document pages.
  4. Insert the Link Ripper output in the Harvester in order to alphabetize the obtained URLs and remove textual descriptions.
  5. Manually clean the output by again searching for the '' in an Excel file and produce a separate list of Afghan War Diary 2004 - 2010 document URLs.
  6. Analyze the list containing 179 non-unique results, select all document pages that receive at least two links (following the Issue Crawler logic) and create a file with the 'most mentioned' 17 warlogs, to be exact.

Preliminary Findings

As shown by the graphs in the attached presentation, content syndication was based on local interest. For example UK political blogger James Barlow was referring to British-related entries, not necessarily commenting on them, but rather listing a collection of links. Thus, the level of engagement with the actual data is very low given the entries' extremely technical language.

The Afghan War Diary documents were usually not directly referenced; blog entries and news stories relied heavily on the reports and databases put together by WikiLeaks' official media partners, namely Der Spiegel, The Guardian and The New York Times.

Issues and Limitations

The highly technical language of the war diaries (military terms and codes) made them difficult to analyze individually, meaning that the envisioned content-based search did not occur, especially given the limited resources and time span of this particular project.


Based on such a reserved linking practice, the future of data-driven user journalism looks bleak. The Afghan War Diary 2004 - 2010 was a unique opportunity to deal with first-hand military information and to criticize crucial matters like the violation of human rights and unjust killings. If these documents are indeed discussed independently of linking or major media outlets, then this analysis is happening in the underground and it better come out for a true alternative account to emerge. The only beacon of hope in such a dark landscape, where only 17 documents are linked at least twice, is a blogger, Peak of Elephants who astutely observes that most civilian shootings happened because of rebounds (2). One user on the entire Web who comments on the documents and simultaneously links to them.

Further Research

Contents analysis is expected to be useful in order to follow syndication practices that lead into identifying non-hyperlinked networks. In other words, careful examination of how documents are discussed without being linked to could shed new light on the distribution and circulation of these highly controversial pieces of classified information. Special emphasis should be placed on the reusability of content in order to avoid problems such as the undecipherable technicality of the original Afghan War Diary.




Topic attachments
I Attachment Action Size Date Who Comment
Afghan_War_Diary_Twitter_Tagcloud.jpgjpg Afghan_War_Diary_Twitter_Tagcloud.jpg manage 87 K 13 Sep 2010 - 12:47 CatalinaIorga A tag cloud with the most used hashtags and mentions related to the Afghan War Diaries.
Data-Driven_User_Journalism_The_Afghan_War_Diary.pdfpdf Data-Driven_User_Journalism_The_Afghan_War_Diary.pdf manage 745 K 13 Sep 2010 - 12:11 CatalinaIorga Presentation with small graphs showing which alternative outlets cite the overall most mentioned Afghan War Diary entries.
List_of_Afghan_War_Diary_Inlinked_Documents.xlsxls List_of_Afghan_War_Diary_Inlinked_Documents.xls manage 40 K 13 Sep 2010 - 12:15 CatalinaIorga A list of the 179 Afghan War Diary entries that were linked to.
Sites_Linking_to_the_AfghanWarDiary_Homepage.xlsxls Sites_Linking_to_the_AfghanWarDiary_Homepage.xls manage 61 K 13 Sep 2010 - 12:16 CatalinaIorga A list of the sites linking to the Afghan War Diary homepage on WikiLeaks.
Twitter_Afghan_War_Diary_Results.xlsxls Twitter_Afghan_War_Diary_Results.xls manage 19 K 13 Sep 2010 - 12:19 CatalinaIorga A list of results for scraping Twitter with the query 'afghan war diary'.
Twitter_wardiary.wikileaks.org_Results.xlsxls Twitter_wardiary.wikileaks.org_Results.xls manage 53 K 13 Sep 2010 - 12:19 CatalinaIorga A list of results for scraping Twitter with the query ''.
Websites_Linking_to_Categories_on_wardiary.wikileaks.org_Tagcloud.jpgjpg Websites_Linking_to_Categories_on_wardiary.wikileaks.org_Tagcloud.jpg manage 113 K 13 Sep 2010 - 12:45 CatalinaIorga A tag cloud for the actors that link to broad categories on the WikiLeaks Afghan War Diary website (such as 'browse by region' or 'type'.
Websites_Linking_to_wardiary.wikileaks.org_Tagcloud.jpgjpg Websites_Linking_to_wardiary.wikileaks.org_Tagcloud.jpg manage 106 K 13 Sep 2010 - 12:46 CatalinaIorga A tag cloud for the actors that link to the WikiLeaks Afghan War Diary homepage.
WikiLeaks_Documents_and_Who_Links_to_Each_of_Them_12082010.xlsxls WikiLeaks_Documents_and_Who_Links_to_Each_of_Them_12082010.xls manage 44 K 13 Sep 2010 - 12:24 CatalinaIorga A list of all retrieved Afghan War Diary linked documents with corresponding websites which mention them.
