Francisco J Grajales III
The idea that all data should be accessible by governments has a number of societal implications. What are the consequences when everyone can access the databases equally that are used to make decisions for our society as a whole? And, what happens to our privacy when we know that data gathered about our movement and actions are made publicly?
Investigating actors and movements in the discussion on open data could give some insights in the interests and perspectives that governments, companies and academics have regarding this topic. According to the Open Data Institute, Open Data is data that is made available by organizations, businesses and individuals for anyone to access, use and share (What Is Open Data? par. 2). This data is freely accessible to anyone, anytime, and anywhere without restrictions. The Open Data Institute
is a non-profit organization, and as such does not have any commercial interests, in contrast to, for example, the United States Government, which developed Project Open Data (Park and VanRoekel
2013). The webpage of the project describes open data in accordance with a specific set of principles; it declares that open data will be public, accessible, described, reusable, complete, timely, and managed post-release (Principles par. 1). In addition, such data not only strengthens our democracy and promotes efficiency in governments, but also has the potential to create economic opportunity and improve citizens quality of life (Project Open Data par. 1).
Open data is a contested concept and raises scepticism when it comes to its relationship with privacy. Due to its openness and disclosive nature this is a sensitive matter, since open data tends to conflict with sensitive issues regarding privacy, such as an individuals right to privacy. Where to draw the line? In a political system based on open data, citizens are granted with the benefit of an open government, which shares information and data with the public, and thus should be transparent. On the downside however, the same government might violate the citizens privacy rights and use the collected data against them. Not only citizens have privacy issues. As a government, dealing with top national security concerns and being transparent means perhaps sharing those (partially) with the public. There seems to be a growing anxiety concerning the ambivalent implications of a political system based on open data. This controversial issue is being tackled by several NGOs, and various related events, including Open Up? 2014
, whose speakers suggested that the best approach is for the government to decide, announce, and defend; civil servants and ministers would lock themselves in a darkened room, develop what they consider a beautifully crafted policy, and unleash it unto the world (Hughes par. 4). In this conference it is mentioned that policies should be made regarding open data, and as such protect against violation of privacy.
We want to find out who runs the debate on the open data topic and what the primary concerns are. Is is mostly run by academics? Were there changes during Open Up? 2014
? Therefore we turn to Twitter, a Social Media platform where discussions (among academics and laymen) are constantly updated. Investigating this ongoing debate and the actors involved, this research can contribute to Twitters bubble discourse.
Research Question and Hypothesis
How is the debate around privacy and open data manifested in social media, specifically Twitter? What are the main concerns that are discussed and who are the main actors driving discussion?
We expect the open data and privacy debate to be mostly academically-driven. We do not expect to find the ordinary people to be engaging with this debate. Furthermore, we anticipate topics like transparency and legislation to be the main concerns.
The initial data set from Twitter was gathered using the Twitter Capture and Analysis Tool (TCAT)
. The TCAT server used in this report is the one found on the DMI webpage 
. Under the drop-down menu we selected the privacy dataset, which contains tweets related to facebookprivacy, privacy, and surveillance keywords. Within this set, we run the query [open data OR opendata]. This specific set and query combination was chosen because it would produce tweets that mentioned both [privacy] and [open data]. In the query, the term [open data] was entered both with and without a space to account for the various ways it may be written in a tweet, capturing both hashtags and non-hashtags. The date parameters were set for 2014-07-10 to 2015-01-10, thus capturing six months.
Once the data for these criteria was returned, a csv file with all tweets and all information about them was exported. This file was uploaded to Discovertext
, a cloud-based, collaborative text analytics tool, where the tweets were classified according to a myriad of different criteria, such as the top retweets, top mentioned users, language preference of the user, location, and other similar metrics. The metrics analysed for the purposes of this paper were those concerning (i) the retweets (top 10 and all of them collectively), (ii) the top 10 users who tweeted the most in this dataset, and (iii) the time of their activity. In the case of retweets, we first analysed the top 10 most retweeted users, and the content of their retweeted tweets. We then, examined the whole collection of retweets and coded each of them manually, in order to determine the main concerns expressed in those retweets. After that, a Word cloud of the labels was produced using infogr.am
, showing the impact of each by making the most used words the biggest. The largest labels would correlate with the issues most discussed in the data set.
In order to identify the top issue concerns, we applied several content analysis techniques. First, we looked at the topics discussed within the data set. Using the coding tool in Discovertext, we labelled the full set of exact duplicate tweets and near duplicate tweets retweets and modified retweets (174 tweets in total) in order to identify the major concerns and areas of discussion. The tool brings up the original tweet for each retweet group and allows the user to add labels, either new of selected from the growing list. Through collaboration, each tweet was labelled and a comprehensive list was produced. With many different people working on this task, the initial list contained a number of errors such as redundant labels, spelling errors and labels that were too specific or obscure. These were cleaned up by combining, editing and reevaluating in order to end with a final list of relevant topics indicating which topics were most discussed in our data set.
Using the coding tool of Discovertext, we were able to get a clear picture of the primary concerns that are discussed on Twitter, and we also used Discovertext to look at the content of the most retweeted tweets. In this section we will show our findings using visualizations and after that we will explore the content of the main concerns, using the tweet set and also other literature to get a better understanding of the concepts named in the concerns.
Having labelled all the 174 retweeted tweets, we have found that there were 74 different tags associated to the tweets. Most of them (39) were unique tags, that is, they appeared only once in the entire sub-dataset of retweets. In this section, we give a detailed account of the 10 codes that were used the most. They are all individually defined and analysed briefly, with regards to the overarching issues of open data and privacy.
The peak of activity was recorded on 12 November, corresponding to the 2014 Open Up?
conference. The conference on open data was held in London and consisted of various talks of open data activists, government officials, private sector leaders, technologists, journalists, and other civil society leaders (openup2014.org). The amount of tweets sent out on 12th November was 224, which amounts for the 13,08% of the total number of tweets tweeted in a span of 6 months (10 July 2014 10 January 2015). Furthermore, most tweets were in English (1354/1712). The second and third most used languages were Italian (117) and French (81).
Figure 1. Most tweeted tweets are around the OpenUp
?14 event, which took place on the 12th of November 2014 in London.
| big data
|| 14.04 %
|| 7.88 %
|| 6.16 %
|| 5.82 %
| protecting privacy
|| 5.14 %
|| 5.14 %
|| 5.14 %
|| 4.45 %
|| 3.77 %
|| 2.40 %
Table 1. Top 10 main concerns identified through the labeling of the retweets.
Big data is often used as a hashtag, and is not defined in a specific way, contrary to open data that is mostly related to a specific case. Since the two terms might sound very broad, Joel Gurin explains the differences between big and open data: the first one is defined by size, the latter by its use. A possible definition of big data may be a very large, complex, rapidly-changing datasets (n. pag.), but Gurin already admits that this description is not solid but rather depends on technology, so the definition might change from time to time.
To explain what the balance between privacy and open data is, Ben Rooney quotes Harvey Lewis: All organizations need to be thinking about the way they use data, not just in terms of the company's needs, but the balance between its needs and the needs of the customer or citizen. (n. pag.). So companies that use data should think about their own needs and the needs of the customer or citizen. As Rooney explains, privacy is one of those needs, and the imbalance is that customers that want to keep their privacy do not know what amount - and kind of data is extracted from them.
The word or hashtag transparency is mostly used together with the word government. One of the Tweets contains a link to the National Freedom of Information Coalition
website. This coalition focuses on the openness of Governments: Join NFOIC in the fight for transparency. The website also says that it held a conference, where the relation between open government and open data was discussed. These two terms come together in another term that is used quite some times in our Twitter database: open government data,
The conference Open Up?
in 2014 was about the boundaries of openness and privacy. The main topic was the intersections and tensions around openness and privacy, this being the reason why in its title 'open up' is followed by a question mark. The contents of the tweets were mainly aimed to present and promote the event.
Within this discussion, many are wondering how they can protect their privacy while allowing for open data. Daniel Barth-Jones is the main actor within this concern, he is a health Data specialist who works on Data privacy. Barth-Jones refers to his own article in his Tweet about protecting privacy and relates it to anecdata - a term we will explain further on.
As explained in transparency and what we also see in our tweets, is that governments can be made open, but this still does not mean that we get full insight in the process of decisions.
Anecdata is a term from Barth-Jones article The Antidote for Anecdata: A Little Science Can Separate Data Privacy Facts from Folklore
. It discusses the chain reaction that followed the New York-based Taxi and Limousine Commission
s recent release of data on more than 173 million taxi rides in response to a Freedom of Information Law. Barth-Jones uses the word anecdata as a critique on the way that issues around open data and privacy are sometimes being analyzed by many anecdotes instead of proper scientific research.
Anonymity is a highly relevant concept for privacy and open data. The majority of the tweets regarding this concern were actually retweeted tweets by the bot user ATOM. Its original tweet that was retweeted and modified the most was contained the word anonymizing. There is also a tweet declaring that data can be open or anonymous, but it can not be both at the same time.
The following tweet by the user opendatamcr, 21/09: Dont Spy On Us: Surveillance where do you draw the line?
indicates that some boundaries need to be established in order to protect the privacy. Oppendatamcr stands for Open Data Manchester organization, which is interested in realizing the potential of open data. Other actors, interested in the subject of surveillance are journalists, the Open Rights Group project, Creative Commons board, and OmidyarNetwork
, which is responsible for realising the OpenUp
?14 event. Security
Mainly governmental institutes, researchers and others interested in open data and open source tweeted about security. The tweets mention how and when security is achieved and how it is related to open data. Privacy is mentioned in combination with security, since open access to (personal) data can mean a violation of ones privacy. Securing is therefore seen as vital.
Top 10 most engaged actors
In the initial top 10 of most engaged actors, the first and third place were taken by bots (A.T.O.M and NoSQL
). We considered them worth mentioning, because they add to the ongoing discourse, but since both are automated, they are not actually engaging as an ordinary person would. Therefore the 11nd and 12th were taken in account, and appeared in the new top 10 list. The list is made up only of academics, professionals and organizations distinguished by color, see legend below that have a previous vested interest in the debate between open data and privacy. No ordinary people made it into the top, which is not surprising considering Twitters nature and its penchant for rewarding those who have an interesting or expert viewpoint.
Legend of colors:
| Organizations/bodies actors
| Ordinary users: None
- Daniel Barth-Jones @dbarthjones (31): Columbia University, NYC
- GoonjLabs @GoonjLabs (20): Mumbai
- RGF3 Esq. @Iron_Light (16): CEO Iron Light, Seattle
- openeverything @openevrthng (16): Politics blog, Berlin
- Eddie Copeland @EddieACopeland (13): Head of Technology Policy unit, UK.
- opencorporates @opencorporates (11): England.
- Alan Hudson @alanhudson1 (10): Managing Director for Global Integrity, Washington DC.
- Metamedio @metamedio (9): Consulting and Strategy Functions, Spain.
- Martin Tisne @martintisne (7): Director of policy at Omidyar Network, London.
- Andrew Clarke @andrew_c_clarke (6): Government Transparency Team, London.
(originally on 1st place) A.T.O.M @atomsoffice (128)
(originally on 3d place) NoSQL @NoSQLDigest (28)
Retweeted the most
Out of the total of 174 retweets, these are the top 10 most retweeted groups. Below we report the names and usernames of the actors, their affiliation and the retweet. Again, all come from users previously engaged in the field and mainly serve to disseminate information, either through linking articles, or mentioning events and programs. Only one tweets (no. 9) seems to take a stance and attempt to start a discussion.
(The color scheme just used is not applied on the following two figures)
1. Jonathan Gray @jwyg: Director of Policy at Open Knowledge Foundation, Cambridge.
_RT @jwyg: New research project exploring tensions between open data data protection and privacy: http://t.co/Ox89PnIAIG #opendata #privacy
2. Daniel Barth-Jones @dbarthjones: Columbia University, NYC.
RT @dbarthjones: De-identify Anonymize Differential Privacy & more on getting past Anecdata http://t.co/KN6Qpsvzbo #OpenData #BigData #Pr
3. Daniel Barth-Jones @dbarthjones: Columbia University, NYC.
RT @dbarthjones: De-identify Anonymize Differential Privacy & Getting Past Anecdata http://t.co/KN6Qpsvzbo #OpenData #BigData #DataSci http
4. Harvard University @Harvard: Cambridge.
RT @Harvard: The promise and perils of de-identifying learner data from MOOCs and how to balance privacy with open data http://t.co/6309hi
5. Derrick Harris, @derrickharris: Senior writer at Gigaom, San Francisco and NYC.
RT @derrickharris: 3 interesting @gigaom posts this AM: economy http://t.co/F7a1yeIhnW privacy http://t.co/MITBZpSXu7 and open data http:/
6. OpenData BC, @OpenDataBC, British Columbia.
RT @OpenDataBC: Interesting #opendata #openscience article RT @tamaradull: #Privacy Anonymity and #BigData in the Social Sciences http://
7. Policy Exchange, @Policy_Exchange.
_RT @Policy_Exchange: Tues 10am at #CPC14: Open Data Big Data and Privacy w/ David Willetts @martintisne @dominiccampbell @SaturnSA4 http:/
8. Neal Mann, @fieldproducer. Multimedia Innovations Editor at The wall Street Journal, currently working on strategy for News Corp Australia
_RT @fieldproducer: Fascinating to see pressure on the #SamaritansRadar app over privacy despite open data app clearly good intentioned htt
9. Renata Avila, @avilarenata. Berlin.
_RT @avilarenata: We cannot discuss #transparency and open data without discussing #surveillance secrecy inequality of data from foreigner
10. Nick Scott @thefaketree: Fredericton, Canada
RT @thefaketree: Critically important talk by @mgeist at #GovMaker Lets do #opendata right w/ respect to privacy #nbpoli (9)
Figure 3: Word Cloud of main concerns on Twitter. The size is determined by number of mentions (see fig. 2).
Figure 4: Actors in the debate on Twitter.
According to our analysis, the main concerns are related to policy and legislation of open data. The content focuses mainly on the role of governments in the debate surrounding open data. The relationship between government and transparency is considered an important one. However, open data does not necessarily mean transparency of the government: open data has to make sense to a citizen in order to actually be considered information. This brings us to the notion of big data, which is mostly used as an umbrella term for large datasets. Oftentimes, open data is big data, but big data is not necessarily always open. The terms surveillance and spying are often used together, and they point to a possible invasion of citizens privacy. Furthermore, the question of whether anonymity can always be granted in case of the open data is still very much under debate.
All the different types of users are involved in the discussion, as they all use the term balance in their tweets at least once. It is used to describe the interests of citizens concerning privacy and the interests of the organizations concerning open data. The balance keyword indicates that a great deal of the conversation acknowledges privacy and open data as equally important interrelated matters, instead of looking at privacy as an obstacle for the open data.
In our analysis of the retweets, we deduced that the discussion is driven by academics, and much less by ordinary users. The discussion is dominated by an article by Barth-Jones, which elaborates on the specificity of platforms, where discussion can be initiated and lead by the most influential actors. Only a few users tweeted original messages: the content is being multiplied and copied endlessly, as can be seen in the case of Barth-Jones article, which is being referred to in many tweets, generating the terms such as anecdata and NYC taxi.
The discussion on Twitter can be seen as mainly event-based, since most of the tweets were generated around the OpenUp
? 14 event (a total of 448 tweets were tweeted on that day), and the main topic during this peak was the intersection and tension between openness and privacy. It can be said that Tweeters were trying to reach a point where a win-win situation could be achieved.
In conclusion, one can argue that open data is still a deep controversial issue that has legal, political, and economic aspects which requires a long period of time in order to be totally resolved and well framed. The conversations are happening mainly among academics, researchers, activists, policy makers, journalists, corporate employees, and other highly involved and active actors, but it is still far away from the ordinary people.
Limitations and propositions for further research
There were some limitations to this research. Firstly, Twitter does not necessarily reflect public opinion, therefore we can form an image of the discussion only within the boundaries of the platform. Furthermore, the dataset was limited to a period of 6 months, with OpenUp
? 14 event dominating the discussion. Additionally, if bots were excluded, the analysis might have been different.
For further research it might be interesting to exclude the 12th of November (the day when OpenUp? 14
took place) from the dataset, in order to look at the discussion outside of the event. Moreover it would be interesting to figure out how the main actors are communicating and debating with each other, and to what extent they have a common or a contradicting agenda.
Goerin, Joel. Big data and open data: what's what and why does it matter? The Guardian,
April 2014. 16-1-2015 < http://www.theguardian.com/public-leaders-network/2014/apr/15/big-data-open-data-transform- government
Hughes, Tim. Opening policy, protecting privacy. Open Up? 2014
, November 2014. 16-1-2015 < http://www.openup2014.org/opening-policy-protecting-privacy/
Jones, Daniel-Barth. "
The Antidote for Anecdata: A Little Science Can Separate Data Privacy Facts from Folklore." Info/Law.
Nov 21, 2014. 15-1-2014. < http:// blogs.law.harvard.edu/infolaw/2014/11/21/ the-antidote-for-anecdata-a-little-science-can-separate-data-privacy-facts-from-folklore/
Kunigami, Muente Arturo. "Differences between "(Open Government) Data and Open (Government Data)." World Bank
, March 2012. 16-1-2015 < http://blogs.worldbank.org/ic4d/ differences-between-open-government-data-and-open-government-data
Park, Todd and Van Roekel, Steven. Introducing: Project Open Data. The White House
, 2013.16-1-2015 < http://www.whitehouse.gov/blog/2013/05/16/ introducing-project-open-data
Principles. Project Open Data
. 19-1-2015 < https://project-open-data.cio.gov/principles/
Project Open Data. Project Open Data
. 19-1-2015 < https:// project-open-data.cio.gov/
What Is Open data? theodi.org
. Open Data Institute. 13-1-2015 < http://theodi.org/ what-is-open-data