A study on the bias in Twitter's Search API and Streaming API during the Hong Kong Protests, between October 11th and 12th, 2014

Team Members

Richelle Werners (10296980)

Chaïm Wijnberg (5908930)

David de Bie (10121021)

Introduction

On June 10th 2014, Beijing issued a so called ‘white paper’ on Hong Kong that shows that the city's freedoms could be revoked at any time (The Telegraph). Subsequently, on june 30th 2014, the protest group Occupy Central organised an unofficial referendum in which 800,000 people voted in favour of greater democratic freedoms than Beijing has proposed in their ‘white paper’ (The Telegraph) with which China attempts to rule over Hong Kong since it regained sovereignty from Britain in 1997 (Li, Khan, Browning and Liu). Then, when China insisted on its right to vet candidates for Hong Kong’s next leadership elections in 2017, namely on August 31th 2014, Occupy Central and other protest groups responded by announcing an ‘era of civil disobedience’, such as protests and mass sit ins (The Telegraph). These events eventually led to actual protests that began on September 22th 2014 (The Telegraph). These first protests included a week-long boycott of school classes, and protests near, and sometimes even in, government headquarters (The Telegraph). The Chinese police used pepper spray to rebound the protesters, but they defended themselves with their umbrellas¹ (Telegraph). In December, after two months of rallies punctuated by violence, three protest leaders surrendered to the police, when the police was trying to take the protests and the protestors off the streets (Aljazeera). Afterwards, on the 10th of December, the protests ended when the police managed to clear all the streets (Li, Khan, Browning and Liu).

Since the beginning of the protests, the Hong Kong protesters have won a lot of sympathy from people around the world, mostly due to social media such as Twitter (Tahroor). Especially since images of Hong Kong's #UmbrellaRevolution spread globally (Tahroor). Although the protests in Hong Kong could certainly have happened without the Internet, social media did play an important role in it (Parker). Mainly, because there was an active group of people who published images and information of the protests to the world via social media, and thus found creative ways to share their messages via social media (Parker). Hence, it can be said that social media (mostly accessed via mobile devices) have helped to sustain the protests (Parker). When this active group of people spread their images and information via social media, they included several hashtags.

The important role of social media during protests, is not something new. The past few years, especially since the Arab Spring and the Occupy movement, social movements and their social media usage have been the main topic of scientific research. It seems that the Internet and social media can give social movements the advantage to reduce transaction costs, blur private and public boundaries, and enable accessibility to information and new types of knowledge management systems, which has led to social movements having new strategic possibilities of organizing themselves (Bimber, Stohl and Flanagin 72). This is especially the case with Twitter; with over 300 million users, Twitter is the second most popular social network after Facebook; it has been deemed such an indispensable tool for political activists around the world that Blake Hounshell, the managing editor of Foreign Policy², in 2011 suggested that future revolutions will be tweeted (qtd in Theocharis 36). Hence, it seems that Twitter may beneficial to use by social movements because, due to the information sharing, collaboration, and community building characteristics, it facilitates loosely organized protest activism (Theocharis 49). This means that social media can help with the organization and mobilization of people (Bimber, Stohl and Flanagin 83). More specifically, it seems to be the case that activists can benefit from social media by improving their online organization and coordination (Theocharis 52).

The fact that social media played an important role in the Hong Kong protests, makes it interesting to investigate how the public debate on the Hong Kong protests was situated on Twitter. More specifically, it is interesting to examine which hashtags were used in the debate on Twitter about the Hong Kong protests, who used these hashtags and how much influence and/or reach these people have had by focussing on their followers, mentions and retweets. Also, the present study will examine if there are any differences or matches regarding these aspects that can be identified from the datasets of the Streaming API and the Search API, between October 11th and October 12th, 2014. Therefore, the present study will use Twitter as a research object with the use of the Twitter Capture and Analysis Toolset provided by the Digital Methods Initiative. This leads to the following research question:

RQ: When it comes to the public debate on the Hong Kong 2014 protests on Twitter, what are the main differences and/or matches regarding hashtags, influencers and top tweeters that can be identified from the datasets of the Streaming API and the Search API, between October 11th and October 12th, 2014?

Before answering this research question, an introduction into Twitter and Twitter studies will be given. Afterwards, the used research methods will be substantiated and the obtained results will be presented. This paper will end with a conclusion, in which the research question will be answered, and a discussion in which the findings critically will be discussed, implications will be presented, shortcomings and limitations will be discussed and suggestions for future research will be given.

Twitter and Twitter studies

Twitter is a microblogging system that was launched in 2006 and currently has 284 million monthly active users (Twitter 2014). Together these users sent 500 million tweets per day (Twitter 2014). Twitter can be used via computers, laptops, tablets and mobile devices. When people use Twitter, they write tweets up to 140 characters, which will be received by all followers of the user. Depending on the privacy settings per tweet, tweets can also be viewed and searched for individually by any user. This is enabled by an important feature of Twitter, namely the so-called ‘hashtags’. Hashtags are keywords that can be attached to a tweet by placing them in the tweet, directly after the '#' symbol. According to Small, "hashtags are central to organizing information on Twitter. …Hashtags ‘organize discussions around specific topics or events’” (873-4). Bruns and Stieglitz stress the importance of the hashtag to tweets and the functioning of Twitter by stating that hashtags are an originally user-generated mechanism for making messages related to a specific topic more easily discoverable (164). Users and non-users can use Twitter to search for hashtags and in this way fetch the stream of new messages containing specific hashtags in real time (Bruns and Stieglitz 164). According to Bruns and Stieglitz, this aspect makes hashtags an useful and an important mechanism for coordinating conversations around identified themes and events, ranging from breaking news through major media events to viral marketing campaigns (164). Furthermore, it can be said that hashtags are interesting because hashtagged discussions on Twitter are not controlled by any one organization or user; all sorts of actors can publish tweets online and be central figures in a Twitter discussion, but every user can decide to use hashtags or not (Bruns et al. 7).

Since Twitter turned out to be a very useful communication tool to use during events, disasters, conferences, elections and other “massively shared experiences” (Rogers 4), Twitter can also be used as a research object. The present study also uses Twitter as a research object, and more specifically, as an archived data set (Rogers 2). When Twitter is used as an archived data set, there are a few ways to fetch data, as in tweets, from Twitter. Two of those ways are with the use of the Search API and the Streaming API. Both Search and Streaming APIs are essential parts of Twitter programming for Tweet collection. It is important to note that these APIs differ in the way they work; Search goes back in time while Streaming goes forward. Furthermore, the Search API has a rich set of operators that can filter results based on attributes like location of sender, language, and various popularity measurements, can collect a wider range of dataand is able to pull historical data. However, the Streaming API returns real time data and usually returns a much higher flow of tweets (140Dev). Thus, it can be said that there might be differences in the data both APIs can aggregate.

Research Question

When it comes to the public debate on the Hong Kong 2014 protests on Twitter, what are the main differences and/or matches regarding hashtags, influencers and top tweeters that can be identified from the datasets of the Streaming API and the Search API, between October 11th and October 12th, 2014?

Hypothesis

Based on the characteristics of both the Search API and the Streaming API, it can be expected that there will be differences in the data they will pull regarding hashtags, influencers and top tweeters that can be identified from their datasets between October 11th and October 12th, 2014? This leads to the following hypothesis:

H1: Based on the characteristics of both the Search API and the Streaming API, it can be expected that there will be differences regarding hashtags, influencers and top tweeters that can be identified from their datasets between October 11th and October 12th, 2014.

Methodology

To analyse what the main differences and/or matches are regarding hashtags, influencers and top tweeters, that can be identified from the datasets of the Search API and the Streaming API between October 11th and October 12th, 2014, the present study used the [HongKongProtests] dataset. This dataset contains all tweets with the hashtags #hongkong, #occupycentral, #umbrellarevolution, #occupyadmiralty, #hk929, #hkstudentstrike and #hongkongprotests that have been posted between October 1st, 2014 and January 6th, 2015. For this analysis the present study made use of the Twitter Capture and Analysis Toolset provided by the Digital Methods Initiative, which will be further explained in the Tools section.

Tools

As mentioned above, for the analysis of the tweets in our dataset, the present study used the Twitter Capture and Analysis Toolset. To search for the top 10 hashtags influencers and top tweeters, the query “#hongkong OR #occupycentral OR #umbrellarevolution OR #occupyadmiralty OR #hk929 OR #hkstudentstrike” within the Streaming API and the Search API on TCAT has been used. After running different tools of TCAT, the results of each dataset has been compared.

Findings

In this section, two datasets will be analysed, namely [HongKongProtests] which is the Streaming API and [hongkonglookups], which is the Search API. First, the results of the 11th and 12th of October will be analysed by looking at the Streaming API. Second, the results of those two days will be analysed by focussing on the Search API.

Streaming API

At the time of writing, the TCAT-dataset of [HongKongProtests] contains 2.058.748 tweets that have been recorded from October 1st 2014 until October 6th 2015; sent by 521.690 distinct users. 45.2% of those tweets contain links, which logically means that 54.8% of those tweets do not contain links. On the 11th and 12th of October, 36.904 tweets have been sent by 16.156 distinct users; 53.7% of those tweets contain links, which logically means that 46.3% of those tweets do not contain links. Appendix 1 shows the different peaks of tweets that have been sent on the two days.

When one looks at the user statistics tool available in the DMI-TCAT, it can be found that the overall minimum of tweets per user is 1 and that the maximum is 592, with an average of 2.28 tweets per user. The average followers count is 3751.86, with one account that has 2677985 followers.Table 1 shows the results of using the Hashtag tool, which gives the overall frequency of the hashtags that have been used over the two days. The top five most frequent hashtags that have been used are #hongkong, #OccupyCentral, #UmbrellaRevolution, #Occupyhk and #umbrellamovement. The reason why those four hashtags are in the top 5 is because the dataset consists of the keywords: hk929, hkstudentstrike, hongkong, hongkongprotests, occupyadmiralty, occupycentral and umbrellarevolution. Interestingly there is no mention of the hashtags #hk929, #hkstudentstrike, #occupyadmiralty and #occupycentral. The hashtag #香港 means 'Hong Kong' in Chinese.

Table 2 shows the results of the overall top 10 users who tweeted the most. [hongkongcang] has tweeted the most, with 592 tweets. This account has send zero retweets, mentions others zero times, and has been mentioned four times. In addition, this account has 3731 followers and 4081 friends. Table 3 shows the users sorted by the times they have been mentioned. [fion_li] is the top user, who has been mentioned 961 times and mentions others 12 times. This account has 3379 followers and 972 friends. In addition, this user has sent 69 tweets and has retweeted three times.

The two tables show that none of the users is mentioned twice in the two lists . In table 3, it can be found that [hkdemonow] is actually the top tweeter of the list, with 84 tweets. The most mentioned tweeter is [tax_free], who is mentioned 209 times. This shows that there clearly is a difference between the top tweeters and the times an user is mentioned.

Search API

At the time of writing, the TCAT-dataset of [hongkonglookups] contains 259.069 tweets that have been recorded from September 30th 2014 until October 15th 2014; sent by 70.226 distinct users. 84.8% of those tweets contain links, which logically means that 15.2% of those tweets do not contain links. On the 11th and 12th of October, 27.096 tweets have been sent by 9.826 distinct users; 86.2% of those tweets contain links, which logically means that 13.8% of those tweets do not contain links. Appendix 2 shows the different peaks of tweets that have been sent on the two days.

When one looks at the user statistics tool available in the DMI-TCAT, it can be found that the overall minimum of tweets per user is 1 and that the maximum is 293, with an average of 2.38 tweets per user. The average followers count is 4916.16, with one account that has 2638080 followers; the average friend count is 1310.09, with one account that has 892905 friends.
Table 4 shows the results of using the Hashtag tool, which gives the overall frequency of the hashtags that have been used over the two days. The top five most frequent hashtags that have been used are #hongkong, #OccupyCentral, #UmbrellaRevolution, #Occupyhk and #umbrellamovement. The reason why those four hashtags are in the top 5, is because the dataset consists of the keywords: hk929, hkstudentstrike, hongkong, hongkongprotests, occupyadmiralty, occupycentral and umbrellarevolution. Interestingly, similar to the case in the Streaming API, there is no mentioning of the hashtags #hk929, #hkstudentstrike, #occupyadmiralty and #occupycentral.

Table 5 shows the results of the overall top 10 users who tweeted the most. [hongkongcang] has tweeted the most, with 593 tweets. The account has sent zero retweets, mentions others zero times, and has been mentioned four times. In addition, the account has 4215 followers and 4514 friends. Table 6 shows the users sorted by the times they have been mentioned. [fion_li] is the top user, who has been mentioned 901 times and mentions others 11 times. The account has 5257 followers and 1072 friends. In addition, the account has sent 65 tweets and has retweeted three times.

The two tables show that none of the users is mentioned twice in the two lists. In Table 6, it can be found that [hkdemonow] is actually the top tweeter of the list, with a number of 83 tweets. The most mentioned tweeter is [tax_free], who is mentioned 192 times. This shows that there is clearly a difference between the top tweeters and the times a user is mentioned.

Conclusion

The present study aimed to examine what the main differences and/or matches are regarding hashtags, influencers and top tweeters that can be identified from the datasets of the Streaming API and the Search API, between October 11th and October 12th, 2014? It has used the Hong Kong protests of 2014 as a case study.

When looking at the top 10 hashtags that appear based on both the Streaming API and the Search API, it can be stated that the hashtags appear in both APIs. However, there is a big difference between the frequency with which the hashtags appear. For instance, the most used hashtag [HongKong] has a been used 15.015 times in the Streaming API, but only 13.830 times in the Search API. Apparently there is a loss of 1.185 tweets. A possible explanation for this loss could be that here Twitter only shows the most relevant tweets, and therefore has marked the 1.185 tweets as irrelevant.

When looking at the top tweeters that appear based on both the Streaming API and the Search API, it can be stated that the users appear in both APIs. However, top tweeter [hongkongcang] has sent 592 tweets in the Streaming API, but 593 tweets in the Search API. Also in the Search API, the user has more followers, friends, number of hashtags used, and number of tweets with hashtags. The only thing that is the same, is the number of times the user has been mentioned. A possible explanation for this could be that there was some corrupted data in either the Streaming API or in the Search API.

When looking at the most mentioned users that appear both in the Streaming API and in the Search API, it can be stated that the same users appear in both APIs. When looking at the top user [fion_li], who has been mentioned the most, the Streaming API shows an amount of 961, and the Search API 901 times. However, when looking at the difference between the two APIs, the user has 3379 followers and 972 friends in the Streaming API, and 5257 followers and 1026 friends in the Search API. But then again, the user has sent 69 tweets, is mentioning others 12 times, has used 116 hashtags with 69 tweets according to the Streaming API. In contrast, the Search API shows lower amounts. The only match between both APIs is the number of retweets the user has sent. A possible explanation could be that Twitter has marked some tweets as irrelevant and that the user has gained some followers and friends in the mean time. Thus, in line with the proposed hypothesis and the research question it can be said that there are indeed differences in data gathered via both APIs.

Discussion

The dataset provided by the researchers of Leiden University contained two gaps. On October 5th and October 19th, no tweets were captured due to a technical issue. As avoiding these gaps would mean that the dataset would start at October 20th, and thus would miss a few important dates, the present study has decided to include these gaps in this research. Acknowledging that this could affect the credibility of the data and thus this research, the results were still significant enough to show the value of this research. The gaps can be seen as a shortcoming. However, it can also serve as inspiration for future research. The present study encourages scholars to capture data in this same time period where there are peaks in activity.Theocharis believes that this is indeed the case, by stating that the use of Twitter as a medium during protests deserves further investigation: “the fact that the medium provides the means for coordination and collaboration should prompt scholars, especially in the field of social movements and protest action, to further explore whether activists actually alter their courses of action based on information received through Twitter during protest events such as demonstrations” (52).

Furthermore, research should be done on a content level as well. Big data research has the advantage of being able to investigate large amounts of data in relatively little time, but the large dataset makes it impossible to analyse the content of tweets. Because hashtags are only used to categorize tweets, tweets using particular hashtags can contain either positive or negative content about a subject or issue; this means that a content analysis of the aggregated tweets could produce differing results. As Theocharis states, future research on Twitter "should apply message (tweet) content analysis with specialized software for social media text crawling to in-depth interviews with end-users/protesters” (52). The present research can be used to distinguish the most influential users and hashtags, in order to determine which tweets should be analyzed on a content analysis level.

References

Footnotes

1. The umbrella has become the defining image of the protest movement. They were used by the protesters to shield them from police tear gas, and to carry painted political slogans. See http://www.cnn.com/2014/09/30/world/asia/objects-hong-kong-protest/

2. See for entire article: http://foreignpolicy.com/2011/06/20/the-revolution-will-be-tweeted/.

Bibliography

“About”. Twitter.com. Twitter, Inc. 3 December 2014. < https://about.twitter.com/company>.

"Aggregating tweets: Search API vs. Streaming API". 140Dev. Twitter API Programming Tips, Tutorials, Source Code Libraries and Consulting. 140Dev, LLC. 19 January 2015. <http://140dev.com/twitter-api-programming-tutorials/aggregating-tweets-search-api-vs-streaming-api/>.

Aljazeera. “Hong Kong Protest Leaders Surrender to Police.” Aljazeera. Aljazeera. 4 December 2014. Web. 15 January 2015. <http://www.aljazeera.com/news/asia-pacific/2014/12/hong-kong-protest-leaders-surrender-police-201412323244153270.html>.

Bimber, Stohl & Flanagin: Chadwick, Andrew, and Philip N. Howard, eds. Routledge handbook of Internet politics. Taylor & Francis, 2010.

Bruns, Axel and Stefan Stieglitz. “Quantitative Approaches to Comparing Communication Patterns on Twitter”. Journal of Technology in Human Services 30: 3-4 (2012): 160-85.

Bruns, Axel, Tim Highfield and Jean Burgess. “The Arab Spring and Its Social Media Audiences: English and Arabic Twitter Users and Their Networks”. American Behavioral Scientist 57.7 (2013): 871-98.

Fion Li, Natasha Khan, Jonathan Browning and Alfred Liu. “Hong Kong Protests Started With a Roar End With a Whisper.” Bloomberg. Bloomberg. 11 December 2014. Web. 15 January 2015. <http://www.bloomberg.com/news/2014-12-11/hong-kong-police-remove-democracy-protesters-making-last-stand.html>.

Parker, Emily. “Social Media and the Hong Kong Protests.” The New Yorker. The New Yorker. 1 October 2013. Web. 15 January 2015. <http://www.newyorker.com/tech/elements/social-media-hong-kong-protests>.

Rogers, Richard. "Debanalizing Twitter: The Transformation of an Object of Study". Proceedings of ACM Web Science 2013. Paris: May 2013.

Small, Tamara. “WHAT THE HASHTAG?” Information, Communication and Society 14:6 (2011): 872-95.

Tahroor, Ishaan. “Hong Kong’s Students Want You to Stop Calling Their Protest a ‘Revolution’.” The Washington Post. The Washington Post. 4 October 2014. Web. 15 January 2015. <http://www.washingtonpost.com/blogs/worldviews/wp/2014/10/04/hong-kongs-students-want-you-to-stop-calling-their-protest-a-revolution/>.

Theocharis, Y. “The Wealth of (Occupation) Networks? Communication Patterns and Information Distribution in a Twitter Protest Network.”Journal of Information Technology & Politics 10.1 (2013): 35-56. 14 January 2015. <http://www.tandfonline.com/doi/pdf/10.1080/19331681.2012.701106>.

The Telegraph. “Hong Kong Pro-Democracy Protests: Timeline.” The Telegraph. The Telegraph. 11 December 2014. Web. 14 January 2015. <http://www.telegraph.co.uk/news/worldnews/asia/hongkong/11287055/Hong-Kong-pro-democracy-protests-timeline.html>.

Appendices

Appendix 1

Appendix 2

This topic: Dmi > DmiWinter2015Projects > SearchAndStreamBias20
Topic revision: 19 Jan 2015, ChaimWijnberg
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback