The Tracker's Guide to the Cloud
Team Members
Esther, Anne, Kalina, Sabine, Lonneke, Diego, Allesandro, Farida, Gabriele, Sara & Carolin.
Introduction
In this project we aim to grasp the cloud. What is the cloud; how to recognise it? But also: where is it located. In this way we engage with cloud critiques (jurisidiction, privacy, etc.). There are many facets of the cloud and specify which part of the cloud we are interested in. In this project we focus on the content going IN to websites (CDNs) and content going OUT of websites (trackers). We are thus not researching which clouds run which websites, but seek to trace the ecology of cloud services and associated dataflows.
The Trackerguide can be found here: *
TrackerGuide_reduced.pdf
Research Questions
How can we detect and fingerprint the cloud?
How can we trace the flow of data in and out of the cloud?
Methodology
Starting from the top 100 pages based on Alexa, we formed two subgroups, one focusing on the data collected and traced on websites based on the output of the Tracker Tracker tool and the seond seeking to detect fingerprints of the cloud within these pages. Both project started by compiling an overview and glossary of cloud (via CDNs) and tracker. In subsequent steps, we combined this information to trace both data in- and outflows.
A further sub-project focused on a specific case study on the cloud relations of Amazon and its clients.
Subprojects:
I. Content IN: CDN tracing
Fingerprinting the Alexa top 100 websites
0. Fetch Alexa top 100 websites
1. enter url in
http://www.cdnplanet.com/tools/cdnfinder
2. fetch ‘hostname’ and ‘cdn’ and copy paste into
https://docs.google.com/spreadsheet/ccc?key=0Am1mvbZjJLMedDFRalJLTHRiaWJ5R2VIVVUyYTVlMFE#gid=7
Content types
Aim: Coding which content type is hosted on which CDN, per website in the Alexa top 100 (e.g. images, video, css/js)
Steps:
0. Open spreadsheet
https://docs.google.com/spreadsheet/ccc?key=0Am1mvbZjJLMedDFRalJLTHRiaWJ5R2VIVVUyYTVlMFE#gid=7. For the 10 annotated websites:
1. Per Website, look up which content is hosted on which CDN and add this in the column ‘type of content in’. I.e. per website, have a look which content is on there (images, video, css/js, third party scripts (widgets).
Open website and ‘view source’. Apple F (find) the ‘hostname’ from column ‘hostname’ in the source code and determine what type of content it is by reading the extension of the url (for example .jpg or .css or .js or .png). Note: Evaluate if the site is a frontpage, or login page, common in social networking sites such as Facebook or Twitter, or whether we should go “deeper” into the site to show the platform with all its content. If it is a site that does not necessarily require a login (such as Twitter, where you can view other profiles without a Twitter login) go to a public page, copy paste URL and redo in the CDN finder. Facebook also has some public pages, some groups:
https://www.facebook.com/barbarian.group If the site does require a login (such as Facebook) you can follow the following steps.
2. ‘Walled gardens’, or sites behind login, have additional extra steps.
2.1. Open the website (or page with a lot of divers content) and view source.
2.2. Copy/paste source code into harvester:
https://tools.digitalmethods.net/beta/harvestUrls/
2.3. Evaluate which urls are content links to the cloud and run these through DNS resolver:
http://www.mxtoolbox.com/SuperTool.aspx
2.4. Update spreadsheet
3 For each website, if possible, locate where the CDN content is located and annotate on a screenshot (see Annotated homepages below for steps)
B. annotated homepages
Question: what happens when you load a site in the browser?
1. Install skitch to take and annotate a screenshot. (
http://skitch.com/)
2. take the lists of CDN’s and trackers, and load the website in a browser that has Ghostery and Firebug.
3. annotate the screen shot
4. collect all information about the trackers and CDNs (and widgets) active in the page from the glossary
II. Glossary of Trackers
1. Extract the top 100 Alexa websites and detect the presence of trackers by using the
Tracker Tracker tool.
2. Create a collaborative spread sheet and identify the following categories manually:
- Data out (yes/no) – defined as the data flow that goes from the user to the tracker;
- Data in (yes/no) – defined as the data that flows from the tracker and it is displayed on the web site (example: ad banner);
- Category (ad/social) – this is the category of the Data in (in case there is such).
3. Use the information provided by Ghostery
http://www.ghostery.com/apps/TRACKERNAME to identify the following categories:
Tracker Name, Type (as defined by ghostery), Description of the tracker (in company's own words)
Data collected by the tracker (as defined by Ghostery). The categories were listed according to the information, describing the each tracker in Ghostery. The list was dynamically extended in the process of researching each tracking device.
Expiry day - period of expiration of the tracking device - Sharing data with third parties (yes/no) - Ad Views - Page Views - Browser Information - Analytics - Date/Time - Demographic data - Serving Domains - IP Address (EU PII) - Internet service provider - Interaction data - Cookie data - Hardware/Software Type - Search history clickstream data - Device ID
Glossary of CDNs
Amazon case study
Aim: to what extent can we trace Amazon cloud services from the source code of these customers?
Steps:
1. run the costomers of Amazon through the
http://www.cdnplanet.com/tools/cdnfinder/. *note: evaluate whether the homepage is good, or whether it should be a deeper page
2. Take all hostnames (including those that do not resolve to a CDN, because we are interested in all aspects of the cloud, not only CDNs), and run the 'hostnames' through the DNS resolver. When an IP address is returned, rerun (click) on IP address
http://www.mxtoolbox.com/SuperTool.aspx
3. When an Amazon result is returned, copy paste into spreadsheet (note that not all amazon results will have 'amazon' in it, but may include 'cloudfront', 'aws' etc)
https://docs.google.com/spreadsheet/ccc?key=0AhVYuU4ube_EdE05SXFNY1JJNWQ0aW5CUGdYSjZvd3c&pli=1#gid=0
(4. Double check step, only when you want to double check the full url of the 'hostname' you are checking: 'view source' of website and apple f 'hostname')
To do: How does CDN finder determine which hostnames it returns per website? (look in code of the tool)
For the CDN tracking we used a tool called CDN Finder by CDN Planet:
http://www.cdnplanet.com/tools/cdnfinder, an open source tool:
https://github.com/sajal/cdnfinder.js.
What it does: "CDN Finder fetched the HTML and parsed it with very simple logic to get to a list of hostnames. For each found hostname CDN Finder would then do a DNS lookup, look at the last CNAME only and regex match that hostname to our list of CDN hostnames."
http://www.cdnplanet.com/blog/better-cdn-finder/
Preliminary Findings
things to further explain (in text):
- wordpress and proxy links (Anne)
- resolving CDN fingerprints (dealing with c-names) (Esther)
- hyper-dynamic content (Sabine)
- dinstinguishing between CDNs/trackers/widgets - on data in/out level, and on company and activity level (carolin)
1) Tracking ecologies for URL source sets
In previous projects, the Tracker Tracker results have been visualised as Gephi network maps. To develop an alternative way of sense making of tracking ecologies of URL sets, we have developed an aggregated visualisation that shows the presence of trackers organised according to categories and highlighting key data elements that are being traced.
For this purpose, we selected 5 categories of data collected by the trackers based on its senstive character and its distribution across different tracking categories: data shared with 3rd parties, demographic data, IP address, interaction data and search history.
2) Visual glossary
In addition to the analytical visualisation, an annotating tracker and cloud glossary is being developed that can be used as map of key trackers, but also as legend for other visualisations. The visual glossary highlights the dataflows enabled (in/out, key categoried) and features a description of the associated service.
3) 10 case studies: Data in/data out
Based on the tracker/cloud glossary, the top 5 websites of each are identified and their trackers, data flows in/out are visualised to map out their connection to the cloud.
4) Annotated homepages
5) Amazon's pathway to the cloud
Guide
The guide can be found here: *
TrackerGuide_reduced.pdf
(inspiration: Peterson Guides,
http://www.amazon.com/Field-Guide-Eastern-Central-America/dp/0395740460)
Table of contents
- Introduction
- How to:
find traces of trackers
resolve fingerprints to CDNs
- Equipment:
tools and plugins
- Trackers and CDN Glossary
- Composite Image: Tracking ecologies for URL source sets
- what happens when you open a webpage?
Screenshots and matching visualizations