The Disinformation Laundormat
Alicia Bargar, Peter Benzoni, Gaurika K. Chaturvedi, Helena Schwertheim, Bret Schafer
A consistent challenge in tackling the spread of disinformation, state-sponsored propaganda, and extremist content is our inability to rapidly detect and expose the ways in which bad actors launder harmful content into the online information ecosystem through proxies, cut-outs, aggregators, and mirror sites. This is particularly salient in the context of the war in Ukraine, where Kremlin-affiliated actors have used mirror sites to circumvent EU and tech company bans intended to limit the spread of Russian propaganda. This has allowed disinformation about the war to continue to proliferate across the internet, reaching European audiences through links and websites that obfuscate their affiliation with the Russian government. Part of the problem in detecting this activity is that the process of identifying networked websites is largely manual, resource intensive, and limited in scale. Although, there are existing OSINT and media tracking tools that can provide useful pieces of the intelligence puzzle, there is not a single analytic tool that pulls together the various threads that analysts need to investigate linkages between suspicious websites. The purpose of the Disinformation Laundromat is therefore to provide the OSINT community with a tool that can more effectively identify connections—both at the narrative and technical level—between seemingly unrelated websites. The goal is to develop new methods or document existing ones to identify networked websites, either by discovering websites that publish the same or similar content or share technical indicators of significance. In the past ;^) week, collaborators reviewed tools that already exist to examine what data can be procured about a webpage. We further examined the webpages of news websites identified as mirrors in comparison with legitimate news sources, to see what data can be used to point to common news production. We found that similarities in meta data can, in some cases, be immediately useful to discover common ownership. Such as shared IP addresses, registration locations, AdSense
and verification IDs. In other cases, it may be productive to do a content-level analysis. This project was developed by a consortium that has spent years working on information manipulation. The particular need for this tool is thus based on prior research that has exposed, among other malign behavior, the Kremlin’s use of mirror sites
to push banned state media content to EU audiences. But the tool itself will be designed to be threat actor agnostic; we thus welcome those interested in other use cases, including exposing “pink slime
” faux local news sites.
2. Research Questions
What commonalities do mirrored sites have that can be leveraged to identify them as mirrors? Are there shared attributes on a content level that point towards the same production tactics? Are there technical indicators like metadata and verification IDs that connect disparate sites to the same owner?
3. Methodology and initial datasets
Using data collected from Alliance for Securing Democracy’s Hamilton 2.0 dashboard, which collects, among other data points, all published articles from Russian state media websites, we work to establish methods to identify websites publishing identical or similar content. But this project is also about testing tools and methods to mine technical indicators of significance on websites in order to measure the similarity between and among sites—and thus the probability that there’s a relationship between seemingly unrelated sites.
The first step was a tool review. We searched for existing tools that get various types of metadata about a webpage. These were found through browser queries. We tested out the following tools: WIG, Photon, CTFR, TheHarvester
, DNSlytics, URLscan, WPScan, IntelOwl
, Sn0int, and SimilarWeb
. We found a host of metadata that could be useful. However, we ran into some roadblocks in testing these tools out. Some are oriented towards brand integrity protection and have subscription models oriented towards a company budget rather than individual users. These were completely blocked without a subscription. Others do have subscription models and allow for running a few requests without payment. Two of these - BuiltWith
and URLScan were further incorporated into the tool. BuiltWith
scans a webpage and reports back all the external technology it is using. This includes ad delivery and analytics plugins from Google and payment gateways embedded on the page. URLScan provides information about variable names used in the webpage and the certificates of the webpage and the technologies used on the webpage. These can be matched against other websites to see if they are set up with the same structure or use the same certificate for a technology.
We classified the attributes we found into a tier listed based on how conclusively they can say that the owners of a collection of sites is the same. We further tested out these indicators as discussed below.
The first tier, entitled “conclusive”, is metadata that, if matched, demonstrates a high level of probability that a collection of sites is owned by the same entity. This includes information like shared analytics and search engine IDs. They are conclusive because each ID is associated with only one account. The second tier is “Associative” data. These indicators point towards a reasonable likelihood that a collection of sites is owned by the same entity. This information can be useful if they have highly similar patterns of sourcing and structuring their data. Using the same source for images for instance is not suspicious in one instance, however, it can be if it exhibits a highly similar pattern of content production processes. These tend to be indicators linked to shared content delivery networks and meta tags in the HTML. Tier 3 are “Tertiary” indicators that could be circumstantial and should be substantiated with indicators of higher certainty. These include shared architecture such as plugins and CSS classes. PHashing is also used to determine whether the images they use are similar because often images might be slightly altered to avoid content detection. Here is the list of indicators as of the final day of the Winter School:
Tier 1: Conclusive
These indicators detemine with a high level of probability that a collection of sites is owned by the same entity.
- Shared domain name
- Google Adsense IDs
- Google Analytics IDs
- SSO and Search engine verification ids:
- Crypto wallet ID
- Multi-domain certificate
- Shared social media sites in meta
- (When not associated with a privacy guard) Matching whois information
- (When not associated with a privacy guard) Shared IP address
- Shared Domain name but different TLD
Tier 2: Associative
These indicators point towards a reasonable likelihood that a collection of sites is owned by the same entity.
- Shared Content Delivery Network (CDN)
- Shared subnet, e.g 22.214.171.124 and 126.96.36.199
- Any matching meta tags
- Highly similar DOM tree
Tier 3: Tertiary
These indicators can be circumstantial correlations and should be substantiated with indicators of higher certainty.
- Shared Architecture
- Content Management System (CMS)
- Any Universal Unique Identifier (UUID)
- Highly similar images (as determined by pHash difference)
- Many shared CSS classes
The tool is still being worked on and new changes will be reflected on it's github
page. To test out the tool we built, we ran it on an initial seed list. Subject matter experts provided a seed list of known Russian state media sites, Russian intelligence agency linked sites, suspect sites, and a list of domains tweeted and posted by Russian media and diplomats. A portion of the mirror domains were sourced from ISD’s pre-existing list of the main domains of RT and Rossiya Segodnya (including the most well-known domains of Sputnik and other outlets) which ISD’s OSINT analysts used to find similar and connected domains. We added a minimum requirement of at least an 90% match between tier one indicators. Using this information, we assigned each comparison a weight and visualised it in Gephi
as discussed below.
We initially ran a set of roughly 30 sites where 1100 indicators were found to have some level of match between sites.
We then ran a wider set of sites--more than 750--rendered it in Gephi and were able to identify a number of clusters (ForceAtlas2.pdf). While many of these clusters are yet to be analysed, three notable clusters have already been found--a group of sites directly associated with RT.com and a number of news sites mirroring their content; a cluster of military sites largely isolated from other sites (mil.ru), and a cluster of Russian intelligence linked sites, including a number of sites that previously not known to be associated.
rt.com associates and mirrors
Among these 750 sites, there were 48000 total matches. Below we share a breakdown of the matches by type and tier. Overall we saw 660 Tier 0 indicators, 7179 Tier 1 indicators, 29201 Tier 2 indicators, and 11751 Tier 3 indicators. This is a reasonable gradation - lower tiers (e.g. 0 and 1) should be rarer matches, and thus better signifiers of possible relationships between two sites. We expect to see a greater number of Tier 3 indicators when we have implemented more of them in our script.
|| Indicator Type
We found that some features require further iteration to be useful. For example, it can be highly salient when a rare cdn-domain is shared between two sites; however, it can also be nearly meaningless when a common cdn-domain, such as that of a popular search engine or social media application is shared. Introducing a popularity-based normalization approach should improve the usefulness of the shared technologies feature as well. Other features require an analyst's perspective to be notable - for example, it is not surprising if a site publicly known to be Russian is hosted in Russia; it is much more surprising if a newssite purportedly for a small U.S. town is hosted there.
Alternatively, we were pleased to see greater-than-expected usefulness of features such as the uuid, google analytics ids, and global variables. Uuid was a better matcher than we previously expected. Google analytics ids prove to be as strong of indicators as we hoped. Unique global variables used in a website's code are not very useful on their own to produce salient matches; however, when an intersection-over-union approach is applied to compare two website's sets of global variables, this appears to be an effective way of identifying sites produced by the same underlying code. We look forward to including this approach alongside comparisons of websites' CSS classes and DOM structures to more effectively identify networks of clone or mirror sites.
The tool currently does not account for shared authors and users across websites, privacy polices, external endpoint calls, sitemaps, and content. We believe that in doing so we will be able to account for the disconnected nodes. Additionally, we are yet to test out the tool against legitimate sites. This may prove some of our indicators as redundant or requiring more fine tuning. Likewise, the open-ledger nature of crypto wallet allows financial transactions to be tracked.
More broadly these clusters will need to be analyzed by experts to get feedback on what works and what doesn’t in our current methodology and iterated upon. We’re very excited about the results and bringing this project back to the Digital Methods Institute for additional analysis and improvement!
Tackling misinformation on the website level remains a challenge and an underrepresented space in research in light of the popularity of social media misinformation. The diversity in website set up processes poses an obstacle to perform large scale data gathering and analysis. Indeed one of the greatest issues we encountered when setting up functions to obtain and compare identity information was predicting where it was stored in the code for the page. For some indicators, there are standard practices or infrastructural limitations regarding where this information must be present. However, for other indicators such as cryptocurrency wallet identities, the information could potentially be present anywhere on the page.
However, building comparisons for what might seem like obvious metadata, like domain names, still goes a step beyond what an average visitor of the site might consider when trusting it as a news source. Although the project is in its infancy, it has already produced previously unknown connections. Further addition of more complex indicators and building standardised ways of procuring information across sites could build the Disinformation Laundromat into an incredibly useful tool for researchers and journalists focusing on misinformation.
Kata Balint, Jordan Wildon, Francesca Arcostanzo and Kevin D. Reyes. Effectiveness of the Sanctions on Russian State-Affiliated Media in the EU – An investigation into website traffic & possible circumvention methods, Institute for Strategic Dialogue, 6 Oct 2022
Hamilton 2.0 dashboard
, Alliance for Securing Democracy