web privacy beyond extensions: new browsers are pursuing … · 2020. 7. 7. · chrome edge / ie...

Web Privacy Beyond Extensions: New Browsers Are Pursuing

Deep Privacy ProtectionsPeter Snyder <pes@brave.com>

Privacy Researcher at Brave Software

In a slide…

• Web privacy is a mess.

• Privacy activists and researchers are limited by the complexity of modern browsers.

• New browser vendors are eager to work with activists to deploy their work.

Outline

1. BackgroundExtension focus in practical privacy tools

2. PresentPrivacy improvements require deep browser modifications

3. Next StepsCall to action, how to keep improving

Outline

uBlock PrivacyBadger

DisconnectAdBlock Plus

Browsers are ComplicatedPr

Browser maintenance experience

Chrome Edge / IE

Firefox Safari

Extensions as a Compromise

Chrome Edge / IE

Firefox Safari

Extensions Runtimemodifications

Privacy andBrowser Extensions

• Successes!uBlock Origin, HTTPS Everywhere, Ghostery,Disconnect, Privacy Badger, EasyList / EasyPrivacy, etc…

• AppealingEasy(er) to build, easy to share

• PopularHundreds of thousands of extensions, Millions of users

Browser Extension Limitations

• Limited CapabilitiesNetworking, request modification, rendering, layout, image processing, JS engine, etc…

• Security and PrivacyPossibly giving capabilities to malicious parties

• PerformanceLimited to JS, secondary access

Extensions vs RuntimePr

Chrome Edge / IE

Firefox Safari

Under Explored SpacePr

Chrome Edge / IE

Firefox Safari

Outline

Under Explored SpacePr

Chrome Edge / IE

Firefox Safari

Runtime Privacy Improvements

• AdGraphClient-side, ML, graph-based tracking detection

• SpeedReaderPrivacy enhancing content extraction(i.e. “aggressive reader mode”)

A�G��: A Machine Learning Approach toAutomatic and E�ective Adblocking

Umar IqbalThe University of Iowa

Zubair Sha�qThe University of Iowa

Peter SnyderBrave Software

Shitong ZhuUniversity of California, Riverside

Zhiyun QianUniversity of California, Riverside

Benjamin LivshitsBrave Software

Imperial College London

ABSTRACTFilter lists are widely deployed by adblockers to block ads and otherforms of undesirable content in web browsers. However, these �lterlists are manually curated based on informal crowdsourced feed-back, which brings with it a signi�cant number of maintenancechallenges. To address these challenges, we propose a machinelearning approach for automatic and e�ective adblocking calledA�G��. Our approach relies on information obtained from mul-tiple layers of the web stack (HTML, HTTP, and JavaScript) to traina machine learning classi�er to block ads and trackers. Our evalua-tion on Alexa top-10K websites shows that A�G�� automaticallyand e�ectively blocks ads and trackers with 97.7% accuracy. Ourmanual analysis shows that A�G�� has better recall than �lterlists, it blocks 16%more ads and trackers with 65% accuracy. We alsoshow that A�G�� is fairly robust against adversarial obfuscationby publishers and advertisers that bypass �lter lists.

1 INTRODUCTIONBackground. Adblocking deployment has been steadily increas-ing over the last several years. According to PageFair, adblockersare used on more than 600 million devices globally as of December2016 [15, 24, 26]. There are several reasons that have led to manyusers installing adblockers like Adblock Plus and uBlock Origin.First, many websites show �ashy and intrusive ads that degradeuser experience [25, 27]. Second, online advertising has been repeat-edly abused by hackers to serve malware (so-called malvertising)[44, 48, 54] and more recently cryptojacking, where attackers hijackcomputers to mine cryptocurrencies [37]. Third, online behavioraladvertising has incentivized a nefarious ecosystem of online track-ers and data brokers that infringes on user privacy [28, 36]. Mostadblockers not only block ads, but also block associated malwareand trackers. Thus, in addition to instantly improving user expe-rience, adblocking is a valuable security and privacy enhancingtechnology that protects users against these threats.Motivation. Perhaps unsurprisingly, online publishers and ad-vertisers have started to retaliate against adblockers. First, manypublishers deploy anti-adblockers, which detect adblockers andforce users to disable their adblockers. This arms race between ad-blockers and anti-adblockers has been extensively studied [45, 46].Researchers have proposed approaches to detect anti-adblocking

, ,2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

scripts [40, 49, 60], which can then be blocked by adblockers. Second,some advertisers have started to manipulate the delivery of theirads to bypass �ler lists used by adblockers. For example, Facebookrecently obfuscated signatures of ad elements that were used by�lter lists to block ads. Adblockers, in response, quickly identi�ednew signatures to block Facebook ads. This prompted a few backand forth actions, with Facebook changing their website to removead signatures, and adblockers responding with new signatures [52].Limitations of Filter Lists. While adblockers are able to blockFacebook ads (for now), Facebook’s whack-a-mole strategy pointsto two fundamental limitations of adblockers. First, adblockers usemanually curated �lter lists to block ads and trackers based oninformally crowdsourced feedback from the adblocking community.This manual process of �lter list maintenance is inherently slowand error-prone. When new websites are created, or existing web-sites make changes, it takes adblocking community some time tocatch up by updating the �lter lists [1]. This is similar to other areasof system security, such as updating anti-virus signatures [31, 42].Second, rules de�ned in these �lter lists are fairly simple HTTP andHTML signatures that are easy to defeat for �nancially motivatedpublishers and advertisers. Researchers have shown that random-ization techniques, where publishers constantly mutate their webcontent (e.g., URLs, element ID, style class name), can easily defeatsignatures used by adblockers [51]. Thus, publishers can easily, andcontinuously, engage in the back and forth with adblockers andbypass their �ltering rules. It would be prohibitively challengingfor the adblocking community to keep up in such an adversarialenvironment at scale. In fact, a few third-party ad “unblocking”services claim to serve unblockable ads using the aforementionedobfuscation techniques [14, 32, 58].Proposed Approach. In this paper, we aim to address these chal-lenges by developing an automatic and e�ective adblocking ap-proach. We propose A�G��, which alleviates the need for man-ual �lter list curation by using machine learning to automaticallyidentify e�ective (both accurate and robust) patterns in the pageload process to block ads and trackers. Our key idea is to constructa multi-layer graph representation that incorporates �ne-grainedHTTP, HTML, and JavaScript information during page load. Com-bining information across the di�erent layers of the web stackallows us to capture tell-tale signs of ads and trackers. We extracta variety of structural (degree and connectivity) and content (do-main and keyword) features from the constructed graph to train asupervised machine learning model to detect ads and trackers.

Current Tracking Blocking

• Extremely useful!

• Uses well known, targeted approaches

• Vulnerable to practical countermeasures

• We see increasing evasion

• Two typical approaches…

URL Based Blocking

• Representative ExtensionAdBlock Plus + EasyPrivacy

• Approach1. Identify URLs that trackers come from2. Build rules to instruct the browser to ignore these URLs

• Example1. Notice: https://example.org/tracking.js2. Block: */tracking.js

URL Based Evasions

• Rotate Domains- Domain generation algorithms (DGA)- Host on CDNs

• Move to First PartySites host local copies of tracking code

• Compose with “benign” code- Concatenate into one single file- Magnification / packing / browserify / require.js / etc.

Behavior Based Blocking• Representative Extension

PrivacyBadger

• Approach1. Look for code that does suspicious things2. Block or restrict similar code

• Example1. Notice script from tracker.com uses Canvas and WebGL oddly2. Prevent all code from tracker.com from accessing any privacy sensitive functionality

Behavior Based Evasions• Rotate Domains

- Domain generation algorithms (DGA)- Host on CDNs

• Split Suspicious Activity Across PartiesAvoid detection thresholds by distributing activity

• Evade Attribution- eval- new Function()- Promise.then()- etc…

AdGraph Alternative

• Blocking tracking resourcesJS, tracking pixels, iFrames…

• Deep browser instrumentation- Network: requests made during page execution- Layout: page structure and modifications- JavaScript: attribute above to responsible code

• Block based on contextML classification based on above described context

Common JS Example

1. Script element with inline code, that…

2. Appends a script element after itself, with remote script, that…

3. Reads document cookies (and other FP elements), creates an adjacent image, and then…

4. Fetches images from unknown URLs

AdGraph Example

AdGraph ExampleScriptElm

ScriptElm

Append Element: <script>

Attribute Modification: src=<url>

ScriptElm

ScriptResource

Fetch:<url>

ScriptElm

ScriptResource

Fetch:<url>

ImageElm

Append Element: <img>

ScriptElm

ScriptResource

Fetch:<url>

Attribute Modification: src=<url> Image

ScriptElm

ScriptResource

Fetch:<url>

ResourceFetch:<url>

ImageElm

ScriptElm

ScriptResource

Fetch:<url>

ResourceFetch:<url>

ImageElm

ScriptElm

ClassificationInformation

ScriptResource

Fetch:<url>

ResourceFetch:<url>

ImageElm

ScriptElm

ClassificationInformation

ScriptResource

Fetch:<url>

ResourceFetch:<url>

ImageElm

ScriptElm

AdGraph Results• High accuracy

> 95% compared to current, human approaches

• Strong privacy protectionsIdentifies tracking resources missed by current tools

• High performanceAs fast or faster than current approaches (and default Chromium!)

• Not limited to lists Can adapt as trackers adapt

Not Possible with Extensions

• Information BreathNeeded information not available to browser extensions

• Information DepthJS information not available to other browsers!

• PerformanceBlocking ML classifier benefits from C++ implementation

SpeedReader: Reader Mode Made Fast and PrivateMohammad GhasemisharifUniversity of Illinois at Chicago

mghas2@uic.edu

Peter SnyderBrave Softwarepes@brave.com

Andrius AucinasBrave Software

aaucinas@brave.com

Benjamin LivshitsBrave Software / Imperial College London

ben@brave.com

ABSTRACTMost popular web browsers include “reader modes” that improvethe user experience by removing un-useful page elements. Readermodes reformat the page to hide elements that are not related to thepage’s main content. Such page elements include site navigation,advertising related videos and images, and most JavaScript. Theintended end result is that users can enjoy the content they areinterested in, without distraction.

In this work, we consider whether the “reader mode” can bewidened to also provide performance and privacy improvements.Instead of its use as a post-render feature to clean up the clutter on apage we propose SpeedReader as an alternative multistep pipelinethat is part of the rendering pipeline. Once the tool decides duringthe initial phase of a page load that a page is suitable for readermode use, it directly applies document tree translation before thepage is rendered.

Based on our measurements, we believe that SpeedReader canbe continuously enabled in order to drastically improve end-userexperience, especially on slower mobile connections. Combinedwith our approach to predicting which pages should be rendered inreader mode with 91% accuracy, it achieves drastic speedups andbandwidth reductions of up to 27⇥ and 84⇥ respectively on average.We further �nd that our novel “reader mode” approach bringswith it signi�cant privacy improvements to users. Our approache�ectively removes all commonly recognized trackers, issuing 115fewer requests to third parties, and interacts with 64 fewer trackerson average, on transformed pages.

1 INTRODUCTION“Web bloat” is a colloquial term that describes the trend in websitesto accumulate size and visual complexity over time. The phenom-ena has been measured in many dimensions, including total pagesize [7], page load time [5, 43, 44], memory needed [29], number ofnetwork requests [16, 27], amount of scripts executed [25, 33, 36, 38]and third parties contacted [24, 25, 27]. This work suggests thatgrowth in page size and complexity is outpacing improvements indevice hardware. All of this has a predictably negative impact onuser experience.

Web users and browser vendors have reacted to this “bloat” in avariety of ways, all partially helpful, but with signi�cant downsides.

Ad and tracking blockers are a popular and useful tool for reduc-ing the size complexity of sites. Prior work has shown that thesetools can be e�ective in reducing privacy leaks [30], network use,and extend device memory life. Such tools are inherently limitedin the scope of improvements they can achieve. While these lists

are large [41], they are small as a proportion of all URLs on theweb. Similarly, while these lists are updated often, they are updatedslowly compared to URL updates on the web.

Similarly, “reader mode” tools, provided in many popularbrowsers and browser extensions, are an e�ort to reduce the grow-ing visual complexity of web sites. Such tools attempt to extractthe subset of page content useful to users, and remove advertising,animations, boiler plate code, and other non-core content. Current“reader modes” do not provide the user with resource savings sincethe referenced resources have already been fetched and rendered.The growth and popularity of such tools suggest they are useful tobrowser users, looking to address the problem of page clutter andvisual “bloat”.

In this work, we propose a novel strategy called SpeedReaderfor dealing with resource and bloat on websites. Our techniqueprovides a user experience similar to existing “reader mode” tools,but with network, performance, and privacy improvements thatexceed existing ad and tracking blocking tools, on a signi�cantportion of websites. Signi�cantly, SpeedReader di�ers from exist-ing deployed reader mode tools by operating before page rendering,which allows it to determine which resources are needed for thepage’s core content before fetching.How we achieve speedups. SpeedReader achieves its perfor-mance improvements through a two-step pipeline:

(1) SpeedReader uses a classi�er to determine whether thereis a readable subset of the initial, fetched page HTML. Thisclassi�er is trained on a labeled corpus of 2,833 websites, andis described in detail in Section 3, and determines whether apage can be display in reader mode with 91% accuracy.

(2) If the classi�er has determined that the page is readable,SpeedReader extracts the readable subset of document be-fore rendering, using a variety of heuristics developed in priorresearch [23] and browser vendors [9, 22], and passes thesimpli�ed, reader mode document to the browser’s renderlayer. This tree translation step is described in Section 4.

Deployment. Combined with a highly accurate classi�er of “read-able” pages, the drastic improvements in performance, reduction inbandwidth use and elimination of trackers in reader mode makethe approach practical for continuous use. We therefore proposeSpeedReader as a sticky feature that a user can toggle to be alwayson. This approximates the experience of using an e-book reader,but with strengths of content availability on the web. It is also asuitable strategy for content prerendering or prefetching that couldbe implemented by web browser vendors, automatically deliveringgraceful performance degradation in poor connectivity areas or on

SpeedReader• Prevent tracking resources

JS, tracking pixels, iFrames…

• Most of a page isn’t immediately useful- Boilerplate: navigation, you might like…- Third party ads: often undesirable, offensive, or both- JavaScript: animations and distractions

• Extract good content, don’t block bad contentFocus on identifying the valuable parts of the page, not the harmful ones

Existing Reader Modes

Server Browser Render PageExtract

Main Text and Image

Present Reader Mode Version

AdsImages Videos

Main Image

SpeedReader

Server BrowserDetermine if

Initial HTML is Readable Extract

Main Text and Image from

Present Reader Mode Version

Main Image

If Yes…

If No…

Display as normal

SpeedReader Results

Comparison of SpeedReader to standard browsing on a large sample of websites.3rd Party

(Avg)3rd Party(Median)

Scripts(Avg)

Scripts(Median)

Ads and Trackers

Ads and Trackers(Median)

Default 117 63 83 51 63 24

SpeedReader 1 1 0 0 0 0

Not Possible with Extensions

• Access RestrictionsMost browser’s don’t allow extensions to modify pages

• PerformanceML classifier benefits from C++ implementation

Outline

Unclaimed SpacePr

Chrome Edge / IE

Firefox Safari

Future privacy protections

Better Privacy is Possible• New Browser Vendors

The “big four” aren’t the only game in town anymore

• Many New Browsers are Privacy FocusedPrivacy as top-level goal, willing to be aggressive

• Eager to CollaborateThe new browsers are willing and interested to develop and maintain ambitious privacy protecting browser changes.

• Reach Out!Peter Snyder, Privacy Researcherpes@brave.com – @pes10k

web privacy beyond extensions: new browsers are pursuing … · 2020. 7. 7. · chrome edge / ie...

Documents

uncovering the secrets of malvertising · uncovering the...

plugins( extensions( add1ons) · 2...

group session: malvertising : how to detect and deal with...

amazing ppc tactics ppc tactics 062811.pdf · types of ad...

openstack extensions

pad: programming third-party web advertisement...

ad extensions 2014 - which extensions should you be using

malvertising: una minaccia in espansione

exposing hidden threats - comlinx.com.au€¦ · 10 |...

understanding malvertising through ad-injecting browser...

terrace houses: 3m extensions versus 6m extensions -...

revisiting mobile advertising threats with...

advertsing integrity: malvertising,...

hair extensions - before & after -...

fakeav: journey from trojan to a persistent threat ·...

the hidden enemy: malvertising and ransomware€¦ ·...

extensions substation extensions for high-voltage ... ·...

panel 4 -- how to avoid malvertising leading to phishing ......

integrity of the ad supply chain anti-malvertising best...

implementing php 5 oop extensions -...