web privacy beyond extensions: new browsers are pursuing … · 2020. 7. 7. · chrome edge / ie...
Post on 05-Sep-2020
6 Views
Preview:
TRANSCRIPT
Web Privacy Beyond Extensions: New Browsers Are Pursuing
Deep Privacy ProtectionsPeter Snyder <pes@brave.com>
Privacy Researcher at Brave Software
In a slide…
• Web privacy is a mess.
• Privacy activists and researchers are limited by the complexity of modern browsers.
• New browser vendors are eager to work with activists to deploy their work.
Outline
1. BackgroundExtension focus in practical privacy tools
2. PresentPrivacy improvements require deep browser modifications
3. Next StepsCall to action, how to keep improving
Outline
1. BackgroundExtension focus in practical privacy tools
2. PresentPrivacy improvements require deep browser modifications
3. Next StepsCall to action, how to keep improving
uBlock PrivacyBadger
DisconnectAdBlock Plus
Browsers are ComplicatedPr
ivac
y co
ncer
n
Browser maintenance experience
Chrome Edge / IE
Firefox Safari
uBlock PrivacyBadger
DisconnectAdBlock Plus
Extensions as a Compromise
Priv
acy
conc
ern
Browser maintenance experience
Chrome Edge / IE
Firefox Safari
Extensions Runtimemodifications
Privacy andBrowser Extensions
• Successes!uBlock Origin, HTTPS Everywhere, Ghostery,Disconnect, Privacy Badger, EasyList / EasyPrivacy, etc…
• AppealingEasy(er) to build, easy to share
• PopularHundreds of thousands of extensions, Millions of users
👍
Browser Extension Limitations
• Limited CapabilitiesNetworking, request modification, rendering, layout, image processing, JS engine, etc…
• Security and PrivacyPossibly giving capabilities to malicious parties
• PerformanceLimited to JS, secondary access
👎
uBlock PrivacyBadger
DisconnectAdBlock Plus
Extensions vs RuntimePr
ivac
y co
ncer
n
Browser maintenance experience
Chrome Edge / IE
Firefox Safari
Extensions Runtimemodifications
uBlock PrivacyBadger
DisconnectAdBlock Plus
Under Explored SpacePr
ivac
y co
ncer
n
Browser maintenance experience
Chrome Edge / IE
Firefox Safari
Extensions Runtimemodifications
?
Outline
1. BackgroundExtension focus in practical privacy tools
2. PresentPrivacy improvements require deep browser modifications
3. Next StepsCall to action, how to keep improving
uBlock PrivacyBadger
DisconnectAdBlock Plus
Under Explored SpacePr
ivac
y co
ncer
n
Browser maintenance experience
Chrome Edge / IE
Firefox Safari
Extensions Runtimemodifications
Runtime Privacy Improvements
• AdGraphClient-side, ML, graph-based tracking detection
• SpeedReaderPrivacy enhancing content extraction(i.e. “aggressive reader mode”)
Runtime Privacy Improvements
• AdGraphClient-side, ML, graph-based tracking detection
• SpeedReaderPrivacy enhancing content extraction(i.e. “aggressive reader mode”)
A�G����: A Machine Learning Approach toAutomatic and E�ective Adblocking
Umar IqbalThe University of Iowa
Zubair Sha�qThe University of Iowa
Peter SnyderBrave Software
Shitong ZhuUniversity of California, Riverside
Zhiyun QianUniversity of California, Riverside
Benjamin LivshitsBrave Software
Imperial College London
ABSTRACTFilter lists are widely deployed by adblockers to block ads and otherforms of undesirable content in web browsers. However, these �lterlists are manually curated based on informal crowdsourced feed-back, which brings with it a signi�cant number of maintenancechallenges. To address these challenges, we propose a machinelearning approach for automatic and e�ective adblocking calledA�G����. Our approach relies on information obtained from mul-tiple layers of the web stack (HTML, HTTP, and JavaScript) to traina machine learning classi�er to block ads and trackers. Our evalua-tion on Alexa top-10K websites shows that A�G���� automaticallyand e�ectively blocks ads and trackers with 97.7% accuracy. Ourmanual analysis shows that A�G���� has better recall than �lterlists, it blocks 16%more ads and trackers with 65% accuracy. We alsoshow that A�G���� is fairly robust against adversarial obfuscationby publishers and advertisers that bypass �lter lists.
1 INTRODUCTIONBackground. Adblocking deployment has been steadily increas-ing over the last several years. According to PageFair, adblockersare used on more than 600 million devices globally as of December2016 [15, 24, 26]. There are several reasons that have led to manyusers installing adblockers like Adblock Plus and uBlock Origin.First, many websites show �ashy and intrusive ads that degradeuser experience [25, 27]. Second, online advertising has been repeat-edly abused by hackers to serve malware (so-called malvertising)[44, 48, 54] and more recently cryptojacking, where attackers hijackcomputers to mine cryptocurrencies [37]. Third, online behavioraladvertising has incentivized a nefarious ecosystem of online track-ers and data brokers that infringes on user privacy [28, 36]. Mostadblockers not only block ads, but also block associated malwareand trackers. Thus, in addition to instantly improving user expe-rience, adblocking is a valuable security and privacy enhancingtechnology that protects users against these threats.Motivation. Perhaps unsurprisingly, online publishers and ad-vertisers have started to retaliate against adblockers. First, manypublishers deploy anti-adblockers, which detect adblockers andforce users to disable their adblockers. This arms race between ad-blockers and anti-adblockers has been extensively studied [45, 46].Researchers have proposed approaches to detect anti-adblocking
, ,2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
scripts [40, 49, 60], which can then be blocked by adblockers. Second,some advertisers have started to manipulate the delivery of theirads to bypass �ler lists used by adblockers. For example, Facebookrecently obfuscated signatures of ad elements that were used by�lter lists to block ads. Adblockers, in response, quickly identi�ednew signatures to block Facebook ads. This prompted a few backand forth actions, with Facebook changing their website to removead signatures, and adblockers responding with new signatures [52].Limitations of Filter Lists. While adblockers are able to blockFacebook ads (for now), Facebook’s whack-a-mole strategy pointsto two fundamental limitations of adblockers. First, adblockers usemanually curated �lter lists to block ads and trackers based oninformally crowdsourced feedback from the adblocking community.This manual process of �lter list maintenance is inherently slowand error-prone. When new websites are created, or existing web-sites make changes, it takes adblocking community some time tocatch up by updating the �lter lists [1]. This is similar to other areasof system security, such as updating anti-virus signatures [31, 42].Second, rules de�ned in these �lter lists are fairly simple HTTP andHTML signatures that are easy to defeat for �nancially motivatedpublishers and advertisers. Researchers have shown that random-ization techniques, where publishers constantly mutate their webcontent (e.g., URLs, element ID, style class name), can easily defeatsignatures used by adblockers [51]. Thus, publishers can easily, andcontinuously, engage in the back and forth with adblockers andbypass their �ltering rules. It would be prohibitively challengingfor the adblocking community to keep up in such an adversarialenvironment at scale. In fact, a few third-party ad “unblocking”services claim to serve unblockable ads using the aforementionedobfuscation techniques [14, 32, 58].Proposed Approach. In this paper, we aim to address these chal-lenges by developing an automatic and e�ective adblocking ap-proach. We propose A�G����, which alleviates the need for man-ual �lter list curation by using machine learning to automaticallyidentify e�ective (both accurate and robust) patterns in the pageload process to block ads and trackers. Our key idea is to constructa multi-layer graph representation that incorporates �ne-grainedHTTP, HTML, and JavaScript information during page load. Com-bining information across the di�erent layers of the web stackallows us to capture tell-tale signs of ads and trackers. We extracta variety of structural (degree and connectivity) and content (do-main and keyword) features from the constructed graph to train asupervised machine learning model to detect ads and trackers.
1
arX
iv:1
805.
0915
5v1
[cs.C
Y]
22 M
ay 2
018
Current Tracking Blocking
• Extremely useful!
• Uses well known, targeted approaches
• Vulnerable to practical countermeasures
• We see increasing evasion
• Two typical approaches…
URL Based Blocking
• Representative ExtensionAdBlock Plus + EasyPrivacy
• Approach1. Identify URLs that trackers come from2. Build rules to instruct the browser to ignore these URLs
• Example1. Notice: https://example.org/tracking.js2. Block: */tracking.js
URL Based Evasions
• Rotate Domains- Domain generation algorithms (DGA)- Host on CDNs
• Move to First PartySites host local copies of tracking code
• Compose with “benign” code- Concatenate into one single file- Magnification / packing / browserify / require.js / etc.
Behavior Based Blocking• Representative Extension
PrivacyBadger
• Approach1. Look for code that does suspicious things2. Block or restrict similar code
• Example1. Notice script from tracker.com uses Canvas and WebGL oddly2. Prevent all code from tracker.com from accessing any privacy sensitive functionality
Behavior Based Evasions• Rotate Domains
- Domain generation algorithms (DGA)- Host on CDNs
• Split Suspicious Activity Across PartiesAvoid detection thresholds by distributing activity
• Evade Attribution- eval- new Function()- Promise.then()- etc…
AdGraph Alternative
• Blocking tracking resourcesJS, tracking pixels, iFrames…
• Deep browser instrumentation- Network: requests made during page execution- Layout: page structure and modifications- JavaScript: attribute above to responsible code
• Block based on contextML classification based on above described context
Common JS Example
1. Script element with inline code, that…
2. Appends a script element after itself, with remote script, that…
3. Reads document cookies (and other FP elements), creates an adjacent image, and then…
4. Fetches images from unknown URLs
AdGraph Example
AdGraph ExampleScriptElm
AdGraph ExampleScriptElm
ScriptElm
Append Element: <script>
AdGraph ExampleScriptElm
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
AdGraph ExampleScriptElm
ScriptResource
Fetch:<url>
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
AdGraph ExampleScriptElm
ScriptResource
Fetch:<url>
ImageElm
Append Element: <img>
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
AdGraph ExampleScriptElm
ScriptResource
Fetch:<url>
Attribute Modification: src=<url> Image
Elm
Append Element: <img>
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
AdGraph ExampleScriptElm
ScriptResource
Fetch:<url>
Attribute Modification: src=<url> Image
ResourceFetch:<url>
ImageElm
Append Element: <img>
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
AdGraph ExampleScriptElm
ScriptResource
Fetch:<url>
Attribute Modification: src=<url> Image
ResourceFetch:<url>
ImageElm
Append Element: <img>
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
ClassificationInformation
AdGraph ExampleScriptElm
ScriptResource
Fetch:<url>
Attribute Modification: src=<url> Image
ResourceFetch:<url>
ImageElm
Append Element: <img>
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
ClassificationInformation
AdGraph ExampleScriptElm
ScriptResource
Fetch:<url>
Attribute Modification: src=<url> Image
ResourceFetch:<url>
ImageElm
Append Element: <img>
Attribute Modification: src=<url>
ScriptElm
Append Element: <script>
AdGraph Results• High accuracy
> 95% compared to current, human approaches
• Strong privacy protectionsIdentifies tracking resources missed by current tools
• High performanceAs fast or faster than current approaches (and default Chromium!)
• Not limited to lists Can adapt as trackers adapt
Not Possible with Extensions
• Information BreathNeeded information not available to browser extensions
• Information DepthJS information not available to other browsers!
• PerformanceBlocking ML classifier benefits from C++ implementation
Runtime Privacy Improvements
• AdGraphClient-side, ML, graph-based tracking detection
• SpeedReaderPrivacy enhancing content extraction(i.e. “aggressive reader mode”)
SpeedReader: Reader Mode Made Fast and PrivateMohammad GhasemisharifUniversity of Illinois at Chicago
mghas2@uic.edu
Peter SnyderBrave Softwarepes@brave.com
Andrius AucinasBrave Software
aaucinas@brave.com
Benjamin LivshitsBrave Software / Imperial College London
ben@brave.com
ABSTRACTMost popular web browsers include “reader modes” that improvethe user experience by removing un-useful page elements. Readermodes reformat the page to hide elements that are not related to thepage’s main content. Such page elements include site navigation,advertising related videos and images, and most JavaScript. Theintended end result is that users can enjoy the content they areinterested in, without distraction.
In this work, we consider whether the “reader mode” can bewidened to also provide performance and privacy improvements.Instead of its use as a post-render feature to clean up the clutter on apage we propose SpeedReader as an alternative multistep pipelinethat is part of the rendering pipeline. Once the tool decides duringthe initial phase of a page load that a page is suitable for readermode use, it directly applies document tree translation before thepage is rendered.
Based on our measurements, we believe that SpeedReader canbe continuously enabled in order to drastically improve end-userexperience, especially on slower mobile connections. Combinedwith our approach to predicting which pages should be rendered inreader mode with 91% accuracy, it achieves drastic speedups andbandwidth reductions of up to 27⇥ and 84⇥ respectively on average.We further �nd that our novel “reader mode” approach bringswith it signi�cant privacy improvements to users. Our approache�ectively removes all commonly recognized trackers, issuing 115fewer requests to third parties, and interacts with 64 fewer trackerson average, on transformed pages.
1 INTRODUCTION“Web bloat” is a colloquial term that describes the trend in websitesto accumulate size and visual complexity over time. The phenom-ena has been measured in many dimensions, including total pagesize [7], page load time [5, 43, 44], memory needed [29], number ofnetwork requests [16, 27], amount of scripts executed [25, 33, 36, 38]and third parties contacted [24, 25, 27]. This work suggests thatgrowth in page size and complexity is outpacing improvements indevice hardware. All of this has a predictably negative impact onuser experience.
Web users and browser vendors have reacted to this “bloat” in avariety of ways, all partially helpful, but with signi�cant downsides.
Ad and tracking blockers are a popular and useful tool for reduc-ing the size complexity of sites. Prior work has shown that thesetools can be e�ective in reducing privacy leaks [30], network use,and extend device memory life. Such tools are inherently limitedin the scope of improvements they can achieve. While these lists
are large [41], they are small as a proportion of all URLs on theweb. Similarly, while these lists are updated often, they are updatedslowly compared to URL updates on the web.
Similarly, “reader mode” tools, provided in many popularbrowsers and browser extensions, are an e�ort to reduce the grow-ing visual complexity of web sites. Such tools attempt to extractthe subset of page content useful to users, and remove advertising,animations, boiler plate code, and other non-core content. Current“reader modes” do not provide the user with resource savings sincethe referenced resources have already been fetched and rendered.The growth and popularity of such tools suggest they are useful tobrowser users, looking to address the problem of page clutter andvisual “bloat”.
In this work, we propose a novel strategy called SpeedReaderfor dealing with resource and bloat on websites. Our techniqueprovides a user experience similar to existing “reader mode” tools,but with network, performance, and privacy improvements thatexceed existing ad and tracking blocking tools, on a signi�cantportion of websites. Signi�cantly, SpeedReader di�ers from exist-ing deployed reader mode tools by operating before page rendering,which allows it to determine which resources are needed for thepage’s core content before fetching.How we achieve speedups. SpeedReader achieves its perfor-mance improvements through a two-step pipeline:
(1) SpeedReader uses a classi�er to determine whether thereis a readable subset of the initial, fetched page HTML. Thisclassi�er is trained on a labeled corpus of 2,833 websites, andis described in detail in Section 3, and determines whether apage can be display in reader mode with 91% accuracy.
(2) If the classi�er has determined that the page is readable,SpeedReader extracts the readable subset of document be-fore rendering, using a variety of heuristics developed in priorresearch [23] and browser vendors [9, 22], and passes thesimpli�ed, reader mode document to the browser’s renderlayer. This tree translation step is described in Section 4.
Deployment. Combined with a highly accurate classi�er of “read-able” pages, the drastic improvements in performance, reduction inbandwidth use and elimination of trackers in reader mode makethe approach practical for continuous use. We therefore proposeSpeedReader as a sticky feature that a user can toggle to be alwayson. This approximates the experience of using an e-book reader,but with strengths of content availability on the web. It is also asuitable strategy for content prerendering or prefetching that couldbe implemented by web browser vendors, automatically deliveringgraceful performance degradation in poor connectivity areas or on
arX
iv:1
811.
0366
1v1
[cs.I
R]
8 N
ov 2
018
SpeedReader• Prevent tracking resources
JS, tracking pixels, iFrames…
• Most of a page isn’t immediately useful- Boilerplate: navigation, you might like…- Third party ads: often undesirable, offensive, or both- JavaScript: animations and distractions
• Extract good content, don’t block bad contentFocus on identifying the valuable parts of the page, not the harmful ones
Existing Reader Modes
Server Browser Render PageExtract
Main Text and Image
Present Reader Mode Version
AdsImages Videos
JS
Main Image
SpeedReader
Server BrowserDetermine if
Initial HTML is Readable Extract
Main Text and Image from
HTML
Present Reader Mode Version
Main Image
Main Image
If Yes…
If No…
Display as normal
SpeedReader Results
Comparison of SpeedReader to standard browsing on a large sample of websites.3rd Party
(Avg)3rd Party(Median)
Scripts(Avg)
Scripts(Median)
Ads and Trackers
(Avg)
Ads and Trackers(Median)
Default 117 63 83 51 63 24
SpeedReader 1 1 0 0 0 0
Not Possible with Extensions
• Access RestrictionsMost browser’s don’t allow extensions to modify pages
• PerformanceML classifier benefits from C++ implementation
Outline
1. BackgroundExtension focus in practical privacy tools
2. PresentPrivacy improvements require deep browser modifications
3. Next StepsCall to action, how to keep improving
uBlock PrivacyBadger
DisconnectAdBlock Plus
Unclaimed SpacePr
ivac
y co
ncer
n
Browser maintenance experience
Chrome Edge / IE
Firefox Safari
Extensions Runtimemodifications
Future privacy protections
Better Privacy is Possible• New Browser Vendors
The “big four” aren’t the only game in town anymore
• Many New Browsers are Privacy FocusedPrivacy as top-level goal, willing to be aggressive
• Eager to CollaborateThe new browsers are willing and interested to develop and maintain ambitious privacy protecting browser changes.
• Reach Out!Peter Snyder, Privacy Researcherpes@brave.com – @pes10k
top related