collecting user-data-socially-responsibly

38
“Collecting User's Data in a Socially-Responsible Manner.” Photograph: Daniel Beltra/Greenpeace Konark Modi @konarkmodi Josep M. Pujol @solso

Upload: konark-modi

Post on 16-Apr-2017

225 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Collecting user-data-socially-responsibly

“Collecting User's Data in a Socially-Responsible Manner.” Photograph: Daniel Beltra/Greenpeace

Konark Modi @konarkmodi

Josep M. Pujol @solso

Page 2: Collecting user-data-socially-responsibly

About Cliqz

• 80+ - Team size

• 500,000 - DAU

• 3 Million+ - Downloads (Germany only)

• 1 billion+ - Indexed pages (We do not believe in indexing the web.)

• 5 TB - In-Memory indexed (Based on open source and in-house build NoSQL stores.)

• 10x more coverage for anti-phishing protection - As compared to other players like safebrowsing by Google.

• Upcoming products like Anti-tracking etc.

Page 3: Collecting user-data-socially-responsibly

About Cliqz

Page 4: Collecting user-data-socially-responsibly

We Love Data …

Page 5: Collecting user-data-socially-responsibly

Let's step back a bit in time, to get the context.

Page 6: Collecting user-data-socially-responsibly

Source : http://thehumanfaceofbigdata.com

“ Data is the new oil ” - Clive HumBy (2006)

Page 7: Collecting user-data-socially-responsibly

Data is still being collected without enough controls & measures.

Is privacy the new Green ?

Page 8: Collecting user-data-socially-responsibly

The biggest by-product of which being SESSIONS.

Is privacy the new Green ?

Page 9: Collecting user-data-socially-responsibly

How ?

Alice

Alice

Bob

MAP/REDUCE :D

Server-Side

Alice

Alice

Bob

Client-Side

Uncharted w

ater

Page 10: Collecting user-data-socially-responsibly

Instead …

Uncharted w

ater

Server-Side

Alice

Alice

Bob

Client-Side

Alice

Alice

Bob

MAP/REDUCE :D

MAP/REDUCE :D

MAP/REDUCE :D

Page 11: Collecting user-data-socially-responsibly

Who is responsible ?

Is there a conspiracy theory or an evil plan ?

Page 12: Collecting user-data-socially-responsibly

Well, we have a simpler explanation:

It’s the consequences of common development

practices, which results in trading user’s data

knowingly / unknowingly !

Page 13: Collecting user-data-socially-responsibly

Demo

Page 14: Collecting user-data-socially-responsibly

This looks like a toy example ?

Page 15: Collecting user-data-socially-responsibly

Which are the queries that are so bad that forces people to redo the same query

elsewhere ?

Let’s take a more complex case

Page 16: Collecting user-data-socially-responsibly

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Client-Side

Page 17: Collecting user-data-socially-responsibly

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-ReduceClient-Side

Server - Side

Page 18: Collecting user-data-socially-responsibly

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-ReduceClient-Side

Server - Side

Page 19: Collecting user-data-socially-responsibly

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Client-Side

Server - Side

Page 20: Collecting user-data-socially-responsibly

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Client-Side

Server - Side

Page 21: Collecting user-data-socially-responsibly

We mentioned before, we believe in data and are not against the collection .

• Stopping data collection altogether would be foolish and dangerous.This also means stopping the wheels of innovation.

• Who would benefit the most by supporting the ban on advertisements of tobacco products??

Page 22: Collecting user-data-socially-responsibly
Page 23: Collecting user-data-socially-responsibly

“Socially responsible manner” is an analogy to ensure events being collected are not suffering from pollutants like Explicit IDs, Implicit IDs and reaches home Secure.

Page 24: Collecting user-data-socially-responsibly

Why does CLIQZ Care ?

Page 25: Collecting user-data-socially-responsibly

German Data Privacy Laws

Security breachesWhen government knocks

on your door

Page 26: Collecting user-data-socially-responsibly

So what do we bring on the table ??

Page 27: Collecting user-data-socially-responsibly

HUMAN WEB

• We have developed HumanWeb to balance the Right-to-Privacy with the needs to build products that improve the web and allow for more openness.

• Ensuring data that can infer sessions, linkages to navigation patterns is not collected.

• Does not create so much data that could allow identification of individuals

• We do not want to know who "YOU" are, what "YOU" searched and when "YOU" searched.

• Designed keeping in mind so that a "malicious/untrustworthy" actor or as a matter of fact even anyone at Cliqz, getting access to the raw data flow cannot infer or identify individuals.

Page 28: Collecting user-data-socially-responsibly

Sample events:{

"action": action of the message,

"ver": version name,

"type": "humanweb",

"payload": { }, //the actual data

"ts": UTC time capped to the day, e.g. 20150909

}

• Sample event for Page

• Sample event for Query

Page 29: Collecting user-data-socially-responsibly

HumanWeb

[{event1}, {event2},

{event3}]

Event Queue | Schedule to ensure not sent in batch

Final checks

Filtering

Sanitisation / Masking

Secure Channel

Client-side

Local storage | Structural data about webpages

Map-Reduce Aggregations, Heuristics,

Filtering,Hashing

Page 30: Collecting user-data-socially-responsibly

Privacy breaches on the way home

To achieve total privacy, we must rely on a network of proxies that remove any network-related data like cookies, IP,

headers so that finger-printing is impossible.

Page 31: Collecting user-data-socially-responsibly

SecureChannel : Protection from network fingerprinting

Page 32: Collecting user-data-socially-responsibly

SecureChannel : What do we encrypt ?

• The queries from the user (initiated by them upon activity on the Cliqz’s instrumented Firefox address bar).

• All telemetry signals (initiated by Cliqz’s instrumented Firefox)

• All messages regarding the HumanWeb data collection effort.

Also, before reaching our infrastructure the encrypted messages are routed through a mesh of

proxies.

Page 33: Collecting user-data-socially-responsibly

SecureChannel : How do we encrypt ?

Life-Cycle of hashes / keys : • AES : Hash-keys used with AES are used only one time. Even if the user types the

same query . • Public / Private KeyPair ( Client ) :

• The Keys on client side are all short lived, we continuously generate keys on the client-side.

• The public/private key pair of the client (the Extension) is meant to be used only once and then thrown away. The key pairs are regenerated to fill a pool while the browser is idle.

• Public / Private KeyPair ( Server ) : • Only public part of this key is shared with the extension. • The client uses it while encrypting the request. This is long lived key, currently

only to change in the case it is compromised

Client side : 128-bit symmetric AES encryption, OpenSSL RSA 1024-bit encryption. EventLogger: 128-bit symmetric AES encryption, OpenSSL RSA 4096-bit encryption.

Page 34: Collecting user-data-socially-responsibly

SecureChannel : How do we encrypt ? (Extension)

encryptedRequest(iv:encryptedMsg:encryptedKey)

iv :Initializaton Vector msg = (originalRequest + ExtensionPublicKey) key = md5(msg) encryptedMsg = AES.encrypt(msg, key, {mode: CBC, padding: PKCS7, iv: iv}) encryptedKey = sign(EventLoggerPublicKey, key)

Each request to be encrypted has the following components : • Message / Request to encrypt : Query or Data• ExtensionPublicKey : Chosen from a pool of public keys for that user on

the machine, key is used only once and then discarded).• Initialisation Vector : Derived from wordarray of 16-bits. • EventLoggerPublicKey : Our public key, shared with the extension.

Page 35: Collecting user-data-socially-responsibly

SecureChannel : Routing ? (Extension)

• Extension maintains a list of proxies which are healthy / good at that point in time.

• When sending the request / message extension picks up the end-point in a round-robin fashion (Round-robin for now).

• To avoid the risk of proxies being malicious with the message, we implement scrambling and splitting of messages into a random ‘n’ parts just before sending the message from extension.

• The value of n is determined by the extension, we expect ‘n’ to be 1,2,4 or 8 for the time being. Also, the value of ’n’ is not known to proxies hence they are unaware if it has all the parts.

• The only way to tamper a message is to have all the parts to decrypt it, but since messages are scrambled, split and send through different proxies this makes the messages safe from proxies.

• Event Logger waits for all the message by combination at our Event Logger(Secure) can decrypt the message.

Page 36: Collecting user-data-socially-responsibly

SecureChannel : How do we decrypt ? (Server)

EncryptedRequest = iv:encryptedMsg:encryptedKey key = unlock(EventLoggerPrivateKey, encryptedKey) msg = AES.decrypt(encryptedMsg, key, {mode: CBC, padding: PKCS7, iv: iv) request = msg.data ExtensionPublicKey = msg.pk (We need it to sign the response)

Important: • Because the server receives messages in parts, to get the key and message we rely on

combinations. • The message itself is scrambled, so even if it is decrypted we need to stitch it together by trying

different combinations.

Page 37: Collecting user-data-socially-responsibly

All talk and no play, makes Jack a dull boy !

Demo

Page 38: Collecting user-data-socially-responsibly

Thank You http://www.cliqz.com/en

We believe it’s possible, we are actually doing it

photo: projectsecretidentity.org