collecting user-data-socially-responsibly
TRANSCRIPT
“Collecting User's Data in a Socially-Responsible Manner.” Photograph: Daniel Beltra/Greenpeace
Konark Modi @konarkmodi
Josep M. Pujol @solso
About Cliqz
• 80+ - Team size
• 500,000 - DAU
• 3 Million+ - Downloads (Germany only)
• 1 billion+ - Indexed pages (We do not believe in indexing the web.)
• 5 TB - In-Memory indexed (Based on open source and in-house build NoSQL stores.)
• 10x more coverage for anti-phishing protection - As compared to other players like safebrowsing by Google.
• Upcoming products like Anti-tracking etc.
About Cliqz
We Love Data …
Let's step back a bit in time, to get the context.
Source : http://thehumanfaceofbigdata.com
“ Data is the new oil ” - Clive HumBy (2006)
Data is still being collected without enough controls & measures.
Is privacy the new Green ?
The biggest by-product of which being SESSIONS.
Is privacy the new Green ?
How ?
Alice
Alice
Bob
MAP/REDUCE :D
Server-Side
Alice
Alice
Bob
Client-Side
Uncharted w
ater
Instead …
Uncharted w
ater
Server-Side
Alice
Alice
Bob
Client-Side
Alice
Alice
Bob
MAP/REDUCE :D
MAP/REDUCE :D
MAP/REDUCE :D
Who is responsible ?
Is there a conspiracy theory or an evil plan ?
Well, we have a simpler explanation:
It’s the consequences of common development
practices, which results in trading user’s data
knowingly / unknowingly !
Demo
This looks like a toy example ?
Which are the queries that are so bad that forces people to redo the same query
elsewhere ?
Let’s take a more complex case
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Client-Side
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Uncharted w
ater
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Map-ReduceClient-Side
Server - Side
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Uncharted w
ater
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Map-ReduceClient-Side
Server - Side
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Uncharted w
ater
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Map-Reduce
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Map-Reduce
Client-Side
Server - Side
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Uncharted w
ater
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Map-Reduce
Aliceapache big data
conf
search engine 2
search engine 1
Aliceapache big data
conf
Map-Reduce
Client-Side
Server - Side
We mentioned before, we believe in data and are not against the collection .
• Stopping data collection altogether would be foolish and dangerous.This also means stopping the wheels of innovation.
• Who would benefit the most by supporting the ban on advertisements of tobacco products??
“Socially responsible manner” is an analogy to ensure events being collected are not suffering from pollutants like Explicit IDs, Implicit IDs and reaches home Secure.
Why does CLIQZ Care ?
German Data Privacy Laws
Security breachesWhen government knocks
on your door
So what do we bring on the table ??
HUMAN WEB
• We have developed HumanWeb to balance the Right-to-Privacy with the needs to build products that improve the web and allow for more openness.
• Ensuring data that can infer sessions, linkages to navigation patterns is not collected.
• Does not create so much data that could allow identification of individuals
• We do not want to know who "YOU" are, what "YOU" searched and when "YOU" searched.
• Designed keeping in mind so that a "malicious/untrustworthy" actor or as a matter of fact even anyone at Cliqz, getting access to the raw data flow cannot infer or identify individuals.
Sample events:{
"action": action of the message,
"ver": version name,
"type": "humanweb",
"payload": { }, //the actual data
"ts": UTC time capped to the day, e.g. 20150909
}
• Sample event for Page
• Sample event for Query
HumanWeb
[{event1}, {event2},
{event3}]
Event Queue | Schedule to ensure not sent in batch
Final checks
Filtering
Sanitisation / Masking
Secure Channel
Client-side
Local storage | Structural data about webpages
Map-Reduce Aggregations, Heuristics,
Filtering,Hashing
Privacy breaches on the way home
To achieve total privacy, we must rely on a network of proxies that remove any network-related data like cookies, IP,
headers so that finger-printing is impossible.
SecureChannel : Protection from network fingerprinting
SecureChannel : What do we encrypt ?
• The queries from the user (initiated by them upon activity on the Cliqz’s instrumented Firefox address bar).
• All telemetry signals (initiated by Cliqz’s instrumented Firefox)
• All messages regarding the HumanWeb data collection effort.
Also, before reaching our infrastructure the encrypted messages are routed through a mesh of
proxies.
SecureChannel : How do we encrypt ?
Life-Cycle of hashes / keys : • AES : Hash-keys used with AES are used only one time. Even if the user types the
same query . • Public / Private KeyPair ( Client ) :
• The Keys on client side are all short lived, we continuously generate keys on the client-side.
• The public/private key pair of the client (the Extension) is meant to be used only once and then thrown away. The key pairs are regenerated to fill a pool while the browser is idle.
• Public / Private KeyPair ( Server ) : • Only public part of this key is shared with the extension. • The client uses it while encrypting the request. This is long lived key, currently
only to change in the case it is compromised
Client side : 128-bit symmetric AES encryption, OpenSSL RSA 1024-bit encryption. EventLogger: 128-bit symmetric AES encryption, OpenSSL RSA 4096-bit encryption.
SecureChannel : How do we encrypt ? (Extension)
encryptedRequest(iv:encryptedMsg:encryptedKey)
iv :Initializaton Vector msg = (originalRequest + ExtensionPublicKey) key = md5(msg) encryptedMsg = AES.encrypt(msg, key, {mode: CBC, padding: PKCS7, iv: iv}) encryptedKey = sign(EventLoggerPublicKey, key)
Each request to be encrypted has the following components : • Message / Request to encrypt : Query or Data• ExtensionPublicKey : Chosen from a pool of public keys for that user on
the machine, key is used only once and then discarded).• Initialisation Vector : Derived from wordarray of 16-bits. • EventLoggerPublicKey : Our public key, shared with the extension.
SecureChannel : Routing ? (Extension)
• Extension maintains a list of proxies which are healthy / good at that point in time.
• When sending the request / message extension picks up the end-point in a round-robin fashion (Round-robin for now).
• To avoid the risk of proxies being malicious with the message, we implement scrambling and splitting of messages into a random ‘n’ parts just before sending the message from extension.
• The value of n is determined by the extension, we expect ‘n’ to be 1,2,4 or 8 for the time being. Also, the value of ’n’ is not known to proxies hence they are unaware if it has all the parts.
• The only way to tamper a message is to have all the parts to decrypt it, but since messages are scrambled, split and send through different proxies this makes the messages safe from proxies.
• Event Logger waits for all the message by combination at our Event Logger(Secure) can decrypt the message.
SecureChannel : How do we decrypt ? (Server)
EncryptedRequest = iv:encryptedMsg:encryptedKey key = unlock(EventLoggerPrivateKey, encryptedKey) msg = AES.decrypt(encryptedMsg, key, {mode: CBC, padding: PKCS7, iv: iv) request = msg.data ExtensionPublicKey = msg.pk (We need it to sign the response)
Important: • Because the server receives messages in parts, to get the key and message we rely on
combinations. • The message itself is scrambled, so even if it is decrypted we need to stitch it together by trying
different combinations.
All talk and no play, makes Jack a dull boy !
Demo
Thank You http://www.cliqz.com/en
We believe it’s possible, we are actually doing it
photo: projectsecretidentity.org