handling personal information in linkedin's content ...about linkedin new york engineering...

46
Handling Personal Information in LinkedIn’s Content Ingestion System David Max Senior Software Engineer LinkedIn

Upload: others

Post on 12-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Handling Personal Information in LinkedIn’s Content Ingestion System

David MaxSenior Software Engineer

LinkedIn

Page 2: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

About Me

• Software Engineer at LinkedIn NYC since 2015

• Content Ingestion team

• Office Hours –Thursday 11:30-12:00

David MaxSenior Software Engineer

LinkedInwww.linkedin.com/in/davidpmax/

Page 3: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

About LinkedIn New York Engineering

• Located in Empire State Building

• Approximately 100 engineers and 1000 employees total

• Multiple teams, front end, back end, and data science

New YorkEngineering

Page 4: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Disclaimers

• I’m not a lawyer

• Some details omitted

• I am not a spokesperson for official LinkedIn policy

Page 5: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

O U R M I S S I O N

Create economic opportunity for every member of the global workforce

Page 6: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

LinkedIn

>546M >70%

• World’s largest professional network

members of members reside outside the U.S.

• More than 200 countries and territories worldwide

Page 7: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

General Data Protection Regulation

• Applies to all companies worldwide that process personal data of EU citizens.

• Widens definition of personal data.

• Introduces restrictive data handling principles.

• Enforceable from May 25, 2018.

Page 8: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Handling Personally Identifiable Information (PII)

Limit personal data collection, storage,

usage

Data Minimization

Cannot use collected data for a different

purpose

Consent

Do not hold data longer then necessary

Retention

Must delete data upon request

Deletion

Page 9: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Handling PII in Content Ingestion

Content Ingestion Data Protection

Babylonia Data Minimization Consent

Retention Deletion

Page 10: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

What is Content Ingestion?

Content Ingestion

Babylonia

Page 11: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Content Ingestion

Babylonia

Page 12: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Content Ingestion

Babylonia

Page 13: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Content Ingestion

Babylonia

url: https://www.youtube.com/watch?v=MS3c9hz0bRg

title: "SATURN 2017 Keynote: Software is Details”

image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sqpoaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u0026rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

Page 14: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Content Ingestion

Babylonia

Page 15: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

What is Content Ingestion?

Content Ingestion

Babylonia

• Extracts metadata from web pages

• Source of Truth for 3rd party content

• Also contains metadata for some public 1st party content

• Used by LinkedIn services for sharing, decorating, and embedding content

• Data also feeds into content understanding and relevance models

Page 16: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

How does PII get into

Babylonia?

Page 17: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Ingesting 1st party pages containing publicly viewable

member PII• Profile pages• Publish posts• SlideShare content

Page 18: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

When a Member Account is Closed

• Remove scraped data relating to the member pages that have been taken down

• Notify downstream systems that might be holding a copy of the data

• Babylonia (along with other systems) is notified that the member’s account is closed

• Other systems take down the member’s content(i.e. public profile page, publish posts, etc.)

What happens What Babylonia needs to do

Page 19: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Babylonia Datasets

EspressoDatabase

HDFSETL

Brooklin Data Change Events

Datasets

Content Ingestion

Babylonia

Page 20: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Downstream and Upstream Datasets

EspressoDatabase

HDFSETL

Brooklin Data Change Events

1st party web page

profile

job

article

publishing

profile

Online Service

Near Line

Offline

Page 21: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

• Need to identify URLs that contain a member’s PII.

• My post might contain your PII

• Connection between member and the URL resides in the upstream system

Challenges of member PII in

Babylonia

Page 22: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Option #1: Require Upstream Systems to Notify Babylonia

• Simple – Babylonia waits to be told specifically which URLs should be purged

• Babylonia only does extra work when a URL needs to be purged

• Puts responsibility where the knowledge is

Pros Cons

• Requires additional work by every system that exposes PII in publicly accessible web pages

• If the notification is missed, how will Babylonia know?

• 1st party URLs sometimes change as upstream systems are changed – need to correctly handle old URLs too

Page 23: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Option #2: Actively Refetch Every 1st Party URL

• Simple logic: Page gone? Purge the page.

• Requires little additional work from upstream systems

• Works also for old 1st party URLs

Pros Cons

• There are a lot of 1st party URLs in Babylonia

• Continuous polling of all 1st party URLs consumes a lot of resources just for the sake of the very few URLs that are actually affected

• Extra work to avoid false positives or false negatives

Page 24: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Option #3: Eliminate Member PII in Babylonia

• The easiest data to delete is data that isn’t in your system to begin with

• Gets closer to Single Source of Truth (SSOT) for all 1st party content – better for consistency, not only for compliance

Pros Cons

• Babylonia is relied upon by numerous systems to have content for URLs – excluding 1st party content will affect member experience

• No substitute currently available

• Difficult to achieve based on URL – can’t always tell by looking at a URL if it resolves to 1st party content (eg. shortlinks)

Page 25: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Blended Approach

• Option 1 - Having upstream systems notify is best, but might miss some pages

• Option 2 - Active refetch is thorough but expensive. Must use to catch pages that won’t support notifications

• Option 3 - Some pages won’t work with active refetch. For example, pages that still return an HTTP status code 200 even when the data has been removed. These must be blocked

Page 26: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Classification of Ingested URLs

URL3rd Party

1st PartyBlocked

Whitelisted

Actively Refetched

Notified by Upstream

Page 27: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Option 1 – Upstream Notification

• Upstream system sends a Kafka message

• Babylonia consumes message and purges data

• Open source -kafka.apache.org

Page 28: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Option 2 – Active Refetching

EspressoDatabase

HDFSETL

RefetchURL table

RefetchURL table

Offlinejob

Refetchmessages

Kafka Pushjob

Refetchprocess

UPDATETakedown Requests for

deleted pages

Page 29: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Option 3 – Whitelist

• Block all 1st party URLs that can’t meet minimal requirements

• Mainly must return a 404 for an invalid or deleted URL

• Ensures new 1st party URLs are onboarded before being ingested

Page 30: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Managing PII in Datasets

HDFSETL

Offline Datasets

EspressoDatabase

Page 31: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Espresso Datasets

Espresso Datasets

EspressoDatabase

• LinkedIn distributed NoSQL database

• Data stored in Avro format (JSON)

• Indexed by specific primary key fields

What is Espresso? Challenges

• Reference to PII not always in the key

• ETL snapshots of Espresso Dataset become offline Datasets

Page 32: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Offline (HDFS) Datasets

HDFSETL

Offline Datasets• Files of Avro (JSON) records

• Need to read whole record to see if it has PII

• Files not conducive to removing one record from the middle

• Dataset can be source for downstream jobs that also need to be purged

Challenges

Page 33: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Which datasets contain member PII?

Data Discovery

• Data discovery and lineage tool

• Central location for all schema

• Document meanings of each column

• Trace downstream/upstream lineage of datasets

• Tag every column that can contain member reference or PII.

• Open Source -github.com/linkedin/wherehows

WhereHows

Page 34: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

• Interface for accessing datasets

• Combines dataset schema with WhereHows metadata

• Defines output virtual dataset while preserving data tags

• Supports defining virtual datasets where PII is excluded or obfuscated

Dali (Data Access at LinkedIn)

Raw Dataset

WhereHowsMetadata

Dali Reader

Page 35: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Only systems that handle PII properly are allowed access

Access Control

• Controls access to PII data to known list of authorized systems

• We only approve access to systems that it can handle PII properly

• Ensures that member PII can’t leak into untracked systems/datasets

• Acts as a list of downstream services

Access Control List (ACL)

Page 36: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Keeping Track of Personal Information in Babylonia

• Field tagging for fields containing PII

• Know where the PII is

WhereHows Dali ACL

• Downstreams use Dali, which preserves the WhereHows tagging on new virtual datasets

• Keeps tags with the data as it moves from one dataset to another

• Control the spread of PII data only to authorized readers

• Serves as a list of current downstream systems to notify when data is purged

Page 37: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Apache Gobblin

• Framework for transforming large datasets

• Data lifecycle management

• Uses WhereHows tags to identify data in our Espresso or offline datasets that need to be purged

• Open source - gobblin.apache.org

Page 38: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

• Created tags representing ingested content URLs in WhereHows

• Enables downstream systems to onboard with Espresso auto purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name)

Tagging in WhereHows

WhereHows and Gobblin

Page 39: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

• Choose an implementation where restriction is the default until proven safe

• Whitelisting ensures all allowed 1st

party URLs meets a minimum technical bar for ingestion

• Simplicity of active refetching helps keep the bar low enough to include most content safely

Compliance Comes First

Page 40: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

• Added constraints to the system

• Developer restrictions

• Made certain kinds of things harder to do

Constraints

Bigger Picture

Page 41: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

“Constraints can act as guide rails thatpoint a system where you want it to go.”

G E O R G E F A I R B A N K S

Page 42: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

• A constrained system is easier to predict and control

• Make the wrong things harder to do

• Give guidance to all developers how things are supposed to be done

Constraints / Guide Rails

Bigger Picture

Page 43: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

• Constraints should manifest in some explicit way

• Counter-Example: “No backwards incompatible schema changes”

• Hard to tell what developers refrained from doing

• WhereHows, Dali, and ACLs make metadata and the rules explicit and thus easier to perpetuate

Manifest Guide Rails in the Code

Bigger Picture

Page 44: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

A design technique where the responsibility for a guide rail is moved away from developer vigilance into code, with the goal of achieving a global property on the system.

Architecture Hoisting

Bigger Picture

Page 45: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Architecture Hoisting

Bigger Picture

• Make use of the framework to manage PII

• Requires developers to think about PII concerns up front to access the data

• Once set up, developers can focus less on managing PII because the architecture is handling it

• Users of the framework can automatically benefit from future enhancements

Page 46: Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Thank you