democratizing data at airbnb

Post on 22-Jan-2018

1.232 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Democratizing Data at Airbnb

CHRIS WILLIAMS / JOHN BODLEY / MAY 11, 2017

Airbnb connects people to unique travel experiences

The problem

tribal knowledge |ˈtrībəl ˈnäləj | noun

Tribal knowledge is any unwritten information that is not commonly known by others within a company

Relying on tribal knowledge stifles productivity

As Airbnb grows so do the challenges around the volume, complexity, and obscurity of data

In a large and complex organization, with a sea of data resources, users struggle to find the right data

Data is often siloed, inaccessible, or lacks context

I’m a recovering Data Scientist who wants to democratize data, automate common workflows, surface relevant

information, and provide context

Tables in our Hive data warehouse200k

> 10,000 Superset charts and dashboards

> 6,000 Experiments and metrics

> 6,000 Tableau workbooks and charts

> 1,500 Knowledge posts

Data resourcesBeyond the data warehouse

With many more data sources and data types to love

and most importantly…

> 3,500 Airbnb employees

PortlandSan Francisco

Los Angeles

TorontoNew York

Miami

Sao Paulo

DublinLondon

Paris

Barcelona

Berlin

Milan

Copenhagen

New Delhi

SeoulBeijing

Tokyo

Sydney

Singapore

Washington, DC

> 20Offices around the world

The mandate

To democratize data and empower Airbnb employees to be data-informed by aiding with data exploration, discovery, and trust

The concept

Search…

It should be fairly evident what we feed into the search indices

But are we missing something?

The relevancy of relationshipsNodes and relationships have equal standing

created consumedSpoke 3

The graph

created

associated

associated

associated

consumed

consumed

created

consumed

The graph

created

associated

associated

associated

consumed

consumed

created

consumed

The graph

created

associated

associated

consumed

consumed

created

consumed

associated

The graph

associated

associated

associated

consumed

consumed

consumed

created created

The graph

created

associated

associated

associated

consumed

created

consumed

consumed

The graph

created

associated

associated

associated consumed

created

consumed

consumed

The graph

created

associated

consumed

consumed

created

consumed

associated

associated

The construction

Databases

6APIs

4Airflow DAG

1

Databases6

APIs4

Airflow DAG1

We leverage all these data resources to build a graph in Hive comprising of nodes and relationships

The workflow is run everyday though the graph is left to soak to prevent flickering

Addressing graph flickering

Addressing graph flickering

The issue is certain types of relationships are sporadic in nature causing the graph to flicker

Persistent vs. transient relationshipsPersistent relationships represent a snapshot in time

createdSpoke 3

Persistent vs. transient relationshipsTransient relationships represent events which are somewhat sporadic in nature

M Tu W Th F

consumedSpoke 3

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Logical Given our data is represented as a graph it is logical to use a graph database to store the data

Nimble Performance wins when dealing with connected data versus relational databases

Popular It is the world’s leading graph database and the community edition is free

Integrative It integrates well with Python and Elasticsearch

Why we choose Neo4j for our databaseThe four main reasons

The Neo4j and Elasticsearch symbiotic relationshipCourtesy of two GraphAware plugins

Neo4j plugin Provides bi-directional integration which transparently and asynchronously replicate data from Neo4j to Elasticsearch

Elasticsearch plugin Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the search rankings by leveraging the graph topology

Node label hierarchy

:Entity

:Org

:Group :User

:Tableau

:Workbook:Chart

:Hive

:Schema :Table

jane_doe

(:Entity:Org:User {id: ‘jane_doe’})

(:Entity:Hive:Table {id: ‘dim_users’})

(:Entity:Tableau:Chart {id: ‘12345’})

dim_users

12345

MATCH (n:Entity:Org:User {id: ’<id>’}) USING INDEX n:User(id) RETURN n

From local to global uniquenessA mechanism to reference nodes in an abstract manner

GraphAware UUID plugin Transparently assigns a globally unique UUID property to newly created elements (nodes and relationships) which cannot be changed or deleted

Globally unique Enables us to uniquely identify a single node via the Entity label and UUID property which allows for parameterized queries which leads to faster query and execution times

MATCH (n:Entity {uuid: ’<uuid>’}) USING INDEX n:Entity(uuid) RETURN n

/api/graph/nodes/org/user/<id>

/api/graph/nodes/<uuid>

/api/graph/relationships/<uuid>/created/<uuid>

The frontend

web app

Designing the interface and user experience of a data tool should not be an afterthought

Technical data power user; the epitome of a tribal knowledge holder

Daphne Data

User personas

Less data literate; needs to keep tabs on her team’s resources

Manager MelNew employee, new team, or new to data; has no idea what’s going on

Nathan New

Designing for data exploration, discovery, and trust

Company dataSearch Resource details& metadata User data Group data

Company dataSearch User data Group dataResource details& metadata

Search Resource details & metadata Company dataUser data Group data

Google-esque search filters

Resource details & metadata

Context, context, & context

Search Resource details & metadata Company dataUser data Group data

Surface relationships, everything’s a link to promote exploration

Metadata & consumption

Description, external link, social

Column details & value distributionsTable lineageEnrich metadata on the fly

Search Resource details & metadata Company dataUser data Group data

Search Resource details & metadata Company dataUser data Group data

User details & metadata

What they make, what they consume

Search Resource details & metadata Company dataUser data Group data

Former employees also hold tribal knowledge

Search Resource details & metadata Company dataUser data Group data

Group overview

Search Resource details & metadata Company dataUser data Group data

Thumbnails for maximum context

Basic organization functionality

Pinterest-like curation & suggested content

We gather over 15,000 thumbnails from Tableau, Superset, and the Knowledge Repo

Search Resource details & metadata Company dataUser data Group data

Pinning flow from resource page

Edit mode / draggable grid

???? ??

Employees can feel disconnected from Company-level metrics

Search Resource details & metadata Company dataUser data Group data

The technology stack

Application + dependencies

DOM Testing

eslint enzyme mocha

chai

Application state

Styling

khan/aphrodite

The challenges

Proxy nodes Abstracting complexity where necessary while accurately modeling the data ecosystem

Graph merging Non-trivial Git-like merging of graph updates

Data-dense design Balancing simplicity and functionality is hard; most internal design resources are not made for data-rich apps

Complex dependencies An umbrella data tool is vulnerable to changes in upstream resource dependencies

The challenges

The future

Game-ification Provide content producers with a sense of value

Alerts & recommendations Move from active exploration to deliver relevant updates and content suggestions

Certified content Use certification to build trust and enable users to filter through a sea of stale content

Network analysis Determine obsolete nodes, critical paths, lines of communication, etc.

The future

The team

The Dataportal teamAnalytics & Experimentation Products

John Bodley Software Engineer

Eli Brumbaugh Experience Designer

Jeff Feng Product Manager

Michelle Thomas Software Engineer

Chris Williams Data Visualization

Thank you

Appendix

Naturally bidirectional relationships

associated

Dealing with mutual relationships

Naturally bidirectional relationships

associated

Modeling both creates an unnecessary relationship

associated

Naturally bidirectional relationships

associated

Most efficient solution is to use a single relationship in the many-to-one direction

CREATE TABLE nodes ( labels ARRAY<STRING>, id STRING, properties STRING )

jane_doe

{ labels:[‘Org’,’User’], id:’jane_doe’ }

{ labels:[‘Hive’,’Table’], id:’dim_users’ }

{ labels:[‘Tableau’,’Chart’], id:’12345’ }

dim_users

12345

CREATE TABLE relationships ( source STRUCT<labels:ARRAY<STRING>,id:STRING>, target STRUCT<labels:ARRAY<STRING>,id:STRING>, type STRING, properties STRING )

Efficient data retrieval

Solution Create an index for every label keyed by the ID and UUID properties which in addition to index hints provides optimal node retrieval

Problem Indexes provide for efficient data retrieval similar to a RDBMS primary key, however they are only defined for a single label as opposed to our tuple of hierarchical labels

Restrictions and workarounds with Neo4j indexes

top related