ml and data science at uber - gitpro talk 2017

39
ML and Data Science at Uber Sudhir Tonse, Engineering Lead, Uber FEB 18, 2017 GITPro 2017

Upload: sudhir-tonse

Post on 12-Apr-2017

995 views

Category:

Data & Analytics


1 download

TRANSCRIPT

ML and Data Science at UberSudhir Tonse, Engineering Lead, Uber

FEB 18, 2017

GITPro 2017

Where do we want to go today?

Agenda

Introduction Problem Space Tools of the Trade

Challenges likely unique to Uber .. interesting opportunities

Challenges & Opportunities

Who am I and what are we talking about today?

Why does Uber need ML and what are some of the problems we tackle?

What does Uber’s tech stack look like?

AgendaHop on the Uber ML Ride … destination please?

Uber, this talk and me the speaker

Introduction

•Engineering Leader @ Uber•Marketplace Data

•Realtime Data Processing•Analytics•Forecasting

• Previous -> MicroServices/Cloud Platform at Netflix

•Twitter @stonse

5

Who am I?

Driver Partner Riders Merchants

Uber’s logistic platform

Marketplace

Our partner in the ride sharing business

Folks like you and me who request a ride on any of Uber’s transportation products. e.g. UberX, uberPool

Restaurants or shops that have signed on to the Uber platform.

IntroductionUber

“Transportation as reliable as running water, everywhere, for everyone”Uber Mission

• Mapping (Routes, ETAs, …)

• Fraud and Security

• uberEATS Recommendations

• Marketplace Optimizations

• Forecasting

• Driver Positioning

• Health, Trends, Issues, ...

• And more …

ML ProblemsWhy do we need Machine Learning?

ETA, Route Optimization, Pickup Points, Pool rider matches

Marketplace

Build the platform, products, and algorithms responsible for the real time execution and online optimization of Uber's marketplace.

We are building the brain of Uber, solving NP-hard algorithms and economic optimization problems at scale.

Uber | MarketplaceMission

Request Event

Driver Accept Event

Trip Started Event

more events …

Overall Flow

Ma

t

c

h

Se

r

v

ices

Trip StatesSub-title

Scale

~400 Cities

Many Billion Events per Day

Scale

Geo Space

Vehicle Types Time

• Indexing, Lookup, Rendering

• Symmetric Neighbors

• Convex & Compact Regions

• Equal Areas

• Equal Shape

Space -> Hexagons

Granular Data

Multi-resolution Realtime Forecasting, Airport ETR

ML Examples

Real-time spatiotemporal forecasting at a variable resolution of time and space

Example 1

Rider Demand ForecastingPredict #of Riders per hexagon for various time horizons

Spatial granularity & Multiresolution Forecasting

The more you aggregate or zoom out, trends emerge

Sparsity at hexagon level: many hexagons have little signal

1. Forecast at the hex-cluster level

2. Using past activity for a similar time window, apportion out total activity from the hex-cluster to its component hexagons

Multiresolution ForecastingForecasting at different spatial granularity

Airport ETR

ML Example No 2.

Airport Taxi Line Uber Airport Lot

Flight Arrival (t1) Client Eyeball (t2) Pickup Request (t3)

Airport Demand (ETR)

Mean Delay ~30 minutes

Half Life~ 1.0 minute

“ETR too much. I bail out

..”

Solution: Time Meter Banner

“Only about 20 minutes. I would

wait!”

20 minutes wait to get a $40 trip, oh yeah!

Data Science Flow A Typical Data Scientist Workflow

Analyze/Prepare Feature SelectionModel FittingEvaluationStorage Apply Model and serve

predictionsEvaluate Runtime

Performance

Serving/DisseminationMonitoring

Data exploration, cleansing, transformations etc.

Evaluate strength of various signals Use Python/R etc. to fit

Model.Evaluate Model

PerformanceStore Model with

versioning

Data Preparation A Typical Data Scientist Workflow

Analyze/Prepare

Data exploration, cleansing, transformations etc.

Feature SelectionModel FittingEvaluationStorage Apply Model and serve

predictionsEvaluate Runtime

Performance

Serving/DisseminationMonitoring

Evaluate strength of various signals Use Python/R etc. to fit

Model.Evaluate Model

PerformanceStore Model with

versioning

Data Processing

Data Science Flow A Typical Data Scientist Workflow

Feature SelectionModel FittingEvaluationStorageEvaluate strength of

various signals Use Python/R etc. to fit Model.

Evaluate Model Performance

Store Model with versioning

Data Scientists (Analytics)

Data Science Flow A Typical Data Scientist Workflow

Analyze/Prepare Feature SelectionModel FittingEvaluationStorage Apply Model and serve

predictionsEvaluate Runtime

Performance

Serving/DisseminationMonitoring

Data exploration, cleansing, transformations etc.

Evaluate strength of various signals Use Python/R etc. to fit

Model.Evaluate Model

PerformanceStore Model with

versioning

Overview

Streamline the forecasting process

from conception to production

• Streams w/ flexible

geo-temporal resolution

• Valuable external data feeds

• Modular, reusable

components at each stage

• Same code for offline

model fitting and

production to enable fast

model iteration

Operators & Computation DAGs

Feature Generation

Online ModelsOffline Model Fitting

Predictions, Metrics & Visualizations

External Data Streams

Airport feed

Weather feed

Concerts feed

Realtime Models

- Something happened at a time and a place. Now we will

Evaluate the DAG

- DAG evaluated for a single instant in time

real-time spatiotemporal forecasting at a variable resolution of time and space

Under the hood ..

Tools & Framework

• Curated set of algorithms• Model Versioning• Model Performance & Visualizations• Automated Deployment Workflow• …

Machine Learning as a ServiceML workflow at Uber

Open Source TechnologiesSub-title

Samza

Micro Batch based processingGood integration with HDFS & S3Exactly once semantics

Spark Streaming

Well integrated with KafkaBuilt in State ManagementBuilt in Checkpointing

Distributed Indexes & QueriesVersatile aggregations

Jupyter/IPython

Great community supportData Scientists familiar with Python

..

Challenges & Opportunities

• What’s the best model for integrating vast amounts of disparate kinds of information over space and time?

• What’s the best way of building spatiotemporal models in a fashion that is effective, elegant, and debuggable?

• About a 100 or so more … :-)

ML ProblemsChallenges

Happy to discuss design/architecture

Q & A

No product/business questions please :-)

@stonse

Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be

reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any

information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the

use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise

exempt from disclosure under applicable law. All recipients of this document are notified that the information contained

herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any

way disclose this document or any of the enclosed information to any person other than employees of addressee to the

extent necessary for consultations with authorized personnel of Uber.

Sudhir Tonse

@stonse

Thank you