data science at trainline for smarter journeys

30

Upload: marco-rossetti

Post on 13-Feb-2017

265 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Data Science at Trainlinefor Smarter Journeys

London, 22/11/2016

@DataScienceFest

@TrainlineTalent

Outline• A bit about Trainline.• Cloud-based serverless architecture for Big Data.• Case Study: BusyBot

• Other Case Studies

2

John Telford, Head of Data Architecture.Leading the adoption of Big Data technology at Trainline. Manages a team of Data Engineers and Database Administrators. Previously worked on Data Warehousing and Big Data at Channel 4. Computer Science degree from Brunel University.Twitter: @jtelford1

Marco Rossetti, Senior Data Scientist.Leading personalisation initiatives, like providing context-aware personalised services, journey recommendations, and tailored travel options. Previously worked on recommender systems for researchers at Mendeley. He has a PhD in Computer Science from University of Milan-Bicocca.Twitter: @ross85

Trainline - Smarter JourneysHelp our customers save,• Time (no more queuing for tickets at station)• Money (book early, find cheap tickets)• Energy (remove complexity)

Headlines...• We process more than £2.3 billion in ticket sales annually.• 100,000 smarter journeys every single day.• 44 train companies, across 24 European countries.• ~400 employees (London, Edinburgh, Paris).• More than 30m visits per month• 1 ticket sold every three seconds

3

Trainline takeover of Kings X, Oct 2016.

‡ 4

‡ 5

‡ 6

Bob's cloud lawsIt’s cloud if…1. It offers self provisioning.2. It offers pay-as-you-go pricing.3. It is, for all intents and purposes, infinitely scalable.

Thus, no need for support from the provider for set-up, no upfront payments for licences or minimum term agreements, and no constraints on what I can do!

• Hosting is not cloud.• BYO licensing is not cloud.

7

From servers... to serverless

8

Servers = Pets

Virtual Machines= Cattle

Containers & Serverless = Herds

Trainline policy:Use PaaS wherever possible,Use Serverless wherever possible,... so long as they are good enough.

Data Gateway

9

Data Platform

10

Lessons: Lambda• Effortless scaling; we often have >

100 λs running at once.• Warm-up time.

• Choose language / framework carefully.

• Consequences of 'freeze'.• Monitoring– single thread.

Google "Trainline Engineering Lambda"

11

ServiceTimeDistribution

Execution(ms)

Lessons: Kinesis Streams

• TCO is generally low.• But... understand costs, related to capacity of stream (number & size of

messages), time-to-live, etc.• Monitoring / alerting... CloudWatch is (probably) not enough.• Compress & encrypt?

Google "AWS Overview of Security Processes"

12

‡ 13

BusyBot

0% 10% 20% 30% 40% 50% 60% 70%

Delays

Overcrowding

Value for money

Toilet Facilities

Luggage Space

Availability of staff

Car Parking

Unhappy customers

Source : National Rail Passenger Survey (NRPS) 2015

14

‡ 15

‡ 16

Google "BusyBot overcrowding"

Busy Bot DiscoveryData from March to May - approx. 100k feedback from our Android

users.

17

Infrastructure – Data Gateway

Feedbackcollection

DailyEnrichment

{"train_destination": "RDG","retail_train_number": "GW2980","train_origin": "NRC","train_date": "2016-08-08T07:38:00.000Z","customer_longitude": 0,"train_hashid": "NRC:RDG:08/08/2016 08:38:00:GW2980","customer_location_on_train": "Back","customer_hashid": ”…","customer_got_seat": 1,"customer_feedback": "Yes","feedback_type": 1,"customer_latitude": 0,"feedbackid": ”…","device_id": ”…","timestamp": "2016-08-08T07:41:39.390Z","customer_id": ”…”

} 18

‡ 19

‡ 20

ALLfeebacks

≧100feedbacks

≧1000feedbacks

CityThameslink:50%

0%withaseat

100%withaseat

21

Infrastructure – Data PlatformModel BuildingAndValidation Service

route-origin

route-destination stop

customer-location-train

percentage-who-got-seat

feedback-count

EUS MAN EUS middle 0.738059701 4020

EUS BHM EUS middle 0.63788222 3532

KGX LDS KGX middle 0.704984154 3471

BHM EUS BHM middle 0.679082241 3356

KGX EDB KGX middle 0.5589236 3233

EUS GLC EUS middle 0.676663543 3201

MAN EUS MAN middle 0.769495772 3193

PAD SWA PAD middle 0.608086078 3067

EUS BHM EUS front 0.672365666 2866

EUS MAN EUS front 0.790479625 2773

{"retailTrainIdentifier": "VT7280","isBusy": false,"callingPoints": [

{"stationCode": "EUS","coaches": [

{"position": "Back", "recommend": true},{"position": "Front", "recommend": false},{"position": "Middle", "recommend": false}

]},{

"stationCode": "MKC","coaches": [

{"position": "Back", "recommend": false},…

22

• AtleastN feedbacks

• AtleastfeedbacksforD days

• CIonthepercentagewhogotaseat<=p

Data Validation

23

Journey Results

Live Tracker

BusyBot V1Sep 2016

24

‡ 25

Coming soon…

Hotels

26

JourneyRecommendations

27

SearchPrediction

28

SummaryBusyBot Hotels

Journey RecommendationsSearch

Prediction

DelaysPrices

Real Time InformationPersonalisation

….

29

Any Questions?

(we are hiring!)

Data Scientist positions: [email protected] Engineer positions: [email protected]

30