data science at scale @ barricade.io

62
Data Science @ Scale

Upload: david-coallier

Post on 25-Jan-2017

755 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Data Science @ Scale

@davidcoallierPart of an amazing team at Barricade.io

Data Science is Hard

Data Hacking is “Easy”

Data Analysis is “Easy”

Data Expertise is “Easy”

Got all?Having the three is real hard!

Is that it?Well don’t forget your purpose.

You are not an economist.ɪˈkɒnəmɪst/: Someone with all the answers, and none of the questions.

The Data Scientific Method

Find a question.

Use the data you have

Features & Tests

Analyse ResultsYou will be sad.

ConversateTalk about your findings.

Good ChatsImply egoless and collaborative data scientists.

Recap.

1. Hacking 2. Maths & Stats 3. Expertise

And

1. Question 2. Be Pragmatic 3. Features 4. Analyse 5. Share.

A team!Rarely a single-person effort.

An ExampleFraud Prevention — Business Prevention

I knew better.Obviously… duh

We didn’t share.Science has historically been shared.

Not with p-values

Empathise.Use human language, not lingo.

For us at Barricade

Doing this at scale is hard.

We’re still smallAbout a billion data points a day.

Humble BeginningsTypically… an Queue and an API.

This had issues.Hard to scale, hard to decouple, etc.

Enter the Lambda Architecture.

Speed Layer

Batch Layer

Speed Layer: U new behaviour from new data

Batch Layer: All classified behaviour since T

Serving Layer

Speed Layer: U new behaviour from new data

Batch Layer: All classified behaviour since T

Serve Layer: Batch layer U Speed Layer

Cache Layer

On Amazon AWS

Identifying an Attack.

Ahh! What’s that?

Kafka Queue.Distributed messaging system Append-only log Consumers have offsets Partition for parallelism Replicate for redundancy Message order guaranteed, per-partition

Barricade

Customer

Questions?

@davidcoallier @barricadeio