10 ways to stumble with big data

10 ways to stumble with big data

2017-09-14Lars Albertsson

www.mapflat.com

1

Who’s talking?● KTH-PDC Center for High Performance Computing (MSc thesis)● Swedish Institute of Computer Science (distributed system test+debug tools)● Sun Microsystems (building very large machines)● Google (Hangouts, productivity)● Recorded Future (natural language processing startup)● Cinnober Financial Tech. (trading systems)● Spotify (data processing & modelling)● Schibsted Media Group (data processing & modelling)● Mapflat (independent data engineering consultant)

2

Data-centric systems, 1st generation● The monolith

○ All data in one place○ Analytics + online serving from

single database

3

DB

Presentation

Logic

Storage

Data-centric systems, 2nd generation● Collect aggregated data from

multiple online systems to data warehouse

● Aggregate to OLAP cubes● Analytics focused

4

ServiceService

Service

Web application

Data warehouse

Daily aggregates

3rd generation - event oriented

5

Cluster storage

ETL

Datalake

AI feature

DatasetJobPipeline

Data-driven product development

Analytics

Why bother?

6

Development iteration speed

Data-driven development

Machine learning features

Democratised data access

1 - Spending-driven development

7

● Large spending before value delivery● Vendors want you to make this mistake

No workflow orchestration tool

Driven by infrastructure department

Project named “data lake” or “data platform” High trust

in vendor

Warning signs

2 - Premature scaling● You don’t have big data!● Max cloud instance memory: 2TB● Does your data

○ fit?○ grow faster than Moore’s law?

● Scaling out only when needed● Big data Lean data

○ Time-efficient data handling○ Democratised data○ Complex business logic○ Human fault tolerance○ Data agility

88

Funky databases

In-memory technology

Daily work requires cluster

3 - The data waterfall

9

● Handovers add latency● Low product agility

High time to delivery

Unclear use cases

Many teams from source to end No workflow

orchestration tool

Mono-functional teams

Right turn: Feature-driven teams & infrastructure● Cross-functional teams own

specific feature● Path from source data to end

user service

10

Start out with workflow orchestration

Self-service infrastructure added lazily

Postpone clusters & investments

End-to-end proof of concepts

Team that owns data exports to lake

Team needing data imports to lake

4 - Lake of trash

11111111

Excessive time spent cleaning

Data feature teams access production data

Data quality & semantics issues

5 - Random walk● Many iterative steps without a

target vision● Works fine for months.

Pain then increases gradually.● Difficult to be GDPR compliant.

1212121212

Autonomous / microservice culture

Little technology governance

No plan for schemas, deployment, privacy Wide

changes difficult

6 - Distinct crawl● Batch data pipelines are forgiving

○ Workflow orchestration tool for recovery

● Many practices are cargo rituals○ Release management○ In situ testing○ Performance testing

● Start minimal & quick○ Developer integration tests○ Continuous deployment pipeline

● Add process iff pain

131313131313

Enterprise culture

Heavypractice governance

Standard rituals applied

Late first delivery

7 - Data loss by design

14

Processing during data ingestion

Unclear source of truth

Mutablemaster data

Store every event

Immutable data

Reproducible execution

Large recovery buffers

Human error tolerance

Component error tolerance

Rapiditerationspeed

Eliminate manual

precautions

8 - AI first● You can climb, not jump● PoCs are possible

Credits: “The data science hierarchy of needs”, Monica Rogati

15

AIDeep learning

A/B testingMachine learning

AnalyticsSegments

CurationAnomaly detection

Data infrastructurePipelines

InstrumentationData collection

Value Effort

9 - Technical bankruptcy● Data pipeline == software product● Apply common best practices

○ Quality tools & processes○ Automated (integration) testing○ CI/CD○ Refactoring

● Avoid tools that steer you away○ Local execution?○ Difficult testing?○ Mocks required?

● Strong software engineersneeded

○ Rotate if necessary1616

Heterogeneous environmentWeak

release process

Few code quality tools

Excessive time on operations

1717

Data engineer

Increasing tech debt

10 - Team trinity unbalance● Team sport● Mutual respect & learning● Be driven by

○ user value

● Balance with ○ innovation○ engineering

17

Data scientist

Product ownerLittle innovation

Low business value

11 - Miss the train

18

Big data + AI is not optional

C.f. Internet, smartphones, …

Product development speed impact is significant

Data-driven evaluation

Forgiving environment - move fast without breaking things

Democratised access to data

10 ways to stumble with big data

Data & Analytics