from zero to solid data pipelines - jfokus€¦ · · 2016-02-14from zero to solid lars...
TRANSCRIPT
![Page 1: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/1.jpg)
Data pipelines from zero to solid
Lars Albertssonwww.mapflat.com
1
![Page 2: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/2.jpg)
Who’s talking?Swedish Institute of Computer Science (test tools)Sun Microsystems (very large machines)Google (Hangouts, productivity)Recorded Future (NLP startup)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling)Schibsted (data processing & modelling)Independent data engineering consultant
2
![Page 3: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/3.jpg)
Presentation goals● Overview of data pipelines for analytics / data products● Target audience: Big data starters
○ Seen wordcount, need the stuff around● Overview of necessary components & wiring● Base recipe
○ In vicinity of state-of-practice○ Baseline for comparing design proposals
● Subjective best practices - not single truth● Technology suggestions, (alternatives)
3
![Page 4: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/4.jpg)
Presentation non-goals● Stream processing
○ High complexity in practice○ Batch processing yields > 90% of value
● Technology enumeration or (fair) comparison● Writing data processing code
○ Already covered en masse
4
![Page 5: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/5.jpg)
Data product anatomy
5
Cluster storage
Unified log
Ingress ETL Egress
DBService
DatasetJobPipeline
Service
Export
Businessintelligence
Datalake
DBDB
![Page 6: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/6.jpg)
RAM
Input
File
Computer program anatomy
6
Input data
Process Output
File
FileHID
VariableFunctionExecution path
Lookupstructure
Output data
Window
![Page 7: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/7.jpg)
Data pipeline = yet another programDon’t veer from best practices● Regression testing● Design: Separation of concerns, modularity, etc● Process: CI/CD, code review, lint tools● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...In data engineering, slipping is the norm... :-( Solved by mixing strong software engineers with data engineers/scientists. Mutual respect is crucial.
7
![Page 8: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/8.jpg)
Cluster storageHDFS
(NFS, S3, Google CS, C*)
Event collection
8
Service
Unreliable
Unreliable
Reliable, simple,write available
Bus with historyKafka
(Kinesis, Google Pub/Sub)
(Secor,Camus)
Immediate handoff to append-only replicated log.Once in the log, events eventually arrive in storage.
Unified logImmutable events, append-only,
source of truth
![Page 9: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/9.jpg)
Event registration
9
Unified log
Service(unimportant)
Events are safe from here
Replicated bus with history
Asynchronous fire-and-forget handoff for unimportant data.Synchronous, replicated, with ack for important data
Service(important)
![Page 10: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/10.jpg)
Cluster storageHDFS
(NFS, S3, Google CS, C*)
Event transportation
10
Bus-to-bus WAN mirrorexpect delays
Log has long history (months+) => robustness end to end.Avoid risk of processing & decoration. Except timestamps.
![Page 11: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/11.jpg)
Cluster storage
Event arrival
11
Bundle incoming events into datasets● Sealed quickly, thereafter immutable● Bucket on arrival / wall-clock time● Predictable bucketing, e.g. hour
(Secor,Camus)
clicks/2016/02/08/14
clicks/2016/02/08/15
![Page 12: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/12.jpg)
Database state collection
12
Cluster storageHDFS
(NFS, S3, Google CS, C*)
Service
DB
DB
Service
Source of truth sometimes in database.Snapshot to cluster storage.Easy on surface...
?
![Page 13: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/13.jpg)
Anti-pattern: Send the oliphants!● Sqoop (dump with MapReduce) production DB● MapReduce from production APIHadoop / Spark == internal DDoS service
13
Cluster storageHDFS
(NFS, S3, Google CS, C*)
Service
DB
DB
Service
Our preciousss
![Page 14: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/14.jpg)
Deterministic slaves
14
DB
Service
backupsnapshot
Restore
DB
Restore backup to offline slave+ Standard procedure- Serial or resource consuming
Cluster storageHDFS
(NFS, S3, Google CS, C*)
![Page 15: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/15.jpg)
Using snapshots
● join(event, snapshot) => always time mismatch● Usually acceptable● Some behaviour difficult to catch with snapshots
○ E.g. user creates, then deletes account
15
DB’DBjoin?
![Page 16: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/16.jpg)
Event sourcing
● Every change to unified log == source of truth● snapshot(t + 1) = sum(snapshot(t), events(t, t+1))● Allows view & join at any point in time
Application services still need DB for current state lookup16
DB’DB
![Page 17: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/17.jpg)
Event sourcing, synced databaseA. Service interface generates events
and DB transactions
B. Generate stream from commit logPostgres, MySQL -> Kafka
C. Build DB with stream processing
17
AP
IA
PI
AP
I
![Page 18: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/18.jpg)
DB snapshot lessons learnt● Put fences between online and offline components
○ The latter can kill the former● Team that owns a database/service must own exporting
data to offline○ Protect online stability○ Affects choice of DB technology
18
![Page 19: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/19.jpg)
The data lake
Unified log + snapshots● Immutable datasets● Raw, unprocessed● Source of truth from batch
processing perspective● Kept as long as permitted● Technically homogeneous
19
Cluster storage
Data lake
![Page 20: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/20.jpg)
Datasets● Pipeline equivalent of objects● Dataset class == homogeneous records, open-ended
○ Compatible schema○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters○ Immutable○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
20
![Page 21: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/21.jpg)
Representation - data lake & pipes● Directory with multiple files
○ Parallel processing○ Sealed with _SUCCESS (Hadoop convention)○ Bundled schema format
■ JSON lines, Avro, Parquet○ Avoid old, inadequate formats
■ CSV, XML ○ RPC formats lack bundled schema
■ Protobuf, Thrift21
![Page 22: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/22.jpg)
Directory datasets
22
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json
● Some tools, e.g. Spark, understand Hive name conventions
Dataset class
Instance parameters,Hive convention
Seal PartitionsPrivacylevel
Schemaversion
![Page 23: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/23.jpg)
Ingress / egress representationLarger variation:● Single file● Relational database table● Cassandra column family, other NoSQL● BI tool storage● BigQuery, Redshift, ...Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it.
23
![Page 24: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/24.jpg)
Schemas● There is always a schema
○ Plan your evolution● New field, same semantic == compatible change● Incompatible schema change => new dataset class● Schema on read - assumptions in code
○ Dynamic typing○ Quick schema changes possible
● Schema on write - enumerated fields○ Static typing & code generation possible○ Changes must propagate down pipeline code 24
![Page 25: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/25.jpg)
Schema on read or write?
25
DBDB
DBService
Service
Export
BusinessintelligenceChange agility important here
Production stability important here
![Page 26: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/26.jpg)
Batch processingGradual refinement1. Wash
- time shuffle, dedup, ...2. Decorate
- geo, demographic, ...3. Domain model
- similarity, clusters, ...4. Application model
- Recommendations, ...26
Data lake
Artifact of business valueE.g. service index
JobPipeline
![Page 27: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/27.jpg)
Batch job code● Components should scale up
○ Spark, (Scalding, Crunch)● And scale down
○ More important!○ Component should support local mode
■ Integration tests■ Small jobs - less risk, easier debugging
27
![Page 28: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/28.jpg)
Language choice● People and community thing, not a technical thing● Need for simple & quick experiments
○ Java - too much ceremony and boilerplate● Stable and static enough for production
○ Python/R - too dynamic● Scala connects both worlds
○ Current home of data innovation● Beware of complexity - keep it sane and simple
○ Avoid spaceships: <|*|> |@| <**> 28
![Page 29: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/29.jpg)
Job == function([input datasets]): [output datasets]● No orthogonal concerns
○ Invocation○ Scheduling○ Input / output location
● Testable● No other input factors● No side-effects● Ideally: atomic, deterministic, idempotent
Batch job
29
q
![Page 30: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/30.jpg)
● Pipeline equivalent of Command pattern● Parameterised
○ Higher order, c.f. dataset class & instance○ Job instance == job class + parameters○ Inputs & outputs are dataset classes
● Instances are ideally executed when input appears○ Not on cron schedule
Batch job class & instance
30
![Page 31: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/31.jpg)
Pipelines● Things will break
○ Input will be missing○ Jobs will fail○ Jobs will have bugs
● Datasets must be rebuilt● Determinism,
idempotency● Backfill missing / failed● Eventual correctness
31
Cluster storage
Data lake
Pristine,immutabledatasets
Intermediate
Derived,regenerable
![Page 32: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/32.jpg)
Workflow manager● Dataset “build tool”● Run job instance when
○ input is available○ output missing○ resources are available
● Backfill for previous failures● DSL describes DAG● Includes ingress & egressLuigi, (Airflow, Pinball)
32
DB
![Page 33: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/33.jpg)
ClientSessions A/B tests
DSL DAG example (Luigi)
33
class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 12)] + \ [UserDB(date=self.hour.date)] ...
class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ...
class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]
def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))
...
Actions
UserDB
Time shuffle, user decorate
Form sessions
A/B compare
ClientActions
A/B session evaluation
● Expressive, embedded DSL - a must for ingress, egress○ Avoid weak DSL tools: Oozie, AWS Data Pipeline
Dataset instance
Job (aka Task) classes
![Page 34: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/34.jpg)
Egress datasets● Serving
○ Precomputed user query answers○ Denormalised○ Cassandra, (many)
● Export & Analytics○ SQL (single node / Hive, Presto, ..)○ Workbenches (Zeppelin)○ (Elasticsearch, proprietary OLAP)
● BI / analytics tool needs change frequently○ Prepare to redirect pipelines 34
![Page 35: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/35.jpg)
Test strategy considerations● Developer productivity is the primary value of test
automation● Test at stable interface
○ Minimal maintenance○ No barrier to refactorings
● Focus: single job + end to end○ Jobs & pipelines are pure functions - easy to test
● Component, unit - only if necessary○ Avoid dependency injection ceremony
35
![Page 36: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/36.jpg)
Testing single job
36
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
● (Tool-specific frameworks, e.g. for Spark?)○ Usable, but rarely cover I/O - home of many bugs.○ Tied to processing technology
Don’t commit - expensive to maintain.Generate / verify with code.
Runs well in CI / from IDE
![Page 37: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/37.jpg)
Testing pipelines - two options
37
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI+ Runs in IDE+ Quick setup- Multi-job maintenance
p()+ Tests workflow logic+ More authentic- Workflow mgr setup for testability- Difficult to debug- Dataset handling with Python
f()
B:● Both can be extended with Kafka, egress DBs
![Page 38: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/38.jpg)
Deployment
38
Hg/git repo Luigi DSL, jars, config
my-pipe-7.tar.gzHDFS
Luigidaemon
> pip install my-pipe-7.tar.gz
WorkerWorker
WorkerWorker
WorkerWorker
WorkerWorker
Redundant cron schedule, higher frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily \ --backfill 14
All that a pipeline needs, installed atomically
![Page 39: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/39.jpg)
Continuous deployment
39
● Poll and pull latest on worker nodes○ virtualenv package/version
■ No need to sync environment & versions
○ Cron package/latest/bin/*■ Old versions run pipelines to
completion, then exit
Hg/git repo Luigi DSL, jars, config
my-pipe-7.tar.gzHDFS
my_cd.py hdfs://pipelines/
Worker
> virtualenv my_pipe/7> pip install my-pipe-7.tar.gz
* 10 * * * my_pipe/7/bin/*
![Page 40: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/40.jpg)
Start lean: assess needsYour data & your jobs:A. Fit in one machine, and will continue to do soB. Fit in one machine, but grow faster than Moore’s lawC. Do not fit in one machine
● Most datasets / jobs: A○ Even at large companies with millions of users
● cost(C) >> cost(A)● Running A jobs on C infrastructure is expensive
40
![Page 41: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/41.jpg)
Lean MVP● Start simple, lean, end-to-end
○ No parallel cluster computations necessary?○ Custom jobs or local Spark/Scalding/Crunch
● Shrink data○ Downsample○ Approximate algorithms (e.g. Count-min sketch)
● Get workflows running○ Serial jobs on one/few machines○ Simple job control (Luigi only / simple work queue)
41
![Page 42: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/42.jpg)
Scale carefully● Get end-to-end workflows in production for evaluation
○ Improvements driven by business value, not tech● Keep focus small
○ Business value○ Privacy needs attention early
● Keep iterations swift○ Integration test end-to-end○ Efficient code/test/deploy cycle
● Parallelise jobs only when forced42
![Page 43: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/43.jpg)
Protecting privacy in practice● Removing old personal identifiable information (PII)● Right to be forgotten● Access control to PII data● Audit of access and processing
● PII content definition is application-specific● PII handling subject to business priorities
○ But you should have a plan from day one
43
![Page 44: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/44.jpg)
Data lake Derived
Data retention● Remove old, promote derived datasets to lake
44
Cluster storage
Data lake Derived
Cluster storage
![Page 45: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/45.jpg)
PII removal
● Must rebuild downstream datasets regularly○ In order for PII to be washed in x days
45
bobwhite,http://site_a/,2015-01-03Tbobwhite,http://site_b/,2015-01-03Tjoeblack,http://site_c/,2015-01-03T
bobwhite,Bath,ukjoeblack,Bristol,uk
bobwhite,http://site_a/,2015-01-03T,Bath,ukbobwhite,http://site_b/,2015-01-03T,Bath,ukjoeblack,http://site_c/,2015-01-03T,Bristol,uk
34ac,http://site_a/,2015-01-03T34ac,http://site_b/,2015-01-03T56bd,http://site_c/,2015-01-03T
34ac,Bath,uk56db,Bristol,uk
bobwhite,http://site_a/,2015-01-03T,Bath,ukbobwhite,http://site_b/,2015-01-03T,Bath,uknull,http://site_c/,2015-01-03T,Bristol,uk
34ac,bobwhite56bd,null
Split out PII, wash on user deletion
Key on PII => difficult to wash
![Page 46: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/46.jpg)
Simple PII audit
46
● Classify PII level○ Name, address, messages, ...○ IP, city, ...○ Total # page views, …
● Tag datasets and jobs in code● Manual access through gateway tool
○ Verify permission, log○ Dedicated machines only
● Log batch jobs○ Deploy with CD only, log hg/git commit hash
![Page 47: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/47.jpg)
Parting words + sales plugKeep things simple; batch, homogeneity & little stateFocus on developer code, test, debug cycle - end to endHarmony with technical ecosystemsLittle technology overlap with yesterday - follow leadersPlan early: Privacy, retention, audit, schema evolution
Please give feedback -- mapflat.com/feedbackI help companies plan and build these things
47
![Page 48: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/48.jpg)
Bonus slides
48
![Page 49: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/49.jpg)
+ Operations+ Security+ Responsive scaling- Development workflows- Privacy- Vendor lock-in
Cloud or not?
![Page 50: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/50.jpg)
Security?● Afterthought add-on for big data components
○ E.g. Kerberos support○ Always trailing - difficult to choose global paradigm
● Container security simpler○ Easy with cloud○ Immature with on-premise solutions?
50
![Page 51: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/51.jpg)
Data pipelines example
51
Users
Pageviews
Sales Salesreports
Views with demographics
Sales with demographics
Conversion analytics
Conversion analytics
Views with demographics
Raw Derived
![Page 52: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/52.jpg)
Form teams that are driven by business cases & needForward-oriented -> filters implicitly appliedBeware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines team organisation
![Page 53: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/53.jpg)
Conway’s law
“Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.”
Better organise to match desired design, then.
![Page 54: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/54.jpg)
Personae - important characteristicsArchitect
- Technology updated- Holistic: productivity, privacy- Identify and facilitate governance
Backend developer- Simplicity oriented- Engineering practices obsessed- Adapt to data world
Product owner- Trace business value to
upstream design- Find most ROI through difficult
questions
Manager- Explain what and why- Facilitate process to determine how- Enable, enable, enable
Devops- Always increase automation- Enable, don’t control
Data scientist- Capable programmer- Product oriented
![Page 55: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/55.jpg)
Protect production servers
55
Cluster storageHDFS
(NFS, S3, Google CS, C*)
DB offline slave
Service
+ Online service is safe- Replication may be out of sync- Cluster storage may be write unavailable
=> Delayed, inaccurate snapshot
![Page 56: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/56.jpg)
Deterministic slaves
56
+ Standard procedure- Serial or resource
consuming
DB
Service
backupsnapshot
Restore
DB
Service
+ Deterministic- Ad-hoc solution- Serial => not scalable
commit log
Incremental, controlled replay
DB DB
![Page 57: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/57.jpg)
PII privacy control● Simplify with coarse classification (red/yellow/green)
○ Datasets, potentially fields○ Separate production areas
● Log batch jobs○ Code checksum -> commit id -> source code○ Tag job class with classification
■ Aids PII consideration in code review■ Enables ad-hoc verification
57
![Page 58: from zero to solid Data pipelines - Jfokus€¦ · · 2016-02-14from zero to solid Lars Albertsson 1. ... processing perspective Kept as long as permitted ... //red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS](https://reader031.vdocuments.net/reader031/viewer/2022022011/5b0a4ebb7f8b9a99488c0033/html5/thumbnails/58.jpg)
Audit● Audit manual access● Wrap all functionality in gateway tool
○ Log datasets, output, code used○ Disallow download to laptop○ Wrapper tool happens to be great for enabling data
scientists, too - shields them from operations.
58