building an open source data lake at scale in the cloud · •aws sns •enable downstream data...

24
1 Building an open source data lake at scale in the cloud Adrian Woodhead, Principal Engineer

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

1

Building an open source data lake at scale in the cloud

Adrian Woodhead, Principal Engineer

Page 2: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

2Expedia Group Proprietary and Confidential

Agenda

Background

Data Lake foundation: data + metadata

High Availability and Disaster Recovery

Data federation

Event-based data processing

2

Page 3: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

3

Page 4: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

4Expedia Group Proprietary and Confidential

Data Lake journey

• “traditional” RDBMS Data Warehouse

• Introduced on-premise Hadoop + Hive cluster

• RDBMS SQL replaced by SQL from Hive

• Slow at busy times

• Painful upgrade path (software and hardware)

• Migration to “Cloud” as primary data lake

Page 5: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

5Expedia Group Proprietary and Confidential

C l o u d D a t a L a k e F o u n d a t i o n

1

2

Page 6: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

6Expedia Group Proprietary and Confidential

C l o u d D a t a L a k e H i g h A v a i l a b i l i t y

1

2

Page 7: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

7Expedia Group Proprietary and Confidential

C l o u d D a t a L a k e R e d u n d a n c y

1

2

Page 8: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

8Expedia Group Proprietary and Confidential

Redundancy by replication

• Data and Metadata

• Co-ordinated

• Data consistency during replication

• No partial reads

• Completeness more important than latency

8

1

2

Page 9: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

9Expedia Group Proprietary and Confidential

Circus Train – Hive dataset replicator

• https://github.com/HotelsDotCom/circus-train/

• Metadata only available after data

• Supports HDFS, S3, GCS etc.

• Standard “distcp” and optimised copiers

• Plugin architecture – Notifications, Copiers, Metadata transformations

• Selective data replication – custom filters, “Hive Diff”

• https://github.com/HotelsDotCom/shunting-yard• Event-driven Circus Train

9

1

2

Page 10: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

10Expedia Group Proprietary and Confidential

D a t a L a k e S i l o s

1

2

Page 11: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

11Expedia Group Proprietary and Confidential

Data Lake Silo Solutions

• Move back to a single data lake

• Scalability issues

• Increased “blast radius”

• Replicate shared data sets between data lakes

• Cost of maintaining replication jobs

• Increased file storage costs

• Increased network transfer costs

11

1

2

Page 12: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

12Expedia Group Proprietary and Confidential

Federated Cloud Data Lake

• https://github.com/HotelsDotCom/waggle-dance/

• Waggle Dance – a Hive Thrift metastore proxy

• Configure it with “downstream” Hive metastores

• Configure S3 bucket access permissions

• Set “hive.metastore.uris” to Waggle Dance server

• Use as you would Hive metastore in any client app

12

1

2

Page 13: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

13Expedia Group Proprietary and Confidential

W a g g l e D a n c e O v e r v i e w

1

2

Page 14: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

14Expedia Group Proprietary and Confidential

M u l t i - R e g i o n F e d e r a t e d C l o u d D a t a L a k e

Federate

Rep

licate US_WEST_2

US_EAST_1

US_WEST_2

US_EAST_1 Rep

licate

Page 15: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

15Expedia Group Proprietary and Confidential

Federated Cloud Data Lake Best Practices

• Expose read-only endpoints to “external” users

• Separate critical path infrastructure

• Federate data for access within a region

• Replicate data for access in a different region

15

1

2

Page 16: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

16Expedia Group Proprietary and Confidential

Federated Cloud Data Lake Alternative

• Presto – distributed SQL query engine for big data

• Federate Hive, MySQL, PostgreSQL and many others

• https://github.com/prestodb/presto

OR

• https://github.com/prestosql/presto

?

16

1

2

Page 17: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

17Expedia Group Proprietary and Confidential

Apiary - Cloud Data Lake Components

• https://github.com/ExpediaGroup/apiary

• Various components for a federated cloud data lake

• Docker images for all services

• Terraform deployment scripts

• Ranger for authorization

• Various optional extensions

17

1

2

Page 18: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

18Expedia Group Proprietary and Confidential

Apiary – Metadata Events

• https://github.com/ExpediaGroup/apiary-extensions/tree/master/apiary-metastore-events

• Events for tables/partitions CRUD operations

• Hive MetaStoreEventListener implementations

• Kafka

• AWS SNS

• Enable downstream data processing use cases• ETL, Governance, Lineage etc

18

1

2

Page 19: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

19Expedia Group Proprietary and Confidential

Problem – rewriting data at scale

• Changes to existing data

• Read isolation for long running queries

• Always create new folders for updates

• Repoint Hive data locations

• How to expire “orphaned data”?

19

1

2

Page 20: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

20Expedia Group Proprietary and Confidential

Beekeeper – orphaned data cleanup

• https://github.com/ExpediaGroup/beekeeper/

• Hive table parameter: beekeeper.remove.unreferenced.data=true

• Apiary event listener

• Detects data re-writes

• Schedules old data for deletion in future

• Periodically performs the data deletions

20

1

2

Page 21: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

21Expedia Group Proprietary and Confidential

Consistent CRUD alternatives

• http://hive.apache.org/ - Hive 3.1.x with ACID

• https://iceberg.incubator.apache.org/ - Iceberg

• https://delta.io/ - Delta Lake

• https://hudi.apache.org/ - Hudi

21

1

2

Page 22: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

22Expedia Group Proprietary and Confidential

Don’t forget to test

• https://github.com/klarna/HiveRunner/ - Hive SQL unit tests

• https://github.com/HotelsDotCom/mutant-swarm/ - Code coverage for HiveRunner

• https://github.com/HotelsDotCom/beeju - Unit tests for Thrift Hive metastore service and HiveServer2

22

1

2

Page 23: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

23Expedia Group Proprietary and Confidential

Where to next?

• Hybrid cloud

• best of both worlds but increased complexity

• Multi-cloud

• best of breed but increased complexity

• Docker + Kubernetes

• Reduce vendor lock-in

• Massive scale without too much effort

• Minimal changes for on-prem/EKS/GKE/AKS etc

Page 24: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker

24Expedia Group Proprietary and Confidential

Open Source Data Lake Components

Hive Federation

https://github.com/HotelsDotCom/waggle-dance

Hive Replication

https://github.com/HotelsDotCom/circus-train

https://github.com/ExpediaGroup/shunting-yard

Cloud Data Lake

https://github.com/ExpediaGroup/apiary

Hive Cleanup

https://github.com/ExpediaGroup/beekeeper