the past, present, and future of hadoop at linkedin

31
The Past, Present, and Future of Hadoop @ LinkedIn Carl Steinbach Senior Staff Software Engineer Data Analytics Infrastructure Group LinkedIn

Upload: carl-steinbach

Post on 14-Apr-2017

1.067 views

Category:

Software


4 download

TRANSCRIPT

Page 1: The Past, Present, and Future of Hadoop at LinkedIn

The Past, Present, and Future of Hadoop @ LinkedIn

Carl SteinbachSenior Staff Software EngineerData Analytics Infrastructure GroupLinkedIn

Page 2: The Past, Present, and Future of Hadoop at LinkedIn

The (Not So) Distant Past

Page 3: The Past, Present, and Future of Hadoop at LinkedIn

PYMK (People You May Know)First version implemented in 2006

6-8 Million members

Ran on Oracle (foreshadowing!)Found various overlaps

School, Work… etc

Used common connections Triangle closing (?)

Page 4: The Past, Present, and Future of Hadoop at LinkedIn

Triangle Closing

?Mary

Dave

Steve

Page 5: The Past, Present, and Future of Hadoop at LinkedIn

PYMK ProblemsBy 2008, 40-50 Million membersStill running on OracleFailed oftenInfrequent data refresh

6 weeks – 6 months!

Page 6: The Past, Present, and Future of Hadoop at LinkedIn

Humble Beginnings Back in ‘08

Page 7: The Past, Present, and Future of Hadoop at LinkedIn

Success! (circa 2009)Apache Hadoop 0.2020 node cluster (repurposed hardware) PYMK in 3 days!

Page 8: The Past, Present, and Future of Hadoop at LinkedIn

The Present

Page 9: The Past, Present, and Future of Hadoop at LinkedIn

Hadoop @ LinkedIn Circa 2016> 10 Clusters> 10,000 Nodes> 1000 Users

Thousands of workflows, datasets, and ad-hoc queries

MR, Pig, Hive, Gobblin, Cubert, Scalding, Tez, Spark, Presto, …

Page 10: The Past, Present, and Future of Hadoop at LinkedIn

Two Types of Scaling Challenges

Machines

People and Processes

Page 11: The Past, Present, and Future of Hadoop at LinkedIn

Scaling Machines

Page 12: The Past, Present, and Future of Hadoop at LinkedIn

Some Tough Talk About HDFSConventional wisdom holds that HDFS Scales to > 4k nodes without federation* Scales to > 8k nodes with federation*

What’s been our experience? Many Apache releases won’t scale past a couple thousand nodes Vendor distros usually aren’t much better

Why? Scale testing happens after the release, not before Most vendors have only a handful of customers with clusters larger than 1k nodes

* Heavily dependent on NN RPC workload, block size, average file size, average container size, etc, etc

Page 13: The Past, Present, and Future of Hadoop at LinkedIn

March 2015 Was Not a Good Month

Page 14: The Past, Present, and Future of Hadoop at LinkedIn

What Happened?We rapidly added 500 nodes to a 2000 node cluster

(don’t do this!)

NameNode RPC queue length and wait time skyrocketed

Jobs crawled to a halt

Page 15: The Past, Present, and Future of Hadoop at LinkedIn

What Was the Cause?A subtle performance/scale regression was introduced upstream

The bug was included in multiple releases

Increased time to allocate a new file

The more nodes you had, the worse it got

Page 16: The Past, Present, and Future of Hadoop at LinkedIn

How We Used to do Scale Testing1. Deploy the release to a small cluster (num_nodes = 100)2. See if anything breaks3. If no, then deploy to next largest cluster and goto step 24. If yes, figure out what went wrong and fix it

Problems with this approach Expensive: developer time + hardware Risky: Sometimes you can’t roll back! Doesn’t always work: overlooks non-linear regressions

Page 17: The Past, Present, and Future of Hadoop at LinkedIn

17

• Scale testing and performance investigation tool for HDFS

• High fidelity in all the dimensions that matter

• Focused on the NameNode• Completely Black-box• Accurately fakes thousands of DNs on a

small fraction of the hardware• More details in forthcoming blog post

HDFS Dynamometer

Page 18: The Past, Present, and Future of Hadoop at LinkedIn

Scaling People and Processes

Page 19: The Past, Present, and Future of Hadoop at LinkedIn

19

Page 20: The Past, Present, and Future of Hadoop at LinkedIn

20

v

HadoopPerformanceTuning

Page 21: The Past, Present, and Future of Hadoop at LinkedIn

21

Too many dials!

Lots of frameworks: each one is slightly different.

Performance can change over time.

Tuning requires constant monitoring and maintenance!

Why Are Most User Jobs Poorly Tuned?

* Tuning decision tree from “Hadoop In Practice”

Page 22: The Past, Present, and Future of Hadoop at LinkedIn

22

Dr Elephant: Running Light Without OverbyteAutomated Performance Troubleshooting for Hadoop Workflows

● Detects Common MR and Spark Pathologies:

○ Mapper Data Skew○ Reducer Data Skew○ Mapper Input Size○ Mapper Speed○ Reducer Time○ Shuffle & Sort○ More!

● Explains Cause of Disease● Guided Treatment Process

Page 24: The Past, Present, and Future of Hadoop at LinkedIn

Upgrades are HardA totally fictional story: The Hadoop team pushes a new Pig upgrade The next day thirty flows fail with ClassNotFoundExceptions Angry users riot Property damage exceeds $30mm

What happened? The flows depended on a third-party UDF that depended on a transitive

dependency provided by the old version of Pig, but not the new version of Pig

Page 25: The Past, Present, and Future of Hadoop at LinkedIn

Bringing Shading Out of the ShadowsWhat most people think it is

Package artifact and all dependencies in the same JAR + rename some or all of the package names

What it really is Static linking for Java

Unfairly maligned by many people

We built an improved Gradle plugin that makes shading easier for inexperienced users

Page 26: The Past, Present, and Future of Hadoop at LinkedIn

26

Audit Hadoop flows for incompatible and unnecessary dependencies.

Predict failures before they happen by scanning for dependencies that won’t be satisfied post-upgrade.

Proved extremely useful during Hadoop2 migration

Byte-Ray: “X-Ray Goggles for JAR Files”

Page 27: The Past, Present, and Future of Hadoop at LinkedIn

Byte-Ray in Action

Page 28: The Past, Present, and Future of Hadoop at LinkedIn

SoakCycle: Real World Integration Testing

Page 29: The Past, Present, and Future of Hadoop at LinkedIn

The Future?

Page 30: The Past, Present, and Future of Hadoop at LinkedIn

Dali2015 was the year of the table

We want to make 2016 the year of the view

Learn more at the Dali talk tomorrow

Page 31: The Past, Present, and Future of Hadoop at LinkedIn

©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.