hadoop 2 @twitter, elephant scale. presented at

29
Hadoop 2 @Twitter, Elephant Scale Lohit VijayaRenu Gera Shegalov @lohitvijayarenu @gerashegalov @TwitterHadoop 1 / 29 v1.0

Upload: lohitvijayarenu

Post on 27-Jan-2015

120 views

Category:

Technology


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop 2 @Twitter, Elephant Scale. Presented at

Hadoop 2 @Twitter, Elephant Scale

Lohit VijayaRenu Gera Shegalov@lohitvijayarenu @gerashegalov

@TwitterHadoop1 / 29 v1.0

Quay Ly
hbase-dw (in blue) should be hbase-proc
Gera Shegalov
Should we skip this? We can say this on the title slide
Lohit VijayaRenu
I agree. Lets remove it
Gera Shegalov
Let us split this slide in 2? They encourage 24'' font
Page 2: Hadoop 2 @Twitter, Elephant Scale. Presented at

About this talk

Share @twitterhadoop’s efforts, experience and learning in moving thousand users and multi petabyte workloads from Hadoop 1 to Hadoop 2

@twitterhadoop2 / 29 v1.0

Page 3: Hadoop 2 @Twitter, Elephant Scale. Presented at

Use cases Personalization

Graph analysis, Recommendations, Trends, User/topic modelingAnalyticsa/b testing, user behavior analysis, api analyticsGrowthNetwork Digest, People Recommendations, EmailRevenueEngagement prediction, Ad targeting, ads analytics, marketplace optimizationNielsen Twitter TV RatingTweet impressions processingBackups & Scribe LogsMySQL backups, Manhattan backups, FrontEnd scribe logs

Many more...

@twitterhadoop3 / 29 v1.0

Page 4: Hadoop 2 @Twitter, Elephant Scale. Presented at

Hadoop and Data pipeline

TFE hadoop real time

hadoop processing

hadoop warehouse

hadoop cold

hadoop backupsSearch,

Ads, etc Partners

MySQL

hadoophbase

Vertica Manhattan

hadooptst

@twitterhadoop

SVN, Git, ...

hadooptst

4 / 29 v1.0

Lohit VijayaRenu
Remember to ask Joep about retention policy for each cluster
Joep Rottinghuis
rt clusters 0-7 days, processing 0-30 days, DW ~0- 1year (depending on dataset), cold ~1year +
Page 5: Hadoop 2 @Twitter, Elephant Scale. Presented at

Elephant Scale

➔ Tens of thousands Hadoop servers (Mix of hardware)

➔ Hundreds of thousands of disk drives➔ Few hundred PB data stored in HDFS➔ Hundreds of thousands of daily

hadoop jobs➔ Tens of millions of daily hadoop tasks

@twitterhadoop

Individual Cluster Stats

➔ More than 3500 nodes➔ 30-50+ PB data stored in HDFS➔ 35K RPC/second on NNs➔ 30K+ jobs per day➔ 10M+ tasks per day➔ 6PB+ data crunched per day

5 / 29 v1.0

Page 6: Hadoop 2 @Twitter, Elephant Scale. Presented at

Hadoop 1 Challenges (Q4-2012)

Growth:Supporting twitter growth, Request for new features on older branch, new JAVAScalability: NameNode files/blocks, NN Operations, GC pause, CheckpointingJobTracker GC pause, task assignment

Reliability:SPOF NN and JT, NameNode restart delaysEfficiency:Slot utilization, QoS, Multi Tenant, New features & frameworks Maintenance: Old codebase, Numerous issues fixed in later versions, dev branch

. @twitterhadoop

6 / 29 v1.0

Page 7: Hadoop 2 @Twitter, Elephant Scale. Presented at

Hadoop 2 Configuration (Q1-2013)

NodeManagerDataNode

NodeManagerDataNode

NodeManagerDataNode

YARN ResourceManager

JN JN JN JN JN JN

ViewFS, HDFS Balancer, Admin tools, hRaven, Metrics Alerts

……. …….

logs user tmpTrash

@twitterhadoop

TrashTrash

7 / 29 v1.0

Page 8: Hadoop 2 @Twitter, Elephant Scale. Presented at

Hadoop 2 Migration (Q2-Q4 2013)

Phase 1 : Testing

Phase 3 : Production

Phase 2 : Semi production

➔ Apache 2.0.3 branch➔ New Hardware*, New

OS and JVM➔ Benchmarks and user

jobs (lots of them…)➔ Dependent

component updates➔ Data movement

between different versions

➔ Metrics, Alerts and tools➔ Production use cases

running in 2 clusters in parallel.

➔ Tuning/parameter updates and learnings

➔ Started contributing fixes back to community

➔ Educating users about new version and changes

➔ Benefits of Hadoop 2

➔ Stable Apache 2.0.5 release with many fixes and backports

➔ Multiple internal releases

➔ Template for new clusters

➔ Ready to roll Apache 2.3 release

*http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter @twitterhadoop

8 / 29 v1.0

Page 9: Hadoop 2 @Twitter, Elephant Scale. Presented at

CPU Utilization

Hadoop 1 CPU Utilization for one day. (45% peaks)

Hadoop 2 CPU Utilization for one day. (85% peaks)

@twitterhadoop9 / 29 v1.0

Page 10: Hadoop 2 @Twitter, Elephant Scale. Presented at

Memory Utilization

Hadoop 1 Memory Utilization for one day (68% peaks)

Hadoop 2 Memory Utilization for one day (96% peaks)

@twitterhadoop10 / 29 v1.0

Quay Ly
Perhaps add a highlight for the avg Memory number on the left? it's too hard to see the number from the graph
Lohit VijayaRenu
[email protected] I could not figure out a way to do it in VEX, do we have to do it by hand?
Page 11: Hadoop 2 @Twitter, Elephant Scale. Presented at

Migration Challenge: web-based FS

Need a web-based FS to deal with H1/H2 interactions● Hftp based on cross-DC LogMover experience● Apps broken due to no FNF on non-existing paths

HDFS-6143● Faced challenges cross-version checksums

@twitterhadoop11 / 29 v1.0

Page 12: Hadoop 2 @Twitter, Elephant Scale. Presented at

Migration Challenge: hard-coded FS

1000’s of occurrences hdfs://${NN}/path and absolute URIs● For cluster1 dial hdfs://hadoop-cluster1-nn.dc CNAME● For cluster2 dial …

Ideal: use logical paths and viewfs as defaultFSMore realistic and faster:● HDFSCompatibleViewFS HADOOP-9985

@twitterhadoop12 / 29 v1.0

Page 13: Hadoop 2 @Twitter, Elephant Scale. Presented at

Migration Challenge: Interoperability

Migration in progress: H1 job requires input from H2● hftp://OMGwhatNN/has/my/path problem● ideal: use viewfs on H1 resolving to correct H2-NN● realistic: see above “hardcoded FS”● Even if you know OMGwhatNN, is it active?

@twitterhadoop13 / 29 v1.0

Page 14: Hadoop 2 @Twitter, Elephant Scale. Presented at

StandbyActiveClusterCNAME

H1 client

Active Standby Active Standby

Load client-side mounttable on the server side:

1. redirect to the right namespace

2. redirect to active within namespace

@twitterhadoop14 / 29 v1.0

Page 15: Hadoop 2 @Twitter, Elephant Scale. Presented at

Migration: Tools and Ecosystem

● Port/recompile/package: o Data Access Layer/HCatalog, o Pig, o Cascading/Scaldingo ElephantBirdo hadoop-lzo

● PIG-3913 (local mode counters),● Analytics team fixed PIG-2888 (performance)● hRaven fixes:

o translation between slot_millis and mb_millis

@twitterhadoop15 / 29 v1.0

Joep Rottinghuis
Another idea might be to make the public jira numbers red if they are not in upstream as of now
Gera Shegalov
good idea, not sure have the energy to check jira status now. Will add summary in the end. How many JIRA 's committed / in review
Page 16: Hadoop 2 @Twitter, Elephant Scale. Presented at

HadOops found and fixed

● ViewFS can’t be used for public DistributedCache (DC)o HADOOP-10191, YARN-1542

● getFileStatus RPC storm on public DC: o YARN-1771

● No user-specified progress string in MR-AM UI tasko MAPREDUCE-5550

● Uberized jobs for scheduling small jobs great but ...o can you kill them? MAPREDUCE-5841o size correctly for map-only? YARN-1190

@twitterhadoop16 / 29 v1.0

Page 17: Hadoop 2 @Twitter, Elephant Scale. Presented at

More HadOops

Incident: a job blacklists nodes by logging terabytes● need capping, but userlog.limit.kb loses valuable log tail● RollingFileAppender for MR-AM/tasks MAPREDUCE-

5672

@twitterhadoop17 / 29 v1.0

Page 18: Hadoop 2 @Twitter, Elephant Scale. Presented at

Diagnostics improvement

App/Job/Task kill:● DAG processors/users can say why

o MAPREDUCE-5648, YARN-1551

● MR-AM: “speculation”, “reducer preemption”o MAPREDUCE-5692, MAPREDUCE-5825

● Thread Dumpso On task timeout: MAPREDUCE-5044o On demand from CLI/UI: MAPREDUCE-5784, ...

@twitterhadoop18 / 29 v1.0

Page 19: Hadoop 2 @Twitter, Elephant Scale. Presented at

UX/UI improvements

● NameNode state and cluster stats● App size in MB on RM Apps Page● RM Scheduler UI improvements: queue descriptions,

bugs min/max resource calc.● Task Attempt state filtering in MR-AM

HDFS-5928, YARN-1945, HDFS-5296...

@twitterhadoop19 / 29 v1.0

Page 20: Hadoop 2 @Twitter, Elephant Scale. Presented at

YARN reliability improvements

● Unhealthy nodes / positive feedbacko drain containers instead of killing: YARN-1996 o don’t rerun maps when all reduces committed: MAPREDUCE-5817

● RM crashes JIRA fixed either just internally or publico YARN-351, YARN-502

@twitterhadoop20 / 29 v1.0

Page 21: Hadoop 2 @Twitter, Elephant Scale. Presented at

MapReduce usability

● Memory.mb as a single tunable: Xmx, sort.mb auto-seto mb is optimized on case-by-case basiso MAPREDUCE-5785

● Users want newer artifacts like guava: job.classloadero MAPREDUCE-5146 / 5751 / 5813 / 5814

● Help users debugo thread dump on timeout, and on demand via UIo educate users about heap dumps on OOM and java profiling

@twitterhadoop21 / 29 v1.0

Page 22: Hadoop 2 @Twitter, Elephant Scale. Presented at

Multi-DC environment

MR clients across latency boundaries. Submit fast: ● moving split calculation to MR-AM: MAPREDUCE-207

DSCP bit coloring for DataXfer● HDFS-5175● Hftp (switched to Apache Commons HttpClient)

DataXfer throttling (client RW)

22 / 29 v1.0

Page 23: Hadoop 2 @Twitter, Elephant Scale. Presented at

YARN: Beyond Java & MapReduce

● MR-AM and other REST API’s across the stack for easy integration in non-JVM tools.

● Vowpal Wabbit: (production)o no extra spanning tree step

● Spark (semi-production)

@twitterhadoop23 / 29 v1.0

Page 24: Hadoop 2 @Twitter, Elephant Scale. Presented at

Ongoing Project: Shared Cache

MapReduce function shipping: computation->data● Teams have jobs with 100’s of jars uploaded via libjars

o Ideal: manage a jar repo on HDFSo Reference jars via DistributedCache instead of uploadingo Real: currently hard to coordinate

● YARN-1492: Manage artifacts cache transparently● Measure it:

o YARN-1529: Localization overhead/cache hits NM metricso MAPREDUCE-5696: Job localization counters

@twitterhadoop24 / 29 v1.0

Page 25: Hadoop 2 @Twitter, Elephant Scale. Presented at

Upcoming Challenges

● Reduce ops complexity: o grow to 10K+-node clusterso try to avoid adding more clusters

● Scalability limits for NN, RM● NN heap sizes: large Java heap vs namespace splitting● RPC QoS Issues● NN startup: long initial block report processing● Integrating non-MR frameworks with hRaven

@twitterhadoop25 / 29 v1.0

Quay Ly
Heap size? When we go to 10K+ Nodes, we also expect num of files to also grow too right? The challenge of splitting namespace
Joep Rottinghuis
I think you mean to say, grow to 10K+ _node_ clusters. 10K clusters are not easily operated :)I'd also add the RPC throttling and back-pressure items here.Finer lock granularity is a potential solution. the others are challenges.
Lohit VijayaRenu
I also see that we seem to suggest we have only HDFS future challenges and not thinking about YARN. We should add few items. One thing that came to my mind is as we add new framework, we have to think about uniform way of integrating them with MapReduce related tools like hRaven, history server and such.
Page 26: Hadoop 2 @Twitter, Elephant Scale. Presented at

Future Work Ideas

● Productize RM HA and work-preserving restart ● HDFS Readable Standby NN● Whole DAG in a single NN namespace● Contribute to HDFS-5477 - Dedicated BM service● NN SLA: fairshare for RPC queues: HADOOP-10598● Finer lock granularity in NN

@twitterhadoop26 / 29 v1.0

Page 27: Hadoop 2 @Twitter, Elephant Scale. Presented at

Summary: Hadoop 2 @ Twitter

● No JT bottleneck: Lightweight RM + MR-AM● High compute density with flexible slots● Reduced NN bottleneck using Federation● HDFS HA removes the angst to try out new NN configs● Much closer to upstream to consume/contribute fixes

o Development on 2.3 branch

● Adopting new frameworks on YARN

@twitterhadoop27 / 29 v1.0

Page 28: Hadoop 2 @Twitter, Elephant Scale. Presented at

Conclusion

Migrating 1000+ users/use cases is anything but trivial… however,● Hadoop 2 made it worthwhile● Hadoop 2 contributions:

o 40+ patches committedo ~40 in review

@twitterhadoop28 / 29 v1.0

Page 29: Hadoop 2 @Twitter, Elephant Scale. Presented at

Thank you! Questions

@JoinTheFlock about.twitter.com/careers

@TwitterHadoop

Catch up with us in person@LohitVijayaRenu @GeraShegalov

@twitterhadoop29 / 29 v1.0