hadoop 2 @twitter, elephant scale. presented at

Hadoop 2 @Twitter, Elephant Scale

Lohit VijayaRenu Gera Shegalov@lohitvijayarenu @gerashegalov

@TwitterHadoop1 / 29 v1.0

Quay Ly

hbase-dw (in blue) should be hbase-proc

Gera Shegalov

Should we skip this? We can say this on the title slide

Lohit VijayaRenu

I agree. Lets remove it

Gera Shegalov

Let us split this slide in 2? They encourage 24'' font

About this talk

Share @twitterhadoop’s efforts, experience and learning in moving thousand users and multi petabyte workloads from Hadoop 1 to Hadoop 2

@twitterhadoop2 / 29 v1.0

Use cases Personalization

Graph analysis, Recommendations, Trends, User/topic modelingAnalyticsa/b testing, user behavior analysis, api analyticsGrowthNetwork Digest, People Recommendations, EmailRevenueEngagement prediction, Ad targeting, ads analytics, marketplace optimizationNielsen Twitter TV RatingTweet impressions processingBackups & Scribe LogsMySQL backups, Manhattan backups, FrontEnd scribe logs

Many more...


Hadoop and Data pipeline

TFE hadoop real time

hadoop processing

hadoop warehouse

hadoop cold

hadoop backupsSearch,

Ads, etc Partners

MySQL

hadoophbase

Vertica Manhattan

hadooptst

@twitterhadoop

SVN, Git, ...

hadooptst

4 / 29 v1.0

Lohit VijayaRenu

Remember to ask Joep about retention policy for each cluster

Joep Rottinghuis

rt clusters 0-7 days, processing 0-30 days, DW ~0- 1year (depending on dataset), cold ~1year +

Elephant Scale

➔ Tens of thousands Hadoop servers (Mix of hardware)

➔ Hundreds of thousands of disk drives➔ Few hundred PB data stored in HDFS➔ Hundreds of thousands of daily

hadoop jobs➔ Tens of millions of daily hadoop tasks

@twitterhadoop

Individual Cluster Stats

➔ More than 3500 nodes➔ 30-50+ PB data stored in HDFS➔ 35K RPC/second on NNs➔ 30K+ jobs per day➔ 10M+ tasks per day➔ 6PB+ data crunched per day

5 / 29 v1.0

Hadoop 1 Challenges (Q4-2012)

Growth:Supporting twitter growth, Request for new features on older branch, new JAVAScalability: NameNode files/blocks, NN Operations, GC pause, CheckpointingJobTracker GC pause, task assignment

Reliability:SPOF NN and JT, NameNode restart delaysEfficiency:Slot utilization, QoS, Multi Tenant, New features & frameworks Maintenance: Old codebase, Numerous issues fixed in later versions, dev branch

. @twitterhadoop

6 / 29 v1.0

Hadoop 2 Configuration (Q1-2013)

NodeManagerDataNode

NodeManagerDataNode

NodeManagerDataNode

YARN ResourceManager

JN JN JN JN JN JN

ViewFS, HDFS Balancer, Admin tools, hRaven, Metrics Alerts

……. …….

logs user tmpTrash

@twitterhadoop

TrashTrash

7 / 29 v1.0

Hadoop 2 Migration (Q2-Q4 2013)

Phase 1 : Testing

Phase 3 : Production

Phase 2 : Semi production

➔ Apache 2.0.3 branch➔ New Hardware*, New

OS and JVM➔ Benchmarks and user

jobs (lots of them…)➔ Dependent

component updates➔ Data movement

between different versions

➔ Metrics, Alerts and tools➔ Production use cases

running in 2 clusters in parallel.

➔ Tuning/parameter updates and learnings

➔ Started contributing fixes back to community

➔ Educating users about new version and changes

➔ Benefits of Hadoop 2

➔ Stable Apache 2.0.5 release with many fixes and backports

➔ Multiple internal releases

➔ Template for new clusters

➔ Ready to roll Apache 2.3 release

*http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter @twitterhadoop

8 / 29 v1.0

CPU Utilization

Hadoop 1 CPU Utilization for one day. (45% peaks)

Hadoop 2 CPU Utilization for one day. (85% peaks)


Memory Utilization

Hadoop 1 Memory Utilization for one day (68% peaks)

Hadoop 2 Memory Utilization for one day (96% peaks)


Quay Ly

Perhaps add a highlight for the avg Memory number on the left? it's too hard to see the number from the graph

Lohit VijayaRenu

[email protected] I could not figure out a way to do it in VEX, do we have to do it by hand?

Migration Challenge: web-based FS

Need a web-based FS to deal with H1/H2 interactions● Hftp based on cross-DC LogMover experience● Apps broken due to no FNF on non-existing paths

HDFS-6143● Faced challenges cross-version checksums


Migration Challenge: hard-coded FS

1000’s of occurrences hdfs://${NN}/path and absolute URIs● For cluster1 dial hdfs://hadoop-cluster1-nn.dc CNAME● For cluster2 dial …

Ideal: use logical paths and viewfs as defaultFSMore realistic and faster:● HDFSCompatibleViewFS HADOOP-9985


Migration Challenge: Interoperability

Migration in progress: H1 job requires input from H2● hftp://OMGwhatNN/has/my/path problem● ideal: use viewfs on H1 resolving to correct H2-NN● realistic: see above “hardcoded FS”● Even if you know OMGwhatNN, is it active?


StandbyActiveClusterCNAME

H1 client

Active Standby Active Standby

Load client-side mounttable on the server side:

1. redirect to the right namespace

2. redirect to active within namespace


Migration: Tools and Ecosystem

● Port/recompile/package: o Data Access Layer/HCatalog, o Pig, o Cascading/Scaldingo ElephantBirdo hadoop-lzo

● PIG-3913 (local mode counters),● Analytics team fixed PIG-2888 (performance)● hRaven fixes:

o translation between slot_millis and mb_millis


Joep Rottinghuis

Another idea might be to make the public jira numbers red if they are not in upstream as of now

Gera Shegalov

good idea, not sure have the energy to check jira status now. Will add summary in the end. How many JIRA 's committed / in review

HadOops found and fixed

● ViewFS can’t be used for public DistributedCache (DC)o HADOOP-10191, YARN-1542

● getFileStatus RPC storm on public DC: o YARN-1771

● No user-specified progress string in MR-AM UI tasko MAPREDUCE-5550

● Uberized jobs for scheduling small jobs great but ...o can you kill them? MAPREDUCE-5841o size correctly for map-only? YARN-1190


More HadOops

Incident: a job blacklists nodes by logging terabytes● need capping, but userlog.limit.kb loses valuable log tail● RollingFileAppender for MR-AM/tasks MAPREDUCE-

5672


Diagnostics improvement

App/Job/Task kill:● DAG processors/users can say why

o MAPREDUCE-5648, YARN-1551

● MR-AM: “speculation”, “reducer preemption”o MAPREDUCE-5692, MAPREDUCE-5825

● Thread Dumpso On task timeout: MAPREDUCE-5044o On demand from CLI/UI: MAPREDUCE-5784, ...


UX/UI improvements

● NameNode state and cluster stats● App size in MB on RM Apps Page● RM Scheduler UI improvements: queue descriptions,

bugs min/max resource calc.● Task Attempt state filtering in MR-AM

HDFS-5928, YARN-1945, HDFS-5296...


YARN reliability improvements

● Unhealthy nodes / positive feedbacko drain containers instead of killing: YARN-1996 o don’t rerun maps when all reduces committed: MAPREDUCE-5817

● RM crashes JIRA fixed either just internally or publico YARN-351, YARN-502


MapReduce usability

● Memory.mb as a single tunable: Xmx, sort.mb auto-seto mb is optimized on case-by-case basiso MAPREDUCE-5785

● Users want newer artifacts like guava: job.classloadero MAPREDUCE-5146 / 5751 / 5813 / 5814

● Help users debugo thread dump on timeout, and on demand via UIo educate users about heap dumps on OOM and java profiling


Multi-DC environment

MR clients across latency boundaries. Submit fast: ● moving split calculation to MR-AM: MAPREDUCE-207

DSCP bit coloring for DataXfer● HDFS-5175● Hftp (switched to Apache Commons HttpClient)

DataXfer throttling (client RW)

22 / 29 v1.0

YARN: Beyond Java & MapReduce

● MR-AM and other REST API’s across the stack for easy integration in non-JVM tools.

● Vowpal Wabbit: (production)o no extra spanning tree step

● Spark (semi-production)


Ongoing Project: Shared Cache

MapReduce function shipping: computation->data● Teams have jobs with 100’s of jars uploaded via libjars

o Ideal: manage a jar repo on HDFSo Reference jars via DistributedCache instead of uploadingo Real: currently hard to coordinate

● YARN-1492: Manage artifacts cache transparently● Measure it:

o YARN-1529: Localization overhead/cache hits NM metricso MAPREDUCE-5696: Job localization counters


Upcoming Challenges

● Reduce ops complexity: o grow to 10K+-node clusterso try to avoid adding more clusters

● Scalability limits for NN, RM● NN heap sizes: large Java heap vs namespace splitting● RPC QoS Issues● NN startup: long initial block report processing● Integrating non-MR frameworks with hRaven


Quay Ly

Heap size? When we go to 10K+ Nodes, we also expect num of files to also grow too right? The challenge of splitting namespace

Joep Rottinghuis

I think you mean to say, grow to 10K+ _node_ clusters. 10K clusters are not easily operated :)I'd also add the RPC throttling and back-pressure items here.Finer lock granularity is a potential solution. the others are challenges.

Lohit VijayaRenu

I also see that we seem to suggest we have only HDFS future challenges and not thinking about YARN. We should add few items. One thing that came to my mind is as we add new framework, we have to think about uniform way of integrating them with MapReduce related tools like hRaven, history server and such.

Future Work Ideas

● Productize RM HA and work-preserving restart ● HDFS Readable Standby NN● Whole DAG in a single NN namespace● Contribute to HDFS-5477 - Dedicated BM service● NN SLA: fairshare for RPC queues: HADOOP-10598● Finer lock granularity in NN


Summary: Hadoop 2 @ Twitter

● No JT bottleneck: Lightweight RM + MR-AM● High compute density with flexible slots● Reduced NN bottleneck using Federation● HDFS HA removes the angst to try out new NN configs● Much closer to upstream to consume/contribute fixes

o Development on 2.3 branch

● Adopting new frameworks on YARN


Conclusion

Migrating 1000+ users/use cases is anything but trivial… however,● Hadoop 2 made it worthwhile● Hadoop 2 contributions:

o 40+ patches committedo ~40 in review


Thank you! Questions

@JoinTheFlock about.twitter.com/careers

@TwitterHadoop

Catch up with us in person@LohitVijayaRenu @GeraShegalov


hadoop 2 @twitter, elephant scale. presented at

Technology

peaks hadoop

hdfscompatibleviewfs

memory utilization hadoop

cpu utilization hadoop

daily hadoop tasks

changes benefits of

twitterhadoop trashtrash

twitterhadoop svn