hadoop 2 @twitter, elephant scale. presented at
DESCRIPTION
TRANSCRIPT
Hadoop 2 @Twitter, Elephant Scale
Lohit VijayaRenu Gera Shegalov@lohitvijayarenu @gerashegalov
@TwitterHadoop1 / 29 v1.0
About this talk
Share @twitterhadoop’s efforts, experience and learning in moving thousand users and multi petabyte workloads from Hadoop 1 to Hadoop 2
@twitterhadoop2 / 29 v1.0
Use cases Personalization
Graph analysis, Recommendations, Trends, User/topic modelingAnalyticsa/b testing, user behavior analysis, api analyticsGrowthNetwork Digest, People Recommendations, EmailRevenueEngagement prediction, Ad targeting, ads analytics, marketplace optimizationNielsen Twitter TV RatingTweet impressions processingBackups & Scribe LogsMySQL backups, Manhattan backups, FrontEnd scribe logs
Many more...
@twitterhadoop3 / 29 v1.0
Hadoop and Data pipeline
TFE hadoop real time
hadoop processing
hadoop warehouse
hadoop cold
hadoop backupsSearch,
Ads, etc Partners
MySQL
hadoophbase
Vertica Manhattan
hadooptst
@twitterhadoop
SVN, Git, ...
hadooptst
4 / 29 v1.0
Elephant Scale
➔ Tens of thousands Hadoop servers (Mix of hardware)
➔ Hundreds of thousands of disk drives➔ Few hundred PB data stored in HDFS➔ Hundreds of thousands of daily
hadoop jobs➔ Tens of millions of daily hadoop tasks
@twitterhadoop
Individual Cluster Stats
➔ More than 3500 nodes➔ 30-50+ PB data stored in HDFS➔ 35K RPC/second on NNs➔ 30K+ jobs per day➔ 10M+ tasks per day➔ 6PB+ data crunched per day
5 / 29 v1.0
Hadoop 1 Challenges (Q4-2012)
Growth:Supporting twitter growth, Request for new features on older branch, new JAVAScalability: NameNode files/blocks, NN Operations, GC pause, CheckpointingJobTracker GC pause, task assignment
Reliability:SPOF NN and JT, NameNode restart delaysEfficiency:Slot utilization, QoS, Multi Tenant, New features & frameworks Maintenance: Old codebase, Numerous issues fixed in later versions, dev branch
. @twitterhadoop
6 / 29 v1.0
Hadoop 2 Configuration (Q1-2013)
NodeManagerDataNode
NodeManagerDataNode
NodeManagerDataNode
YARN ResourceManager
JN JN JN JN JN JN
ViewFS, HDFS Balancer, Admin tools, hRaven, Metrics Alerts
……. …….
logs user tmpTrash
@twitterhadoop
TrashTrash
7 / 29 v1.0
Hadoop 2 Migration (Q2-Q4 2013)
Phase 1 : Testing
Phase 3 : Production
Phase 2 : Semi production
➔ Apache 2.0.3 branch➔ New Hardware*, New
OS and JVM➔ Benchmarks and user
jobs (lots of them…)➔ Dependent
component updates➔ Data movement
between different versions
➔ Metrics, Alerts and tools➔ Production use cases
running in 2 clusters in parallel.
➔ Tuning/parameter updates and learnings
➔ Started contributing fixes back to community
➔ Educating users about new version and changes
➔ Benefits of Hadoop 2
➔ Stable Apache 2.0.5 release with many fixes and backports
➔ Multiple internal releases
➔ Template for new clusters
➔ Ready to roll Apache 2.3 release
*http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter @twitterhadoop
8 / 29 v1.0
CPU Utilization
Hadoop 1 CPU Utilization for one day. (45% peaks)
Hadoop 2 CPU Utilization for one day. (85% peaks)
@twitterhadoop9 / 29 v1.0
Memory Utilization
Hadoop 1 Memory Utilization for one day (68% peaks)
Hadoop 2 Memory Utilization for one day (96% peaks)
@twitterhadoop10 / 29 v1.0
Migration Challenge: web-based FS
Need a web-based FS to deal with H1/H2 interactions● Hftp based on cross-DC LogMover experience● Apps broken due to no FNF on non-existing paths
HDFS-6143● Faced challenges cross-version checksums
@twitterhadoop11 / 29 v1.0
Migration Challenge: hard-coded FS
1000’s of occurrences hdfs://${NN}/path and absolute URIs● For cluster1 dial hdfs://hadoop-cluster1-nn.dc CNAME● For cluster2 dial …
Ideal: use logical paths and viewfs as defaultFSMore realistic and faster:● HDFSCompatibleViewFS HADOOP-9985
@twitterhadoop12 / 29 v1.0
Migration Challenge: Interoperability
Migration in progress: H1 job requires input from H2● hftp://OMGwhatNN/has/my/path problem● ideal: use viewfs on H1 resolving to correct H2-NN● realistic: see above “hardcoded FS”● Even if you know OMGwhatNN, is it active?
@twitterhadoop13 / 29 v1.0
StandbyActiveClusterCNAME
H1 client
Active Standby Active Standby
Load client-side mounttable on the server side:
1. redirect to the right namespace
2. redirect to active within namespace
@twitterhadoop14 / 29 v1.0
Migration: Tools and Ecosystem
● Port/recompile/package: o Data Access Layer/HCatalog, o Pig, o Cascading/Scaldingo ElephantBirdo hadoop-lzo
● PIG-3913 (local mode counters),● Analytics team fixed PIG-2888 (performance)● hRaven fixes:
o translation between slot_millis and mb_millis
@twitterhadoop15 / 29 v1.0
HadOops found and fixed
● ViewFS can’t be used for public DistributedCache (DC)o HADOOP-10191, YARN-1542
● getFileStatus RPC storm on public DC: o YARN-1771
● No user-specified progress string in MR-AM UI tasko MAPREDUCE-5550
● Uberized jobs for scheduling small jobs great but ...o can you kill them? MAPREDUCE-5841o size correctly for map-only? YARN-1190
@twitterhadoop16 / 29 v1.0
More HadOops
Incident: a job blacklists nodes by logging terabytes● need capping, but userlog.limit.kb loses valuable log tail● RollingFileAppender for MR-AM/tasks MAPREDUCE-
5672
@twitterhadoop17 / 29 v1.0
Diagnostics improvement
App/Job/Task kill:● DAG processors/users can say why
o MAPREDUCE-5648, YARN-1551
● MR-AM: “speculation”, “reducer preemption”o MAPREDUCE-5692, MAPREDUCE-5825
● Thread Dumpso On task timeout: MAPREDUCE-5044o On demand from CLI/UI: MAPREDUCE-5784, ...
@twitterhadoop18 / 29 v1.0
UX/UI improvements
● NameNode state and cluster stats● App size in MB on RM Apps Page● RM Scheduler UI improvements: queue descriptions,
bugs min/max resource calc.● Task Attempt state filtering in MR-AM
HDFS-5928, YARN-1945, HDFS-5296...
@twitterhadoop19 / 29 v1.0
YARN reliability improvements
● Unhealthy nodes / positive feedbacko drain containers instead of killing: YARN-1996 o don’t rerun maps when all reduces committed: MAPREDUCE-5817
● RM crashes JIRA fixed either just internally or publico YARN-351, YARN-502
@twitterhadoop20 / 29 v1.0
MapReduce usability
● Memory.mb as a single tunable: Xmx, sort.mb auto-seto mb is optimized on case-by-case basiso MAPREDUCE-5785
● Users want newer artifacts like guava: job.classloadero MAPREDUCE-5146 / 5751 / 5813 / 5814
● Help users debugo thread dump on timeout, and on demand via UIo educate users about heap dumps on OOM and java profiling
@twitterhadoop21 / 29 v1.0
Multi-DC environment
MR clients across latency boundaries. Submit fast: ● moving split calculation to MR-AM: MAPREDUCE-207
DSCP bit coloring for DataXfer● HDFS-5175● Hftp (switched to Apache Commons HttpClient)
DataXfer throttling (client RW)
22 / 29 v1.0
YARN: Beyond Java & MapReduce
● MR-AM and other REST API’s across the stack for easy integration in non-JVM tools.
● Vowpal Wabbit: (production)o no extra spanning tree step
● Spark (semi-production)
@twitterhadoop23 / 29 v1.0
Ongoing Project: Shared Cache
MapReduce function shipping: computation->data● Teams have jobs with 100’s of jars uploaded via libjars
o Ideal: manage a jar repo on HDFSo Reference jars via DistributedCache instead of uploadingo Real: currently hard to coordinate
● YARN-1492: Manage artifacts cache transparently● Measure it:
o YARN-1529: Localization overhead/cache hits NM metricso MAPREDUCE-5696: Job localization counters
@twitterhadoop24 / 29 v1.0
Upcoming Challenges
● Reduce ops complexity: o grow to 10K+-node clusterso try to avoid adding more clusters
● Scalability limits for NN, RM● NN heap sizes: large Java heap vs namespace splitting● RPC QoS Issues● NN startup: long initial block report processing● Integrating non-MR frameworks with hRaven
@twitterhadoop25 / 29 v1.0
Future Work Ideas
● Productize RM HA and work-preserving restart ● HDFS Readable Standby NN● Whole DAG in a single NN namespace● Contribute to HDFS-5477 - Dedicated BM service● NN SLA: fairshare for RPC queues: HADOOP-10598● Finer lock granularity in NN
@twitterhadoop26 / 29 v1.0
Summary: Hadoop 2 @ Twitter
● No JT bottleneck: Lightweight RM + MR-AM● High compute density with flexible slots● Reduced NN bottleneck using Federation● HDFS HA removes the angst to try out new NN configs● Much closer to upstream to consume/contribute fixes
o Development on 2.3 branch
● Adopting new frameworks on YARN
@twitterhadoop27 / 29 v1.0
Conclusion
Migrating 1000+ users/use cases is anything but trivial… however,● Hadoop 2 made it worthwhile● Hadoop 2 contributions:
o 40+ patches committedo ~40 in review
@twitterhadoop28 / 29 v1.0
Thank you! Questions
@JoinTheFlock about.twitter.com/careers
@TwitterHadoop
Catch up with us in person@LohitVijayaRenu @GeraShegalov
@twitterhadoop29 / 29 v1.0