pig on tez - low latency etl with big data

1. Pig on Tez Daniel Dai @daijy Rohini Palaniswamy @rohini_pswamy H a d o o p S u m m i t 2 0 1 4 , S a n J o s e

2. Agenda Team Introduction Apache Pig Why Pig on Tez? Pig on Tez - Design - Tez features in Pig - Performance - Current status - Future Plan 2 3. 3 Apache Pig on Tez Team Daniel Dai Pig PMC Hortonworks Rohini Palaniswamy Pig PMC Yahoo! Olga Natkovich Pig PMC Yahoo! Cheolsoo Park VP Pig, Pig PMC Netflix Mark Wagner Pig Committer LinkedIn Alex Bain Pig Contributor LinkedIn 4. Pig Latin Procedural scripting language Closer to relational algebra Heavily used for ETL Schema / No schema data, Pig eats everything More than SQL and Feature rich 4 Multiquery Nested Foreach Illustrate Algebraic and Accumulator java UDFs Script Embedding Scalars Macros non-java UDFs (jython, python, javascript, groovy, jruby) Distributed Orderby, Skewed Join 5. Pig users Heavily used for ETL at Web Scale by Major Internet Companies At Yahoo! - 60% of total hadoop jobs run daily - 12 million monthly pig jobs Other heavy users - Twitter - Netflix - LinkedIn - Ebay - Salesforce Standard data science tool, in university textbook 5 6. Why Pig on Tez? DAG execution framework Low level DAG framework - Build DAG by defining vertices and edges - Customize scheduling of DAG and routing of data Highly customizable with pluggable implementations Resource efficient Performance - Without having to increase memory Natively built on top of YARN - Multi-tenancy, resource allocation come for free Scale Security Excellent support from Tez community - Bikas Saha, Siddharth Seth, Hitesh Shah 6 7. PIG on TEZ 8. Design 8 Logical Plan Tez Plan MR Plan Physical Plan Tez Execution Engine MR Execution Engine LogToPhyTranslationVisitor MRCompilerTezCompiler 9. DAG Plan Split Group by + Join 9 f = LOAD foo AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Group by y Group by z Load foo Join Load g1 and Load g2 Group by y Group by z Load foo Join Multiple outputs Reduce follows reduce HDFS HDFS Split multiplex de-multiplex 10. DAG Execution - Visualization 10 Vertex 1 (Load) Vertex 2 (Group) Vertex 3 (Group) Vertex 4 (Join) MROutput MRInput 11. DAG Plan Distributed Orderby 11 Aggregate Sample Sort Partition A = LOAD foo AS (x, y); B = FILTER A by $0 is not null; C = ORDER f BY x; Stage sample map on distributed cache Load/Filter & Sample Aggregate Partition Sort Broadcast sample map HDFS HDFS Load/FilterHDFS HDFS Map Reduce Map Reduce Map 1-1 Unsorted Edge Cache sample map 12. Session Reuse Feature - Session reuse Submit more than one DAG to same AM Usage - Each Pig script uses a single session - Grunt shell uses one session for all commands till timeout - More than one DAG submitted for merge join, exec Benefits - A pig script with 5 MR jobs has 5 AM containers launched. Single AM for one pig script in Tez saves capacity. - Eliminates issue of queue and resource contention faced in MR by every new MR job in the pipeline of a multi-stage pig script. 12 13. Container Reuse Features - Container reuse Rerun new tasks on already launched containers (jvm) Usage - Turned on by default for all pig scripts and grunt shell Benefits - Reduced launch overhead Container request and release overhead Resource localization overhead JVM launch time overhead - Reduced network IO 1-1 edge tasks are launched on same node - Object caching User impact - Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 13 14. Custom Vertex Input/Output/Processor/Manager Features - Custom Vertex Processor - Custom Input and Output between vertices - Custom Vertex Manager Usage - PigProcessor instead of MapProcessor and ReduceProcessor - Unsorted input/output with Partitioner Union without Partitioner Broadcast Edge (Replicate join, Orderby and Skewed join), 1-1 Edge (Order by, Skewed join and Multiquery off) - Custom Vertex Manager Automatic Parallelism Estimation Benefits - No framework restrictions like MR - More efficient processing and algorithms 14 15. Broadcast Edge and Object Caching Feature - Broadcast Edge Broadcast same data to all tasks in successor vertices - Object Caching Ability to cache objects in memory for scope of Vertex, DAG and Session - Input fetch on choice Usage - Replicate join small table - Orderby and Skewed join partitioning samples Benefits - Replace use of Distributed cache and avoid NodeManager bottleneck of localization - Avoid input fetching if in cache on container reuse - Performance gains of upto 3x in tests for replicated join on smaller clusters with higher container reuse 15 16. Vertex Groups Feature - Vertex Grouping Ability to group multiple vertices into one vertex group and produce a combined output Usage - Union operator Benefits - Better performance due to elimination of an additional vertex - Performance gains of 1.2x to 2x over MR 16 A = LOAD a; B = LOAD b; C = UNION A, B; D = GROUP C by $0; Load A Load B GROUP 17. Dynamic Parallelism Determine parallelism beforehand is hard Dynamic adjust parallelism at runtime Tez VertexManagerPlugin - Custom policy to determine parallelism at runtime - Library of common policy: ShuffleVertexManager 17 18. Dynamic Parallelism - ShuffleVertexManager 18 Load A JOIN Load A JOIN 4 2 Load B Load B Stock VertexManagerPlugin from Tez Used by Group, Hash Join, etc Dynamic reduce parallelism of vertex based on estimated input size 19. Dynamic Parallelism PartitionerDefinedVertexManager Custom VertexManagerPlugin Used by Order by / Skewed Join Dynamic increase / decrease parallelism based on input size 19 Load/Filter & Sample Sample Aggregate Partition Sort Calculate #Parallelism 20. PERFORMANCE 21. Performance numbers 21 0 10 20 30 40 50 60 70 80 Prod script 1 1.5x 1 MR Job 3172 vs 3172 Tasks Prod script 2 2.1x 12 MR jobs 966 vs 941 Tasks Prod script 3 1.5x 4 MR jobs on 8.4 TB input 21397 vs 21382 Tasks Prod script 4 2% 4 MR Jobs on 25.2 TB input 101864 vs 101856 tasks Timeinmins MR Tez 28 vs 18m 11 vs 5m 50 vs 35m 74 vs 72m 22. Performance numbers 22 0 20 40 60 80 100 120 140 160 Prod script 1 2.52x 5 MR Jobs Prod script 2 2.02x 5 MR Jobs Prod script 3 2.22x 12 MR Jobs Prod script 4 1.75x 15 MR jobs Timeinmins MR Tez 25 vs 10m 34 vs 16m 2h 22m vs 1h 21m 1h 46m vs 48m 23. Lipstick from 23 24. Performance Numbers Interactive Query 24 0 100 200 300 400 500 600 700 10G 5G 1G 500M Timeinsecs Input Size TPC-H Q10 MR Tez 2.49X 3.41X 4.89X 6X When the input data is small, latency dominates Tez significantly reduce latency through session/container reuse 25. Performance Numbers Iterative Algorithm 25 Pig can be used to implement iterative algorithm using embedding Iterative algorithm is ideal for container reuse Example: k-means Algorithm - Each iteration takes an average 1.48s after the first iteration (vs 27s for MR) 0 1000 2000 3000 10 50 100 Timeinsecs Iteration k-means MR Tez 14.84X 13.12X 5.37X * Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding 26. Performance is proportional to Number of stages in the DAG - Higher the number of stages in the DAG, performance of Tez over MR will be better due to elimination of map read stages. Size of intermediate output - More the size of intermediate output, the performance of Tez over MR will be better due to reduced HDFS usage. Cluster/queue capacity - More congested a queue is, the performance of Tez over MR will be better due to container reuse. Size of data in the job - For smaller data and more stages, the performance of Tez over MR will be better as percentage of launch overhead in the total time is high for smaller jobs. 26 27. CURRENT & FUTURE 28. Where are we? 90% feature parity with Pig on MR - No Local mode (TEZ-235) - Rarely used operators not implemented MAPREDUCE (native mapreduce jobs) Collected CoGroup 98% of ~1300 e2e tests pass. 35% of ~2850 unit tests pass. Porting of rest pending on Tez Local mode. Tez branch merged into trunk and will be part of Pig 0.14 release Netflix has Lipstick working with Pig on Tez - Credits: Jacob Perkins, Cheolsoo Park 28 29. User Impact Tez - Zero pain deployment - Tez library installation on local disk and copy to HDFS Pig - No pain migration from Pig on MR to Pig on Tez Existing scripts work as is without any modification Only two additional steps to execute in Tez mode export TEZ_HOME=/tez-install-location pig -x tez myscript.pig - Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 29 30. What next? Support for Tez Local mode All unit tests ported Improve - Stability - Usability - Debuggability Apache Release - Pig 0.14 with Tez released by Sep 2014 Deployment - In research in Yahoo! by early Q3 - In production in Yahoo and Netflix by Q3/Q4 Performance - From 1.2-3x to 1.5x-5x by Q4 30 31. Tez Features - WIP Tez UI - Application Master UI and Job history UI is in the works by integrating via Application Timeline server. - Currently only AM logs are easily viewable. Task logs are available but have to grep the AM log to find the URL. Tez Local mode Tez AM Recovery - Tez checkpointing and resuming on AM failure is functional but needs more work. With single DAG execution of whole script, AM retries can be very costly. Input fetch optimizations - Custom ShuffleHandler on NodeManager - Local input fetch on container reuse 31 32. What next - Performance? Shared Edges - Same output to multiple downstream vertices Multiple Vertex Caching Unsorted shuffle for skewed join and order by Custom edge manager and data routing for skewed join Groupby and join using hashing and avoid sorting Better memory management Dynamic reconfiguration of DAG - Automatically determine type of join - replicate, skewed or hash join 32 33. We are hiring!!! Hortonworks Stop by Kiosk D5 Yahoo! Stop by Kiosk P9 or reach out to us at [email protected]. Thank You

pig on tez - low latency etl with big data

Technology

group group

pig performance

group vertex

pig users

pig scripts

pig eats

load vertex

stage pig script