quick introduction to apache tez

20
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Apache Tez Piotr Krewski, Adam Kawa

Upload: getindata

Post on 12-Jul-2015

670 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Apache TezPiotr Krewski, Adam Kawa

Page 2: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Apache Tez

■ Efficient execution engine● Faster than MapReduce

■ Can be leveraged by existing frameworks e.g. Hive, Pig, Scalding● SET hive.execution.engine=[tez,mr,spark]

■ Built atop Hadoop YARN

Page 3: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Apache Tez

Page 4: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Some Advantages Of Tez

■ Natural DAG● No intermediate data written to HDFS (replication 3x)● No need for “empty” map tasks to reshuffle data● No time spent in a queue to start a next MapReduce job

Page 5: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Simple Comparison

■ Three real-world queries■ Real production datasets

● Stored in Avro and ORC formats■ +900-node cluster (thanks, Spotify!)

● Queries run in a queue with limited capacity■ Hive 0.14 and Tez 0.5 (version from April 2014)

Page 6: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Top Three Users

■ Find top 3 users with largest number of streams

SELECT user_id, count(*) AS cnt

FROM stream

GROUP BY user_id

ORDER BY cnt DESC

LIMIT 3

■ The pattern is GROUP BY and ORDER BY and LIMIT

Page 7: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Top Three Users

Hive on MapReduce on Avro Hive on Tez on Avro

Plan 2 MapReduce jobsMap => Reduce =>

Reduce

Wallclock Time (sec) 353 197

Improvement 1.8x

Page 8: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Top Three Users - On A Busier Cluster

Hive on MapReduce on Avro Hive on Tez on Avro

Wallclock Time (sec) 576 183

Improvement 3.14x

Page 9: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Console Output……Query ID = kawaa_20141130185757_3e4bd581-23bb-4d7c-b755-

044c4a5783b5

Total jobs = 1

Launching Job 1 out of 1

Status: Running (application id:

application_1414118456795_314710)

Map 1: -/- Reducer 2: 0/5 Reducer 3: 0/1

Map 1: 0/36 Reducer 2: 0/5 Reducer 3: 0/1

Map 1: 0/36 Reducer 2: 0/5 Reducer 3: 0/1

Map 1: 0/36 Reducer 2: 0/5 Reducer 3: 0/1

……

Page 10: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Some Advantages Of Tez

■ Container reuse● Less time spent negotiating with the Resource Manager● Smaller tasks can be started, so fewer stragglers

Page 11: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Some Advantages Of Tez

■ Container reuse● Less time spent negotiating with the Resource Manager● Smaller tasks can be started, so fewer stragglers

Page 12: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Top Ten Countries

■ Find top 10 countries with largest number of streams

SELECT country, count(*) AS cnt

FROM stream

JOIN user ON stream.user_id = user.id

GROUP BY country

ORDER BY cnt DESC

LIMIT 3

■ The pattern is JOIN ON and GROUP BY and ORDER BY and LIMIT

Page 13: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Top Ten Countries

Hive on MapReduce on

Avro

Hive on Tez on Avro

Hive on Tez on ORC Snappy

Plan3 MapReduce

jobs

Map => Map => Reduce => Reduce

=> Reduce

Map => Map => Reduce => Reduce

=> Reduce

Wallclock Time (sec)

636 268 203

Improvement 2.4x 3.1x

Page 14: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

The Biggest Polish Fan of Timbuktu

■ Find the biggest Polish fan of Timbuktu (popular Swedish rap/reggae artists)

SELECT user_id, count(*) AS cnt

FROM stream

JOIN user ON stream.user_id = user.id

JOIN track ON stream.track_id = track.id

WHERE ...

GROUP BY user_id

ORDER BY cnt DESC

LIMIT 1

Page 15: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

The Biggest Polish Fan of Timbuktu

Hive on MapReduce on

ORC ZLIB

Hive on Tez on ORC ZLIB

Hive on Tez on ORC Snappy

Plan 6 MapReduce jobsMap => Map =>

Map => Reduce => Reduce

Map => Map => Map => Reduce =>

Reduce

Wallclock Time (sec)

519 259 209

Improvement 2x 2.5x

Page 16: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

The Biggest Polish Fan of Timbuktu

■ We also run this query on 1.5-year long production dataset● +25 TB of data● 690 nodes

■ Benefits (after optimizations)● 6+ hours with Hive on MapReduce and Avro Deflate● 10min 11sec with Hive on Tez and ORC Zlib

■ Features used● Containers reuse● Broadcast JOIN● Warm containers

Page 17: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Summary

■ Very fast and smart● Out of the box performance for small and large queries

■ Very good at scale● Tested by Yahoo!

■ Not memory-hungry● Great for large datasets and multi-tenancy

■ Well integrated with YARN■ No pain deployment and maintenance

● No daemons - build Tez jars and upload them to HDFS■ Gives you a powerful and effortless option

● Switch execution mode between MR, Tez or Spark using simple configuration settings

Page 18: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Q&A

Page 19: Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Thanks!

Page 20: Quick Introduction to Apache Tez

© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.

About GetInData

■ Data-processing challenges addressed with passion and experience

■ +4 years with Apache Hadoop and Big Data technologies