how to use parquet as a basis for etl and analytics (hadoop summit san jose 2015)

45
How to use Parquet as a basis for ETL and analytics Julien Le Dem @J_ VP Apache Parquet Analytics Data Pipeline tech lead, Data Platform @ApacheParquet

Upload: julien-le-dem

Post on 28-Jul-2015

3.333 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

How to use Parquet as a basis for ETL and analytics

Julien Le Dem @J_

VP Apache Parquet Analytics Data Pipeline tech lead, Data Platform

@ApacheParquet

Page 2: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Outline

2

- Storing data efficiently for analysis - Context: Instrumentation and data collection - Constraints of ETL

Page 3: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Storing data efficiently for analysis

Page 4: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Why do we need to worry about efficiency?

Page 5: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Producing a lot of data is easy

5

Producing a lot of derived data is even easier.Solution: Compress all the things!

Page 6: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Scanning a lot of data is easy

6

1% completed

… but not necessarily fast.Waiting is not productive. We want faster turnaround.Compression but not at the cost of reading speed.

Page 7: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Interoperability not that easy

7

We need a storage format interoperable with all the tools we use and

keep our options open for the next big thing.

Page 8: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Enter Apache Parquet

Page 9: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Parquet design goals

9

Interoperability

Space efficiency

Query efficiency

Page 10: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Efficiency

Page 11: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Columnar storage

11

Logical table representation

Row layout

Column layout

encoding

Nested schema

a b c

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5

a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5

encoded chunk encoded chunk encoded chunk

Page 12: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Properties of efficient encodings

12

- Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency !

- Minimize CPU cache misses reduce size of the working set

Page 13: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

The right encoding for the right job

13

- Delta encodings:for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. !

- Prefix coding (delta encoding for strings)When dictionary encoding does not work. !

- Dictionary encoding: small (60K) set of values (server IP, experiment id, …) !

- Run Length Encoding:repetitive data.

Page 14: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Parquet nested representation

14

Document

DocId Links Name

Backward Forward Language Url

Code Country

Columns: docid links.backward links.forward name.language.code name.language.country name.url

Schema:

Borrowed from the Google Dremel paper

https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Page 15: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Statistics for filter and query optimization

15

Vertical partitioning (projection push down)

Horizontal partitioning (predicate push down)

Read only the data you need!+ =

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

+ =

Page 16: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Interoperability

Page 17: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Interoperable

17

Model agnostic

Language agnostic

Java C++

Avro Thrift Protocol Buffer Pig Tuple Hive SerDe

Assembly/striping

Parquet file format

Object model

parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive

Column encoding

Impala

...

...

Encoding

Queryexecution

Page 18: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Frameworks and libraries integrated with Parquet

18

Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto, SparkSQL !

Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite !

Data Models: Avro, Thrift, ProtocolBuffers, POJOs

Page 19: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Context: Instrumentation and data collection

Page 20: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Typical data flow

20

Serving

InstrumentedServices

MutableServing stores

mutation

Happy users

Page 21: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Typical data flow

21

Serving

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

streaming analysis

periodic consolidation

snapshots

schema

Page 22: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Typical data flow

22

Serving

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

analysis

Storage (HDFS)

ad-hoc queries(Impala,

Hive, Drill, ...)

automated dashboard

Batch computation(Graph, machine

learning, ...)

Streaming computation

(Storm, Samza, SparkStreaming..)

Query-efficient formatParquet

streaming analysis

periodic consolidation

snapshots

Page 23: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Typical data flow

23

Happy Data Scientist

Serving

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

analysis

Storage (HDFS)

ad-hoc queries(Impala,

Hive, Drill, ...)

automated dashboard

Batch computation(Graph, machine

learning, ...)

Streaming computation

(Storm, Samza, SparkStreaming..)

Query-efficient formatParquet

streaming analysis

periodic consolidation

snapshots

Page 24: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Schema management

Page 25: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Schema in Hadoop

25

Hadoop does not define a standard notion of schema but there are many available:

- Avro- Thrift- Protocol Buffers- Pig- Hive- …

And they are all differentServing

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

streaming analysis

periodic consolidation

snapshots

schema

Page 26: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

What they define

26

Schema:Structure of a recordConstraints on the type

!Row oriented binary format: How records are represented one at a time

Serving

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

streaming analysis

periodic consolidation

snapshots

schema

Page 27: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

What they *do not* define

27

Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format

Serving

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

streaming analysis

periodic consolidation

snapshots

schema

Page 28: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Example: address book

28

AddressBook

Addressstreetcitystatezipcomment

addresses

Page 29: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Protocol Buffers

29

message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }!}!!- Allows recursive definition- Types: Group or primitive- binary format refers to field ids only => Renaming fields does not impact binary format- Requires installing a native compiler separated from your build

Fields have ids and can be optional, required or repeated

Lists are repeated fields

Page 30: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Thrift

30

struct AddressBook {! 1: required list<Address> addresses;!}!struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;!}!!- No recursive definition- Types: Struct, Map, List, Set, Union or primitive- binary format refers to field ids only => Renaming fields does not impact binary format- Requires installing a native compiler separately from the build

Fields have ids and can be optional or required

explicit collection types

Page 31: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Avro

31

{! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]!}

explicit collection types

- Allows recursive definition- Types: Records, Arrays, Maps, Unions or primitive- Binary format requires knowing the write-time schema➡ more compact but not self descriptive➡ renaming fields does not impact binary format

- generator in java (well integrated in the build)

null is a type Optional is a union

Page 32: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Requirements of ETL

Page 33: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Log event collection

33

Initial collection is fundamentally row oriented: ! - Sync to disk as early as possible to minimize event loss ! - Counting events sent and received is a good idea

Serving

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

streaming analysis

periodic consolidation

snapshots

schema

Page 34: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Columnar storage conversion

34

Columnar storage requires writes to be buffered in memory for the entire row group: ! - Write many records at a time. ! - Better executed in batch

Serving

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

streaming analysis

periodic consolidation

snapshots

schema

Page 35: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Columnar storage conversion

35

Not just columnar storage: !

- Dynamic partitioning !

- Sort order !

- Stats generationServing

InstrumentedServices

MutableServing stores

mutation

Data collection

log collection

Streaming log(Kafka, Scribe,

Chukwa ...)

periodic snapshots

log

Pull

Pull

streaming analysis

periodic consolidation

snapshots

schema

Page 36: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Hive / Impala:

Write to Parquet

36

OutputFormat ProtoParquetOutputFormat ParquetThriftOutputFormat AvroParquetOutputFormat

define schema setProtobufClass(job, AddressBook.class)

setThriftClass(job, AddressBook.class)

setSchema(job, AddressBook.SCHEMA$)

Scalding: // define the Parquet source!case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)!// load and transform data! pipe.write(ParquetSource())

Pig: STORE mydata !! INTO ‘my/data’ !! USING parquet.pig.ParquetStorer();

MapReduce:

create table parquet_table (x int, y string) stored as parquetfile;!insert into parquet_table select x, y from some_other_table;

Page 37: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Query engines

Page 38: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Scalding

38

loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)!}!!operations: p.map( (r) => r.a + r.b )!p.groupBy( (r) => r.c )!p.join !…

Explicit push down

Page 39: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Pig

39

loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();!!operations: A = FOREACH mydata GENERATE a + b;!B = GROUP mydata BY c;!C = JOIN A BY a, B BY b;

Projection push down happens automatically

Page 40: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

SQL engines

40

Load queryHive

create table as … SELECT city FROM addresses WHERE zip == 95113ImpalaPresto

Drilloptional. !Drill can directly query parquet files

SELECT city FROM dfs.`/table/addresses` zip == 95113

SparkSQLval parquetFile = sqlContext.parquetFile("/table/addresses")

val result = sqlContext!!.sql("SELECT city FROM addresses WHERE zip == 95113”)!result.map((r) => …)

Projection push down happens automatically

Page 41: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Community

Page 42: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Parquet timeline

42

- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats

- March 2013: OSS announcement; Criteo signs on for Hive integration

- July 2013: 1.0 release. 18 contributors from more than 5 organizations.

- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.

- Apr 2015: Parquet graduates from the Apache Incubator

Page 43: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Thank you to our contributors

43

Open Source announcement

1.0 release

Page 44: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Get involved

44

Mailing lists: - [email protected]

!Github repo:

- https://github.com/apache/parquet-mr !Parquet sync ups:

- Regular meetings on google hangout

@ApacheParquet

Page 45: How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

Questions

45

SELECT answer(question) FROM audience

@ApacheParquet