data formats hdf5 and parquet files - uhgabriel/courses/cosc6339_f18/bda_16_dataformats.pdfbig data...
TRANSCRIPT
1
COSC 6339
Big Data Analytics
Data Formats –
HDF5 and Parquet files
Edgar Gabriel
Fall 2018
File Formats - Motivation
• Use-case: Analysis of all flights in the US between 2004-
2008 using Apache Spark
File Format File Size Processing Time
csv 3.4 GB 525 sec
json 12 GB 2245 sec
Hadoop sequence file 3.7 GB 1745 sec
parquet 0.55 GB 100 sec
2
Scientific data libraries
• Handle data on a higher level
• Provide additional information typically not available in
flat data files (Metadata)
– Size and type of of data structure
– Data format
– Name
– Units
• Two widely used libraries available
– NetCDF
– HDF-5
HDF-5
• Hierarchical Data Format (HDF) developed since 1988
at NCSA (University of Illinois)
– http://hdf.ncsa.uiuc.edu/HDF5/
• Has gone through a long history of changes, the recent
version HDF-5 available since 1999
• HDF-5 supports
– Very large files
– Parallel I/O interface
– Fortran, C, Java, Python bindings
3
HDF-5 dataset
• Multi-dimensional array of basic data elements
• A dataset consists of
– Header + data
• Header consists of
– Name
– Datatype : basic (e.g. HDF_NATIVE_FLOAT) or
compound dataypes
– Dataspace: defines size and shape of a multidimensional
array. Dimensions can be fixed or unlimited.
– Storage layout: defines how multidimensional arrays are
stored in file. Can be contiguous or chunked.
Example of an HDF-5 fileHDF5 “tempseries.h5” {
GROUP “/” {
GROUP “tempseries” {
DATASET “height” {
DATATYPE {“H5_STD_I32BE” }
DATASPACE ( ARRAY (4) (4) }
DATA {
0, 50, 100, 150
}
ATTRIBUTES “units” {
DATATYPE {“undefined string” }
DATASPACE { ARRAY (0) (0) }
DATA {
unable to print
}
}
}
DATASET “temperature” {
DATATYPE {“H5T_IEEE_F32BE” }
DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) }
DATA {…}
4
Storage layout: contiguous vs. chunkedcontiguous chunked
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
1
5
9
13
33
37
41
45
2
6
10
14
34
38
42
46
3
7
11
15
35
39
43
47
4
8
12
16
36
40
44
48
21
25
29
49
57
61
22
26
30
50
58
62
23
27
31
51
59
63
24
28
32
52
60
64
17 18 19 20
53 54 55 56
Advantages and disadvantages of chunking Accessing rows and columns require the same
number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching
HDF-5 API
• HDF-5 naming convention
– All API functions start with an H5
– The next character identifies category of functions
• H5F: functions handling files
• H5G: functions handling groups
• H5D: functions handling datasets
• H5S: functions handling dataspaces
• H5A: functions handling attributes
• A HDF-5 group is a collection of data sets
– Comparable to a directory in a UNIX-like file system
5
h5py
• Python interface to the HDF5 binary data format
• Uses NumPy and Python abstractions such as dictionary
and NumPy array syntax
Reading and Writing an HDF-5 file
using h5py
import numpy as np
import h5py
MyData = np.random.random(size=(100,20))
h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('dataset_1', data=Mydata)
h5f.close()
h5f = h5py.File('data.h5','r')
MyData = h5f['dataset_1'][:]
h5f.close()
6
Setting datatypes and compression in
h5py
hf = h5py.File('integer_8.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='i8')
d[:] = arr
f.close()
f = h5py.File('float.hdf5', 'w')
d = f.create_dataset('dataset', (100,), dtype='f16‘, compression="gzip")
d[:] = arr
f.close()5f = h5py.File('data.h5','r')
MyData = h5f['dataset_1'][:]
h5f.close()
Parquet files
• Columnar data representation
• Available to many projects in the Hadoop ecosystem
• Built on the record shredding and assembly algorithm
described in the Dremel paper
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of
Web-Scale Datasets”
https://ai.google/research/pubs/pub36632.pdf
• Support compression and efficient encoding schemes.
7
Record vs. column oriented data
Image source: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of Web-Scale Datasets” https://ai.google/research/pubs/pub36632.pdf
Example
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward; }
repeated group Name {
repeated group Language {
required string Code;
optional string Country; }
optional string Url; }}
8
• Schema can be seen as a tree with leaves being
primitive types.
• A column is created for each leaf
• For this example we end up with 6 columns:DocId
Links.Backward
Links.Forward
Name.Language.Code
Name.Language.Country
Name.Url
• As some of the fields are repeated fields, we need
extra pieces of information to be stored along with the
data to allow re-assembling the records together.
Repetition and definition levels
• Repetition level tells us at what field in the field’s
path the value has repeated.
• Definition level specifies how many fields in a record
that could be undefined (because they are optional or
repeated) are present in the record.
• Only repeated fields increment the repetition level,
• Only non-required fields increment the definition level
9
Image source: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
10
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
R = 0 (current repetition level)
DocId: 10, R:0, D:0
Links.Backward: NULL, R:0, D:1 (no value defined so D < 2)
Links.Forward: 20, R:0, D:2
R = 1 (we are repeating 'Links.Forward' of level 1)
Links.Forward: 40, R:1, D:2
R = 1 (we are repeating 'Links.Forward' again of level 1)
Links.Forward: 60, R:1, D:2
Back to the root level: R=0
Name.Language.Code: en-us, R:0, D:2
Name.Language.Country: us, R:0, D:3
R = 2 (we are repeating 'Name.Language' of level 2)
Name.Language.Code: en, R:2, D:2
Name.Language.Country: NULL, R:2, D:2 (no value defined so D < 3)
Name.Url: http://A, R:0, D:2
R = 1 (we are repeating 'Name' of level 1)
Name.Language.Code: NULL, R:1, D:1 (Only Name is defined so D = 1)
Name.Language.Country: NULL, R:1, D:1
Name.Url: http://B, R:1, D:2
R = 1 (we are repeating 'Name' again of level 1)
Name.Language.Code: en-gb, R:1, D:2
Name.Language.Country: gb, R:1, D:3
Name.Url: NULL, R:1,
Image source: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
DocId: 20, R:0, D:0
Links.Backward: 10, R:0, D:2
Links.Backward: 30, R:1, D:2
Links.Forward: 80, R:0, D:2
Name.Language.Code: NULL, R:0, D:1
Name.Language.Country: NULL, R:0, D:1
Name.Url: http://C, R:0, D:2
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
Image source: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
11
Resulting Columns stored in Parquet
Image source: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of Web-Scale Datasets” https://ai.google/research/pubs/pub36632.pdf
Parquet Glossary
• Row group: A logical horizontal partitioning of the data
into rows.
– A row group consists of a column chunk for each column
in the dataset.
– Max. size buffered in memory while writing
– No physical structure that is guaranteed for a row group.
• Column chunk: A chunk of the data for a particular
column.
– guaranteed to be contiguous in the file.
• Page: Column chunks are divided up into pages.
– conceptually an indivisible unit (in terms of compression
and encoding).
12
Example file of• A file consists of one or more row groups, here N columns split
into M row groups
• A row group contains exactly one column chunk per column.
• Column chunks contain one or more pages.
Image source: https://parquet.apache.org/documentation/latest/
• Reading columns is straight forward
• Record level API to integrate with existing row-based
engines (Hive, Pig, M/R)
– Repetition level 0 indicates new record
– Repetition/Definition level capture the structure
– One column per leaf in the schema
• Unit of parallelization
– MapReduce - File/Row Group
– IO - Column chunk
– Encoding/Compression - Page
13
Encodings• Bit encoding
– Small integers encoded in minimum bits required
– Useful for repetition and definition levels
• Run level encoding
– sequences in which the same data value occurs in many
consecutive data elements are stored as a single data value and
count
– Useful for definition levels of sparse columns
• Dictionary encoding
– searches for matches between the text to be compressed and a
set of strings contained in a 'dictionary'
– When the encoder finds a match, it substitutes a reference to
the string's position in the data structure.
– Useful for small ( ~60k) set of values
Encodings
• Delta encoding (new in parquet 2)
Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635
14
Formats comparison (I)
Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635
Formats comparison (II)
Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635
15
Parquet implementations
• Java implementation
– sources can be build using mvn package.
– current stable version is available from Maven
Central.
• C++ sources
– Based on Apache Thrift (a software stack with a
code generation engine to build services that work
efficiently and seamlessly between numerous
languages, including C++, Java, Python, … )