cloveretl + hadoop

CloverETL versus Hadoop in light of transforming very large data sets

in parallel

a deathmatch or happy together ?

✕

=similarities

• Both technologies use data parallelism - input data are split into “partitions” which are then processed in parallel.

• Each partition is processed the same way (same algorithm used).• At the end of the processing, results of individually processed partitions

need to be merged to process final result.

Part 3

Part 2

Part 1

Final result

datasplit

datamerge

dataprocess

✕differences

• Hadoop uses Map->Reduce pattern originally developed by Google for Web indexing and searching. Processing is divided into Map phase (filtering&sorting) and Reduction phase (summary operation). Hadoop approach expects that initial large volume of data is reduced to much smaller result -> e.g. search for pages with certain keyword.

• CloverETL is based on pipeline-parallelism pattern where individual specialized components perform various operations on flow of data records - parsing, filtering, joining, aggregating,de-duping... Clover is optimized for large volumes of data flowing through it and being transformed on-the-fly.

Both technologies use partitioned&distributed storage of data (filesystem).

• Hadoop uses HDFS (Hadoop Distributed Filesystem) with individual DataNodes residing on physical nodes of Hadoop/HDFS cluster.

• CloverETL uses Partitioned Sandbox where data are spread over physical nodes of CloverETL Cluster. Each node is also a data processing node typically processing locally stored data (not exclusively). One node can be part of more than one Partitioned Sandbox.

=similarities

CloverETL’s Partitioned Sandbox operates at record level (data are read&written as complete records). Data loss prevention is left to be handled by the underlying file system storage. Clover’s Partitioned sandbox supports very high I/O throughput needed for massive data transformations.

✕differences

HDFS operates at byte level (data are read&written as streams of bytes). It includes data loss prevention through data redundancy. HDFS is based on “write-once, read-many-times” pattern.

CloverETL ✕ Hadoop HDFS

HDFS stores, splits and distributes data at byte level

CloverETL stores, splits and distributes data at record level

4 5 6 , N Y , J O H N \n

456,NY,JOHN\n 457,VA,BILL\n 458,MA,SUE\n

split

split split

split

Hadoop HDFS

organises files into large blocks of bytes (64MB or more) which are then physically stored on different nodes of Hadoop cluster

HDFS data file

Block1 Block2

split

reco

rd

Node 1 Node 2{data records

data blocks of 64MB

•block 1•block 3•block 5•block 7• ...

•block 2•block 4•block 6•block 8• ...

Hadoop HDFS

partitions, distributes and stores data at byte level

4 5 6 , N Y , J O H N \n

split

Node 1 Node 2

1st pa

rt sto

red 2nd part stored

☛ One data record in source data can end up being split between two different nodes☛ Writing or reading such record requires accessing two different nodes via network☛ HDFS presents files as single continuous stream of data (similar to any local filesystem)

Block1

Block2

Hadoop HDFS☛ Parallel writing to one HDFS file is impossibleTwo processes can not write to one data block at the same time.Two processes trying to write in parallel to one HDFS file (two different blocks) will face the block boundary issue - with potential collision.

not enough space

already filled space by 2nd process

Node 1

Node 2

n-th record

1st process

2nd processwhere to write

?

data blocks of 64MB output file

bloc

k 1

bloc

k 2

➟ file grows (blocks added)

exec

uted

on Node

1

exec

uted

on Node

2

writes

to

Node1 &

2

writes

to

Node1 &

2

starts writing to Block2

CloverETL Partitioned Sandbox

partitions, distributes and stores data at record level

Node 2

gets

store

d gets stored

456,NY,JOHN\n 457,VA,BILL\n 458,MA,SUE\n

split split

☛ Nodes contain complete records. ☛ Writing or reading records means accessing locally stored data only☛ Partitioned data are located in multiple files located on individual nodes. Clover offers unified user view over those files. When processing, partition files are accessed individually.

Node 1 Node 2

CloverETL Partitioned Sandbox

Node 1

Node 2

1st process

2nd process

☛ Parallel writing to Partitioned Sandbox is easyTwo processes write to two independent partitions of Clover sandbox.Each process writes to partition which is local to node where it runs - no collisions.

456,NY,JOHN\n 458,VA,WILLIAM\n 460,MA,MAG\n ➟

Partition 1

457,NJ,ANN\n 459,IL,MEGAN\n 461,WA,RYAN\n ➟

Partition 2

exec

uted

on Node

1

exec

uted

on Node

2

writes

to

Node1 o

nly

writes

to

Node2 o

nly

Fault resiliency

☛ HDFS implements fault toleranceHDFS replicates individual data blocks across cluster nodes thus ensuring fault tolerance.

☛ Clover delegates fault resiliency to local file systemClover provides unified view on data stored locally on nodes. It’s nodes’ setup (OS, filesystem) responsible for fault resiliency.

public 16 17 public 18 19 20 21 InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = 24 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public 32 33 34 35 36 37 sum += val.get(); 38 } 39 context.write(key, 40 } 41 } 42 43 public 44 Configuration conf = 45 46 Job job = 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53

Merge data (partially)

Sort temp data

How Hadoop processes data

Block1

Block2

Block3

map()

map()

map()reduce()

reduce()

Map data to key->value pairs

output.part1

output.part2

Input data file

output.part1

• Hadoop concentrates transformation logic into 2 stages - map & reduce.• Complex logic must be split to multiple map & reduce phases with temporary data being stored

in between• Intense network communication happens when reducers (one or more) merge data from multiple

mappers (mappers and reducers may run on different nodes)

• If multiple reducers are used (to accelerate processing) the resulting data are located in multiple output files (need to be merged again to produce single final result)

process 1

process 2

process 3

process 4

process 5

http://wiki.apache.org/hadoop/WordCount#CA-be0269e73ac25d11242ef29b7d00d52b58d74ed8_16






































How CloverETL processes data

• Clover processes data via set of transformation components running in pipeline-parallelism mode• Even complex transformation can be performed without temporarily storing data• Individual processing nodes obey data locality - each cluster node processes only locally stored

data partition• Clover allows partitioned output data be automatically presented as one singe result

457,NJ,ANN\n 459,IL,MEGAN\n ➟

Partition 2

456,NY,JOHN\n 458,VA,WILLIAM\n ➟

Partition 1

output.full

Transformation logicwith pipeline-parallelism

Wikipedia > Pipeline parallelism - When multiple components run on same data set i.e. when a record is processed in one component and a previous record is being processed in another components.

Input data file

Transformation logicwith pipeline-parallelism

☛ HDFS optimizes for storageHDFS optimizes for storing vast amount of data across hundreds of cluster nodes. It follows the ““write-once, read-many-times” pattern.

☛ Clover optimizes for I/O throughputClover optimizes for very fast writing or reading of data in parallel on dozens of cluster nodes. This lends itself nicely to read&process&write (aka ETL)

✕differences

Which approach is better ? it depends..

better for typical data transformation/integration tasks where all/most input data records get transformed and written out.Clover Partitioned sandbox expects short-term storage of data.

better when storing vast amount of data which are written by single process and potentially read by several processes.HDFS expects long-term storage of data.

• Clover is able to read&write data from HDFS

• Clover can read and process HDFS stored data in parallel

• Clover can write the results of processing to its Partitioned sandbox in parallel or store them back to HDFS as serial file

• Data processing tasks can be visually designed in CloverETL

?which one

Wouldn’t it be nice to have the best from both worlds ?

It’s possible !

…thus taking advantage of both worlds.

CloverETL parallel reading from HDFS

Block1

Block2

Block3

Input data file on HDFS

Multiple instances of Parallel Reader access

HDFSto read data in parallel

Standard CloverETL debugging available

Final result written as single serial file to local

filesystem

Data processing performed by CloverETL standard components

In this scenario:•HDFS serves as a storage system for raw source data•CloverETL is the data processing engine

Benchmarks

+

The (simple) scenario

• Apache log stored on HDFS• ~274 million web log records

• Extract year, month and IP address

• Aggregate data to get number of unique visitors per month

• Running on cluster of 4 HW nodes, using:

• Hadoop only

• Hadoop+Hive

• CloverETL only

• CloverETL + Hadoop/HDFS

The (simple) scenario resultsTime (sec)

Hadoop 329

Hadoop Hive Query 127

CloverETL only 59

CloverETL + Hadoop/HDFS 72

8 reducers

Partitioned Sandbox

Segmented Parallel Reading from HDFS

CloverETL brings•fast parallel processing•visual design & debugging•support for formats and communication protocols•process automation & monitoring

+synergy

“Happy Together” song b

y

The T

urtles

Hadoop/HDFS brings•low cost storage of big data•fault resiliency through controllable data replication

For more information on

• CloverETL Cluster architecture:http://www.cloveretl.com/products/server/clusterhttp://www.slideshare.net/cloveretl/cloveretl-cluster

• CloverETL in general:http://www.cloveretl.com

+synergy

http://www.cloveretl.com/products/server/cluster

http://www.slideshare.net/cloveretl/cloveretl-cluster

http://www.cloveretl.com

cloveretl + hadoop

Software