big data and scripting systems build on top of...

Big Data and ScriptingSystems build on top of Hadoop

1,

Pig/Latin

• high-level map reduce programming platform• Pig is the name of the system• Pig Latin is the provided programming language• Pig Latin is

– similar to query languages like SQL– still procedural1 (in contrast to SQL)– extendable using various languages

• originally developed at Yahoo,moved to the Apache Software foundation in 2007

• pig.apache.org

1commands describe actions to execute, not the desired result2,

pig.apache.org

Pig/Latin – overview

• execute commands that can be run on a Hadoop cluster• simple/easy to learn language• enable rapid prototyping of map reduce applications• use map/reduce cluster similar to a database system

• interactive or batch mode– commands are translated into Hadoop jobs– executed on the Hadoop system

3,

Pig/Latin – concepts

• commands (operators) act on relations• a relation is typically a CSV-file from hdfs• example: assign a relation with named fields to variable A

A = LOAD ’student’ USING PigStorage()AS (name:chararray, age:int, gpa:float);

• further operators then transform relations into other relations• example: group items by field ”age”

B = GROUP A BY age;

• lazy evaluation2

• relations can have schemas (column names and types)• schemas can be used to ensure type-safety

2nothing is executed until needed4,

extending Pig/Latin

• new functions can be added by implementing them in variouslanguages

– Java, Python, JavaScript, Ruby• most extensive support is provided in Java:

– extend class org.apache.pig.EvalFunc– register in Pig

REGISTER myudfs.jar;

• Java programs can be run native on Hadoop• special interfaces allow more efficient integration of special types

of functions

5,

Apache Hive

• distributed data-warehouse allowing queries and transformations• use various file systems as backend (HDFS, Amazon S3 fs, . . . )• SQL-like query language HiveQL• execution by translation into map reduce jobs• indexing to accelerate queries• command line interface Hive CLI

• originally developed by Facebookturned into Apache project hive.apache.org

6,

hive.apache.org

HiveQL – examplescreate a table with two columns:

hive> create table student (sid string, sname string)> ROW FORMAT DELIMITED> FIELDS TERMINATED BY ’,’;

• tables correspond to directories in the underlying file system• stored as CSV-filesload some data:

hive> LOAD DATA INPATH ’/tmp/students.txt’ INTO TABLE student;

• imports the file content into Hive’s storage• dropping the table deletes data and index from Hive storage, does

not affect external data

select * from student;7,

HiveQL – examples

• multiple tables can be joined• only equality joins

SELECT * FROM student JOIN scores ON (student.sid = marks.sid);

• many standard SQL statements available, e.g. GROUP BY

INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender;

• grouping and aggregation• write result into new table

8,

Hive – data organization

• top level organization: databases containing tables• tables correspond to (top-level) directories• tables are divided into partitions

– sub directories of the table directory– tables can be partitioned by arbitrary column

CREATE TABLE table (col1 INT, col2 STRING)PARTITIONED BY (col3 DATE);

• partitions are divided into buckets– further break downs of partitions– allow better organization with respect to map reduce

• storage of actual data in flat files• arbitrary formats can be used, description via regular expressions

9,

Hive – summary

• bring together SQL functionality and scaling features of Hadoop• subset of table operations specified by SQL• no low-latency queries• optimized for scalability• storage in flat files on distributed file system• querying/processing by translation into map reduce jobs• extending storage by indexing• due to distributed storage: no individual updates

10,

HBase

• “sparse, distributed, persistent multidimensional sorted map”– implementation of the Bigtable3 idea (google)

• uses HDFS and Hadoop, Zookeeper for storage and execution• implements servers for administration and storage/computation• mapping keys to values• keys are structured, values are arbitrary• implements random read/write access on top of HDFS• provides consistency (on certain levels)• accessible via shell or Java-API• hbase.apache.org

3research.google.com/archive/bigtable.html

11,

hbase.apache.org

research.google.com/archive/bigtable.html

HBase – structure

• keys are stored sorted, allowing range queries• keys are highly structured into:

– rowkey– column family– column– timestamp

• tables are stored sparsely– missing values are not encoded but simply not stored– every value has to be stored with full address

• data is distributed, load balancing automated• consistency is guaranteed on row-key level:

– all changes within one rowkey are atomicall data of one rowkey is stored on a single machine

12,

HBase – storage and access

• data is partitioned by keys• column families define storage properties• columns are only label for the corresponding valuesprinciple operations:• put insert/update value• delete delete value• get retrieve single value• scan retrieve collection of values sequential reading

13,

HBase – guarantees

atomicity• mutations are atomic within a row• operation result reported• not atomic over multiple rows (parts may fail, others succeed)consistency and isolation• returned rows consist of complete rows:

– contained data may have changed in between– the data returned refers to a single point in the past4

• scans are not consistent over multiple rows– different rows may refer to different points in time

4HBase keeps data for various time points14,

HBase – guarantees

visibility• after successful writing, data is immediately visible to all clients• versions of rows strictly increasedurability• refers to data being stored on disk• data that has been read from a cell is guaranteed to be durable• successful operations are durable, failed operations not

• visibility and durability may be tuned for performance• individual reads without visibility guarantees• instead of durability only periodic writing of data

15,

comparison

• Pig Latin– allows to view data as tables– provides ad hoc queries– extendable to arbitrary map reduce jobs

• Hive– tries to provide SQL-functionality– slow, large scale queries– structured query language, query planner

• Hbase– more like a NOSQL database or key/value store– no sql operations, only storage and retrieval– guarantees for operations– optimized for random, real-time access

note: Pig and Hive can access data from Hbase directlynote: Cassandra is a dbs similar to HBase, optimized for security

16,

Mahout

• scalable implementations of data mining/learning algorithms• provide a library for easy access to machine learning

implementations• provide algorithms for the most common problems, e.g.:

– clustering– classification– frequent pattern mining, . . .

• optimize for practical (e.g. business) usage• language: Java• mahout.apache.org

note: Mahout is currently switching from map/reduce to Spark

17,

mahout.apache.org

Mahout – overview

• provides large API of Java-classes• integration into other applications• execution on top of distributed cluster5

• implementations can be adapted to specific problems– provide individual I/O classes– individual similarity/distance functions, . . .

• integration of Apache Lucene (document search engine)

5e.g. Hadoop or Spark18,

Systems beyond Hadoop

19,

ZooKeeper

• distributed coordination service– many problems/functions are shared among distributed systems– ZooKeeper provides a single implementation of these– avoid repeated implementation of the same services

• provide primitives for– synchronization– configuration maintenance– naming

• optimized for failure tolerance, reliability and performance• used in other projects as sub service• another Apache top-level project (zookeeper.apache.org)

20,

zookeeper.apache.org

Zookeeper

• provides tree-like information storage• update guarantees

– sequential consistency (keep update order)– atomicity– single system image (one state for views)– reliability (applied updates persist)– timeliness (time bounds for updates)

• extremely simple interface– create/delete test node existance– get/set data– get children– sync (wait for update to propagate)

21,

Mesos6

Hadoop:• use one (physical) cluster of machines exclusively

Mesos:• share a physical cluster between multiple distributed systems• implent intermediate layer between

distributed frameworks and hardware• administrate physical resources• distribute these to involved frameworks• improve cluster utilization• implement prioritization and failure tolerance

6Mesos: A platform for fine-grained resource sharing in the data center,Hindman et.al., 2011

22,

Mesos – example scenarios

multiple Hadoop systems on the same (physical) set of machines• production system, takes priority• testing implementations or execute analyses that are of general

interest but should not disturb the production system• test new versions of Hadoop• all involved Hadoop instances use the same data as input

different distributed frameworks on the same cluster• different tasks benefit from different optimization approaches• the map reduce approach is not optimal in every situation• still, the different frameworks might work on the same base data

23,

Mesos – dividing tasks

scheduling• distribute tasks to available resources• consider data locality

send tasks to nodes that already store the involved data• depends on framework (optimization strategy), job (algorithm)

and task ordershould be implemented by the framework

resource distribution• distribute available resources to frameworks• keep track of system usage• ensure priorities between different frameworks

should be implemented by intermediate layer (Mesos)

24,

Mesos – architecture

• centralized master-slave system• frameworks run tasks on slave nodes• master implements sharing using resource offers:

– list of free resources on various slaves– master decides which (and how many) resources are offered to

which frameworkimplements organizational policy

• frameworks have two parts:– scheduler - accepts or declines offers from Mesos– executor process - started on computing nodes, executes tasks

• framework decides which task is solved on a particular resource• tasks are executed by sending task description to Master

25,

Mesos – summary/overview

• a framework/library/set of servers• allows to run several distributed frameworks on top of a single

cluster of machines• administrates and distributes resources with respect to

configurable priorities

• is an actually implemented and used system:mesos.apache.org

• made it into the apache incubator• started at UC Berkeley AMP Lab• uses Zookeeper

26,

mesos.apache.org

Spark7

• map reduce is not optimal for all problems• many algorithms iterate a number of times over source data• example: gradient descent

each iteration uses source data to compute new gradient• in Hadoop, every iteration reads all source data completely from

disk, computes a single step and writes result

• approach in Spark: create resilient distributed datasets (RDDs),if possible cached in memory of the involved machines

7Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin,Shenker, Stoica, 2010

27,

Spark: overview

• cluster computing system, comparable to Hadoop• provides primitives for in-memory cluster computing

– data types are distributed in the cluster– parts on the individual machines kept in memory

• speedup in comparison to Hadoop for certain (iterating)algorithms (e.g. logistic regression)

• build on top of Mesos• provides APIs for Scala, Java, Python• originally developed for

– iterative algorithms (iterations using the same source data)– interactive data mining

• spark.apache.org• often seen as the successor of Hadoop

28,

spark.apache.org

Spark programming model

• Spark applications consist of a driver program– implements the global, high-level control flow– launches operations that are executed in parallel

• distribution and parallelization is achieved with– resilient distributed data sets (RDDs)– parallel operations working on RDDs– shared variables

• RDDs are read-only, distributed collections• constructed from input data or transformation• held in memory (if possible)

29,

Spark – RDDs: resilient distributed datasets

• lazy evaluation:– creating handle describes derivation– derivation is only executed when necessary

• ephemeral:– not guaranteed to stay in memory– recreated on demand

• state can be changed using cache and save• cache:

– still not evaluated– after first evaluation kept in memory if possible

• save:– triggers evaluation– writes to distributed storage– handle of saved RDD points to persistently stored object

30,

Spark – parallel transformations

• RDD are transformed by parallel transformation• result is always a new RDDemulating map reduce is simple:• flatMap(function) apply function to each element of the RDD,

produce new RDD from results (multiple results per call)• reduceByKey(function)

– called on collections of (K,V) key/value pairs– groups by key, aggregate with function

other transformations include• union() (of two RDDs), distinct() (distinct elements)• sort(), groupByKey()• join() (equi-join on key), cartesian()• cogroup() maps (K,V), (K,W) → (K, Seq(V), Seq(W))

31,

Spark – actions

actions extract data from the RDD and transport it back into thecontext of the driver program:• collect() retrieve all elements of an RDD• first(), take(n), retrieve first/first n elements• reduce(func) use commutative, associative func for parallel

reduction and retrieve final result• forEach(func) run function over all elements (e.g. for statistics)• count() get number of elements

32,

Spark – shared variables

• parallel functions transport variables from their original context tothe node they are executed at

• these have to be transport every time a function is send over thenetwork

• Spark supports 2 additional forms for special use cases• broadcast variables:

– transported only once to all involved nodes– read only in parallel functions

• accumulators– parallel functions can “add” to accumulators– adding is some associative operation– can be read only by driver program

33,

Pregel8

• solve large scale problems on graphs/networksexample: PageRank

• distributed system designed for graph computations• assumption: many graph algorithms

– traverse graph via edges– access data very local

e.g. computations for a node involve values of its neighbors• Pregel implements such a system, but is not public/open source• Giraph is an open source framework implementing the same idea:

giraph.apache.org

8Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik,Dehnert, Horn, Leiser,Czajkowski, 2010

34,

giraph.apache.org

Pregel – overview

• basic unit of computation:node with unique id and its incident edges

• node– perform computations in parallel– communicate with each other via messages

• a superstep is one round where each node computes• in each superstep:

– node receives messages from last round– update/computes/sends messages to be received in next round

• each node can vote for stopping– turns node inactive– gets reactivated by received message– computation stops when all nodes vote for stop

35,

example: connected components

• assume: node ids are totally ordered• node initializes minID with its ID• sends its ID to all neighbors• in each following round:

– collect all received IDs– update minimum– if minID changed send new minID to neighbors– else vote for stop

result:• all nodes in a component have same minID• nodes in different components have different minID

36,

• master slave system• master node:

– determines end of algorithm– takes care of node failure– synchronizes node communication

• basic idea can be extended:– nodes can mutate the graph (create/delete nodes/edges)

37,

big data and scripting systems build on top of...

Documents