how to build big data pipelines for hadoop dr. mark pollack

How to Build Big Data Pipelines for Hadoop

Dr. Mark Pollack

• “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze

• A subjective and moving target

• Big data in many sectors today range from 10’s of TB to multiple PB

Big Data

2

Enterprise Data Trends

3

Value from Data Exceeds Hardware & Software costs

Value in connecting data sets• Grouping e-commerce users by user agent

• Orbitz shows more expensive hotels to Mac users

• See http://on.wsj.com/UhSlNi

The Data Access Landscape - The Value of Data

4

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/418.9 (KHTML, like Gecko) Safari/419.3c

http://on.wsj.com/UhSlNi

• Spring has always provided excellent data access support– Transaction Management– Portable data access exception hierarchy– JDBC – JdbcTemplate– ORM - Hibernate, JPA, JDO, Ibatis support– Cache support (Spring 3.1)

• Spring Data project started in 2010• Goal is to “refresh” Spring’s Data Access support

– In light of new data access landscape

Spring and Data Access

5

Spring Data Mission Statement

6

“Provide a familiar and consistent Spring-based programming model for Big Data, NoSQL, and relational stores while retaining store-specific features and capabilities.

• Relational– JPA– JDBC Extensions

• NoSQL– Redis– HBase– Mongo– Neo4j– Lucene– Gemfire

• Big Data– Hadoop

• HDFS and M/R• Hive• Pig• Cascading

– Splunk

• Access– Repositories

– QueryDSL

– REST

Spring Data – Supported Technologies

7

A View of a Big Data System

8

Integration Apps

Stream Processing

Unstructured Data Store

InteractiveProcessing(Structured DB)

BatchAnalysis

Analytical Apps

Real TimeAnalytics

Data Streams

(Log Files, Sensors, Mobile)

IngestionEngine

DistributionEngine

Mo

nito

ring

/ Dep

loym

ent

SaaS

Social Where Spring Projects can be used to provide a solution

• Real world big data solutions require workflow across systems

• Share core components of a classic integration workflow• Big data solutions need to integrate with existing data and

apps• Event-driven processing • Batch workflows

Big Data Problems are Integration Problems

9

• Spring Integrationfor building and configuring message-based integration flowsusing input & output adapters, channels, and processors

• Spring Batchfor building and operating batch workflows and manipulating data in files and ETLBasis for JSR 352 in EE7...

Spring projects offer substantial integration functionality

10

• Spring Datafor manipulating data in relational DBs as well as a variety of NoSQL databases and data grids (inside Gemfire 7.0)

• Spring for Apache Hadoopfor orchestrating Hadoop and non-Hadoop workflowsin conjunction with Batch and Integration processing (inside GPHD 1.2)

Spring projects offer substantial integration functionality

11

Integration is an essential part of Big Data

12

Some Existing Big Data Integration tools

13

Hadoop as a Big Data Platform

14

• Hadoop has a poor out of the box programming model

• Applications are generally a collection of scripts calling command line apps

• Spring simplifies developing Hadoop applications

• By providing a familiar and consistent programming and configuration mode

• Across a wide range of use cases

– HDFS usage

– Data Analysis (MR/Pig/Hive/Cascading)

– Workflow

– Event Streams

– Integration

• Allowing to start small and grow

Spring for Hadoop - Goals

15

Relationship with other Spring projects

16

Spring Hadoop – Core Functionality

17

• Declarative configuration – Create, configure, and parameterize Hadoop connectivity and all

job types– Environment profiles – easily move from dev to qa to prod

• Developer productivity– Create well-formed applications, not spaghetti script applications– Simplify HDFS and FsShell API with support for JVM scripting – Runner classes for MR/Pig/Hive/Cascading for small workflows– Helper “Template” classes for Pig/Hive/HBase

Capabilities: Spring + Hadoop

18

Core Map Reduce idea

19

• Standard Hadoop APIs

Counting Words – Configuring M/R

20

Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount");job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

• Standard Hadoop API - Mapper

Counting Words –M/R Code

21

public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); Text word = new Text();

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, one);

}

} }

• Standard Hadoop API - Reducer

Counting Words –M/R Code

22

public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0; for (IntWritable val : values) { sum += val.get(); }

result.set(sum); context.write(key, result); }

}

}

• Standard Hadoop

• SDHP (Spring Hadoop)

Running Hadoop Example Jars

23

bin/hadoop jar hadoop-examples.jar wordcount /wc/input /wc/output

<hdp:configuration />

<hdp:jar-runner id=“wordcount“ jar="hadoop-examples.jar> <hdp:arg value=“wordcount“ /> <hdp:arg value=“/wc/input“ /> <hdp:arg value=“/wc/output“/></hdp:jar-runner>

• Standard Hadoop

• SHDP

Running Hadoop Tools

24

bin/hadoop jar –conf myhadoop-site.xml –D ignoreCase=true wordcount.jar org.myorg.WordCount /wc/input /wc/output

<hdp:configuration resources=“myhadoop-site.xml“/>

<hdp:tool-runner id="wc“ jar=“wordcount.jar”> <hdp:arg value=“/wc/input“ /> <hdp:arg value=“/wc/output“/>

ignoreCase=true</hdp:tool-runner>

Configuring Hadoop

25

<context:property-placeholder location="hadoop-dev.properties"/>

<hdp:configuration>fs.default.name=${hd.fs}</hdp:configuration>

<hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“myjob.jar” mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper“ reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/>

<hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ />

input.path=/wc/input/output.path=/wc/word/hd.fs=hdfs://localhost:9000

applicationContext.xml

hadoop-dev.properties

• Access all “bin/hadoop fs” commands through FsShell– mkdir, chmod, test

HDFS and Hadoop Shell as APIs

26

class MyScript { @Autowired FsShell fsh; @PostConstruct void init() { String outputDir = "/data/output"; if (fsShell.test(outputDir)) { fsShell.rmr(outputDir); } }}

• FsShell is designed to support JVM scripting languages

HDFS and FsShell as APIs

27

// use the shell (made available under variable fsh)

if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir)}if (fsh.test(outputDir)) { fsh.rmr(outputDir)}

copy-files.groovy

HDFS and FsShell as APIs<hdp:script id=“setupScript“ language=“groovy“> <hdp:property name=“inputDir“ value=“${input}“/> <hdp:property name=“outputDir“ value=“${output}“/> <hdp:property name=“sourceFile“ value=“${source}“/>

// use the shell (made available under variable fsh)

if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir)}if (fsh.test(outputDir)) { fsh.rmr(outputDir)}</hdp:script>

appCtx.xml

• Externalize Script

HDFS and FsShell as APIs

29

<script id="setupScript" location="copy-files.groovy"> <property name="inputDir" value="${wordcount.input.path}"/> <property name="outputDir" value="${wordcount.output.path}"/> <property name=“sourceFile“ value="${localSourceFile}"/></script>

appCtx.xml

30

$> demo

input.path=/wc/input/output.path=/wc/word/hd.fs=hdfs://localhost:9000

Streaming Jobs and Environment Configurationbin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt

<context:property-placeholder location="hadoop-${env}.properties"/>

<hdp:streaming id=“wc“ input-path=“${input}” output-path=“${output}” mapper=“${cat}” reducer=“${wc}” files=“classpath:stopwords.txt”></hdp:streaming>

env=dev java –jar SpringLauncher.jar applicationContext.xml

hadoop-dev.properties

Streaming Jobs and Environment Configurationbin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt

<context:property-placeholder location="hadoop-${env}.properties"/>

<hdp:streaming id=“wc“ input-path=“${input}” output-path=“${output}” mapper=“${cat}” reducer=“${wc}” files=“classpath:stopwords.txt”></hdp:streaming>

env=qa java –jar SpringLauncher.jar applicationContext.xml

input.path=/gutenberg/input/output.path=/gutenberg/word/hd.fs=hdfs://darwin:9000

hadoop-qa.properties

• Use Dependency Injection to obtain reference to Hadoop Job– Perform additional runtime configuration and submit

Word Count – Injecting Jobs

33

public class WordService {

@Inject private Job mapReduceJob;

public void processWords() { mapReduceJob.submit(); }}

Pig

34

• An alternative to writing MapReduce applications– Improve productivity

• Pig applications are written in the Pig Latin Language• Pig Latin is a high level data processing language

– In the spirit of sed and ask, not SQL

• Pig Latin describes a sequence of steps– Each step performs a transformation on item of data in a collection

• Extensible with User defined functions (UDFs)• A PigServer is responsible for translating PigLatin to MR

What is Pig?

35

Counting Words – PigLatin Script

36

• Standard Pig

• Spring Hadoop– Creates a PigServer– Optional execution of scripts on application startup

Using Pig

37

pig –x mapreduce wordcount.pig

pig wordcount.pig –P pig.properties –p pig.exec.nocombiner=true

<pig-factory job-name=“wc” properties-location=“pig.properties"> pig.exec.nocombiner=true <script location=“wordcount.pig"> <arguments>ignoreCase=TRUE</arguments> </script></pig-factory>

• Execute a small Pig workflow (HDFS, PigLatin, HDFS)

Spring’s PigRunner

38

• PigRunner implements Callable• Use Spring’s Scheduling support

Schedule a Pig job

39

@Scheduled(cron= “0 0 12 * * ?”)public void process() { pigRunner.call();}

• Simplifies the programmatic use of Pig• Common tasks are ‘one-liners’

PigTemplate

40

PigTemplate - Programmatic Use

41

public class PigPasswordRepository implements PasswordRepository {

private PigTemplate pigTemplate;

private String pigScript = "classpath:password-analysis.pig";

public void processPasswordFile(String inputFile) { String outputDir = baseOutputDir + File.separator + counter.incrementAndGet(); Properties scriptParameters = new Properties(); scriptParameters.put("inputDir", inputFile); scriptParameters.put("outputDir", outputDir);

pigTemplate.executeScript(pigScript, scriptParameters); }

//...}

Hive

42

• An alternative to writing MapReduce applications– Improve productivity

• Hive applications are written using HiveQL• HiveQL is in the spirit of SQL• A HiveServer is responsible for translating HiveQL to MR• Access via JDBC, ODBC, or Thrift RPC

What is Hive?

43

Counting Words - HiveQL

44

-- import the file as lines

CREATE EXTERNAL TABLE lines(line string)LOAD DATA INPATH ‘books’ OVERWRITE INTO TABLE lines;

-- create a virtual view that splits the lines

SELECT word, count(*) FROM lines

LATERAL VIEW explode(split(text, ‘ ‘ )) lTable as word

GROUP BY word;

• Command-line

• JDBC based

Using Hive

45

• Access Hive using JDBC Client and use JdbcTemplate

Using Hive with Spring Hadoop

46

<bean id="hive-driver" class="org.apache.hadoop.hive.jdbc.HiveDriver"/>

<bean id="hive-ds" class="org.springframework.jdbc.datasource.SimpleDriverDataSource" c:driver-ref="hive-driver" c:url="${hive.url}"/>

<bean id="template" class="org.springframework.jdbc.core.JdbcTemplate" c:data-source-ref="hive-ds"/>

• Reuse existing knowledge of Spring’s Rich ResultSet to POJO Mapping Features

Using Hive with Spring Hadoop

47

public long count() { return jdbcTemplate.queryForLong("select count(*) from " + tableName);}

List<Password> result = jdbcTemplate.query(“select * from passwords", new ResultSetExtractor<List<Password>() { public String extractData(ResultSet rs) throws SQLException { // extract data from result set}});

• HiveClient is not thread-safe, throws checked exceptions

Standard Hive – Thrift API

48

public long count() { HiveClient hiveClient = createHiveClient(); try { hiveClient.execute("select count(*) from " + tableName); return Long.parseLong(hiveClient.fetchOne()); // checked exceptions } catch (HiveServerException ex) { throw translateExcpetion(ex); } catch (org.apache.thrift.TException tex) { throw translateExcpetion(tex); } finally { try { hiveClient.shutdown(); } catch (org.apache.thrift.TException tex) { logger.debug("Unexpected exception on shutting down HiveClient", tex); }}}

protected HiveClient createHiveClient() { TSocket transport = new TSocket(host, port, timeout); HiveClient hive = new HiveClient(new TBinaryProtocol(transport)); try { transport.open(); } catch (TTransportException e) { throw translateExcpetion(e); } return hive;}

Spring Hadoop – Batch & Integration

49

• Reuse same Batch infrastructure and knowledge to manage Hadoop workflows

• Step can be any Hadoop job type or HDFS script

Hadoop Workflows managed by Spring Batch

50

Spring Batch for File/DB/NoSQL driven applications– Collect: Process local files

– Transform: Scripting or Java code to transform and enrich

– RT Analysis: N/A

– Ingest: (batch/aggregate) write to HDFS or split/filtering

– Batch Analysis: Orchestrate Hadoop steps in a workflow

– Distribute: Copy data out of HDFS to structured storage

– JMX enabled along with REST interface for job control

Capabilities: Spring + Hadoop + Batch

51

Collect Transform RT Analysis Ingest Batch Analysis Distribute Use

Spring Batch Configuration for Hadoop

52

<job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step></job>

• Reuse previous Hadoop job definitions

Spring Batch Configuration for Hadoop

53

Spring Integration for Event driven applications– Collect: Single node or distributed data collection

(tcp/JMS/Rabbit)

– Transform: Scripting or Java code to transform and enrich

– RT Analysis: Connectivity to multiple analysis techniques

– Ingest: Write to HDFS, Split/Filter data stream to other stores

– JMX enabled + control bus for starting/stopping individual components

Capabilities: Spring + Hadoop + SI

54

Collect Transform RT Analysis Ingest Batch Analysis Distribute Use

• Poll a local directory for files, files are rolled over every 10 min• Copy files to staging area and then to HDFS• Use an aggregator to wait to “process all files available every hour” to

launch MR job

Ingesting Copying Local Log Files into HDFS

55

• Use syslog adapter

• Transformer categorized messages

• Route to specific channels based on category

• One route leads to HDFS write and filtered data stored in Redis

Ingesting Syslog into HDFS

56

• Syslog collection across multiple machines

• Use TCP Adapters to forward events– Or other middleware

Ingesting Multi-node Syslog into HDFS

57

• Use Spring Batch– JdbcItemReader

– FileItemWriter

Ingesting JDBC to HDFS

58

<step id="step1"> <tasklet> <chunk reader=“jdbcItemReader" processor="itemProcessor" writer=“flatFileItemWriter" commit-interval="100" retry-limit="3"/> </chunk> </tasklet></step>

• Use FsShell

• Include as step in Batch workflow

• Spring Batch and fire events when jobs end… SI can poll HDFS…

Exporting HDFS to local Files

59

<step id="hdfsStep"> <script-tasklet script-ref="hdfsCopy"/></step>

59

<hdp:script id=“hdfsCopy“ language=“groovy“> <hdp:property name=“sourceDir“ value=“${sourceDir}“/> <hdp:property name=“outputDir“ value=“${outputDir}“/> // use the shell (made available under variable fsh) fsh.copyToLocal(sourceDir, outputDir); </hdp:script>

• Use Spring Batch– MutliFileItemReader

– JdbcItemWriter

Exporting HDFS to JDBC

60

<step id="step1"> <tasklet> <chunk reader=“flatFileItemReader" processor="itemProcessor" writer=“jdbcItemWriter" commit-interval="100" retry-limit="3"/> </chunk> </tasklet></step>

• Use Spring Batch– MutliFileItemReader

– MongoItemWriter

Exporting HDFS to Mongo

61

<step id="step1"> <tasklet> <chunk reader=“flatFileItemReader" processor="itemProcessor" writer=“mongoItemWriter"/> </chunk> </tasklet></step>

CEP – Style Data Pipeline

62

HTTP EndpointConsumerRouteHDFSTransform Esper Gemfire

GPDBFilter

• Esper for CEP functionality• Gemfire for Continuous Query as well as “data capacitor” like functionalty• Greenplum Database as another ‘big data store’ for ingestion.

Thank You!

• Prepping for GA – feedback welcome• Project Page: springsource.org/spring-data/hadoop• Source Code: github.com/SpringSource/spring-hadoop• Books

Resources

64

how to build big data pipelines for hadoop dr. mark pollack

Documents

value of data

existing data

spring data project

data sets

data grids

big data system

big data pipelines

big data platform