how to build big data pipelines for hadoop dr. mark pollack
TRANSCRIPT
![Page 1: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/1.jpg)
How to Build Big Data Pipelines for Hadoop
Dr. Mark Pollack
![Page 2: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/2.jpg)
• “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze
• A subjective and moving target
• Big data in many sectors today range from 10’s of TB to multiple PB
Big Data
2
![Page 3: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/3.jpg)
Enterprise Data Trends
3
![Page 4: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/4.jpg)
Value from Data Exceeds Hardware & Software costs
Value in connecting data sets• Grouping e-commerce users by user agent
• Orbitz shows more expensive hotels to Mac users
• See http://on.wsj.com/UhSlNi
The Data Access Landscape - The Value of Data
4
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/418.9 (KHTML, like Gecko) Safari/419.3c
![Page 5: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/5.jpg)
• Spring has always provided excellent data access support– Transaction Management– Portable data access exception hierarchy– JDBC – JdbcTemplate– ORM - Hibernate, JPA, JDO, Ibatis support– Cache support (Spring 3.1)
• Spring Data project started in 2010• Goal is to “refresh” Spring’s Data Access support
– In light of new data access landscape
Spring and Data Access
5
![Page 6: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/6.jpg)
Spring Data Mission Statement
6
“Provide a familiar and consistent Spring-based programming model for Big Data, NoSQL, and relational stores while retaining store-specific features and capabilities.
![Page 7: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/7.jpg)
• Relational– JPA– JDBC Extensions
• NoSQL– Redis– HBase– Mongo– Neo4j– Lucene– Gemfire
• Big Data– Hadoop
• HDFS and M/R• Hive• Pig• Cascading
– Splunk
• Access– Repositories
– QueryDSL
– REST
Spring Data – Supported Technologies
7
![Page 8: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/8.jpg)
A View of a Big Data System
8
Integration Apps
Stream Processing
Unstructured Data Store
InteractiveProcessing(Structured DB)
BatchAnalysis
Analytical Apps
Real TimeAnalytics
Data Streams
(Log Files, Sensors, Mobile)
IngestionEngine
DistributionEngine
Mo
nito
ring
/ Dep
loym
ent
SaaS
Social Where Spring Projects can be used to provide a solution
![Page 9: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/9.jpg)
• Real world big data solutions require workflow across systems
• Share core components of a classic integration workflow• Big data solutions need to integrate with existing data and
apps• Event-driven processing • Batch workflows
Big Data Problems are Integration Problems
9
![Page 10: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/10.jpg)
• Spring Integrationfor building and configuring message-based integration flowsusing input & output adapters, channels, and processors
• Spring Batchfor building and operating batch workflows and manipulating data in files and ETLBasis for JSR 352 in EE7...
Spring projects offer substantial integration functionality
10
![Page 11: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/11.jpg)
• Spring Datafor manipulating data in relational DBs as well as a variety of NoSQL databases and data grids (inside Gemfire 7.0)
• Spring for Apache Hadoopfor orchestrating Hadoop and non-Hadoop workflowsin conjunction with Batch and Integration processing (inside GPHD 1.2)
Spring projects offer substantial integration functionality
11
![Page 12: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/12.jpg)
Integration is an essential part of Big Data
12
![Page 13: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/13.jpg)
Some Existing Big Data Integration tools
13
![Page 14: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/14.jpg)
Hadoop as a Big Data Platform
14
![Page 15: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/15.jpg)
• Hadoop has a poor out of the box programming model
• Applications are generally a collection of scripts calling command line apps
• Spring simplifies developing Hadoop applications
• By providing a familiar and consistent programming and configuration mode
• Across a wide range of use cases
– HDFS usage
– Data Analysis (MR/Pig/Hive/Cascading)
– Workflow
– Event Streams
– Integration
• Allowing to start small and grow
Spring for Hadoop - Goals
15
![Page 16: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/16.jpg)
Relationship with other Spring projects
16
![Page 17: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/17.jpg)
Spring Hadoop – Core Functionality
17
![Page 18: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/18.jpg)
• Declarative configuration – Create, configure, and parameterize Hadoop connectivity and all
job types– Environment profiles – easily move from dev to qa to prod
• Developer productivity– Create well-formed applications, not spaghetti script applications– Simplify HDFS and FsShell API with support for JVM scripting – Runner classes for MR/Pig/Hive/Cascading for small workflows– Helper “Template” classes for Pig/Hive/HBase
Capabilities: Spring + Hadoop
18
![Page 19: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/19.jpg)
Core Map Reduce idea
19
![Page 20: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/20.jpg)
• Standard Hadoop APIs
Counting Words – Configuring M/R
20
Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount");job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
![Page 21: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/21.jpg)
• Standard Hadoop API - Mapper
Counting Words –M/R Code
21
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());context.write(word, one);
}
} }
![Page 22: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/22.jpg)
• Standard Hadoop API - Reducer
Counting Words –M/R Code
22
public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0; for (IntWritable val : values) { sum += val.get(); }
result.set(sum); context.write(key, result); }
}
}
![Page 23: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/23.jpg)
• Standard Hadoop
• SDHP (Spring Hadoop)
Running Hadoop Example Jars
23
bin/hadoop jar hadoop-examples.jar wordcount /wc/input /wc/output
<hdp:configuration />
<hdp:jar-runner id=“wordcount“ jar="hadoop-examples.jar> <hdp:arg value=“wordcount“ /> <hdp:arg value=“/wc/input“ /> <hdp:arg value=“/wc/output“/></hdp:jar-runner>
![Page 24: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/24.jpg)
• Standard Hadoop
• SHDP
Running Hadoop Tools
24
bin/hadoop jar –conf myhadoop-site.xml –D ignoreCase=true wordcount.jar org.myorg.WordCount /wc/input /wc/output
<hdp:configuration resources=“myhadoop-site.xml“/>
<hdp:tool-runner id="wc“ jar=“wordcount.jar”> <hdp:arg value=“/wc/input“ /> <hdp:arg value=“/wc/output“/>
ignoreCase=true</hdp:tool-runner>
![Page 25: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/25.jpg)
Configuring Hadoop
25
<context:property-placeholder location="hadoop-dev.properties"/>
<hdp:configuration>fs.default.name=${hd.fs}</hdp:configuration>
<hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“myjob.jar” mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper“ reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/>
<hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ />
input.path=/wc/input/output.path=/wc/word/hd.fs=hdfs://localhost:9000
applicationContext.xml
hadoop-dev.properties
![Page 26: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/26.jpg)
• Access all “bin/hadoop fs” commands through FsShell– mkdir, chmod, test
HDFS and Hadoop Shell as APIs
26
class MyScript { @Autowired FsShell fsh; @PostConstruct void init() { String outputDir = "/data/output"; if (fsShell.test(outputDir)) { fsShell.rmr(outputDir); } }}
![Page 27: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/27.jpg)
• FsShell is designed to support JVM scripting languages
HDFS and FsShell as APIs
27
// use the shell (made available under variable fsh)
if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir)}if (fsh.test(outputDir)) { fsh.rmr(outputDir)}
copy-files.groovy
![Page 28: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/28.jpg)
HDFS and FsShell as APIs<hdp:script id=“setupScript“ language=“groovy“> <hdp:property name=“inputDir“ value=“${input}“/> <hdp:property name=“outputDir“ value=“${output}“/> <hdp:property name=“sourceFile“ value=“${source}“/>
// use the shell (made available under variable fsh)
if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir)}if (fsh.test(outputDir)) { fsh.rmr(outputDir)}</hdp:script>
appCtx.xml
![Page 29: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/29.jpg)
• Externalize Script
HDFS and FsShell as APIs
29
<script id="setupScript" location="copy-files.groovy"> <property name="inputDir" value="${wordcount.input.path}"/> <property name="outputDir" value="${wordcount.output.path}"/> <property name=“sourceFile“ value="${localSourceFile}"/></script>
appCtx.xml
![Page 30: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/30.jpg)
30
$> demo
![Page 31: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/31.jpg)
input.path=/wc/input/output.path=/wc/word/hd.fs=hdfs://localhost:9000
Streaming Jobs and Environment Configurationbin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt
<context:property-placeholder location="hadoop-${env}.properties"/>
<hdp:streaming id=“wc“ input-path=“${input}” output-path=“${output}” mapper=“${cat}” reducer=“${wc}” files=“classpath:stopwords.txt”></hdp:streaming>
env=dev java –jar SpringLauncher.jar applicationContext.xml
hadoop-dev.properties
![Page 32: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/32.jpg)
Streaming Jobs and Environment Configurationbin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt
<context:property-placeholder location="hadoop-${env}.properties"/>
<hdp:streaming id=“wc“ input-path=“${input}” output-path=“${output}” mapper=“${cat}” reducer=“${wc}” files=“classpath:stopwords.txt”></hdp:streaming>
env=qa java –jar SpringLauncher.jar applicationContext.xml
input.path=/gutenberg/input/output.path=/gutenberg/word/hd.fs=hdfs://darwin:9000
hadoop-qa.properties
![Page 33: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/33.jpg)
• Use Dependency Injection to obtain reference to Hadoop Job– Perform additional runtime configuration and submit
Word Count – Injecting Jobs
33
public class WordService {
@Inject private Job mapReduceJob;
public void processWords() { mapReduceJob.submit(); }}
![Page 34: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/34.jpg)
Pig
34
![Page 35: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/35.jpg)
• An alternative to writing MapReduce applications– Improve productivity
• Pig applications are written in the Pig Latin Language• Pig Latin is a high level data processing language
– In the spirit of sed and ask, not SQL
• Pig Latin describes a sequence of steps– Each step performs a transformation on item of data in a collection
• Extensible with User defined functions (UDFs)• A PigServer is responsible for translating PigLatin to MR
What is Pig?
35
![Page 36: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/36.jpg)
Counting Words – PigLatin Script
36
![Page 37: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/37.jpg)
• Standard Pig
• Spring Hadoop– Creates a PigServer– Optional execution of scripts on application startup
Using Pig
37
pig –x mapreduce wordcount.pig
pig wordcount.pig –P pig.properties –p pig.exec.nocombiner=true
<pig-factory job-name=“wc” properties-location=“pig.properties"> pig.exec.nocombiner=true <script location=“wordcount.pig"> <arguments>ignoreCase=TRUE</arguments> </script></pig-factory>
![Page 38: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/38.jpg)
• Execute a small Pig workflow (HDFS, PigLatin, HDFS)
Spring’s PigRunner
38
![Page 39: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/39.jpg)
• PigRunner implements Callable• Use Spring’s Scheduling support
Schedule a Pig job
39
@Scheduled(cron= “0 0 12 * * ?”)public void process() { pigRunner.call();}
![Page 40: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/40.jpg)
• Simplifies the programmatic use of Pig• Common tasks are ‘one-liners’
PigTemplate
40
![Page 41: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/41.jpg)
PigTemplate - Programmatic Use
41
public class PigPasswordRepository implements PasswordRepository {
private PigTemplate pigTemplate;
private String pigScript = "classpath:password-analysis.pig";
public void processPasswordFile(String inputFile) { String outputDir = baseOutputDir + File.separator + counter.incrementAndGet(); Properties scriptParameters = new Properties(); scriptParameters.put("inputDir", inputFile); scriptParameters.put("outputDir", outputDir);
pigTemplate.executeScript(pigScript, scriptParameters); }
//...}
![Page 42: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/42.jpg)
Hive
42
![Page 43: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/43.jpg)
• An alternative to writing MapReduce applications– Improve productivity
• Hive applications are written using HiveQL• HiveQL is in the spirit of SQL• A HiveServer is responsible for translating HiveQL to MR• Access via JDBC, ODBC, or Thrift RPC
What is Hive?
43
![Page 44: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/44.jpg)
Counting Words - HiveQL
44
-- import the file as lines
CREATE EXTERNAL TABLE lines(line string)LOAD DATA INPATH ‘books’ OVERWRITE INTO TABLE lines;
-- create a virtual view that splits the lines
SELECT word, count(*) FROM lines
LATERAL VIEW explode(split(text, ‘ ‘ )) lTable as word
GROUP BY word;
![Page 45: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/45.jpg)
• Command-line
• JDBC based
Using Hive
45
![Page 46: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/46.jpg)
• Access Hive using JDBC Client and use JdbcTemplate
Using Hive with Spring Hadoop
46
<bean id="hive-driver" class="org.apache.hadoop.hive.jdbc.HiveDriver"/>
<bean id="hive-ds" class="org.springframework.jdbc.datasource.SimpleDriverDataSource" c:driver-ref="hive-driver" c:url="${hive.url}"/>
<bean id="template" class="org.springframework.jdbc.core.JdbcTemplate" c:data-source-ref="hive-ds"/>
![Page 47: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/47.jpg)
• Reuse existing knowledge of Spring’s Rich ResultSet to POJO Mapping Features
Using Hive with Spring Hadoop
47
public long count() { return jdbcTemplate.queryForLong("select count(*) from " + tableName);}
List<Password> result = jdbcTemplate.query(“select * from passwords", new ResultSetExtractor<List<Password>() { public String extractData(ResultSet rs) throws SQLException { // extract data from result set}});
![Page 48: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/48.jpg)
• HiveClient is not thread-safe, throws checked exceptions
Standard Hive – Thrift API
48
public long count() { HiveClient hiveClient = createHiveClient(); try { hiveClient.execute("select count(*) from " + tableName); return Long.parseLong(hiveClient.fetchOne()); // checked exceptions } catch (HiveServerException ex) { throw translateExcpetion(ex); } catch (org.apache.thrift.TException tex) { throw translateExcpetion(tex); } finally { try { hiveClient.shutdown(); } catch (org.apache.thrift.TException tex) { logger.debug("Unexpected exception on shutting down HiveClient", tex); }}}
protected HiveClient createHiveClient() { TSocket transport = new TSocket(host, port, timeout); HiveClient hive = new HiveClient(new TBinaryProtocol(transport)); try { transport.open(); } catch (TTransportException e) { throw translateExcpetion(e); } return hive;}
![Page 49: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/49.jpg)
Spring Hadoop – Batch & Integration
49
![Page 50: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/50.jpg)
• Reuse same Batch infrastructure and knowledge to manage Hadoop workflows
• Step can be any Hadoop job type or HDFS script
Hadoop Workflows managed by Spring Batch
50
![Page 51: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/51.jpg)
Spring Batch for File/DB/NoSQL driven applications– Collect: Process local files
– Transform: Scripting or Java code to transform and enrich
– RT Analysis: N/A
– Ingest: (batch/aggregate) write to HDFS or split/filtering
– Batch Analysis: Orchestrate Hadoop steps in a workflow
– Distribute: Copy data out of HDFS to structured storage
– JMX enabled along with REST interface for job control
Capabilities: Spring + Hadoop + Batch
51
Collect Transform RT Analysis Ingest Batch Analysis Distribute Use
![Page 52: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/52.jpg)
Spring Batch Configuration for Hadoop
52
<job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step></job>
![Page 53: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/53.jpg)
• Reuse previous Hadoop job definitions
Spring Batch Configuration for Hadoop
53
![Page 54: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/54.jpg)
Spring Integration for Event driven applications– Collect: Single node or distributed data collection
(tcp/JMS/Rabbit)
– Transform: Scripting or Java code to transform and enrich
– RT Analysis: Connectivity to multiple analysis techniques
– Ingest: Write to HDFS, Split/Filter data stream to other stores
– JMX enabled + control bus for starting/stopping individual components
Capabilities: Spring + Hadoop + SI
54
Collect Transform RT Analysis Ingest Batch Analysis Distribute Use
![Page 55: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/55.jpg)
• Poll a local directory for files, files are rolled over every 10 min• Copy files to staging area and then to HDFS• Use an aggregator to wait to “process all files available every hour” to
launch MR job
Ingesting Copying Local Log Files into HDFS
55
![Page 56: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/56.jpg)
• Use syslog adapter
• Transformer categorized messages
• Route to specific channels based on category
• One route leads to HDFS write and filtered data stored in Redis
Ingesting Syslog into HDFS
56
![Page 57: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/57.jpg)
• Syslog collection across multiple machines
• Use TCP Adapters to forward events– Or other middleware
Ingesting Multi-node Syslog into HDFS
57
![Page 58: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/58.jpg)
• Use Spring Batch– JdbcItemReader
– FileItemWriter
Ingesting JDBC to HDFS
58
<step id="step1"> <tasklet> <chunk reader=“jdbcItemReader" processor="itemProcessor" writer=“flatFileItemWriter" commit-interval="100" retry-limit="3"/> </chunk> </tasklet></step>
![Page 59: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/59.jpg)
• Use FsShell
• Include as step in Batch workflow
• Spring Batch and fire events when jobs end… SI can poll HDFS…
Exporting HDFS to local Files
59
<step id="hdfsStep"> <script-tasklet script-ref="hdfsCopy"/></step>
59
<hdp:script id=“hdfsCopy“ language=“groovy“> <hdp:property name=“sourceDir“ value=“${sourceDir}“/> <hdp:property name=“outputDir“ value=“${outputDir}“/> // use the shell (made available under variable fsh) fsh.copyToLocal(sourceDir, outputDir); </hdp:script>
![Page 60: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/60.jpg)
• Use Spring Batch– MutliFileItemReader
– JdbcItemWriter
Exporting HDFS to JDBC
60
<step id="step1"> <tasklet> <chunk reader=“flatFileItemReader" processor="itemProcessor" writer=“jdbcItemWriter" commit-interval="100" retry-limit="3"/> </chunk> </tasklet></step>
![Page 61: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/61.jpg)
• Use Spring Batch– MutliFileItemReader
– MongoItemWriter
Exporting HDFS to Mongo
61
<step id="step1"> <tasklet> <chunk reader=“flatFileItemReader" processor="itemProcessor" writer=“mongoItemWriter"/> </chunk> </tasklet></step>
![Page 62: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/62.jpg)
CEP – Style Data Pipeline
62
HTTP EndpointConsumerRouteHDFSTransform Esper Gemfire
GPDBFilter
• Esper for CEP functionality• Gemfire for Continuous Query as well as “data capacitor” like functionalty• Greenplum Database as another ‘big data store’ for ingestion.
![Page 63: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/63.jpg)
Thank You!
![Page 64: How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack](https://reader035.vdocuments.net/reader035/viewer/2022070305/5513efd055034679748b5b5a/html5/thumbnails/64.jpg)
• Prepping for GA – feedback welcome• Project Page: springsource.org/spring-data/hadoop• Source Code: github.com/SpringSource/spring-hadoop• Books
Resources
64