introduction to hadoop programming bryon gill, pittsburgh supercomputing center

Introduction to Hadoop Programming

Bryon Gill, Pittsburgh Supercomputing Center

What We Will Discuss

• Hadoop Architecture Overview• Practical Examples

• “Classic” Map-Reduce• Hadoop Streaming• Spark

• Hbase and Other Applications

Hadoop Overview

• Framework for Big Data• Map/Reduce• Platform for Big Data Applications

Map/Reduce

• Apply a Function to all the Data• Harvest, Sort, and Process the Output

© 2010 Pittsburgh Supercomputing Center© 2014 Pittsburgh Supercomputing Center

5

Big Data

… Split n

Split 1

Split 3

Split 4

Split 2

… Output n

Output 1

Output 3

Output 4

Output 2

Reduce F(x)

Map F(x)

Result

Map/Reduce

HDFS

• Distributed FS Layer• WORM fs

– Optimized for Streaming Throughput

• Exports• Replication• Process data in place

HDFS Invocations: Getting Data In and Out

• hdfs dfs -ls• hdfs dfs -put• hdfs dfs -get• hdfs dfs -rm• hdfs dfs -mkdir• hdfs dfs -rmdir

Writing Hadoop Programs

• Wordcount Example: Wordcount.java– Map Class– Reduce Class

Exercise 1: Compiling

hadoop com.sun.tools.javac.Main WordCount.java

Exercise 1: Packaging

jar cf wc.jar WordCount*.class

Exercise 1: Submitting

• hadoop \jar wc.jar \WordCount \/datasets/compleat.txt \output \-D mapred.reduce.tasks=2

Configuring your Job Submission

• Mappers and Reducers• Java options• Other parameters

Monitoring

• Important Web Interface Ports:– dxchd01.psc.edu:8088 – Yarn Resource Manager (Track Jobs)– dxchd01.psc.edu:50070 – HDFS (Namenode)– dxchd02.psc.edu:8042 – NodeManager (Slave Nodes)– dxchd02.psc.edu:50075 – Datanode (Slave Nodes)– dxchd01.psc.edu:8080 – Spark Master Web interface

Hadoop Streaming

• Write Map/Reduce Jobs in any language• Excellent for Fast Prototyping

Hadoop Streaming: Bash Example

• Bash wc and cat• hadoop jar \

$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -input /datasets/plays/ \-output streaming-out \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l '

Hadoop Streaming Python Example

• Wordcount in python• hadoop jar

$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -file mapper.py \-mapper mapper.py \-file reducer.py \-reducer reducer.py \-input /datasets/plays/ \-output pyout

Spark

• Alternate programming framework using HDFS• Optimized for in-memory computation• Well supported in Java, Python, Scala

Spark Resilient Distributed Dataset (RDD)

“Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDD’s.

(Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et al.)

Spark Resilient Distributed Dataset (RDD)

• RDD for short• Persistence-enabled data collections• Transformations• Actions• Flexible implementation: memory vs. hybrid vs. disk

Selected RDD Transformations

• map(func)• filter(func)• flatMap(func)• sample(withReplacement,fraction,seed)• distinct([numTasks)]• union(otherDataset)• join(otherDataset)• cogroup(otherDataset,[numTasks])• cartesian(otherDataset)• groupByKey([numTasks])• reduceByKey(func,[numTasks])• sortByKey([ascending],[numTasks])• repartition(numPartitions)• pipe(command,[envVars])

Selected RDD Actions

• reduce(func)• count()• collect()• take(n)• saveAsTextFile(path)

Actions Trigger Transformations!

Spark example

• lettercount.py

Spark Machine Learning Library

• Clustering (K-Means)• Many others, list at

http://spark.apache.org/docs/1.0.1/mllib-guide.html

K-means

• Randomly seed cluster starting points

• Test each point with respect to the others in its cluster to find a new mean

• If the centroids change do it again

• If the centroids stay the same they've converged and we're done.

• Awesome visualization:

http://www.naftaliharris.com/blog/visualizing-k-means-clustering/



Exercise 2: K-Means

• spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://dxchd01.psc.edu:/datasets/kmeans_data.txt 3

• spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://dxchd01.psc.edu:/datasets/archiver.txt 2

NoSQL Databases

• “Not Only SQL”• Similar interface to RDBMS• Trade Consistency Guarantees for Horizontal Scaling

NoSQL Overview

• HBase• Hive• Cassandra• Others

HBase

• Modeled after Google’s Bigtable• Linear scalability• Automated failovers• Hadoop (and Spark) integration• Store files -> HDFS -> FS

HBase Example

hbase shellstatushelpcreate ‘$userid-table', ‘$userid-family'put 'u', 'row01', ‘$userid-family:col01', 'value1'scan ‘$userid-table'disable ‘$userid-table'drop ‘$userid-table'

Questions?

• Thanks!

References and Useful Links

• HDFS shell commands:

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

• Apache Hadoop Official Releases:https://hadoop.apache.org/releases.html

• Apache Spark Documentation

http://spark.apache.org/docs/latest/• Apache Spark MLlib Documentation:

https://spark.apache.org/docs/latest/mllib-guide.html

• Apache Hbase:

http://hbase.apache.org/



https://hadoop.apache.org/releases.html

http://spark.apache.org/docs/latest/

https://spark.apache.org/docs/latest/mllib-guide.html

http://hbase.apache.org/

introduction to hadoop programming bryon gill, pittsburgh supercomputing center

Documents