introduction to hadoop programming bryon gill, pittsburgh supercomputing center
TRANSCRIPT
Introduction to Hadoop Programming
Bryon Gill, Pittsburgh Supercomputing Center
What We Will Discuss
• Hadoop Architecture Overview• Practical Examples
• “Classic” Map-Reduce• Hadoop Streaming• Spark
• Hbase and Other Applications
Hadoop Overview
• Framework for Big Data• Map/Reduce• Platform for Big Data Applications
Map/Reduce
• Apply a Function to all the Data• Harvest, Sort, and Process the Output
© 2010 Pittsburgh Supercomputing Center© 2014 Pittsburgh Supercomputing Center
5
Big Data
… Split n
Split 1
Split 3
Split 4
Split 2
… Output n
Output 1
Output 3
Output 4
Output 2
Reduce F(x)
Map F(x)
Result
Map/Reduce
HDFS
• Distributed FS Layer• WORM fs
– Optimized for Streaming Throughput
• Exports• Replication• Process data in place
HDFS Invocations: Getting Data In and Out
• hdfs dfs -ls• hdfs dfs -put• hdfs dfs -get• hdfs dfs -rm• hdfs dfs -mkdir• hdfs dfs -rmdir
Writing Hadoop Programs
• Wordcount Example: Wordcount.java– Map Class– Reduce Class
Exercise 1: Compiling
hadoop com.sun.tools.javac.Main WordCount.java
Exercise 1: Packaging
jar cf wc.jar WordCount*.class
Exercise 1: Submitting
• hadoop \jar wc.jar \WordCount \/datasets/compleat.txt \output \-D mapred.reduce.tasks=2
Configuring your Job Submission
• Mappers and Reducers• Java options• Other parameters
Monitoring
• Important Web Interface Ports:– dxchd01.psc.edu:8088 – Yarn Resource Manager (Track Jobs)– dxchd01.psc.edu:50070 – HDFS (Namenode)– dxchd02.psc.edu:8042 – NodeManager (Slave Nodes)– dxchd02.psc.edu:50075 – Datanode (Slave Nodes)– dxchd01.psc.edu:8080 – Spark Master Web interface
Hadoop Streaming
• Write Map/Reduce Jobs in any language• Excellent for Fast Prototyping
Hadoop Streaming: Bash Example
• Bash wc and cat• hadoop jar \
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -input /datasets/plays/ \-output streaming-out \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l '
Hadoop Streaming Python Example
• Wordcount in python• hadoop jar
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -file mapper.py \-mapper mapper.py \-file reducer.py \-reducer reducer.py \-input /datasets/plays/ \-output pyout
Spark
• Alternate programming framework using HDFS• Optimized for in-memory computation• Well supported in Java, Python, Scala
Spark Resilient Distributed Dataset (RDD)
“Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDD’s.
(Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et al.)
Spark Resilient Distributed Dataset (RDD)
• RDD for short• Persistence-enabled data collections• Transformations• Actions• Flexible implementation: memory vs. hybrid vs. disk
Selected RDD Transformations
• map(func)• filter(func)• flatMap(func)• sample(withReplacement,fraction,seed)• distinct([numTasks)]• union(otherDataset)• join(otherDataset)• cogroup(otherDataset,[numTasks])• cartesian(otherDataset)• groupByKey([numTasks])• reduceByKey(func,[numTasks])• sortByKey([ascending],[numTasks])• repartition(numPartitions)• pipe(command,[envVars])
Selected RDD Actions
• reduce(func)• count()• collect()• take(n)• saveAsTextFile(path)
Actions Trigger Transformations!
Spark example
• lettercount.py
Spark Machine Learning Library
• Clustering (K-Means)• Many others, list at
http://spark.apache.org/docs/1.0.1/mllib-guide.html
K-means
• Randomly seed cluster starting points
• Test each point with respect to the others in its cluster to find a new mean
• If the centroids change do it again
• If the centroids stay the same they've converged and we're done.
• Awesome visualization:
http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Exercise 2: K-Means
• spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://dxchd01.psc.edu:/datasets/kmeans_data.txt 3
• spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://dxchd01.psc.edu:/datasets/archiver.txt 2
NoSQL Databases
• “Not Only SQL”• Similar interface to RDBMS• Trade Consistency Guarantees for Horizontal Scaling
NoSQL Overview
• HBase• Hive• Cassandra• Others
HBase
• Modeled after Google’s Bigtable• Linear scalability• Automated failovers• Hadoop (and Spark) integration• Store files -> HDFS -> FS
HBase Example
hbase shellstatushelpcreate ‘$userid-table', ‘$userid-family'put 'u', 'row01', ‘$userid-family:col01', 'value1'scan ‘$userid-table'disable ‘$userid-table'drop ‘$userid-table'
Questions?
• Thanks!
References and Useful Links
• HDFS shell commands:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
• Apache Hadoop Official Releases:https://hadoop.apache.org/releases.html
• Apache Spark Documentation
http://spark.apache.org/docs/latest/• Apache Spark MLlib Documentation:
https://spark.apache.org/docs/latest/mllib-guide.html
• Apache Hbase:
http://hbase.apache.org/