a hadoop overview
DESCRIPTION
A Hadoop Overview. Outline. Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A. Outline. Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A. Progress. Hadoop buildup has been completed. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/1.jpg)
A Hadoop Overview
![Page 2: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/2.jpg)
Outline
Progress ReportMapReduce ProgrammingHadoop Cluster OverviewHBase OverviewQ & A
![Page 3: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/3.jpg)
Outline
Progress ReportMapReduce ProgrammingHadoop Cluster OverviewHBase OverviewQ & A
![Page 4: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/4.jpg)
Progress
Hadoop buildup has been completed. Version 0.19.0, running under Standalone mode.
HBase buildup has been completed. Version 0.19.3, with no assists of HDFS.
Simple demonstration over MapReduce. Simple word count program.
![Page 5: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/5.jpg)
Testing Platform
Fedora 10JDK1.6.0_18Hadoop-0.19.0Hbase-0.19.3One can connect to the machine using pietty
or putty. Host: 140.112.90.180 Account: labuser Password: robot3233 Port: 3385 (using ssh connection)
![Page 6: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/6.jpg)
Outline
Progress ReportMapReduce ProgrammingHadoop Cluster OverviewHBase OverviewQ & A
![Page 7: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/7.jpg)
MapReduce
A computing framework including map phase, shuffling phase and reduce phase.
Map function and Reduce function are provided by the user.
Key-Value Pair(KVP) map is initiated with each KVP ingested, and output
any number of KVPs. reduce is initiated with each key and its
corresponding values, and output any number of KVPs.
![Page 8: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/8.jpg)
MapReduce(cont.)
![Page 9: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/9.jpg)
What user has to do?
1. Specify the input/output format2. Specify the output key/value type3. Specify the input/output location4. Specify the mapper/reducer class5. Specify the number of reduce tasks6. Specify the partitioner class(dicussed later)
![Page 10: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/10.jpg)
What user has to do?(cont.)
Specify the input/output format “Input/Output format” is class that translate raw data
and KVPs. Has to inherit class InputFormat/OutputFormat. Input format is required.
The most common choice is KeyValueTextInputFormat class and SequenceFileInputFormat class.
Output format is selective, the default is TextOutputFormat class .
![Page 11: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/11.jpg)
What user has to do?(cont.)
Specify the output key/value type The KVP type output by reducer.
The Key type has to implements WritableComparable interface.
The Value type has to implements Writable interface.
Specify the input/output location The directory or for input files/output files. The input directory should exist and contain at least
one file. The output directory should not exist or be empty.
![Page 12: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/12.jpg)
What user has to do?(cont.)
Specify the mapper/reducer class The two classes should extend MapReduceBase class. The map/reduce class should implement Mapper<K1,
V1, K2, V2>/Reducer<K1, V1, K2, V2> interface
Specify the number of reduce tasks Usually approximate the number of computing nodes. 1 if we want a single output file. 0 if we don’t need the reduce phase.
Note that we will not have our result sorted. The reducer class is not required in this case.
![Page 13: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/13.jpg)
Map Phase Configuration
Element Required?
Default
Input path(s) YesClass to convert the input path elements to KVPs
Yes
Map output key class No Job output key classMap output value class No Job output value
classClass supplying the map function YesSuggested minimum number of map tasks
No Cluster default
Number of threads to run each map task
No 1
![Page 14: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/14.jpg)
Reduce Phase Configuration
Element Required?
Default
Output path YesClass to convert the KVPs to output files
No TextOutputFormat
Job input key class No Job output key classJob input value class No Job output value
classJob output key class YesJob output value class YesClass supplying the reduce function YesThe number of reduce tasks No Cluster default
![Page 15: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/15.jpg)
MapReduceIntro.javapublic class MapReduceIntro { protected static Logger logger = Logger.getLogger(MapReduceIntro.class); public static void main(final String[] args) { try { final JobConf conf = new JobConf(MapReduceIntro.class); conf.set("hadoop.tmp.dir","/tmp");
conf.setInputFormat(KeyValueTextInputFormat.class); FileInputFormat.setInputPaths(conf, MapReduceIntroConfig.getInputDirectory());
conf.setMapperClass(IdentityMapper.class);
FileOutputFormat.setOutputPath(conf, MapReduceIntroConfig.getOutputDirectory()); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setNumReduceTasks(1); conf.setReducerClass(IdentityReducer.class); final RunningJob job = JobClient.runJob(conf);
if (!job.isSuccessful()) { logger.error("The job failed."); System.exit(1); }
System.exit(0); }}
Initial Configuration
Map Phase Configuration
Reduce Phase Configuration
Job Running
![Page 16: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/16.jpg)
IdentityMapper.java
public class IdentityMapper<K, V> extends MapReduceBase implements Mapper<K, V, K, V> {
public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter) throws IOException { output.collect(key, val); }}
Input type Output type
Discussed later
Collect output KVPs
![Page 17: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/17.jpg)
IdentityReducer.java
public class IdentityReducer<K, V> extends MapReduceBase implements Reducer<K, V, K, V> {
public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } }}
The input value is an Iterator<V>!
![Page 18: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/18.jpg)
Compiling
Using default java compiler Note that we have to supply –classpath parameter so
that the compiler can find the hadoop core libraries and other classes needed. $ javac –classpath $HADOOP_HOME/hadoop-0.19.0-core.jar:. –d .
Myclass.java
The hadoop core librariesThe location of other class files
![Page 19: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/19.jpg)
Creating jar file
To create an executable jar file:1. Create a file “manifest.mf”
2. Type the command: $ jar –cmf MyExample.jar manifest.mf <list of
classes> Wildcard character * is also accepted.
Main-Class: myclassClass-Path: MyExample.jar
A white space!
A return carriage! White space separate list!
The driver class
![Page 20: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/20.jpg)
Run the jar file
Using hadoop command. $ hadoop jar MyExample.jar <param list>
Remember that the output path should not exist. If the path exist, use rm path –r command.
![Page 21: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/21.jpg)
A simple demonstration
A simple word count program.
![Page 22: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/22.jpg)
Reporter
![Page 23: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/23.jpg)
Outline
Progress ReportMapReduce ProgrammingHadoop Cluster OverviewHBase OverviewQ & A
![Page 24: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/24.jpg)
Hadoop
Full name Apache Hadoop project. Open source implementation for reliable, scalable
distributed computing. An aggregation of the following projects (and its
core): Avro Chukwa HBase HDFS Hive MapReduce Pig ZooKeeper
![Page 25: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/25.jpg)
Virtual Machine (VM)
Virtualization All services are delivered through VMs. Allows for dynamically configuring and managing. There can be multiple VMs running on a single
commodity machine. VMware
![Page 26: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/26.jpg)
HDFS(Hadoop Distributed File System)
The highly scalable distributed file system of Hadoop. Resembles Google File System(GFS). Provides reliability by replication.
NameNode & DataNode NameNode
Maintains file system metadata and namespace. Provides management and control services. Usually one instance.
DataNode Provides data storage and retrieval services. Usually several instances.
![Page 27: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/27.jpg)
MapReduce
The sophisticate distributed computing service of Hadoop. A computation framework. Usually resides on HDFS.
JobTracker & TaskTracker JobTracker
Manages the distribution of tasks to the TaskTrackers. Provides job monitoring and control, and the submission
of jobs. TaskTracker
Manages single map or reduce tasks on a compute node.
![Page 28: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/28.jpg)
Cluster Makeup
A Hadoop cluster is usually make up by: Real Machines.
Not required to be homogeneous. Homogeneity will help maintainability.
Server Process. Multiple process can be run on a single VM.
Master & Slave The node/machine running the JobTracker or NameNode
will be Master node. The ones running the TaskTracker or DataNode will be
Slave node.
![Page 29: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/29.jpg)
Cluster Makeup(cont.)
![Page 30: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/30.jpg)
Administrator Scripts
Administrator can use the following script files to start or stop server processes. Can be located in $HADOOP_HOME/bin
start-all.sh/stop-all.sh start-mapred.sh/stop-mapred.sh start-dfs.sh/stop-dfs.sh slaves.sh hadoop
![Page 31: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/31.jpg)
Configuration
By default, each Hadoop Core server will load the configuration from several files. These file will be located in $HADOOP_HOME/conf Usually identical copies of those files are maintained
in every machine in the cluster.
![Page 32: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/32.jpg)
Outline
Progress ReportMapReduce ProgrammingHadoop Cluster OverviewHBase OverviewQ & A
![Page 33: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/33.jpg)
HBase
The Hadoop scalable distributed database. Resembles Google BigTable. Not relational database. Resides in HDFS.
Master & RegionServer Master
For bootstrapping and RegionServer recovering. Assigning regions to RegionServers.
RegionServer Hold 0 or more regions. responsible for data transaction.
![Page 34: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/34.jpg)
Hbase(cont.)
![Page 35: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/35.jpg)
Row, Column, Timestamp
The data cell is the intersection of an individual row key and a column. Cells stores uninterrupted array of byte. Cell data is versioned by timestamp.
![Page 36: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/36.jpg)
Row
Row (Key) is the primary key of database Can be consisted by arbitrary byte array.
Strings, binary data. Each row has to be distinguished. The table is sorted by row key. Any mutation action of a single row is atomic.
![Page 37: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/37.jpg)
Column/Column Family
Columns are grouped into families, with which shares a common prefix. Ex: temperature:air and temperature:dew_point The prefix has to be a printable string. The column name can also be arbitrary byte array. Column family member can be dynamically added or
dropped. Column families must be pre-specified as table schemas.
HBase is indeed a column-family-oriented storing. The same column family will be stored together in any
file system.
![Page 38: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/38.jpg)
Region
The table is automatically horizontally-partitioned into regions. That is, a region is a subset of data rows. Regions are stored in separated RegionServers. A region is defined by its first row, last row, and a
randomly generated identifier. The partition will be completed by the master
automatically.
![Page 39: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/39.jpg)
Administrator Scripts
Administrator can use the following script files to start or stop server processes. Can be located in $HBASE_INSTALL/bin
start-hbase.sh / stop-hbase.sh hbase
hbase shell to initial a command line interface. hbase master / hbase regionserver
![Page 40: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/40.jpg)
HBase shell command line
Type command help to get information. create ‘table’, ‘column family1’, ‘column family2’, …
put ‘table’, ‘row’, ‘column’, ‘value’ get ‘table’, ‘row’, {COLUMN=>…} alter ‘table’, {NAME=>‘...’}
To modify a table schema, we have to disable it first! scan ‘table’ disable ‘table’ drop ‘table’
To drop a table, we have to disable it first!! list
![Page 41: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/41.jpg)
A Simple Demonstration
Command line operation
![Page 42: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/42.jpg)
Operations
Create table (and its schema) Shell
create ‘table’, ‘cf1’, ‘cf2’,… create ‘table’, {NAME=>‘cf1’}, {NAME=>‘cf2’},…
APIHBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration());HTableDescriptor table = new HTableDescriptor(“table”);table.addFamily(new HColumnDescriptor(“cf1:”));table.addFamily(new HColumnDescriptor(“cf2:”));admin.createTable(table);
![Page 43: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/43.jpg)
Operations(cont.)
Modify table (and its schema) Shell
alter ‘table’, {NAME=>’cf’, KEY=>’value’, …}
API
Note that there will be exceptions if the table is not disabled.
HBaseAdmin admin = new HBaseAdmin();Admin.modifyColumn(“table”,”cf”, new HColumnDescriptor(…));Admin.modifyTable(new HTableDescriptor(…));
![Page 44: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/44.jpg)
Operations(cont.)
Write data Shell
put ‘table’, ‘row’, ‘cf:name’, ‘value’, ts
APIHTable table = new Htable(“table”);BatchUpdate update = new BatchUpdate(“row”);update.put(“cf:name”,”value”);table.commit(update);
![Page 45: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/45.jpg)
Operations(cont.)
Retrieve data Shell
get ‘table’, ‘row’, {COLUMN=>’cf:name’, …}
API
If we don’t know the row retrieved at prior, we can use Scanner object instead. Scanner scanner = table.getScanner(“cf:name”);
HTable table = new HTable(“table”);RowResult row = table.getRow(“row”);Cell data = table.get(“row”,”cf:name”);
![Page 46: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/46.jpg)
Operations
Delete a cell Shell
delete ‘table’, ‘row’, ‘cf:name’
APIHTable table = new HTable(“table”);BatchUpdate update = new BatchUpdate(“row”);Udpate.delete(“cf:name”);table.commit(update);
![Page 47: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/47.jpg)
Operations(cont.)
Enable/Disable a table Shell
enable/disable ‘table’
APIHBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration());admin.disableTable(“table”);admin.enableTable(“table”);
![Page 48: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/48.jpg)
Outline
Progress ReportMapReduce ProgrammingHadoop Cluster OverviewHBase OverviewQ & A
![Page 49: A Hadoop Overview](https://reader036.vdocuments.net/reader036/viewer/2022081604/568160ef550346895dd0275a/html5/thumbnails/49.jpg)
Q & A
Hadoop 0.19.0 API http://hadoop.apache.org/common/docs/r0.19.0/api/
index.htmlHBase 0.19.3 API
http://hadoop.apache.org/hbase/docs/r0.19.3/api/index.html
Any question?