hadoop
TRANSCRIPT
![Page 1: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/1.jpg)
BIG DATA@nuboat
Peerapat Asoktummarungsri
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 2: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/2.jpg)
Who am I?
Software Engineer @ FICO
Blogger @ Thailand Java User Group
Failed Startuper
Cat Lover
เกรียนตัวน้อยๆ ในทวีตเตอร์
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 3: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/3.jpg)
What is Big Data?
Big data analytics is concerned with the analysis of large volumes of transaction/event data and behavioral analysis of human/human a human/system interactions. (Gartner)
Big data represents the collection of technologies that handle large data volumes well beyond that inflection point and for which, at least in theory, hardware resources required to manage data by volume track to an almost straight line rather than a curve. (IDC)
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 4: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/4.jpg)
Structure & Nonstructure
Level Example
Structure RDBMS
Semi-structure XML, JSON
Quasi-structure Text Document
Unstructure Image, Video
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 5: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/5.jpg)
Unstructure data around the world in 1min.Source: go-globe.com
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 6: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/6.jpg)
The challenges associated with Big Data are
Source: oracle.com
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 7: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/7.jpg)
Interactive Reporting (<1s)
Multistep Transaction
Lots of Insert/Update/Delete
Use The Right tool
Affordable Storage/Compute
Structure of Not (Agility)
Resilent Auto Scalability
RDBMS HADOOP
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 8: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/8.jpg)
HDFS
Map Reduce
Hive
Z F
Sqoop PigCassandra,
Mahout, ฯลฯ
HBASE
ZooKeeper
Flume
YARN COMMON
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 9: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/9.jpg)
HDFS Hadoop Distributed File System
1 A
EB
D
C
2
5
34
125
143
153
254
423
Replication Factor = 3
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 10: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/10.jpg)
HDFS Hadoop Distributed File System
1 A
EB
D2
5
34
125
153
254
423
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 11: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/11.jpg)
Client
Name Node Job Tracker
Data Node Data Node Data Node
Data Node Data Node Data Node
Data Node Data Node Data Node
Master Slave mode only
HDFS
Map Reduce
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 12: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/12.jpg)
Install & ConfigInstall JDKAdding Hadoop System UserConfig SSH & PasswordlessDisabling IPv6Installing Hadoop, chown, .sh, conf.xmlFormatting HDFSStart HadoopHadoop Web ConsoleStop Hadoop
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 13: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/13.jpg)
1. Install Java[root@localhost ~]# ./jdk-6u39-linux-i586.bin[root@localhost ~]# chown -R root:root ./jdk1.6.0_39/[root@localhost ~]# mv jdk1.6.0_39/ /usr/lib/jvm/[root@localhost ~]# update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.6.0_39/bin/java" 1[root@localhost ~]# update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.6.0_39/bin/javac" 1[root@localhost ~]# update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.6.0_39/bin/javaws" 1[root@localhost ~]# update-alternatives --config javaThere are 3 programs which provide 'java'.
Selection Command-----------------------------------------------* 1 /usr/lib/jvm/jre-1.6.0-openjdk/bin/java 2 /usr/lib/jvm/jre-1.5.0-gcj/bin/java+ 3 /usr/lib/jvm/jdk1.6.0_39/bin/java
Enter to keep the current selection[+], or type selection number: 3
[root@localhost ~]# java -versionjava version "1.6.0_39"Java(TM) SE Runtime Environment (build 1.6.0_39-b04)Java HotSpot(TM) Server VM (build 20.14-b01, mixed mode)
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 14: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/14.jpg)
2. Create hadoop user[root@localhost ~]# groupadd hadoop[root@localhost ~]# useradd -g hadoop -d /home/hdadmin hdadmin[root@localhost ~]# passwd hdadmin
3. Config SSH[root@localhost ~]# service sshd start...[root@localhost ~]# chkconfig sshd on[root@localhost ~]# su - hdadmin[root@localhost ~]# ssh-keygen -t rsa -P ""$ ~/.ssh/id_rsa.pub >> ~/.ssh/authroized_keys$ ssh localhost
4. Disable IPv6[root@localhost ~]# vi /etc/sysctl.conf
# disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 15: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/15.jpg)
5. SET Path$ vi ~/.bashrc
# .bashrc
# Source global definitionsif [ -f /etc/bashrc ]; then . /etc/bashrcfi
# Do not set HADOOP_HOME# User specific aliases and functionsexport HIVE_HOME=/usr/local/hive-0.9.0export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39export PATH=$PATH:/usr/local/hadoop-0.20.205.0/bin:$JAVA_HOME/bin:$HIVE_HOME/bin
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 16: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/16.jpg)
6. Install Hadoop[root@localhost ~]# tar xzf hadoop-0.20.205.0-bin.tar.gz[root@localhost ~]# mv hadoop-0.20.205.0 /usr/local/hadoop-0.20.205.0[root@localhost ~]# cd /usr/local[root@localhost ~]# chown -R hdadmin:hadoop hadoop-0.20.205.0
[root@localhost ~]# mkdir -p /data/hadoop[root@localhost ~]# chown hdadmin:hadoop /data/hadoop[root@localhost ~]# chmod 750 /data/hadoop
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 17: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/17.jpg)
7. Config HADOOP$ vi /usr/local/hadoop-0.20.205.0/conf/hadoop-env.sh
# export JAVA_HOME=/usr/lib/j2sdk1.5-sunexport JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39export HADOOP_HOME_WARN_SUPPRESS="TRUE"
$ vi /usr/local/hadoop-0.20.205.0/conf/core-site.sh
<configuration><property> <name>hadoop.tmp.dir</name> <value>/data/hadoop</value> <description>A base for other temporary directories.</description></property><property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.The uri's scheme determine the config property(fs..SCHEME.impl) naming the FileSystem implementation class.The uri's authority is used to determine the host, port, etc,. for a filesystem.</description></property></configuration>
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 18: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/18.jpg)
$ vi /usr/local/hadoop-0.20.205.0/conf/hdfs-site.sh
<configuration><property> <name>dfs.replication</name> <value>1</value> <description>
Default block replication. The actual number of replications can be specified when the file is created.The default is used if replication is not specified in create time.
</description></property></configuration>
$ vi /usr/local/hadoop-0.20.205.0/conf/mapred-site.xml
<configuration><property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>
The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>
</property></configuration>
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 19: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/19.jpg)
8. Format HDFS$ hadoop namenode -format13/04/03 13:59:54 INFO namenode.NameNode: STARTUP_MSG: /************************************************************STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = localhost.localdomain/127.0.0.1STARTUP_MSG: args = [-format]STARTUP_MSG: version = 0.20.205.0STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:26:14 UTC 2011************************************************************/13/04/03 14:00:02 INFO util.GSet: VM type = 32-bit13/04/03 14:00:02 INFO util.GSet: 2% max memory = 17.77875 MB13/04/03 14:00:02 INFO util.GSet: capacity = 2^22 = 4194304 entries13/04/03 14:00:02 INFO util.GSet: recommended=4194304, actual=419430413/04/03 14:00:02 INFO namenode.FSNamesystem: fsOwner=hdadmin13/04/03 14:00:02 INFO namenode.FSNamesystem: supergroup=supergroup13/04/03 14:00:02 INFO namenode.FSNamesystem: isPermissionEnabled=true13/04/03 14:00:02 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=10013/04/03 14:00:02 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)13/04/03 14:00:02 INFO namenode.NameNode: Caching file names occuring more than 10 times 13/04/03 14:00:02 INFO common.Storage: Image file of size 113 saved in 0 seconds.13/04/03 14:00:02 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.13/04/03 14:00:02 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1************************************************************/
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 20: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/20.jpg)
9. Start NameNode, DataNode, JobTracker and TaskTracker$ cd /usr/local/hadoop/bin/$ start-all.sh
starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-namenode-localhost.localdomain.outlocalhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-datanode-localhost.localdomain.outlocalhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-secondarynamenode-localhost.localdomain.outstarting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-jobtracker-localhost.localdomain.outlocalhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-tasktracker-localhost.localdomain.out
$ jps
27410 SecondaryNameNode27643 TaskTracker27758 Jps27504 JobTracker27110 NameNode27259 DataNode
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 21: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/21.jpg)
10. Browse HDFS via Browserhttp://localhost:50070/
NameNode 'localhost.localdomain:54310'Started: Wed Apr 03 14:11:39 ICT 2013Version: 0.20.205.0, r1179940Compiled: Fri Oct 7 06:26:14 UTC 2011 by hortonfoUpgrades: There are no upgrades in progress.
Browse the filesystemNamenode LogsCluster Summary6 files and directories, 1 blocks = 7 total. Heap Size is 58.88 MB / 888.94 MB (6%)Configured Capacity : 15.29 GBDFS Used : 28.01 KBNon DFS Used : 4.94 GBDFS Remaining : 10.36 GBDFS Used% : 0 %DFS Remaining% : 67.72 %Live Nodes : 1Dead Nodes : 0Decommissioning Nodes : 0Number of Under-Replicated Blocks : 0
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 22: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/22.jpg)
Map Reduce
(k1, v1) -> list(k2, v2)
(k2, List(v2)) -> list(k3, v3)Reduce:
(c) copyright 2013 nuboat in wonderland
Map:
Thursday, April 11, 13
![Page 23: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/23.jpg)
Map Reduce
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 24: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/24.jpg)
Map
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 25: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/25.jpg)
Reduce
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 26: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/26.jpg)
Job
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 27: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/27.jpg)
$ hadoop jar ./wordcount.jar com.fico.Wordcount /input/* /output/wordcount_output_dir
Execute
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13
![Page 28: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/28.jpg)
ของแถม
Thursday, April 11, 13
![Page 29: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/29.jpg)
HIVE
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Thursday, April 11, 13
![Page 30: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/30.jpg)
hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> SHOW TABLES;
hive> DESCRIBE invites;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
hive> DROP TABLE pokes;
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
HIVE Command Ex.
Thursday, April 11, 13
![Page 31: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/31.jpg)
SQOOP
With Sqoop, you can import data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.
Thursday, April 11, 13
![Page 32: Hadoop](https://reader031.vdocuments.net/reader031/viewer/2022020307/5596638c1a28ab00348b46a3/html5/thumbnails/32.jpg)
THANK YOU :)
(c) copyright 2013 nuboat in wonderlandThursday, April 11, 13