“Cloudera Hadoop”โดย คุณกิตติรักษ์ ม่วงม่ิงสุข
กรรมการผู้จัดการบริษัทคลัสเตอร์ คิท (Cluster Kit) และ นายกสมาคมศึกษาและพัฒนาโอเพ่นซอร์ส
สัมมนา Big Data & Analytics โดย ดาต้า คิวบ์ (facebook.com/datacube.th)
Cloudera Hadoop
ก�ตต�ร�กษ� มวงม� งส�ขKittirak Moungmingsuk
Arp 4, 2015 Data Cube Seminar @ KU HOME
2
รร�จ�กก�นก�อน
ก�ตต� ร�กษ� มวง ม� ง ส�ข ชช อเลน ก�ก
ป�จจ�บ�นทท�หน �ทท ห ล�ยอย�ง ในก�จก�รเล$ก ๆ ชช อ “คล�สเตอร�ค�ท”
และไ ด ร� บมอบหม� ยจ�กค นหล�ยคน ให เป-น น�ยกสมาคมศ�กษาและพ�ฒนาโอเพนซอร�ส หรชอ OSEDA
ว�ฒ�ก�รศ0กษ�
น�กธรรมช�2 นตรท สท�น� กเ รทย นจ�ง หว�ดอ�บลร�ชธ�นท ว�ดป3�
ว�เวก(ธรรมช�น�)
ชอบเลนอ�นเทอร�เ น$ ต ทองเ ทท ยว และ ทท�ก� จกร รมต �ง ๆ
3
ThaiGrid (Tera Cluster)800 Cores, Linux Cluster
133 Cores, Win Cluster
Sila Cluster @Ramkhamhaeng U. 286 Cores
BIOTEC (Eclipse Cluster) 704 Cores Virgin Radio Thailand
7 nodes, Web Cluster
Geo-Informatics and Space Technology Development Agency (GISTDA)
10 nodes, Web Cluster
HAII (HAII Cluster I, II) 480 Cores
Cluster Kit: Achievement
4
Top500.org (update Nov 2014)
5
Top500 Architecture Share (June 2014)
6
Top500 OS Share
7
Why Big Data?
8
Source: https://practicalanalytics.;les.wordpress.com/2012/10/newstyleo;t.jpg
9
Source:http://smartdatacollective.com/yellow;n/75616/why-big-data-and-business-intelligence-one-direction
10
11
Facebook Usage Statistics (June 2014)
829 million daily active users
654 million mobile daily active users
1.32 billion monthly active users
1.07 billion mobile monthly active users
Approximately 81.7% of our daily active users are outside the US and Canada
Source: http://newsroom.fb.com/company-info/
12
Google Usage Statistic
Data from http://expandedramblings.com/index.php/by-the-num
bers-a-gigantic-list-of-google-stats-and-facts/#.
VDavqq2mhNA
Amount of monthly Google searches
11.944 billion (3/20/14)
Number of monthly unique visitor
187 million (3/25/14)
13
จจ�นวนเครร�องเซ�ร�ฟเวอร�
> ล านเครร อง
180,900 Servers
https://www.facebook.com/ArcadianLearning
s/posts/549836811713533
14
Low CostHigh Performance
15
http://www.opencompute.org/
16
Software
Linux
Python C++, Java, Javascript, Go, Sawzal (a custom logging language)
Hadoop
Linux
PHP, C++, Java, Python, and Ruby.
Apache Web Server
MySQL
Hadoop
Memcached, Flashcache
HipHop to transform PHP source code into C++ and gain performance bene;ts.
17
What is Hadoop?
HDFS
MapReduce
How to build Hadoop cluster
How to execute MapReduce
Hive SQL
18
Hadoop – How was it Born?
To Process Huge Volume of data, as the amount of generated data continued to rapidly increase. (Big Data).Also the Web generated more and more information, which was becoming quite challenging to index the content.
19
What Is Apache Hadoop?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Image Source: http://blogs.ejb.cc/archives/4290/hadoop-technical-manuals-a-the-hadoop-ecosystem/tumblr_lbbwggcer71qappj8
20
HDFS Architecture
Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html
21
Data Replication
Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html
22
Hadoop - Basic Architecture
Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html
23
Hadoop - Basic Architecture (contd.)
Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html
24
MapReduce
MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. – Wikipedia
map: (K1, V1) -> list(K2, V2)Reduce: (K2, list(V2)) -> list(K3, V3)
25
MapReduce
Output in a list of (Key, Value)
Output in a list of (Key, List of Values)Image source: http://www.rabidgremlin.com/data20/#(3)/
26
Map function
Reduce function
WordCount - MapReduce
27
WordCount by Pig
A = load 'input/*';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;store D into 'output/wordcount-pig';
Source: http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
https://pig.apache.org/
28
Hive
Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
hive> show tables;
hive> create table country (country_id int, country string) row format delimited fields terminated by ',' stored as textfile;
hive> desc country;
hive> load data local inpath '/tmp/country.csv' into table country;
hive> select count(country_id) from country where country like 'T%';
29
Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm.
Mahout supports four main data science use cases:
Collaborative ;ltering
Clustering
Classi;cation
Frequent itemset mining
30
List of algorithms (for distributed mode)
Canopy Clustering
Dirichlet Process Clustering
Fuzzy K-Means
Hierarchical Clustering
K-Means Clustering
Latent Dirichlet Allocation
Mean Shift Clustering
Minhash Clustering
Spectral Clustering
Distributed Item-based Collaborative Filtering
Collaborative Filtering Using a Parallel Matrix Factorization
Bayesian
Random Forests
Parallel FP Growth Algorithm
Source: http://hortonworks.com/hadoop/mahout/
31
Source: http://imgbuddy.com/hadoop-ecosystem-components.asp
32
HADOOP 1.0 vs 2.0
Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/
33
Hadoop 2.0 : YARN(Yet Another Resource Negotiator)
Source: http://hortonworks.com/get-started/yarn/
34
Cloudera Hadoop (CDH)
CDH is Cloudera's open source software distribution
http://www.cloudera.com/
Source: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html
35
Cloudera GUI (Hue)
36
Another Hadoop Platform
Hartonworks
http://hortonworks.com/
MapR
https://www.mapr.com/
37
References
https://www.facebook.com/Engineering
https://www.facebook.com/data
https://www.facebook.com/publication
http://research.google.com
http://googleblog.blogspot.com
38
References
Stratapps, “An Introduction to Hadoop”, http://stratapps.net/intro-hadoop.php
edureka!, “Introduction to Hadoop 2.0 and advantages of Hadoop 2.0 over 1.0”, http://www.edureka.co/blog/introduction-to-hadoop
-2-0-and-advantages-of-hadoop-2-0/ , May 2014,
39
โครงก�รคอมพ�วเตอร�มรอสองเพร�อน�องในชนบท
The End.
Download this slide at http://goo.gl/DoibT2
Tweet to me at @kittirak