indicthreads-pune12-comparing hadoop data storage
TRANSCRIPT
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
1/21
Comparing Hadoop Data Storage(HDFS, HBase, Hive and Pig)
Rakesh JadhavSAS
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
2/21
Agenda
Hadoop Ecosystem HDFS
HBase
Hive Pig
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
3/21
Hadoop Ecosystem
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
4/21
Hadoop Ecosystem Components
HDFS: Hadoop Distributed File System
MapReduce: Hadoop Distributed Programming Paradigm HBase: Hadoop Column Oriented Database for Random
Access Read/Write of Smaller Data
Hive: Hadoop Petabyte scalable Data WarehousingInfrastructure
Pig: Hadoop Data Flow/Analysis Infrastructure
Zookeeper: Hadoop Co-ordination service, Configuration ServiceInfrastructure
Chukwa: Hadoop Monitoring Service
Avro: Hadoop Data Serialization De-Serialization
Infrastructure
Mahout: Hadoop Scalable Machine Learning Library
http://hive.apache.org/ -
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
5/21
HDFS (Data Storage)
Failure Is Norm
Designed For Large Datasets than Small
Designed For Batch Processing than Interactive Supports Write Once- Read Many
Provides Interfaces to Move Processing CloserTo Data
Design Features
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
6/21
HDFS
APPLICATION AREAS Large Log Processing
Web search indexing
LIMITATIONS
Small Size Problem
Single Node Of Failure
No Random Access
No Write Support
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
7/21
HBase (Data Storage)
Key-Value Store (Like Map)
Semi Structured Data
Column Family, Time Stamp
Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp
De-normalized Data
Faster Data Retrieval Using Column Families
Static Column Families, Dynamic Columns
Design Features
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
8/21
RDBMS v/s HBase: ExampleRDBMS
ID Name Age Birth-Place
MaritalStatus
Location Weight Employer
1 Sam 35 Mumbai Married Pune 76 XYZ
2 Bob 56 Chicago Married NewYork
79 PQR
Row
Key
Personal Information
(Column Family)
Other Information
(Column Family)
1 Name:T1=Sam
Age:T2=35
Age:T1:=25
Birth-Place:T1=Mumbai
MaritalStatus:T2=Married
MaritalStatus:T1=Unmarried
Weight:T2= 76
Weight:T1
= 65
Location: T2=Pune
Location:T1:=Mumbai
Employer:T1=XYZ
2
HBase
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
9/21
HBase: Application Areas
Applications which need Store/Access/Searchusing Key
Need Fast Random Access/Update to scalablestructured data
Applications Needing Flexible Table Schema
Applications Needing range-search capabilitiessupported by key ordering
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
10/21
HBase: Limitations
Expensive Full Row Read No Secondary Keys
No SQL Support
Not Efficient for Big Cell Values
http://hive.apache.org/ -
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
11/21
Hive (Data Access)
Scalable data warehouse on top of Hadoopdeveloped by Facebook
SQL like Query Language HiveQL
Limited JDBC support
Support for rich data types
Ability to insert custom map-reduce jobs
Design Features
http://hive.apache.org/ -
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
12/21
Hive: Application Areas
Adhoc analysis on huge structured data, nothaving any requirement of low latency
Log processing
Text Mining
Document Indexing
Customer Facing business intelligence (Googleanalytics)
Predictive Modeling, hypothesis testing
Hi Li i i
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
13/21
Hive: Limitations
No Support To Update Data Only Bulk Load Support
Not Efficient For Small Data
Hi E l
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
14/21
Hive: Example
create table employee (id bigint, name string,age int) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t' STORED ASTEXTFILE;
LOAD DATA LOCAL INPATH'/sas/employee.txt' OVERWRITE INTOTABLE employee;
INSERT OVERWRITE TABLE oldest_employeeSELECT * FROM employee SORT BY age DESCLIMIT 100;
Pi
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
15/21
Pig(Data Access)
Pig Latin High level data flow language.
Client side library, no server side deployment needed.
Batch processing large unstructured data
Procedural language
Runtime Schema Creation, Check point ability, Splits pipeline support
Customer code support
Rich data types
Support for Joins
Pi A li ti A
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
16/21
Pig: Application Areas
Extract Transform Load (ETL) Unstructured Data Analysis
PIG Li it ti
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
17/21
PIG: Limitations
Not efficient for processing small datasets
PIG Example
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
18/21
PIG: Example
Load Emplyee data from text file, filter it usingage and joining year and group using joiningyear.
1. records = LOAD 'sas/input/files/employee.txt'
AS (joiningYear:chararray, employeeId:int, age:int);
2. filtered_records = FILTER records BY age> 30 AND
( joiningYear >=2000 OR joiningYear
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
19/21
Conclusion
Organizations Revisit data strategy Evaluate Hadoop Ecosystem
Build economical, scalable solutions for Big Data problems
References
-
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
20/21
References
Hadoop: Definitive Guide, By Tom White
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/tutorial/
http://www-01.ibm.com/software/data/infosphere/hadoop/
http://www.information-management.com/blogs/
http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next
_frontier_for_innovation
http://hadoop.apache.org/http://developer.yahoo.com/hadoop/tutorial/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://developer.yahoo.com/hadoop/tutorial/http://hadoop.apache.org/ -
7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage
21/21
21
Thank You