indicthreads-pune12-comparing hadoop data storage

7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

1/21

Comparing Hadoop Data Storage(HDFS, HBase, Hive and Pig)

Rakesh JadhavSAS


2/21

Agenda

Hadoop Ecosystem HDFS

HBase

Hive Pig


3/21

Hadoop Ecosystem


4/21

Hadoop Ecosystem Components

HDFS: Hadoop Distributed File System

MapReduce: Hadoop Distributed Programming Paradigm HBase: Hadoop Column Oriented Database for Random

Access Read/Write of Smaller Data

Hive: Hadoop Petabyte scalable Data WarehousingInfrastructure

Pig: Hadoop Data Flow/Analysis Infrastructure

Zookeeper: Hadoop Co-ordination service, Configuration ServiceInfrastructure

Chukwa: Hadoop Monitoring Service

Avro: Hadoop Data Serialization De-Serialization

Infrastructure

Mahout: Hadoop Scalable Machine Learning Library
http://hive.apache.org/


5/21

HDFS (Data Storage)

Failure Is Norm

Designed For Large Datasets than Small

Designed For Batch Processing than Interactive Supports Write Once- Read Many

Provides Interfaces to Move Processing CloserTo Data

Design Features


6/21

HDFS

APPLICATION AREAS Large Log Processing

Web search indexing

LIMITATIONS

Small Size Problem

Single Node Of Failure

No Random Access

No Write Support


7/21

HBase (Data Storage)

Key-Value Store (Like Map)

Semi Structured Data

Column Family, Time Stamp

Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp

De-normalized Data

Faster Data Retrieval Using Column Families

Static Column Families, Dynamic Columns

Design Features


8/21

RDBMS v/s HBase: ExampleRDBMS

ID Name Age Birth-Place

MaritalStatus

Location Weight Employer

1 Sam 35 Mumbai Married Pune 76 XYZ

2 Bob 56 Chicago Married NewYork

79 PQR

Row

Key

Personal Information

(Column Family)

Other Information

(Column Family)

1 Name:T1=Sam

Age:T2=35

Age:T1:=25

Birth-Place:T1=Mumbai

MaritalStatus:T2=Married

MaritalStatus:T1=Unmarried

Weight:T2= 76

Weight:T1

= 65

Location: T2=Pune

Location:T1:=Mumbai

Employer:T1=XYZ

2

HBase


9/21

HBase: Application Areas

Applications which need Store/Access/Searchusing Key

Need Fast Random Access/Update to scalablestructured data

Applications Needing Flexible Table Schema

Applications Needing range-search capabilitiessupported by key ordering


10/21

HBase: Limitations

Expensive Full Row Read No Secondary Keys

No SQL Support

Not Efficient for Big Cell Values


11/21

Hive (Data Access)

Scalable data warehouse on top of Hadoopdeveloped by Facebook

SQL like Query Language HiveQL

Limited JDBC support

Support for rich data types

Ability to insert custom map-reduce jobs

Design Features


12/21

Hive: Application Areas

Adhoc analysis on huge structured data, nothaving any requirement of low latency

Log processing

Text Mining

Document Indexing

Customer Facing business intelligence (Googleanalytics)

Predictive Modeling, hypothesis testing

Hi Li i i


13/21

Hive: Limitations

No Support To Update Data Only Bulk Load Support

Not Efficient For Small Data

Hi E l


14/21

Hive: Example

create table employee (id bigint, name string,age int) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t' STORED ASTEXTFILE;

LOAD DATA LOCAL INPATH'/sas/employee.txt' OVERWRITE INTOTABLE employee;

INSERT OVERWRITE TABLE oldest_employeeSELECT * FROM employee SORT BY age DESCLIMIT 100;

Pi


15/21

Pig(Data Access)

Pig Latin High level data flow language.

Client side library, no server side deployment needed.

Batch processing large unstructured data

Procedural language

Runtime Schema Creation, Check point ability, Splits pipeline support

Customer code support

Rich data types

Support for Joins

Pi A li ti A


16/21

Pig: Application Areas

Extract Transform Load (ETL) Unstructured Data Analysis

PIG Li it ti


17/21

PIG: Limitations

Not efficient for processing small datasets

PIG Example


18/21

PIG: Example

Load Emplyee data from text file, filter it usingage and joining year and group using joiningyear.

1. records = LOAD 'sas/input/files/employee.txt'

AS (joiningYear:chararray, employeeId:int, age:int);

2. filtered_records = FILTER records BY age> 30 AND

( joiningYear >=2000 OR joiningYear


19/21

Conclusion

Organizations Revisit data strategy Evaluate Hadoop Ecosystem

Build economical, scalable solutions for Big Data problems

References


20/21

References

Hadoop: Definitive Guide, By Tom White

http://hadoop.apache.org/

http://developer.yahoo.com/hadoop/tutorial/

http://www-01.ibm.com/software/data/infosphere/hadoop/

http://www.information-management.com/blogs/

http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next

_frontier_for_innovation
http://hadoop.apache.org/http://developer.yahoo.com/hadoop/tutorial/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://developer.yahoo.com/hadoop/tutorial/http://hadoop.apache.org/


21/21

21

Thank You

indicthreads-pune12-comparing hadoop data storage

Documents