indicthreads-pune12-comparing hadoop data storage

Upload: indicthreads

Post on 04-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    1/21

    Comparing Hadoop Data Storage(HDFS, HBase, Hive and Pig)

    Rakesh JadhavSAS

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    2/21

    Agenda

    Hadoop Ecosystem HDFS

    HBase

    Hive Pig

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    3/21

    Hadoop Ecosystem

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    4/21

    Hadoop Ecosystem Components

    HDFS: Hadoop Distributed File System

    MapReduce: Hadoop Distributed Programming Paradigm HBase: Hadoop Column Oriented Database for Random

    Access Read/Write of Smaller Data

    Hive: Hadoop Petabyte scalable Data WarehousingInfrastructure

    Pig: Hadoop Data Flow/Analysis Infrastructure

    Zookeeper: Hadoop Co-ordination service, Configuration ServiceInfrastructure

    Chukwa: Hadoop Monitoring Service

    Avro: Hadoop Data Serialization De-Serialization

    Infrastructure

    Mahout: Hadoop Scalable Machine Learning Library

    http://hive.apache.org/
  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    5/21

    HDFS (Data Storage)

    Failure Is Norm

    Designed For Large Datasets than Small

    Designed For Batch Processing than Interactive Supports Write Once- Read Many

    Provides Interfaces to Move Processing CloserTo Data

    Design Features

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    6/21

    HDFS

    APPLICATION AREAS Large Log Processing

    Web search indexing

    LIMITATIONS

    Small Size Problem

    Single Node Of Failure

    No Random Access

    No Write Support

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    7/21

    HBase (Data Storage)

    Key-Value Store (Like Map)

    Semi Structured Data

    Column Family, Time Stamp

    Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp

    De-normalized Data

    Faster Data Retrieval Using Column Families

    Static Column Families, Dynamic Columns

    Design Features

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    8/21

    RDBMS v/s HBase: ExampleRDBMS

    ID Name Age Birth-Place

    MaritalStatus

    Location Weight Employer

    1 Sam 35 Mumbai Married Pune 76 XYZ

    2 Bob 56 Chicago Married NewYork

    79 PQR

    Row

    Key

    Personal Information

    (Column Family)

    Other Information

    (Column Family)

    1 Name:T1=Sam

    Age:T2=35

    Age:T1:=25

    Birth-Place:T1=Mumbai

    MaritalStatus:T2=Married

    MaritalStatus:T1=Unmarried

    Weight:T2= 76

    Weight:T1

    = 65

    Location: T2=Pune

    Location:T1:=Mumbai

    Employer:T1=XYZ

    2

    HBase

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    9/21

    HBase: Application Areas

    Applications which need Store/Access/Searchusing Key

    Need Fast Random Access/Update to scalablestructured data

    Applications Needing Flexible Table Schema

    Applications Needing range-search capabilitiessupported by key ordering

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    10/21

    HBase: Limitations

    Expensive Full Row Read No Secondary Keys

    No SQL Support

    Not Efficient for Big Cell Values

    http://hive.apache.org/
  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    11/21

    Hive (Data Access)

    Scalable data warehouse on top of Hadoopdeveloped by Facebook

    SQL like Query Language HiveQL

    Limited JDBC support

    Support for rich data types

    Ability to insert custom map-reduce jobs

    Design Features

    http://hive.apache.org/
  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    12/21

    Hive: Application Areas

    Adhoc analysis on huge structured data, nothaving any requirement of low latency

    Log processing

    Text Mining

    Document Indexing

    Customer Facing business intelligence (Googleanalytics)

    Predictive Modeling, hypothesis testing

    Hi Li i i

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    13/21

    Hive: Limitations

    No Support To Update Data Only Bulk Load Support

    Not Efficient For Small Data

    Hi E l

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    14/21

    Hive: Example

    create table employee (id bigint, name string,age int) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t' STORED ASTEXTFILE;

    LOAD DATA LOCAL INPATH'/sas/employee.txt' OVERWRITE INTOTABLE employee;

    INSERT OVERWRITE TABLE oldest_employeeSELECT * FROM employee SORT BY age DESCLIMIT 100;

    Pi

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    15/21

    Pig(Data Access)

    Pig Latin High level data flow language.

    Client side library, no server side deployment needed.

    Batch processing large unstructured data

    Procedural language

    Runtime Schema Creation, Check point ability, Splits pipeline support

    Customer code support

    Rich data types

    Support for Joins

    Pi A li ti A

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    16/21

    Pig: Application Areas

    Extract Transform Load (ETL) Unstructured Data Analysis

    PIG Li it ti

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    17/21

    PIG: Limitations

    Not efficient for processing small datasets

    PIG Example

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    18/21

    PIG: Example

    Load Emplyee data from text file, filter it usingage and joining year and group using joiningyear.

    1. records = LOAD 'sas/input/files/employee.txt'

    AS (joiningYear:chararray, employeeId:int, age:int);

    2. filtered_records = FILTER records BY age> 30 AND

    ( joiningYear >=2000 OR joiningYear

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    19/21

    Conclusion

    Organizations Revisit data strategy Evaluate Hadoop Ecosystem

    Build economical, scalable solutions for Big Data problems

    References

  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    20/21

    References

    Hadoop: Definitive Guide, By Tom White

    http://hadoop.apache.org/

    http://developer.yahoo.com/hadoop/tutorial/

    http://www-01.ibm.com/software/data/infosphere/hadoop/

    http://www.information-management.com/blogs/

    http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next

    _frontier_for_innovation

    http://hadoop.apache.org/http://developer.yahoo.com/hadoop/tutorial/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovationhttp://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www.information-management.com/blogs/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://developer.yahoo.com/hadoop/tutorial/http://hadoop.apache.org/
  • 7/30/2019 IndicThreads-Pune12-Comparing Hadoop Data Storage

    21/21

    21

    Thank You