an intriduction to hive

Download An intriduction to hive

Post on 27-Jan-2015




0 download

Embed Size (px)




  • 1. An Introduction to Apache HIVECredits By: Reza AmeriSemester: Fall 2013Course: DDBProf: Dr. Naderi

2. Agenda Starting Note What is Hive What is cool about Hive Hive in use What Hive is not? Brief About Data WarehouseAn Introduction to Apache HIVE2 of 31 3. Agenda- Contd. Hive Architecture Components Architecture Diagram Hive in Production HQL Data Insertion/Aggregation Performance Further Reading References An Introduction to Apache HIVE3 of 31 4. Starting Note What is Apache Hive? Open Source (Very Important!) So Free Data Warehouse System on Hadoop Provides HQL(SQL like query interface) Suitable for Structured and Semi-Structured Data Capability to deal with different storages and file formats An Introduction to Apache HIVE4 of 31 5. Starting Note- Contd. What is cool about Hive Let users use MR without thinking MR with HiveQL interface. Some history Hive is made by Facebook! Developing by Netflix aslo. Amazon uses it in Amazon Elastic MapReduce An Introduction to Apache HIVE5 of 31 6. Starting Note- Contd. What Hive is not Does not use complex indexes so do not response in a seconds! But it scales very well and, It works with data of Peta Byte order It is not independent and its performance is tied HadoopAn Introduction to Apache HIVE6 of 31 7. Brief About Data Warehouse OLAP vs OLTP DW is needed in OLAP We want report and summary not live data of transactions for continuing the operate We need reports to make operation better not to conduct and operation! We use ETL to populate data in DW.An Introduction to Apache HIVE7 of 31 8. Brief About Data WarehouseInmon approach vs Kimbal approachAn Introduction to Apache HIVE8 of 31 9. Brief About Data WarehouseInmon approach vs Kimbal approachAn Introduction to Apache HIVE9 of 31 10. Brief About Data Warehouse Other keywords ODS- Operational Data Store Fact Tables Data Mart Dimensions Concurrent ETLsAn Introduction to Apache HIVE10 of 31 11. Hive Architecture Components Hadoop Driver Command Line Interface (CLI) Web Interface Metastore Thrift ServerAn Introduction to Apache HIVE11 of 31 12. Hive ArchitectureAn Introduction to Apache HIVE12 of 31 13. Hive Architecture Map ReduceWeb UI + Hive CLI + JDBC/ODBCUser-defined Map-reduce ScriptsHDFSBrowse, Query, DDL Hive QL MetaStoreParserUDF/UDAF substr sum averagePlanner ExecutionThrift API OptimizerSerDe CSV Thrift RegexAn Introduction to Apache HIVEFileFormats TextFile SequenceFile RCFile13 of 31 14. Hive Architecture- Contd. Internal Components Compiler and Planner It compiles and checks the input query and create an execution plan. Optimizer It optimizes the execution plan before it runs. Execution Engine Runs the execution plan. It is guaranteed that execution plan is DAGAn Introduction to Apache HIVE14 of 31 15. Hive Architecture- Contd. Hive Data Model Any data in hive is categorized in Databases First level of abstraction. Tables Ordinary tables Partition To handle data transferring in MR. Bucket Facilitate the data access in partitions.An Introduction to Apache HIVE15 of 31 16. Hive in Production Log processing Daily Report User Activity Measurement Data/Text mining Machine learning (Training Data) Business intelligence Advertising Delivery Spam Detection An Introduction to Apache HIVE16 of 31 17. Hive in Production HQL Create Row Format SerDe Select Cluster By/Distribute By Data Insertion/AggregationAn Introduction to Apache HIVE17 of 31 18. HQL- Samples CREATE TABLE CREATE TABLE movies (movie_id int, movie_name string, tags string) ROW FORMAT ROW FORMAT DELIMITED FIELDS TERMINATED BY :;An Introduction to Apache HIVE18 of 31 19. HQL- Samples Partition create table table_name ( id int, date string, name string) partitioned by (date string)An Introduction to Apache HIVE19 of 31 20. HQL- Samples SerDe User Table with id::gender::age::occupation::zipcode format. CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)"); An Introduction to Apache HIVE20 of 31 21. HQL- Samples Select SELECT * FROM movies LIMIT 10; Distribute By Select * from movies distribute by tags; Select the column to organize data while sending it to reducer.An Introduction to Apache HIVE21 of 20 22. Hive Process Data Insertion/Aggregation Bulk ETL Talend - Community version Sqoop (SQl to hadOOP, Apache license) SyncSort Not Free!An Introduction to Apache HIVE22 of 31 23. Hive Process- Contd. STP(Straight Through Processing) Flume Apache lisenced Chukwa - a part of Apache Hadoop distribution Scribe Facebook solution for log processing and aggregation.An Introduction to Apache HIVE23 of 31 24. Hive Process- Contd. NetFlix Case Study Usage of Chukwa Log processing Count Errors per session Count Streams per day Ad-hoc queries like summaries (sum, max, min, )An Introduction to Apache HIVE24 of 31 25. Hive Process- Contd.An Introduction to Apache HIVE25 of 31 26. Hive Process- Contd. Phase 1 Hadoop job parses the logs and loads to Hive every hour. Previous job should also run every 24 hours for summary Phase 2 Real-time log processing(parse/merge/load) Chukwa has non-stop log collection.An Introduction to Apache HIVE26 of 31 27. Performance According to Globant investigations Tables:An Introduction to Apache HIVE27 of 31 28. PerformanceAn Introduction to Apache HIVE28 of 31 29. PerformanceAn Introduction to Apache HIVE29 of 31 30. Further Reading Apache Drill Software framework that supports data-intensive, distributed applications, for interactive analysis of large-scale datasets PIG MR Platform for creating and using MR on Hadoop Oracle Big Data DB2 10 and InfoSphere Warehouse Parallel databases: Gamma, Bubba, Volcano Google: Sawzall Yahoo: Pig IBM: JAQL Microsoft: DradLINQ , SCOPE An Introduction to Apache HIVE30 of 31 31. References Sqoop Database Import For Hadoop, Cloudera, Oct.2009 BEGINNING MICROSOFT SQL SERVER 2012 PROGRAMMING, Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2 Hive A Petabyte Scale Data Warehouse Using Hadoop, facebook team, 2009 An Introduction to Apache HIVE31 of 31 32. ThanksAn Introduction to Apache HIVE


View more >