hortonworks technical workshop: hbase and apache phoenix

30
Page 1 © Hortonworks Inc. 2014 SQL on HBase with Phoenix

Upload: hortonworks

Post on 12-Jul-2015

5.450 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 1 © Hortonworks Inc. 2014

SQL on HBase with Phoenix

Page 2: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 2 © Hortonworks Inc. 2014

Agenda What Is Apache HBase •  High Level Overview. •  Technical Detail.

What Is Apache Phoenix •  Overview. •  What’s New.

•  Secondary Index Demo.

Page 3: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 3 © Hortonworks Inc. 2014

New Data Requires a New Data Architecture

Source: IDC

2.8  ZB  in  2012  

85%  from  New  Data  Types  

15x  Machine  Data  by  2020  

40  ZB  by  2020  

OLTP,  ERP,  CRM  Systems  

Unstructured  documents,  emails  

Clickstream  

Server  logs  

Sen>ment,  Web  Data  

Sensor,  Machine  Data  

Geoloca>on  

Modern  Database  Needs  More  Scalable  

Handle  New  Data  Types  Intelligent  and  Predic>ve  

Page 4: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 4 © Hortonworks Inc. 2014

What Is Apache HBase?

100%  Open  Source  Store  and  Process  Petabytes  of  Data  Flexible  Schema  Scale  out  on  Commodity  Servers  High  Performance,  High  Availability  Integrated  with  YARN  SQL  and  NoSQL  Interfaces  

YARN  :  Data  OperaGng  System  

HBase    

RegionServer  

1   °   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °   N  

HDFS  (Permanent  Data  Storage)  

HBase    

RegionServer  

HBase    

RegionServer  

Dynamic Schema Scales Horizontally to PB of Data Directly Integrated with Hadoop

Page 5: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 5 © Hortonworks Inc. 2014

Kinds of Apps Built with HBase

Interested? See HBase Case Studies later in this document.

Write Heavy Low-Latency

Search / Indexing

Messaging

Audit / Log Archive Advertising Data Cubes

Time Series Sensor / Device

Page 6: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 6 © Hortonworks Inc. 2014

HBase is Deeply Integrated with Hadoop

•  Data  is  stored  in  HDFS.  You  can  store  more  data  and  re-­‐use  exis>ng  HDFS  exper>se.  

•  HBase  is  integrated  with  YARN.  •  Analy>cs  in-­‐place  using  Hive,  Pig,  

Spark  and  more.  

Page 7: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 7 © Hortonworks Inc. 2014

Who’s Using HBase?

Page 8: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 8 © Hortonworks Inc. 2014

HBase Technical Details

Spring 2014 Version 1.0

Page 9: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 9 © Hortonworks Inc. 2014

HBase Technical Details Based on Google BigTable •  Dynamic schema. •  Good for very sparse datasets.

•  All data is range-partitioned for trivial horizontal scaling across commodity hardware.

Directly integrated with HDFS and Hadoop •  Analyze data in HBase with any Hadoop ecosystem tools (Hive, Pig, MapReduce, Tez, etc.) •  Re-use existing Hadoop skills to run HBase.

Page 10: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 10 © Hortonworks Inc. 2014

Page 11: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 11 © Hortonworks Inc. 2014

Logical ArchitectureDistributed, persistent partitions of a BigTable

ab

dc

ef

hg

ij

lk

mn

po

Table A

Region 1

Region 2

Region 3

Region 4

Region Server 7Table A, Region 1Table A, Region 2

Table G, Region 1070Table L, Region 25

Region Server 86Table A, Region 3Table C, Region 30Table F, Region 160Table F, Region 776

Region Server 367Table A, Region 4Table C, Region 17Table E, Region 52

Table P, Region 1116

Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.

Page 12: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 12 © Hortonworks Inc. 2014

Logical Data ModelA sparse, multi-dimensional, sorted map

Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.

1368387247 [3.6 kb png data]"thumb"cf2b

a

cf1

1368394583 71368394261 "hello"

"bar"

1368394583 221368394925 13.61368393847 "world"

"foo"

cf21368387684 "almost the loneliest number"1.0001

1368396302 "fourth of July""2011-07-04"

Table A

rowkey columnfamily

columnqualifier timestamp value

Page 13: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 13 © Hortonworks Inc. 2014

HBase HA Overview (Introduced in HDP 2.1)

HMaster  

Zookeeper  

Client   Client   Client   Client  

HBase  RegionServer  

Region:  100-­‐199  (Standby)  

Region:  200-­‐299  (Standby)  

Region:  0-­‐99  

(Primary)  

HBase  RegionServer  

Region:  100-­‐199  (Primary)  

Region:  0-­‐99  

(Standby)  

Region:  200-­‐299  (Primary)  

HFile   HFile   HFile   HFile   HFile   HFile  

HDFS  

HBase  HA:  Real-­‐Time  Replica>on  

Low-­‐Latency  Reads  and  Writes  

In-­‐Memory  Cache   In-­‐Memory  Cache  

Hive,  Pig,  MapReduce   Hive,  Pig,  MapReduce  

Data  Stored  to  HDFS  

Read  or  Write  Directly  from  Hadoop  Tools  

Cluster  Topology,  Data  Placement  

Page 14: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 14 © Hortonworks Inc. 2014

Apache Phoenix

Spring 2014 Version 1.0

The SQL Skin for HBase

Page 15: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 15 © Hortonworks Inc. 2014

Apache Phoenix A SQL Skin for HBase •  Provides a SQL interface for managing data in HBase. •  Large subset of SQL:1999 mandatory featureset.

•  Create tables, insert and update data and perform low-latency point lookups through JDBC. •  Phoenix JDBC driver easily embeddable in any app that supports JDBC.

Phoenix Makes HBase Better •  Oriented toward online / semi-transactional apps. •  If HBase is a good fit for your app, Phoenix makes it even better.

•  Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.

Page 16: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 16 © Hortonworks Inc. 2014

Apache Phoenix: Current Capabilities

Feature Supported? Common SQL Datatypes Yes Inserts and Updates Yes SELECT, DISTINCT, GROUP BY, HAVING Yes NOT NULL and Primary Key constrants Yes Inner and Outer JOINs Yes Views Yes Subqueries HDP 2.2 Robust Secondary Indexes HDP 2.2

Page 17: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 17 © Hortonworks Inc. 2014

Apache Phoenix: Future Capabilities

Feature Supported? Multi-Table Transactions Future Scalable Joins (Fact-to-Fact) Future Analytics, Windowing Functions Future

Page 18: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 18 © Hortonworks Inc. 2014

Phoenix Provides Familiar SQL Constructs Compare: Phoenix versus Native API

Code Notes //  HBase  Native  API.  HBaseAdmin  hbase  =  new  HBaseAdmin(conf);  HTableDescriptor  desc  =  new  HTableDescriptor("us_population");  HColumnDescriptor  state  =  new  HColumnDescriptor("state".getBytes());  HColumnDescriptor  city  =  new  HColumnDescriptor("city".getBytes());  HColumnDescriptor  population  =  new  HColumnDescriptor("population".getBytes());  desc.addFamily(state);  desc.addFamily(city);  desc.addFamily(population);  hbase.createTable(desc);    

//  Phoenix  DDL.  CREATE  TABLE  us_population  (                  state  CHAR(2)  NOT  NULL,                  city  VARCHAR  NOT  NULL,                  population  BIGINT  CONSTRAINT  my_pk  PRIMARY  KEY  (state,  city));  

•  Familiar SQL syntax. •  Provides additional constraint

checking.

Page 19: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 19 © Hortonworks Inc. 2014

Phoenix: Architecture

HBase Cluster

Phoenix  Coprocessor  

Phoenix  Coprocessor  

Phoenix  Coprocessor  

Java  Applica>on  

Phoenix  JDBC  Driver  

User Application

Page 20: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 20 © Hortonworks Inc. 2014

Phoenix Performance Phoenix Performance Characterization: •  Suitable for 10s of thousands of point-lookups per second. •  Suitable for thousands of aggregations / filtered searches per second.

•  Supports extremely high concurrency.

Phoenix Performance Optimizations •  Column skipping. •  Table salting.

•  Skip scans.

Performance characteristics: •  Index point lookups in milliseconds.

•  Aggregation and Top-N queries in a few seconds over large datasets.

Page 21: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 21 © Hortonworks Inc. 2014

Phoenix Use Cases Phoenix is for: •  Rapidly and easily building an application backed by HBase. •  Making use of your existing SQL skills and investment.

•  High performing aggregations of moderately-sized datasets inside HBase.

Phoenix is not for: •  Sophisticated SQL queries involving large joins or advanced SQL features. •  Queries requiring large scans that do not use indexes. •  ETL.

Page 22: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 22 © Hortonworks Inc. 2014

Phoenix: Futures Short-term focus: •  Transactions. •  Scalable joins.

•  Analytical capabilities.

Long-term focus: Primary interface for HBase. •  Build HBase applications using Phoenix. •  Configure cluster security and replication using Phoenix. •  Integration with BI tools like Microstrategy.

Page 23: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 23 © Hortonworks Inc. 2014

What’s New in Apache Phoenix

Page 24: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 24 © Hortonworks Inc. 2014

What’s New in Apache Phoenix Phoenix in HDP 2.2 •  Based on Apache Phoenix 4.2. •  8 new features, 143 total improvements and fixes.

Notable new features. •  Robust secondary indexes. •  Sub-joins.

•  Basic window functions. •  Bulk loader improvements.

Page 25: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 25 © Hortonworks Inc. 2014

Robust Secondary Index Background / Refresher •  Phoenix supports local and global secondary indexes. •  Updating a global index may require coordination with another RegionServer.

•  See Phoenix docs if you need info on which to use when.

Before Phoenix 4.1 (HDP 2.1): •  Using global indexes, if the RegionServer serving the index key was down, regionservers would abort. •  Note: Does not affect local indexes.

Phoenix 4.1+: •  If the global index cannot be updated:

•  The index is temporarily disabled. •  Background job is launched to rebuild the index.

•  Reads will go directly to base tables rather than accessing the index.

•  Writes will continue to update the index.

•  Controlled by: phoenix.index.failure.handling.rebuild

Page 26: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 26 © Hortonworks Inc. 2014

Improved SQL: Sub Joins Example: select  *  from  A  

 left  join  (B  join  C  on  B.bc_id  =  C.bc_id)  

 on  A.ab_id  =  B.ab_id  and  A.ac_id  =  C.ac_id;

Caveats related to joins still apply: •  Still broadcast joins only.

Page 27: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 27 © Hortonworks Inc. 2014

Phoenix: Basic Window Functions FIRST_VALUE, LAST_VALUE, NTH_VALUE

•  No OVER or PARTITION BY.

•  Function applied to each group based on GROUP BY.

Example: SELECT  

 FIRST_VALUE(“column1”)  

 WITHIN  GROUP  

   (ORDER  BY  column2  ASC)  

 FROM  

   table  

 GROUP  BY  

   column3;  

Page 28: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 28 © Hortonworks Inc. 2014

ENCODE, DECODE DECODE •  Supports hexadecimal format. DECODE('000000008512af277ffffff8',  'hex')  

 

ENCODE •  Supports hexadecimal and Base62 ENCODE(1,  'base62')  

 

What is base 62??? •  Used to encode data using only letters and numbers.  

•  Commonly used for things like URL shorteners.

Page 29: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 29 © Hortonworks Inc. 2014

Demo Phoenix Secondary Indexes

Page 30: Hortonworks Technical Workshop: HBase and Apache Phoenix

Page 30 © Hortonworks Inc. 2014

Secondary Index Recap Index Management via JDBC: •  CREATE INDEX my_index ON my_table (v1); •  DROP INDEX my_index ON my_table;

•  ALTER INDEX my_index ON my_table DISABLE / REBUILD;

Index population during bulk import: •  Uses the CsvBulkLoadTool utility (not psql.py). •  Adds the --index-table argument to specify your target index.

HADOOP_CLASSPATH=/path/to/hbase-­‐protocol.jar:/path/to/hbase/conf  

hadoop  jar  phoenix-­‐4.0.0.jar  \  

       org.apache.phoenix.mapreduce.CsvBulkLoadTool  \  

       -­‐-­‐table  EXAMPLE  -­‐-­‐input  /data/example.csv