hortonworks technical workshop: hbase and apache phoenix

© Hortonworks Inc. 2014

SQL on HBase with Phoenix


Agenda What Is Apache HBase •  High Level Overview. •  Technical Detail.

What Is Apache Phoenix •  Overview. •  What’s New.

•  Secondary Index Demo.


New Data Requires a New Data Architecture

Source: IDC

2.8 ZB in 2012

85% from New Data Types

15x Machine Data by 2020

40 ZB by 2020

OLTP, ERP, CRM Systems

Unstructured documents, emails

Clickstream

Server logs

Sen>ment, Web Data

Sensor, Machine Data

Geoloca>on

Modern Database Needs More Scalable

Handle New Data Types Intelligent and Predic>ve


What Is Apache HBase?

100% Open Source Store and Process Petabytes of Data Flexible Schema Scale out on Commodity Servers High Performance, High Availability Integrated with YARN SQL and NoSQL Interfaces

YARN : Data OperaGng System

HBase

RegionServer

1 ° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° ° N

HDFS (Permanent Data Storage)

HBase

RegionServer

HBase

RegionServer

Dynamic Schema Scales Horizontally to PB of Data Directly Integrated with Hadoop


Kinds of Apps Built with HBase

Interested? See HBase Case Studies later in this document.

Write Heavy Low-Latency

Search / Indexing

Messaging

Audit / Log Archive Advertising Data Cubes

Time Series Sensor / Device


HBase is Deeply Integrated with Hadoop

•  Data is stored in HDFS. You can store more data and re-‐use exis>ng HDFS exper>se.

•  HBase is integrated with YARN. •  Analy>cs in-‐place using Hive, Pig,

Spark and more.


Who’s Using HBase?


HBase Technical Details

Spring 2014 Version 1.0


HBase Technical Details Based on Google BigTable •  Dynamic schema. •  Good for very sparse datasets.

•  All data is range-partitioned for trivial horizontal scaling across commodity hardware.

Directly integrated with HDFS and Hadoop •  Analyze data in HBase with any Hadoop ecosystem tools (Hive, Pig, MapReduce, Tez, etc.) •  Re-use existing Hadoop skills to run HBase.


Logical ArchitectureDistributed, persistent partitions of a BigTable

ab

dc

ef

hg

ij

lk

mn

po

Table A

Region 1

Region 2

Region 3

Region 4

Region Server 7Table A, Region 1Table A, Region 2

Table G, Region 1070Table L, Region 25

Region Server 86Table A, Region 3Table C, Region 30Table F, Region 160Table F, Region 776

Region Server 367Table A, Region 4Table C, Region 17Table E, Region 52

Table P, Region 1116

Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.


Logical Data ModelA sparse, multi-dimensional, sorted map

Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.

1368387247 [3.6 kb png data]"thumb"cf2b

a

cf1

1368394583 71368394261 "hello"

"bar"

1368394583 221368394925 13.61368393847 "world"

"foo"

cf21368387684 "almost the loneliest number"1.0001

1368396302 "fourth of July""2011-07-04"

Table A

rowkey columnfamily

columnqualifier timestamp value


HBase HA Overview (Introduced in HDP 2.1)

HMaster

Zookeeper

Client Client Client Client

HBase RegionServer

Region: 100-‐199 (Standby)

Region: 200-‐299 (Standby)

Region: 0-‐99

(Primary)

HBase RegionServer

Region: 100-‐199 (Primary)

Region: 0-‐99

(Standby)

Region: 200-‐299 (Primary)

HFile HFile HFile HFile HFile HFile

HDFS

HBase HA: Real-‐Time Replica>on

Low-‐Latency Reads and Writes

In-‐Memory Cache In-‐Memory Cache

Hive, Pig, MapReduce Hive, Pig, MapReduce

Data Stored to HDFS

Read or Write Directly from Hadoop Tools

Cluster Topology, Data Placement


Apache Phoenix

Spring 2014 Version 1.0

The SQL Skin for HBase


Apache Phoenix A SQL Skin for HBase •  Provides a SQL interface for managing data in HBase. •  Large subset of SQL:1999 mandatory featureset.

•  Create tables, insert and update data and perform low-latency point lookups through JDBC. •  Phoenix JDBC driver easily embeddable in any app that supports JDBC.

Phoenix Makes HBase Better •  Oriented toward online / semi-transactional apps. •  If HBase is a good fit for your app, Phoenix makes it even better.

•  Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.


Apache Phoenix: Current Capabilities

Feature Supported? Common SQL Datatypes Yes Inserts and Updates Yes SELECT, DISTINCT, GROUP BY, HAVING Yes NOT NULL and Primary Key constrants Yes Inner and Outer JOINs Yes Views Yes Subqueries HDP 2.2 Robust Secondary Indexes HDP 2.2


Apache Phoenix: Future Capabilities

Feature Supported? Multi-Table Transactions Future Scalable Joins (Fact-to-Fact) Future Analytics, Windowing Functions Future


Phoenix Provides Familiar SQL Constructs Compare: Phoenix versus Native API

Code Notes // HBase Native API. HBaseAdmin hbase = new HBaseAdmin(conf); HTableDescriptor desc = new HTableDescriptor("us_population"); HColumnDescriptor state = new HColumnDescriptor("state".getBytes()); HColumnDescriptor city = new HColumnDescriptor("city".getBytes()); HColumnDescriptor population = new HColumnDescriptor("population".getBytes()); desc.addFamily(state); desc.addFamily(city); desc.addFamily(population); hbase.createTable(desc);

// Phoenix DDL. CREATE TABLE us_population ( state CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT CONSTRAINT my_pk PRIMARY KEY (state, city));

•  Familiar SQL syntax. •  Provides additional constraint

checking.


Phoenix: Architecture

HBase Cluster

Phoenix Coprocessor

Phoenix Coprocessor

Phoenix Coprocessor

Java Applica>on

Phoenix JDBC Driver

User Application


Phoenix Performance Phoenix Performance Characterization: •  Suitable for 10s of thousands of point-lookups per second. •  Suitable for thousands of aggregations / filtered searches per second.

•  Supports extremely high concurrency.

Phoenix Performance Optimizations •  Column skipping. •  Table salting.

•  Skip scans.

Performance characteristics: •  Index point lookups in milliseconds.

•  Aggregation and Top-N queries in a few seconds over large datasets.


Phoenix Use Cases Phoenix is for: •  Rapidly and easily building an application backed by HBase. •  Making use of your existing SQL skills and investment.

•  High performing aggregations of moderately-sized datasets inside HBase.

Phoenix is not for: •  Sophisticated SQL queries involving large joins or advanced SQL features. •  Queries requiring large scans that do not use indexes. •  ETL.


Phoenix: Futures Short-term focus: •  Transactions. •  Scalable joins.

•  Analytical capabilities.

Long-term focus: Primary interface for HBase. •  Build HBase applications using Phoenix. •  Configure cluster security and replication using Phoenix. •  Integration with BI tools like Microstrategy.


What’s New in Apache Phoenix


What’s New in Apache Phoenix Phoenix in HDP 2.2 •  Based on Apache Phoenix 4.2. •  8 new features, 143 total improvements and fixes.

Notable new features. •  Robust secondary indexes. •  Sub-joins.

•  Basic window functions. •  Bulk loader improvements.


Robust Secondary Index Background / Refresher •  Phoenix supports local and global secondary indexes. •  Updating a global index may require coordination with another RegionServer.

•  See Phoenix docs if you need info on which to use when.

Before Phoenix 4.1 (HDP 2.1): •  Using global indexes, if the RegionServer serving the index key was down, regionservers would abort. •  Note: Does not affect local indexes.

Phoenix 4.1+: •  If the global index cannot be updated:

•  The index is temporarily disabled. •  Background job is launched to rebuild the index.

•  Reads will go directly to base tables rather than accessing the index.

•  Writes will continue to update the index.

•  Controlled by: phoenix.index.failure.handling.rebuild


Improved SQL: Sub Joins Example: select * from A

left join (B join C on B.bc_id = C.bc_id)

on A.ab_id = B.ab_id and A.ac_id = C.ac_id;

Caveats related to joins still apply: •  Still broadcast joins only.


Phoenix: Basic Window Functions FIRST_VALUE, LAST_VALUE, NTH_VALUE

•  No OVER or PARTITION BY.

•  Function applied to each group based on GROUP BY.

Example: SELECT

FIRST_VALUE(“column1”)

WITHIN GROUP

(ORDER BY column2 ASC)

FROM

table

GROUP BY

column3;


ENCODE, DECODE DECODE •  Supports hexadecimal format. DECODE('000000008512af277ffffff8', 'hex')

ENCODE •  Supports hexadecimal and Base62 ENCODE(1, 'base62')

What is base 62??? •  Used to encode data using only letters and numbers.

•  Commonly used for things like URL shorteners.


Demo Phoenix Secondary Indexes


Secondary Index Recap Index Management via JDBC: •  CREATE INDEX my_index ON my_table (v1); •  DROP INDEX my_index ON my_table;

•  ALTER INDEX my_index ON my_table DISABLE / REBUILD;

Index population during bulk import: •  Uses the CsvBulkLoadTool utility (not psql.py). •  Adds the --index-table argument to specify your target index.

HADOOP_CLASSPATH=/path/to/hbase-‐protocol.jar:/path/to/hbase/conf

hadoop jar phoenix-‐4.0.0.jar \

org.apache.phoenix.mapreduce.CsvBulkLoadTool \

-‐-‐table EXAMPLE -‐-‐input /data/example.csv

hortonworks technical workshop: hbase and apache phoenix

Technology

hadoop data

pb of data

hadoop page

region servers host

table f

table c

web data sensor

table e