hbase - cleveland state universitycis.csuohio.edu/~sschung/cis612/lecturenotes_hbasefinal.pdf ·...

29
HBase CIS 612 SUNNIE CHUNG

Upload: others

Post on 03-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBaseCIS 612

SUNNIE CHUNG

Page 2: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

What is HBase HBase is a column-oriented database management

system that runs on top of HDFS.

HBase designed to provide random, real-time read/write

access to Big Data (consists of billion of rows and millions

of columns.)

RDBMS is better for data with less than few millions of

rows.

It does not support a structured query language like SQL

(called NoSQL database.)

HBase applications are written in Java much like typical

MapReduce application.

2

Page 3: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Architecture 3

Page 4: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Architecture

HBase is a distributed database, designed to run on a cluster of

servers.

HBase is a column-oriented data store. This makes certain data

access patterns less expensive than with relational database

systems.

There is a single HBase master node and multiple region

servers.

HBase tables are partitioned into multiple regions with each

region storing a range of the table’s rows, and multiple regions

are assigned by the master to a region server.

HBase utilizes ZooKeeper to manage region assignments to

region servers, and to recover from region server crashes by

loading into other functioning region servers.

4

Page 5: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Architecture Regions contain MemStore (in-memory data store) and HFile

and all regions on a region server share a reference to the write-ahead log (WAL) which used to store new data.

Each region holds a specific range of row keys, and when a region exceeds its size, HBase automatically scale by splitting the region into child regions.

As table grows, more regions will be created. When a client request a specific row key, HBase manages the client communicate directly to region server where the key exists. This design minimizes seeking process compare to traditional RDBM.

Clients interact with HBase via one of several available APIs, including native Java API, REST-based interface, or Apache Thrift, Avro interfaces.

5

Page 6: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase : Data Model

http://hbase.apache.org/book/datamodel.html#conceptual.view

http://hbase.apache.org/

http://hbase.apache.org/book/architecture.html#arch.overview

6

Page 7: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Data Model:HBase Table

HBase system comprised of a set of Tables

Table contains rows and columns

Table has a primary key which is used to access data in that table

HBase Column represents an attribute of an object

Example of column: timestamp

groups attributes together in column families such that the elements

of a column family are all stored together

Different from a row-oriented relational database where all the

columns of a given row are stored together

Must predefine the table schema and specify the column families

flexible, can add new columns to families at any time

7

Page 8: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Data Model:HBase Table

Consists of Column Families

HBase is really a Map on key value

Multi-Dimentional Map on Key, Column Families:Qualifier,

Timestamp

Distributed - data replicated across multiple nodes

Considered “sparse” because row can have any number of

columns in each column family, or none at all. (so sparse)

Sorted: unlike most map implementations, HBase/BigTable the

key/value pairs are kept in strict alphabetical order.

B+ Tree Index can be built on the sorted key

8

Page 9: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Data Model: HBase Column Family

Column Family is like parent element in XML

All column members of a column family have the same prefix

Example of www.cnn.com websites:

the columns Anchor:si.cnn.com and Anchor:cnnet.cnn.com are both members of the Anchor column family

Family = Anchor, qualifier = si.cnn.com

colon character (:) delimits the column family from the column family qualifier

columns can vary in a table

9

Page 10: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

10Row Key Time

Stamp

ColumnFamily contents ColumnFamily anchor ColumnFamily people

"com.cnn.www" t9 anchor:cnnsi.com = "CNN"

"com.cnn.www" t8 anchor:my.look.ca =

"CNN.com"

"com.cnn.www" t6 contents:html = "<html>..."

"com.cnn.www" t5 contents:html = "<html>..."

"com.cnn.www" t3 contents:html = "<html>..."

"com.example.www" t5 contents:html = "<html>..." people:author = "John Doe"

Page 11: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Data Model: HBase Cell, Timestamp

Cell

is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value's version.

A cell's value is an uninterpreted array of bytes.

Timestamp:

written alongside each value is an identifier for a given version of a value.

Represents time on the Region Server when the data was written

Can have multiple versions of values (differentiated by timestamp)

11

Page 12: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

12Multi-Dementional Map

{"com.cnn.www": {

contents: {

t6: contents:html: "<html>..."

t5: contents:html: "<html>..."

t3: contents:html: "<html>..." }

anchor: {

t9: anchor:cnnsi.com = "CNN"

t8: anchor:my.look.ca = "CNN.com"

} people: {}

}

"com.example.www": {

contents: {

t5: contents:html: "<html>..." }

anchor: {} people: {

t5: people:author: "John Doe" } }

}

Page 13: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBASE Write (Insert) Contact Master node to write data

Writes to HDFS disk

if you have several terabytes of data, do not write to

master because master gets overwhelmed.

Directly write it to region instead. Region will auto write

to HDFS

Data is consistent because every client must talk to the

same region server to get the data

Does not replicate, let the distributed file system

(Hadoop) do that

Users Hadoop Security

13

Page 14: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

ACID in HBase: Write HBase achieves ACID by ensuring that all transactions are

committed serially.

For Write Transaction

it starts with retrieving the highest transaction number

called WriteNumber.

Then the row is locked to prevent concurrent writes.

The changes are applied to Write Ahead Log and then the Memstore.

Memstore is (Key, Value) that contains transaction’sWriteNumber and Timestamp

The transaction is committed and finally the row is

unlocked.

14

Page 15: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBASE Read (Get) data

Client talks to Zookeeper that finds the

region that servers the data it needs

Zookeeper points to Region Server

HBase Master not involved in reading data

15

Page 16: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

ACID in HBase :Read

Read Transaction

It starts retrieving the last committed

transaction number called ReadPoint

The scanner is opened then transaction filters

all scanned Key Value with Memstore

Timestamp to check the ReadPoint

Finally, it closes the scanner.

HBase stores a list of uncommitted transactions and each transaction is delayed until all prior

transactions committed.

16

Page 17: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Join and Index in HBase

Hbase can have multiple Big Tables

No Join Operator for tables

Join between the Big Tables has to be

implemented by a user

Table is sorted on Row key

B+ tree index is built on the sorted row key

No Secondary Indexes, No Transaction

Can not be ported To/from RDBMS

17

Page 18: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Installation HBase stable version ((hbase-0.94.17.tar.gz) downloaded from

http://www.apache.org/dyn/closer.cgi/hbase/

Start HBase: $ ./bin/start-hbase.sh

$ start-hbase.sh localhost: starting zookeeper, logging to /Users/nqt289/hbase-0.92.2/logs/hbase-nqt289-zookeeper-Thuats-MacBook-Pro.local.out

starting master, logging to /Users/nqt289/hbase-0.92.2/logs/hbase-nqt289-master-Thuats-MacBook-Pro.local.out

localhost: starting regionserver, logging to /Users/nqt289/hbase-0.92.2/logs/hbase-nqt289-regionserver-Thuats-MacBook-Pro.local.out

$ jps1202 HRegionServer

1051 HQuorumPeer

1118 HMaster

475 NameNode

1267 Jps

544 DataNode

735 TaskTracker

667 JobTracker

617 SecondaryNameNode

Confirm HBase is running via http://localhost:60010/

18

Page 19: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Confirm HBase is running via http://localhost:60010/ 19

Page 20: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Starting HBase

Connect to running HBase via shell: $ ./bin/hbase shell

$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.92.2, r1379292, Fri Aug 31 13:13:53 UTC 2012

20

Page 21: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Syntax Type 'help' followed by a return to get a listing of commands.

HBase belongs to NoSQL category so its syntax via shell doesn’t look like SQL. It has its own set of command and queries.

Test HBase by creating a table called “users” with a single column family “info” then insert some values, scan the whole table, and get a single row.

hbase(main):003:0> create 'users', 'info'

0 row(s) in 0.1010 seconds

Get table details:

hbase(main):005:0> describe 'users'

DESCRIPTION ENABLED

{NAME => 'users', FAMILIES => [{NAME => 'info', BLO true

OMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSI

ONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS =>

'0', TTL => '2147483647', BLOCKSIZE => '65536', IN_

MEMORY => 'false', BLOCKCACHE => 'true'}]}

1 row(s) in 0.0150 seconds

21

Page 22: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Client API HBase is written in Java Native API.

Ruby, C++ can be used to access HBase.

Users can use the client API directly or access it through proxy that

translate request into API call.

REST, Thrift and Avro are examples of popular HBase client.

REST supports web-based infrastructure

Thrift/Avro are used when throughput performance is considered.

The HBase API supports a complete CRUD Operations for working

with data in HBase.

The primary client interface to HBase is the HTable class in the org.apache.hadoop.hbase.client package.

Pig & Hive on Top of HBASE

22

Page 23: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Example//example codes

Configuration config = HBaseConfiguration.create();

// instantiate HTable object

HTable table = new HTable(config, "User");

// put into the table

Put p = new Put(Bytes.toBytes("row1"));

p.add(Bytes.toBytes("Id"), Bytes.toBytes("col1"),Bytes.toBytes("Emp1"));

p.add(Bytes.toBytes("Name"), Bytes.toBytes("col2"),

Bytes.toBytes("Archana"));

table.put(p);

23

Page 24: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Other important classes:As based on Hadoop, other important

classes provided are:

InputFormat : InputFormat deals with

the input data, split and iterate through

data

Mapper : Mapper processes each

record by using map() method

Reducer : Reducer processes the

output of Mapper.

24

Page 25: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Input

HBase takes input of structured data, i.e.

the typical relational database tables.

Those data could be ones created by

Hive from unstructured data or generated

through java programs.

25

Page 26: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Input ExampleA java program to populate the table ‘access_logs’ with random data of 10

rows:

for (int I = 0; i < 10; i++)

{

int userID = rand.nextInt(maxID) + 1;

byte [] rowkey = Bytes.add(Bytes.toBytes(userID), Bytes.toBytes(i));

String randomPage = pages[rand.nextInt(pages.length)];

Put put = new Put(rowkey);

put.add(Bytes.toBytes("details"), Bytes.toBytes("page"),

Bytes.toBytes(randomPage));

htable.put(put);

}

26

Page 27: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

ExampleConfirm table data:

hbase(main):003:0> scan 'access_logs'

ROW COLUMN+CELL

row0 column=Id:col1, timestamp=1396344825092, value=user3

row0 column=Page:col2, timestamp=1396344825092, value=/b.html

row1 column=Id:col1, timestamp=1396344825095, value=user1

row1 column=Page:col2, timestamp=1396344825095, value=/c.html

row2 column=Id:col1, timestamp=1396344825096, value=user3

row2 column=Page:col2, timestamp=1396344825096, value=/

row3 column=Id:col1, timestamp=1396344825097, value=user2

row3 column=Page:col2,

timestamp=1396344825099, value=user2

27

Page 28: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

HBase Output

Outputs of the HBase process are also structured

data, i.e., tables in relational database, or files

with particular data model structure.

Tables in HBase are supposed to support large

number of row and column-oriented different

than traditional relational tables which are row-

oriented.

Each table must have a primary key.

28

Page 29: HBase - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of

Output Example

Example of a Java program that takes input as a table of log record and finds the frequency of each record using MapReduce:

TableMapReduceUtil.initTableMapperJob("access_logs", scan,

Mapper1.class, ImmutableBytesWritable.class, IntWritable.class, job);

TableMapReduceUtil.initTableReducerJob("summary_user", Reducer1.class, job);

The table summary_user is created to store the result; input table is access_log.

stats : key : 1, count : 3

stats : key : 2, count : 5

stats : key : 3, count : 2

... 13/04/14 21:10:26 INFO mapred.JobClient: map 100% reduce

100%

29