breaking with relational dbms and dating with hbase
DESCRIPTION
Session on Hbase at IndicThread Conference on Java, Dec 2010 http://j10.indicthreads.com/TRANSCRIPT
1
Gaurav KohliXebiaBreaking with DBMS and
Dating with
3
Why are we here ?
Something about RDBMS
Limitations of RDBMS
Why Hbase or any NoSql solution
Overview of Hbase
Specific Use cases
Paradigm shift in Schema Design
Architecture of Hbase
Hbase Interface – Java API, Thrift
Conclusion
4
Databases
5
Relational Databases have a lot of
6
Data Set going into PetaBytes
RDBMS don't scale inherently Scale up/Scale out ( Load Balancing + Replication)
Hard to shard / partition
Both read / write throughput not possible Transactional / Analytical databases
Specialized Hardware …... is very expensive Oracle clustering
7
Master
Slave
Replication
8
MySQL master becomes a problem All Slaves must have the same write capacity as master Single point of failure, no easy failover
Master
Reads
Writes
Slave nodes
9
Master Master
Slave
Replication
10
11
12
2006.11 Google releases paper on BigTable
2007.2 Initial HBase prototype created as Hadoop contrib.
2007.10 First usable HBase
2008.1 Hadoop become Apache top-level project and HBase becomes
subproject 2010.5~
Hbase becomes Apache top-level project 2010.6
Hbase 0.26.5 released.
2010.10
HBase 0.89.2010092 – third developer release
13
Distributed uses HDFS for storage
Column-Oriented
Multi-Dimensional versions
High-Availability
High-Performance
Storage System
14
A Sql Database
No Joins, no query engine, no datatypes, no sql No Schema
Denormalized data
Wide and sparsely populated data structure(key-value)
No DBA needed
Hbase is
15
Bigness Big data, big number of users, big number of computers
Massive write performance Facebook needs 135 billion messages a month Twitter stores 7 TB data per day
Fast key-value access
Write availability
No Single point of failure
16
Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, etc.
Real-time inserts, updates, and queries.
Fraud detection by comparing transactions to known patterns in real-time.
Analytics - Use MapReduce, Hive, or Pig to perform analytical queries
Specific
17
Column-oriented database
Table are sorted by Row
Table schema only defines Column families column family can have any number of columns
Each cell value has a timestamp
18
19
20
Sorted Map(
RowKey, List(
SortedMap(Column, List(
value, Timestamp)
))
)SortedMap(RowKey,List(SortedMap(Column,List(Value,Timestamp)))
21
A BIG SORTED MAP Row Key+ Column Key + timestamp => value
Row Key Column Key Timestamp Value
1 info:name 1273516197868 Gaurav
1 info:age 1273871824184 28
1 info:age 1273871823022 34
1 info:sex 1273746281432 Male
2 info:name 1273863723227 Harsh
3 Info:name 1273822456433 Raman
2 Versionsof this row
Timestamp is a long valueColumn Qualifier/Name
Sorted by Row key andcolumn key
Column family
Student table
22
Example of a Student and Subject
Student TablePK id
nameagesex
Example of a Student and Subject
Subject TablePK id
titleintroductionteacher_id
Student-Subject Tablestudent_id
subject_id
type
m n
23
Example of a Student and Subject
RDBMS
key name age sex1 Gaurav 28 Male
id title introduction teacher_id1 Hbase Hbase is cool 10
Student table
Subject table
student_id subject_id type
1 1 elective
Student-Subject table
24
Hbase
Student-Subject schema - Hbase
Row Key Column family Column Keys
student_id info name, age, sex
student_id subjects Subject Id's as qualifier(key)
Row Key Column family Column Keyssubject_id info title, introduction, teacher_id
subject_id students Student id's as qualifier(key)
Student table
Subject table
25
Hbase
key info subjects1 info:name=Gaurav
info:age=28info:sex=Male
subjects:1=”elective”subjects:2=”main”
key info students1 info:title=Hbase
info:introduction=Hbase is coolinfo:teacher_id=10
students:1students:2
Student-Subject schema - HbaseStudent table
Subject table
26
Attribute Possible Values Default
COMPRESSION NONE,GZ,LZO NONE
VERSIONS 1+ 3
TTL 1-2147483647(seconds) 2147483647
BLOCKSIZE 1 byte – 2 GB 64k
IN_MEMORY true,false false
BLOCKCACHE true,false true
27
Region: Contiguous set of lexicographically sorted rows
hbase.hregion.max.filesize (default:256 Mb) Region hosted by Region Servers
Each Table is partitioned into Regions
28
Regions and
row200
row201
row500
row1
new row
29
Regions and
row200
row201
row350
row1
row 351
row 501
30
Master
Zookeeper
RegionServers
HDFS
MapReduce
31
32
– Java API, Thrift...
33
– Java API, Thrift... Java
Thrift ( Ruby, Php, Python, Perl, C++... )
REST
Groovy DSL
MapReduce
Hbase Shell
34
– Java API, Thrift... Java
Get Put Delete Scan IncrementalColumnValue
35
36
Hbase v/s RDBMS Not a replacement Solves only a small subset(~5%)
37
Where Sql makes life easy Joining Secondary Indexing Referential Integrity (updates) ACID
Where Hbase makes life easy Dataset scale Read/Write scale
Replication Batch analysis
38
39
40
Hbase Apache (http://hbase.apache.org/)
Hbase Wiki (wiki.apache.org/hadoop/Hbase)
Hbase blog (blog.hbase.org)
Images from Google Search
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html