apache hbase performance tuning

54
HBase Tuning Performance and Correctness Lars Hofhansl Principal Architect, Salesforce (10 years!) HBase, Phoenix Committer, PMC Apache Incubator PMC Apache Foundation Member http:// hadoop-hbase.blogspot.com /

Upload: lhofhansl

Post on 08-Aug-2015

9.520 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Apache HBase Performance Tuning

HBase TuningPerformance and Correctness

Lars HofhanslPrincipal Architect, Salesforce (10 years!)

HBase, Phoenix Committer, PMCApache Incubator PMC

Apache Foundation Memberhttp://hadoop-hbase.blogspot.com/

Page 2: Apache HBase Performance Tuning
Page 3: Apache HBase Performance Tuning

Boring TopicExperiment with Colorful Slides

Page 4: Apache HBase Performance Tuning

Agenda• HDFS• HBase – Server• HBase – Client• Correctness• Performance

Page 5: Apache HBase Performance Tuning

HDFShdfs-site.xml

Page 6: Apache HBase Performance Tuning

HDFS - Background• Stores HBase WAL and HFiles• No sync-to-disk by default• Datanode writes tmp file, moves it into place

• Old data lost on power outage

Page 7: Apache HBase Performance Tuning

HDFS Correctness Settings

• dfs.datanode.synconclose = true(since Hadoop 1.1)

• mount ext4 with dirsync! Or use XFS

• You must do this!

Page 8: Apache HBase Performance Tuning

HDFS Performance Settings1. Sync behind writes2. Stale Datanode Detection3. Short Circuit Reads4. Miscellaneous Settings

Page 9: Apache HBase Performance Tuning

HDFS Sync Behind Writes• Syncs partial blocks to disk – best effort

(OK, since blocks are immutable)• Necessary with sync-on-close for performance• Always enable this

• dfs.datanode.sync.behind.writes = true(Since Hadoop 1.1)

Page 10: Apache HBase Performance Tuning

Stale Datanodes - Background• Datanodes (DNs) send block reports to the

Namenode (NN)• After 10min(!) w/o a report, DN is declared dead• NN will still direct reads and writes to those DNs• Bad for recovery. Down by 1 DN by definition.

(every 3rd read/write goes to a bad DN)

Page 11: Apache HBase Performance Tuning

Stale Datanodes - DetectionDon’t use a DN for read or write when it looks like it is stale (default off)

• dfs.namenode.avoid.read.stale.datanode = true• dfs.namenode.avoid.write.stale.datanode = true• dfs.namenode.stale.datanode.interval = 30000

(default)

Page 12: Apache HBase Performance Tuning

HDFS short circuit readsRead local blocks directly without DN, when RegionServers and

DNs are co-located.• dfs.client.read.shortcircuit = true• dfs.client.read.shortcircuit.buffer.size = 131072

(important, OOM on direct buffers, default on 0.98+)• hbase.regionserver.checksum.verify = true

(default on 0.98+)• dfs.domain.socket.path

(local Unix domain socket, not group or world readable)

Page 13: Apache HBase Performance Tuning

Misc HDFS tipsKeep DN running with some failed disks• dfs.datanode.failed.volumes.tolerated = <N>

(tolerate losing this many disks)

Distribute data across disks at a DN• dfs.datanode.fsdataset.volume.choosing.policy =

AvailableSpaceVolumeChoosingPolicy(HDFS-1804 hit drives with more space with higher probability for writes when free space differs by more than 10GB by default)

Page 14: Apache HBase Performance Tuning

Misc HDFS settings(just trust me on these)

• dfs.block.size = 268435456(note that WAL is rolled at 95% of this)

• ipc.server.tcpnodelay = true• ipc.client.tcpnodelay = true

Page 15: Apache HBase Performance Tuning

Misc HDFS settings(just trust me on these, really)

• dfs.datanode.max.xcievers = 8192• dfs.namenode.handler.count = 64• dfs.datanode.handler.count = 8

(match number of spindles)

Page 16: Apache HBase Performance Tuning
Page 17: Apache HBase Performance Tuning

HBaseRegionServer Settings

hbase-site.xml

Page 18: Apache HBase Performance Tuning

Compactions

Page 19: Apache HBase Performance Tuning

Compactions - Background• Writes are buffered in the memstore• Memstore contents flushed to disk as HFiles• Need to limit # HFiles by rewriting small HFiles into

fewer larger ones• Remove deleted and expired Cells• Same data written multiple times => Write

Amplification!

Page 20: Apache HBase Performance Tuning

Read vs. Write• Read requires merging HFiles => fewer is

better• Write throughput better with fewer

compactions => leads to more files• Optimize for Read or Write, not both

Page 21: Apache HBase Performance Tuning

Write Amplification Vs.

Read Performance

Page 22: Apache HBase Performance Tuning

Control the number of HFiles• hbase.hstore.blockingStoreFiles = 10

(do not allow more flushes when there more than <N> files)small for read, large for write, will stop flushes and writes

• hbase.hstore.compactionThreshold = 3(number of files that starts a compaction)small for read, large for write

• hbase.hregion.memstore.flush.size = 128(max memstore size, default is good)larger good for fewer compaction (watch Region Server heap)

Page 23: Apache HBase Performance Tuning

Time Based Compactions• HBase does time based major compactions• expensive, always at wrong time• hbase.hregion.majorcompaction = 604800000

(week, default)• hbase.hregion.majorcompaction.jitter = 0.5 (½

week, default)

Page 24: Apache HBase Performance Tuning

Memstore/Cache Sizing• hbase.hregion.memstore.flush.size = 128• hbase.hregion.memstore.block.multiplier

(allow single memstore to grow by this multiplier, good for heavy, bursty writes)

• hbase.regionserver.global.memstore.upperLimit (0.98)hbase.regionserver.global.memstore.size (1.0+)(percent of heap, default 0.4, decrease for read heavy load)

• hfile.block.cache.size(percent heap used for the block cache, default 0.4)

Page 25: Apache HBase Performance Tuning

Autotune BlockCache vs. Memstores (1.0+)

HBASE-5349, not well tested, Must Experiment

• hbase.regionserver.global.memstore.size.{max|min}.range• hfile.block.cache.size.{max|min}.range• hbase.regionserver.heapmemory.tuner.class• hbase.regionserver.heapmemory.tuner.period

Page 26: Apache HBase Performance Tuning

Data Locality• Essential for Short Circuit Reads• hbase.hstore.min.locality.to.skip.major.compact

(compact even when unnecessary to restore locality)

• hbase.master.wait.on.regionservers.timeout(allow master to wait a bit upon restart, so not all region go to the first servers who sign in 30-90s is good. Default it 4.5s)

• Don’t use the HDFS balancer!

Page 27: Apache HBase Performance Tuning

HBaseColumn Family Settings

Page 28: Apache HBase Performance Tuning

Block Encoding• NONE, FAST_DIFF, PREFIX, etc• alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' }

• Scan friendly, decodes as you scan• Not so Get friendly (might need to decode many previous

Cells)• Currently produces a lot of extra garbage• Safe to enable, always

Page 29: Apache HBase Performance Tuning

Compression• NONE, GZIP, SNAPPY, etc• create ’test', {NAME => ’cf', COMPRESSION => 'SNAPPY’}}• Compresses entire blocks, not Scan or Get friendly• Typically does not achieve much over block encoding• Blocks cached decompressed, unless

hbase.block.data.cachecompressed = true(more cache capacity, but every access needs decompressions)

• Need to test with your data

Page 30: Apache HBase Performance Tuning

HFile Block Size• Don’t confuse with HDFS block size!• create ‘test ,{NAME => ‘cf , BLOCKSIZE => ’4096'}′ ′• Default 64k good compromise between Scans and

point Gets• Increase for large Scans• Decrease for many point gets• Rarely want to change this, likely never > 1mb

Page 31: Apache HBase Performance Tuning

RegionServer - Garbage Collection

(source: http://www.everystockphoto.com)

Page 32: Apache HBase Performance Tuning

Weak Generational Hypothesis

Most Allocated Objects Die Young

Page 33: Apache HBase Performance Tuning

Garbage Collection - BackgroundHotSpot manages four generations (CMS collector):

• Eden for all new objects• Survivor I and II where surviving objects are promoted when eden

is collected• Tenured space. Objects surviving a few rounds (16 by default) of

eden/survivor collection are promoted into the tenured space• Perm gen for classes, interned strings, and other more or less

permanent objects. (gone, finally, in JDK8)

Page 34: Apache HBase Performance Tuning

Garbage Collection - HBase• Garbage from operations is shortlived (single RPC)• Memstore is relatively long-lived

(allocated in 2mb chunks)• Blockcache is long-lived

(allocation in 64k blocks)• Deal with the “operational” garbage efficiently

Page 35: Apache HBase Performance Tuning

Garbage Collection (CMS)-Xmn512mvery small eden space

-XX:+UseParNewGCcollect eden in parallel

-XX:+UseConcMarkSweepGCuse the non-moving CMS collector

-XX:CMSInitiatingOccupancyFraction=70start collecting when 70% of tenured gen is full, avoid collection under pressure

-XX:+UseCMSInitiatingOccupancyOnlydo not try to adjust CMS setting

Page 36: Apache HBase Performance Tuning

RegionServer Machine Sizing

Page 37: Apache HBase Performance Tuning

RegionServer Machine Sizing• How much RAM/Heap?• How many disks?• What size of disk?• Network?• Number of cores?

Page 38: Apache HBase Performance Tuning

RegionServer Disk/Java Heap ratio• Disk/Heap ratio:

RegionSize / MemstoreSize * ReplicationFactor * HeapFractionForMemstores * 2(assuming memstores on average ½ filled)

• 10gb/128mb * 3 * 0.4 * 2 = 192, with default settings

Page 39: Apache HBase Performance Tuning

RegionServer Disk/Java Heap ratio

• Each 192 bytes on disk need 1 byte of Heap• With 32gb of heap, can barely fill 6T

disk/machine(32gb * 192 = 6tb)

192?!W.T.F.

Page 40: Apache HBase Performance Tuning

How about 1gb regions?

1gb/128mb * 3 * 0.4 * 2 = 19

Page 41: Apache HBase Performance Tuning

(source: http://www.everystockphoto.com)

Page 42: Apache HBase Performance Tuning

RegionServer sizing configs• hbase.hregion.max.filesize (default 10g is good)• hbase.hregion.memstore.flush.size (default 128mb)

(decrease for read heavy loads)

• hbase.regionserver.maxlogs(HDFS blocksize * 0.95 * <this> should larger than 0.4*JavaHeap)

Page 43: Apache HBase Performance Tuning

RegionServer Hardware• <= 6T disk space per machine• Enough heap (~diskspace/200)• Many cores are good. HBase is CPU intensive.• Match network and disk throughput

(1ge and 24 disks is not good 125mb/s vs 2.4gb/s)(10ge and 24 disks is OK, 1ge and 4 or 6 disks is OK)

• But… For reads with filters more disks are still better.

Page 44: Apache HBase Performance Tuning

HBase Client Settings

Page 45: Apache HBase Performance Tuning

Client/Server RPC chunk size• No streaming RPC in HBase • Can only asymptotically approach the

full network bandwidth • Typical intra datacenter latency: 0.1ms-1ms• Transmitting 2mb over 1ge: 150ms• Transmitting 2mb over 10ge: 15ms

Page 46: Apache HBase Performance Tuning

2mb chunks between Client and Server are good

But, how Should I do that?

Page 47: Apache HBase Performance Tuning

Client Chunk Size SettingsWrite:• hbase.client.write.buffer = 2mb (default write buffer, good)

Read• Scan.setCaching(<n>) (default 100 rows)

(but… how large are the rows? Must guess!)• hbase.client.scanner.max.result.size = 2mb (default scan

buffer, 0.98.12+ only)

Page 48: Apache HBase Performance Tuning

ClientConsider RPC size * hbase.regionserver.handler.count for server GC

Need to be able to ride over splits and region moves:hbase.client.pause = 100hbase.client.retries.number = 35hbase.ipc.client.tcpnodelay = true

Page 49: Apache HBase Performance Tuning

Replication (trust me)• hbase.zookeeper.useMulti = true (needs ZK 3.4)

this one is important for correctness

Other defaults are good:• replication.sleep.before.failover = 30000• replication.source.maxretriesmultiplier = 300• replication.source.ratio = 0.10

Page 50: Apache HBase Performance Tuning

Linux• Turn THP (Transparent Huge Pages) OFF• Set Swappiness to 0• Set vm.min_free_kbytes to AT LEAST 1GB (8GB on

larger systems, server allocation immediately)• Set zone_reclaim_mode to 0

(one cache on NUMA)• dirsync mount option for EXT4, or use XFS

Page 51: Apache HBase Performance Tuning

Not Covered• Security/Kerberos• HA NameNode/QJM• ZK/Disk Layout• Obscure Configs• Offheap Caching, G1 GC

Page 52: Apache HBase Performance Tuning

(source: http://www.morguefile.com)

Page 53: Apache HBase Performance Tuning

TL;DR:• Enable HDFS Sync on close, Sync behind writes• Mount EXT4 with dirsync• Enabled Stale Datanode detection• Tune HBase read vs. write load• Set HFile block size for your load• Get RPC Client/Server chunk size right

Page 54: Apache HBase Performance Tuning

Thank You!http://hadoop-hbase.blogspot.com/