accumulo summit 2015: ferrari on a bumpy road: shock absorbers to smooth out accumulo performance...
TRANSCRIPT
Ferrari on a Bumpy Road
Shock absorbers to smooth out Accumulo Performance
Adam Fuchs, CTOSqrrl, Inc.
April 28, 2015
Contents
1. Accumulo Architecture Overview
2. Focused Ingest Performance Model
3. Conclusions and Recommendations
Accumulo Processes
Zookeeper
HDFS
Tablet Servers
Master Processes
Client Process
Client Process
Client Process
Tablet ServerThrift Client
Handler
FromBatchWriter
(Thrift)
Tablet
Write-Ahead Log (in HDFS)
RFiles(in HDFS)
In-Memory Map
In-Memory Map
Performance Modeling Setup
Oriented towards weak scaling, or constant problem size per server
Continuous ingest-style load:
Uniform ingest across all tablets
Each client writes to all tablets
Tablets pre-split (for modeling simplicity)
Focus on write-ahead log (WAL) sync operations
Linear Scalability?
Source: http://arxiv.org/pdf/1406.4923v1.pdf
Write-Ahead Log (WAL) Implementation
Implemented as HDFS File Append
Shared across all tablets in tablet server
Configurable Durable Synchronization:
HFlush: Write to the HDFS pipeline (~1ms)
HSync: Make sure the pipeline hits disk (~100ms)
Rolls over every ~1GB
WAL file mapped to tablets in metadata table
Accounting for WAL syncs
Primary WAL Syncs
Mutation groups: ApplyUpdate / CloseUpdate
Tablet definitions
Minor compaction journal entries
Minor compaction start
Minor compaction finish
Metadata WAL Syncs
Register new logs
Clear unused logs
Register minor compacted file
Major compaction handshaking:
Delete flag, File change set
Tablet Lifecyclemigration, timestamp, creation/split, delete
WAL Sync ModelVariable Definition WAL-Related Operation Rate (per Byte)
S Number of Servers Primary Flush Rate S / BW *2
T Number of Tablets per Server Minor Compaction Rate T / IMM
BW BatchWriter buffer size WAL Roll-Over Rate T / WAL
IMM In-Memory Map Size Major Compaction Rate Minor Compaction Rate / 4
WAL Write-Ahead Log File Size
Operation Syncs per Byte of Data per Server Total Syncs
tserver ops
ApplyUpdates/CloseUpdate Primary Flush RateTotal Data Table WAL
Syncs
minor compaction start Minor Compaction Rate
minor compaction finish Minor Compaction Rate
define tablet in new log WAL Roll-Over Rate
metadata ops
register new logs WAL Roll-Over Rate * STotal
Metadata Table WAL
Syncs
clear unused logs Minor Compaction Rate * S
register minor compacted file Minor Compaction Rate * S
majc delete flag Major Compaction Rate * S
majc file set change Major Compaction Rate * S
TabletServer
Tablet
Tablet
Tablet
Tablet
Thrift Client
Handler
Tablet
Write-Ahead Log (in HDFS)
RFiles(in HDFS)
In-Memory MapApplication
BatchWriter
WAL Sync Model: Simple CaseVariable Setting WAL-Related Operation Rate (per GB)
S 1 Primary Flush Rate 20.48
T 10 Minor Compaction Rate 2.5
BW 104857600 WAL Roll-Over Rate 10
IMM 4294967296 Major Compaction Rate 0.625
WAL 1073741824
Operation Syncs per GB per Server Total Syncs
tserver ops
ApplyUpdates/CloseUpdate 20.48
35.48minor compaction start 2.5
minor compaction finish 2.5
define tablet in new log 10
metadata ops
register new logs 10
16.25
clear unused logs 2.5
register minor compacted file 2.5
majc delete flag 0.625
majc file set change 0.625
• 1GB / (35.48 * 100ms) = 300MB/s• 300MB/s per server looks great!• So what’s the problem?
WAL Sync Model: Many Tablets
Variable Setting WAL-Related Operation Rate (per GB)
S 1 Primary Flush Rate 20.48
T 1000 Minor Compaction Rate 250
BW 104857600 WAL Roll-Over Rate 1000
IMM 4294967296 Major Compaction Rate 62.5
WAL 1073741824
Operation Syncs per GB per Server Total Syncs
tserver ops
ApplyUpdates/CloseUpdate 20.48
1520.48minor compaction start 250
minor compaction finish 250
define tablet in new log 1000
metadata ops
register new logs 1000
1625
clear unused logs 250
register minor compacted file 250
majc delete flag 62.5
majc file set change 62.5
• 1GB / (1625 * 100ms) = 6.6MB/s• 1GB / (1625 * 1ms) = 660MB/s• Use HFlush?
WAL Sync Model: Many Servers
Variable Setting WAL-Related Operation Rate (per GB)
S 1000 Primary Flush Rate 20480
T 1000 Minor Compaction Rate 250
BW 104857600 WAL Roll-Over Rate 1000
IMM 4294967296 Major Compaction Rate 62.5
WAL 1073741824
Operation Syncs per GB per Server Total Syncs
tserver ops
ApplyUpdates/CloseUpdate 20480.00
21980.00minor compaction start 250
minor compaction finish 250
define tablet in new log 1000
metadata ops
register new logs 1000000
1625000
clear unused logs 250000
register minor compacted file 250000
majc delete flag 62500
majc file set change 62500
• 1GB / (1.6E6 * 100ms) = 6.15KB/s• 1GB / (1.6E6 * 1ms) = 615KB/s• Needs improvement.
• Mitigated by bigger BW buffer• Mitigated by group sync• Need to reduce client/server connections
• Mitigated by splitting Metadata table• Mitigated by batching Metadata updates
• Mitigated by skipping sync?
Accumulo-3423: Speed up WAL roll-overs
Pre-registers WALogs
Double-buffers to smooth hand-off
Parallelizes operations like log closing
Pushes bottleneck to another thread
Did we solve the problem?
Before
After
Future Performance Improvements
Reduce total number of data syncsCut out unnecessary syncs
Bigger batches from clients
Design patterns for reducing actively-written tablets
Fewer than NxN connections: fat tree?
Reduce total number of metadata syncs:Batch updates to metadata
Assign logs to tablet servers instead of tablets
Drive towards consistently smooth performance
Cutoff
Cutoff
Cutoff
Cutoff
Server
Server
Server
Server
Client
Client
Client
Client