apache cassandra in bangalore - cassandra internals and performance

Post on 01-Dec-2014

1.085 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides from http://www.meetup.com/Apache-Cassandra/events/108524582/

TRANSCRIPT

BANGALORE CASSANDRA UG APRIL 2013

CASSANDRA INTERNALS & PERFORMANCE

Aaron Morton@aaronmorton

www.thelastpickle.com

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

ArchitectureCode

Cassandra Architecture

API's

Cluster Aware

Cluster Unaware

Clients

Disk

Cassandra Cluster Architecture

API's

Cluster Aware

Cluster Unaware

Clients

Disk

API's

Cluster Aware

Cluster Unaware

Disk

Node 1 Node 2

Dynamo Cluster Architecture

API's

Dynamo

Database

Clients

Disk

API's

Dynamo

Database

Disk

Node 1 Node 2

ArchitectureAPI

DynamoDatabase

API Transports

ThriftNative Binary

Read LineRMI

Thrift Transport

//Custom TServer implementations

o.a.c.thrift.CustomTThreadPoolServero.a.c.thrift.CustomTNonBlockingServero.a.c.thrift.CustomTHsHaServer

API Transports

ThriftNative Binary

Read LineRMI

Native Binary Transport

Beta in Cassandra 1.2Uses Netty 3.5Enabled with

start_native_transport(Disabled by default)

o.a.c.transport.Server.run()

//Setup the Netty servernew ExecutionHandler()new NioServerSocketChannelFactory()ServerBootstrap.setPipelineFactory()

o.a.c.transport.Message.Dispatcher.messageReceived()

//Process message from clientServerConnection.validateNewMessage()Request.execute()ServerConnection.applyStateTransition()Channel.write()

o.a.c.transport.messages

CredentialsMessage()EventMessage()ExecuteMessage()PrepareMessage()QueryMessage()ResultMessage()

(And more...)

Messages

Defined in the Native Binary Protocol

$SRC/doc/native_protocol.spec

API Services

JMXCLI

ThriftCQL 3

JMX Management Beans

Spread around the code base.

Interfaces named *MBean

JMX Management Beans

Registered with the names such as

org.apache.cassandra.db:type=StorageProxy

API Services

JMXCLI

ThriftCQL 3

o.a.c.cli.CliMain.main()

// Connect to server to read inputthis.connect()this.evaluateFileStatements()this.processStatementInteractive()

CLI Grammar

ANTLR Grammar$SRC/src/java/o/a/c/cli/CLI.g

o.a.c.cli.CliClient.executeCLIStatement()

// Process statementCliCompiler.compileQuery() #ANTLRswitch (tree.getType()) case...

API Services

JMXCLI

ThriftCQL 3

o.a.c.thrift.CassandraServer

// Implements Thrift Interface// Access control// Input validation// Mapping to/from Thrift and internal types

Thrift Interface

Thrift IDL$SRC/interface/cassandra.thrift

o.a.c.thrift.CassandraServer.get_slice()

// get columns for one rowTracing.begin()ClientState cState = state()cState.hasColumnFamilyAccess()multigetSliceInternal()

CassandraServer.multigetSliceInternal()

// get columns for may rowsThriftValidation.validate*()// Create ReadCommandsgetSlice()

CassandraServer.getSlice()

// Process ReadCommands// return Thrift types

readColumnFamily()thriftifyColumnFamily()

CassandraServer.readColumnFamily()

// Process ReadCommands// Return ColumnFamilies

StorageProxy.read()

API Services

JMXCLI

ThriftCQL 3

o.a.c.cql3.QueryProcessor

// Prepares and executes CQL3 statements// Used by Thrift & Native transports// Access control// Input validation// Returns transport.ResultMessage

CQL3 Grammar

ANTLR Grammar$SRC/o.a.c.cql3/Cql.g

o.a.c.cql3.statements.ParsedStatement

// Subclasses generated by ANTLR// Tracks bound term count// Prepare CQLStatementprepare()

o.a.c.cql3.statements.CQLStatement

checkAccess(ClientState state)validate(ClientState state)execute(ConsistencyLevel cl, QueryState state, List<ByteBuffer> variables)

o.a.c.cql3.functions.Function

argsType()returnType()execute(List<ByteBuffer> parameters)

statements.SelectStatement.RawStatement

// Implements ParsedStatement// Input validationprepare()

statements.SelectStatement.execute()

// Create ReadCommandsStorageProxy.read()

ArchitectureAPI

DynamoDatabase

Dynamo Layero.a.c.service

o.a.c.neto.a.c.dht

o.a.c.locatoro.a.c.gms

o.a.c.stream

o.a.c.service.StorageProxy

// Cluster wide storage operations// Select endpoints & check CL available// Send messages to Stages// Wait for response// Store Hints

o.a.c.service.StorageService

// Ring operations// Track ring state// Start & stop ring membership// Node & token queries

o.a.c.service.IResponseResolver

preprocess(MessageIn<T> message)resolve() throws DigestMismatchException

RowDigestResolverRowDataResolverRangeSliceResponseResolver

Response Handlers / Callback

implements IAsyncCallback<T>

response(MessageIn<T> msg)

o.a.c.service.ReadCallback.get()

//Wait for blockfor & datacondition.await(timeout, TimeUnit.MILLISECONDS)

throw ReadTimeoutException()

resolver.resolve()

o.a.c.service.StorageProxy.fetchRows()

getLiveSortedEndpoints()new RowDigestResolver()new ReadCallback()MessagingService.sendRR()---------------------------------------ReadCallback.get() # blockingcatch (DigestMismatchException ex)catch (ReadTimeoutException ex)

Dynamo Layero.a.c.service

o.a.c.neto.a.c.dht

o.a.c.locatoro.a.c.gms

o.a.c.stream

o.a.c.net.MessagingService.verb<<enum>>

MUTATIONREADREQUEST_RESPONSETREE_REQUESTTREE_RESPONSE

(And more...)

o.a.c.net.MessagingService.verbHandlers

new EnumMap<Verb, IVerbHandler>(Verb.class)

o.a.c.net.IVerbHandler<T>

doVerb(MessageIn<T> message, String id);

o.a.c.net.MessagingService.verbStages

new EnumMap<MessagingService.Verb, Stage>(MessagingService.Verb.class)

o.a.c.net.MessagingService.receive()

runnable = new MessageDeliveryTask( message, id, timestamp);

StageManager.getStage( message.getMessageType());

stage.execute(runnable);

o.a.c.net.MessageDeliveryTask.run()

// If dropable and rpc_timeoutMessagingService.incrementDroppedMessag

es(verb);

MessagingService.getVerbHandler(verb)verbHandler.doVerb(message, id)

Dynamo Layero.a.c.service

o.a.c.neto.a.c.dht

o.a.c.locatoro.a.c.gms

o.a.c.stream

o.a.c.dht.IPartitioner<T extends Token>

getToken(ByteBuffer key)getRandomToken()

LocalPartitionerRandomPartitionerMurmur3Partitioner

o.a.c.dht.Token<T>

compareTo(Token<T> o)

BytesTokenBigIntegerTokenLongToken

Dynamo Layero.a.c.service

o.a.c.neto.a.c.dht

o.a.c.locatoro.a.c.gms

o.a.c.stream

o.a.c.locator.IEndpointSnitch

getRack(InetAddress endpoint)getDatacenter(InetAddress endpoint)sortByProximity(InetAddress address,

List<InetAddress> addresses)

SimpleSnitchPropertyFileSnitchEc2MultiRegionSnitch

o.a.c.locator.AbstractReplicationStrategy

getNaturalEndpoints( RingPosition searchPosition)calculateNaturalEndpoints(Token searchToken, TokenMetadata tokenMetadata)

SimpleStrategyNetworkTopologyStrategy

o.a.c.locator.TokenMetadata

BiMultiValMap<Token, InetAddress> tokenToEndpointMapBiMultiValMap<Token, InetAddress> bootstrapTokensSet<InetAddress> leavingEndpoints

Dynamo Layero.a.c.service

o.a.c.neto.a.c.dht

o.a.c.locatoro.a.c.gms

o.a.c.stream

o.a.c.gms.VersionedValue

// VersionGenerator.getNextVersion()

public final int version;public final String value;

o.a.c.gms.ApplicationState<<enum>>

STATUSLOADSCHEMADCRACK

(And more...)

o.a.c.gms.HeartBeatState

//VersionGenerator.getNextVersion();

private int generation;private int version;

o.a.c.gms.Gossiper.GossipTask.run()

// SYN -> ACK -> ACK2makeRandomGossipDigest()new GossipDigestSyn()

// Use MessagingService.sendOneWay()Gossiper.doGossipToLiveMember()Gossiper.doGossipToUnreachableMember()Gossiper.doGossipToSeed()

gms.GossipDigestSynVerbHandler.doVerb()

Gossiper.examineGossiper()new GossipDigestAck()MessagingService.sendOneWay()

gms.GossipDigestAckVerbHandler.doVerb()

Gossiper.notifyFailureDetector()Gossiper.applyStateLocally()Gossiper.makeGossipDigestAck2Message()

gms.GossipDigestAcksVerbHandler.doVerb()

Gossiper.notifyFailureDetector()Gossiper.applyStateLocally()

ArchitectureAPI Layer

Dynamo LayerDatabase Layer

Database Layero.a.c.concurrent

o.a.c.db

o.a.c.cacheo.a.c.io

o.a.c.trace

o.a.c.concurrent.StageManager

stages = new EnumMap<Stage, ThreadPoolExecutor>(Stage.class);

getStage(Stage stage)

o.a.c.concurrent.Stage

READMUTATIONGOSSIPREQUEST_RESPONSEANTI_ENTROPY

(And more...)

Database Layero.a.c.concurrent

o.a.c.db

o.a.c.cacheo.a.c.io

o.a.c.trace

o.a.c.db.Table

// Keyspaceopen(String table)getColumnFamilyStore(String cfName)

getRow(QueryFilter filter)apply(RowMutation mutation, boolean writeCommitLog)

o.a.c.db.ColumnFamilyStore

// Column FamilygetColumnFamily(QueryFilter filter)getTopLevelColumns(...)

apply(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer)

o.a.c.db.IColumnContainer

addColumn(IColumn column)remove(ByteBuffer columnName)

ColumnFamilySuperColumn

o.a.c.db.ISortedColumns

addColumn(IColumn column, Allocator allocator)removeColumn(ByteBuffer name)

ArrayBackedSortedColumnsAtomicSortedColumnsTreeMapBackedSortedColumns

o.a.c.db.Memtable

put(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer)

flushAndSignal(CountDownLatch latch, Future<ReplayPosition> context)

Memtable.FlushRunnable.writeSortedContents()

// SSTableWritercreateFlushWriter()

// Iterate through rows & CF’s in orderwriter.append()

o.a.c.db.ReadCommand

getRow(Table table)

SliceByNamesReadCommandSliceFromReadCommand

o.a.c.db.IDiskAtomFilter

getMemtableColumnIterator(...)getSSTableColumnIterator(...)

IdentityQueryFilterNamesQueryFilterSliceQueryFilter

Some query performance...

Today.

Write PathRead Path

memtable_flush_queue_size test...

m1.xlarge Cassandra nodem1.xlarge client node

1 CF with 6 Secondary Indexes1 Client Thread

10,000 Inserts, 100 Columns per Row1100 bytes per Column

CF write latency and memtable_flush_queue_size...

0

300

600

900

1,200

85th 95th 99th 100th

Late

ncy

Micr

osec

onds

memtable_flush_queue_size=7 memtable_flush_queue_size=1

Request latency and memtable_flush_queue_size...

0

1,250,000

2,500,000

3,750,000

5,000,000

85th 95th 99th 100th

Late

cy M

icros

econ

ds

memtable_flush_queue_size=7 memtable_flush_queue_size=1

durable_writes test...

10,000 Inserts, 50 Columns per Row50 bytes per Column

Request latency and durable_writes (1 client)...

0

1,750

3,500

5,250

7,000

85th 95th 99th

Late

ncy

Micr

osec

onds

enabled disabled

Request latency and durable_writes (10 clients)...

0

7,500

15,000

22,500

30,000

85th 95th 99th

Late

ncy

Micr

osec

onds

enabled disabled

Request latency and durable_writes (20 clients)...

0

22,500

45,000

67,500

90,000

85th 95th 99th

Late

ncy

Micr

osec

onds

enabled disabled

CommitLog tests...

10,000 Inserts, 50 Columns per Row50 bytes per Column

periodic commit log adds mutation to queue then acknowledges.

Commit Log is appended to by a single thread, sync is called every

commitlog_sync_period_in_ms.

Request latency and commitlog_sync_period_in_ms...

170

183

195

208

220

85th 95th 99th

Late

cy M

icros

econ

ds

10,000 ms 10 ms

batch commit log adds mutation to queue and waits before acknowledging.

Writer thread processes mutations for commitlog_sync_batch_window_in_

ms duration, then syncs, then signals.

Request latency comparing periodic and batch sync...

0

200

400

600

800

85th 95th 99th

Late

cy M

icros

econ

ds

periodic batch

Merge mutation...

Row level Isolation provided via SnapTree.

(https://github.com/nbronson/snaptree)

Row concurrency tests...

10,000 Columns per Row50 bytes per Column

50 Columns per Insert

CF Write Latency and row concurrency (10 clients)...

0

500

1,000

1,500

2,000

85th 95th 99th

Late

cy M

icros

econ

ds

different rows single row

Secondary Indexes...

synchronized access to indexed rows.

(Keyspace wide)

Index concurrency tests...

CF with 2 Indexes10,000 Inserts

6 Columns per Row35 bytes per Column

Alternating column values

Request latency and index concurrency (10 clients)...

0

1,000

2,000

3,000

4,000

85th 95th 99th

Late

cy M

icros

econ

ds

different rows single row

Index tests...

10,000 Inserts50 Columns per Row50 bytes per Column

Request latency and secondary indexes...

0

750

1,500

2,250

3,000

85th 95th 99th

Late

cy M

icros

econ

ds

no indexes six indexes

Today

Write PathRead Path

bloom_filter_fp_chance tests...1,000,000 Rows

50 Columns per Row50 bytes per Column

commitlog_total_space_in_mb: 1

Read random 10% of rows.

CF read latency and bloom_filter_fp_chance...

0

1,750

3,500

5,250

7,000

85th 95th 99th

Late

cy M

icros

econ

ds

default 0.000744. 0.1

key_cache_size_in_mb tests...

10,000 Rows50 Columns per Row50 bytes per Column

Read all Rows

CF read latency and key_cache_size_in_mb...

0

75

150

225

300

85th 95th 99th

Late

cy M

icros

econ

ds

default (100MB) 100% Hit Rate disabled

index_interval tests...100,000 Rows

50 Columns per Row50 bytes per Column

key_cache_size_in_mb: 0

Read 1 Column from random 10% of Rows

CF read latency and index_interval...

0

5,000

10,000

15,000

20,000

85th 95th 99th

Late

cy M

icros

econ

ds

index_interval=128 (default) index_interval=512

row_cache_size_in_mb tests...

100,000 Rows50 Columns per Row50 bytes per Column

Read all Rows

CF read latency and row_cache_size_in_mb...

0

65

130

195

260

85th 95th 99th

Late

cy M

icros

econ

ds

row_cache_size_in_mb=0 and key_cache_size_in_mb=100mbrow_cache_size_in_mb=100mb and key_cache_size_in_mb=0

Column Index tests...

Read first Column by name from 1,200 Columns.

Read first Column by name from 1,000,000

Columns.

CF read latency and Column Index...

0

1,500

3,000

4,500

6,000

85th 95th 99th

Late

cy M

icros

econ

ds

First Column from 1,200 First Column from 1,000,000

Name Locality tests...1,000,000 Columns

50 bytes per Column

Read 100 Columns from middle of row.Read 100 Columns from spread across row.

CF read latency and name locality...

0

50,000

100,000

150,000

200,000

85th 95th 99th

Late

cy M

icros

econ

ds

Adjacent Columns Spread Columns

Start position tests...1,000,000 Columns

50 bytes per Column

Read first 100 Columns without start.Read first 100 Columns with start.

CF read latency and start position...

0

10,000

20,000

30,000

40,000

85th 95th 99th

Late

cy M

icros

econ

ds

Without start position With start position

Start offset tests...1,000,000 Columns

50 bytes per Column

Read first 100 Columns with start.Read middle 100 Columns with start.

CF read latency and start offset...

0

10,000

20,000

30,000

40,000

85th 95th 99th

Late

cy M

icros

econ

ds

First MIddle

Start offset tests...1,000,000 Columns

50 bytes per Column

Read first 100 Columns without start.Read last 100 Columns with reversed.

CF read latency and reversed...

0

10,000

20,000

30,000

40,000

85th 95th 99th

Late

cy M

icros

econ

ds

Forward Reversed

Thanks.

Aaron Morton@aaronmorton

www.thelastpickle.com

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

top related