cql performance with apache cassandra 3.0 (aaron morton, the last pickle) | c* summit 2016
TRANSCRIPT
CASSANDRA SUMMIT 2016
CQL PERFORMANCE WITH APACHE CASSANDRA 3.0
Aaron Morton@aaronmorton
CEO
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
How We Got HereStorage Engine 3.0
Read Path
How We Got Here
Way back in 2011…
2011
Blog: Cassandra Query Plans
http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
2012
Talk: Technical Deep Dive - Query Performance
https://www.youtube.com/watch?v=gomOKhMV0zc
2012
Explain Read & Write performance in 45 minutes.
Skip Forward to 2016
Blog: Introduction To The Apache Cassandra 3.x Storage
Enginehttp://thelastpickle.com/blog/2016/03/04/introductiont-to-
the-apache-cassandra-3-storage-engine.html
Skip Forward to 2016
“Why don’t I do another talk about Cassandra performance.”
Skip Forward to 2016
It was a busy 4 years…
Skip Forward to 2016
CQL 3, Collection Types, UDTs, UDF’s, UDA’s,
Materialised Views, Triggers, SASI,…
Skip Forward to 2016
Explain Read & Write performance in 45 minutes.
So Lets Avoid
CQL 3, Collection Types, UDTs, UDF’s, UDA’s,
Materialised Views, Triggers, SASI,…
How We Got HereStorage Engine 3.0
Read Path
High Level Storage Engine 3.0
Storage Engine 3.0 Files
Data.db Index.db Filter.db
Storage Engine 3.0 FilesCompressionInfo.db
Statistics.db Digest.crc32
CRC.db Summary.db TOC.txt
CQL Recapcreate table my_table ( partition_1 text, cluster_1 text, foo text, bar text, baz text, PRIMARY KEY (partition_1, cluster_1) );
CQL Recap
WARNING: FAKE DATA AHEAD
CQL With Thrift Pre 3.0[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
CQL Pre 3.0
Clustering Keys RepeatedColumn Names Repeated
Timestamps RepeatedFixed Width Encoding
No Knowledge Of Row Contents
Storage Engine 3.0 Improvements
Delta EncodingVariable Int Encoding
Clustering Written OnceAggregated Metadata
Cell Presence
SerializationHeader
For each SSTable*.
Stored in each SSTable.
Held in memory.
SerializationHeaderpublic class SerializationHeader { private final AbstractType<?> keyType; private final List<AbstractType<?>> clusteringTypes;
private final PartitionColumns columns; private final EncodingStats stats; … }
EncodingStats
Collected on the fly by the Memtable.
EncodingStatspublic class EncodingStats { public final long minTimestamp; public final int minLocalDeletionTime; public final int minTTL; … }
SerializationHeaderpublic class SerializationHeader { public void writeTimestamp(long timestamp, DataOutputPlus out) throws IOException
{ out.writeUnsignedVInt(timestamp - stats.minTimestamp);
} … }
VIntCodingpublic class VIntCoding { public static void writeUnsignedVInt(long value, DataOutput output) throws IOException { int size = VIntCoding.computeUnsignedVIntSize(value); if (size == 1) { output.write((int)value); return; }
output.write(VIntCoding.encodeVInt(value, size), 0, size); }
Storage Engine 3.0 Improvements
Delta EncodingVariable Int Encoding
Clustering Written OnceAggregated Metadata
Cell Presence
CQL With Thrift Pre 3.0[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
Storage Engine 3.0 Data.db
Storage Engine 3.0 Partition Header
Storage Engine 3.0 Row
Storage Engine 3.0 Clustering Block
Storage Engine 3.0 Improvements
Delta EncodingVariable Int Encoding
Clustering Written OnceAggregated Cell Metadata
Cell Presence
CQL With Thrift Pre 3.0[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
Aggregated Cell Metadata
Only store Cell Timestamp, TTL, and Local Deletion Time if different to
the Row.
Aggregated Cell MetadataSimple Cell Component Byte Size
Flags 1
Optional Cell Timestamp (delta) varint 1…n
Optional Cell Local Deletion Time (delta) varint 1…n
Optional Cell TTL (delta) varint 1…n
Fixed Width Cell Value Byte Size
Value 1…n
Optional Cell Value See Below
Variable Width Cell Value Byte Size
Value Length varint 1…n
Value 1…n
Apache Cassandra 3.0 Storage Engine
Storage Engine 3.0 Improvements
Delta EncodingVariable Int Encoding
Clustering Written OnceAggregated Cell Metadata
Cell Presence
Cell Presence
SSTable stores list of Cells in this SSTable.
Rows stores bitmap of Cells in this Row, with reference to SSTable.
Storage Engine 3.0 Row
Remember Where We Came From[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
How We Got HereStorage Engine 3.0
Read Path
Read Paths
Ignoring Index Read paths.
Read Commands
PartitionRangeReadCommand SinglePartitionReadCommand
AbstractClusteringIndexFilter
ClusteringIndexNamesFilter (When we know the column names.)
ClusteringIndexSliceFilter (When we do not know the column names.)
ClusteringIndexNamesFilter
When we know what Columns to select, we know
when the search is over.
ClusteringIndexNamesFilter1. Get Partition From Memtables.2. Filter named columns into a temporary
result.3. Select SSTables that may contain Partition
Key.4. Order in descending timestamp order.5. Read from SSTables in order.
Names Filter Short Circuits
If result has a Partition Deletion newer than next SSTable max
timestamp.
Stop Search.
Names Filter Short Circuits
If read all Columns and max timestamp of next SSTable less than selected Columns min timestamp.
Stop Search.
Names Filter Short Circuits
If search clustering value not within clustering range in the SSTable.
Skip SSTable.
Names Filter Short Circuits
If SSTable Cell not in search set.
Skip reading value.
ClusteringIndexSliceFilter
When we do not know which columns to select, the search ends when it is exhausted.
ClusteringIndexSliceFilter
Used with:
Distinct.Not all clustering columns
restricted.
ClusteringIndexSliceFilter1. Get Partition From Memtables.2. Create Iterators for Partitions.3. Select SSTables that may contain Partition
Key.4. Order in reverse max timestamp order.5. Create Iterators for SSTables in order.
Slice Filter Short Circuits
If SSTable max timestamp is before max seen Partition Deletion
timestamp.
Stop Search.
Names Filter Short Circuits
If search clustering value not within clustering range in the SSTable.
Skip SSTable.
Thanks.
Aaron Morton@aaronmorton
Co-Founder & Principal Consultantwww.thelastpickle.com