cassandra at instagram 2016 (dikang gu, facebook) | cassandra summit 2016

CASSANDRA @ INSTAGRAM 2016

Dikang Gu Software Engineer @ Facebook

ABOUT ME

2

• @dikanggu

• Software Engineer

• Instagram core infra, 2014 — present

• Facebook data Infra, 2012 — 2014

AGENDA

3

1 Overview

2 Improvements

3 Challenges

OVERVIEW

OVERVIEWCluster Deployment

5

• Cassandra Nodes: 1,000+

• Data Size: 100s of TeraBytes

• Ops/sec: in the millions

• Largest Cluster: 100+

• Regions: multiple

OVERVIEW

6

• Client: Python/C++/Java/PHP

• Protocol: mostly thrift, some CQL

• Versions: 2.0.x - 2.2.x

• Use LCS for most tables.

TEAM

7

USE CASE 1Feed

8

PUSH

When posting, we push the media information to the followers' feed store.

When reading, we fetch the feed ids from the viewer's feed store.

USE CASE 1Feed

• Write QPS: 1M+

• Avg/P99 Read Latency : 20ms/100ms

• Data Model:

user_id —> List(media_id)

USE CASE 2Metadata store

10

Applications use C* as a key value store, they store a list of blobs associated with a key, and do point query or range query during the read time.

USE CASE 2Metadata store

• Read/Write QPS: 100K+

• Avg read size: 50KB


• Data Model:

user_id —> List(Blob)

USE CASE 3Counter

12

Applications issue bump/get counter operations for each user requests.

USE CASE 3Counter

• Read/Write QPS: 50K+


• C* 2.2

• Data Model:

some_id —> Counter

IMPROVEMENTS

1. PROXY NODES

15

PROXY NODEProblem

16

• Thrift client, NOT token aware

• Data node coordinates the requests

• High latency and timeout when data node is hot.

PROXY NODESolution

17

• join_ring: false

• act as coordinator

• do not store data locally

• client only talks to proxy node

• 2X latency drop

(CASSANDRA-9258)

2. PENDING RANGES

18

PENDING RANGESProblem

19

• CPU usage +30% when bootstrapping new nodes.

• Client requests latency jumps and timeouts

• Multimap<Range<Token>, InetAddress> PendingRange

• In-efficient O(n) pendingRanges lookup for request

PENDING RANGESSolution

20

• Cassandra-9258

• Use two NavigableMaps to implement the pending ranges

• We can expand or shrink the cluster without affecting requests

• Thanks Branimir Lambov for patch review and feedbacks.

(CASSANDRA-6908)

3. DYNAMIC SNITCH

21

DYNAMIC SNITCH

22

• High read latency during peak time.

• Unnecessary cross region requests.

• dynamic_snitch_badness_threshold: 50

• 10X P99 latency drop

4. COMPACTION

23

COMPACTION IMPROVEMENTS

24

• Track the write amplification. (CASSANDRA-11420)

• Optimize the overlapping lookup. (CASSANDRA-11571)

• Optimize the isEOF() checking. (CASSANDRA-12013)

• Avoid searching for column index. (CASSANDRA-11450)

• Persist last compacted key per level. (CASSANDRA-6216)

• Compact tables before making available in L0. (CASSANDRA-10862)

5. BIG HEAP SIZE

25

BIG HEAP SIZE

26

• 64G max heap size

• 16G new gen size

• -XX:MaxTenuringThreshold=6

• Young GC every 10 seconds

• Avoid full GC

• 2X P99 latency drop

(CASSANDRA-10406)

6. NODETOOL REBUILD RANGE

27

NODETOOL REBUILD

28

• rebuild may fail for nodes with TBs of data

• Cassandra-10406

• support to rebuild the failed token ranges

• Thanks Yuki Morishita for reviewing

CHALLENGES

PERFORMANCE

30

P99 Read Latency

Latency on the C* nodes, even higher on the client side.

PERFORMANCE

31

Compaction has difficulties to catch up

Impact the read latency

PERFORMANCE

32

Compaction uses too much CPU (40%+)

PERFORMANCE

33

Tombstone

SCALABILITY

34

Gossip, nodes see inconsistent ring(CASSANDRA-11709, CASSANDRA-11740)

FEATURES

35

Counter, problem with repair(CASSANDRA-11432, CASSANDRA-10862)

SSTables in each level: [966/4, 20/10, 152/100, 33, 0, 0, 0, 0, 0]

CLIENT

36

Access C* from different languages

Cassandra ClusterService

Service

Service

OPERATIONS

37

Cluster expansion takes long time

15 days to bootstrap 30 nodes

RECAP

38

• Proxy Node

• Pending Ranges

• Dynamic Snitch

• Compaction

• Big heap size

• Nodetool rebuild range token

• P99 Read latency

• Compaction

• Tombstone

• Gossip

• Counter

• Client

• Cluster expansion

ChallengesImprovements

QUESTIONS?