cassandra at instagram 2016 (dikang gu, facebook) | cassandra summit 2016

40
CASSANDRA @ INSTAGRAM 2016 Dikang Gu Software Engineer @ Facebook

Upload: datastax

Post on 06-Jan-2017

584 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

CASSANDRA @ INSTAGRAM 2016

Dikang Gu Software Engineer @ Facebook

Page 2: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

ABOUT ME

2

• @dikanggu

• Software Engineer

• Instagram core infra, 2014 — present

• Facebook data Infra, 2012 — 2014

Page 3: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

AGENDA

3

1 Overview

2 Improvements

3 Challenges

Page 4: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

OVERVIEW

Page 5: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

OVERVIEWCluster Deployment

5

• Cassandra Nodes: 1,000+

• Data Size: 100s of TeraBytes

• Ops/sec: in the millions

• Largest Cluster: 100+

• Regions: multiple

Page 6: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

OVERVIEW

6

• Client: Python/C++/Java/PHP

• Protocol: mostly thrift, some CQL

• Versions: 2.0.x - 2.2.x

• Use LCS for most tables.

Page 7: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

TEAM

7

Page 8: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

USE CASE 1Feed

8

PUSH

When posting, we push the media information to the followers' feed store.

When reading, we fetch the feed ids from the viewer's feed store.

Page 9: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

USE CASE 1Feed

• Write QPS: 1M+

• Avg/P99 Read Latency : 20ms/100ms

• Data Model:

user_id —> List(media_id)

Page 10: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

USE CASE 2Metadata store

10

Applications use C* as a key value store, they store a list of blobs associated with a key, and do point query or range query during the read time.

Page 11: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

USE CASE 2Metadata store

• Read/Write QPS: 100K+

• Avg read size: 50KB

• Avg/P99 Read Latency : 7ms/50ms

• Data Model:

user_id —> List(Blob)

Page 12: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

USE CASE 3Counter

12

Applications issue bump/get counter operations for each user requests.

Page 13: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

USE CASE 3Counter

• Read/Write QPS: 50K+

• Avg/P99 Read Latency : 3ms/50ms

• C* 2.2

• Data Model:

some_id —> Counter

Page 14: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

IMPROVEMENTS

Page 15: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

1. PROXY NODES

15

Page 16: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PROXY NODEProblem

16

• Thrift client, NOT token aware

• Data node coordinates the requests

• High latency and timeout when data node is hot.

Page 17: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PROXY NODESolution

17

• join_ring: false

• act as coordinator

• do not store data locally

• client only talks to proxy node

• 2X latency drop

Page 18: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

(CASSANDRA-9258)

2. PENDING RANGES

18

Page 19: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PENDING RANGESProblem

19

• CPU usage +30% when bootstrapping new nodes.

• Client requests latency jumps and timeouts

• Multimap<Range<Token>, InetAddress> PendingRange

• In-efficient O(n) pendingRanges lookup for request

Page 20: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PENDING RANGESSolution

20

• Cassandra-9258

• Use two NavigableMaps to implement the pending ranges

• We can expand or shrink the cluster without affecting requests

• Thanks Branimir Lambov for patch review and feedbacks.

Page 21: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

(CASSANDRA-6908)

3. DYNAMIC SNITCH

21

Page 22: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

DYNAMIC SNITCH

22

• High read latency during peak time.

• Unnecessary cross region requests.

• dynamic_snitch_badness_threshold: 50

• 10X P99 latency drop

Page 23: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

4. COMPACTION

23

Page 24: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

COMPACTION IMPROVEMENTS

24

• Track the write amplification. (CASSANDRA-11420)

• Optimize the overlapping lookup. (CASSANDRA-11571)

• Optimize the isEOF() checking. (CASSANDRA-12013)

• Avoid searching for column index. (CASSANDRA-11450)

• Persist last compacted key per level. (CASSANDRA-6216)

• Compact tables before making available in L0. (CASSANDRA-10862)

Page 25: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

5. BIG HEAP SIZE

25

Page 26: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

BIG HEAP SIZE

26

• 64G max heap size

• 16G new gen size

• -XX:MaxTenuringThreshold=6

• Young GC every 10 seconds

• Avoid full GC

• 2X P99 latency drop

Page 27: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

(CASSANDRA-10406)

6. NODETOOL REBUILD RANGE

27

Page 28: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

NODETOOL REBUILD

28

• rebuild may fail for nodes with TBs of data

• Cassandra-10406

• support to rebuild the failed token ranges

• Thanks Yuki Morishita for reviewing

Page 29: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

CHALLENGES

Page 30: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PERFORMANCE

30

P99 Read Latency

Latency on the C* nodes, even higher on the client side.

Page 31: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PERFORMANCE

31

Compaction has difficulties to catch up

Impact the read latency

Page 32: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PERFORMANCE

32

Compaction uses too much CPU (40%+)

Page 33: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

PERFORMANCE

33

Tombstone

Page 34: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

SCALABILITY

34

Gossip, nodes see inconsistent ring(CASSANDRA-11709, CASSANDRA-11740)

Page 35: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

FEATURES

35

Counter, problem with repair(CASSANDRA-11432, CASSANDRA-10862)

SSTables in each level: [966/4, 20/10, 152/100, 33, 0, 0, 0, 0, 0]

Page 36: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

CLIENT

36

Access C* from different languages

Cassandra ClusterService

Service

Service

Page 37: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

OPERATIONS

37

Cluster expansion takes long time

15 days to bootstrap 30 nodes

Page 38: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

RECAP

38

• Proxy Node

• Pending Ranges

• Dynamic Snitch

• Compaction

• Big heap size

• Nodetool rebuild range token

• P99 Read latency

• Compaction

• Tombstone

• Gossip

• Counter

• Client

• Cluster expansion

ChallengesImprovements

Page 39: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

QUESTIONS?

Page 40: Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016