zero to 1 billion+ records: a true story of learning & scaling gamechanger
TRANSCRIPT
Zero to 1 Billion Records
Kiril Savino @holacrat
2
GC.com/about/product-team
3
• have a sense of humor
• know what use cases work best
• remember that databases are hard
• don’t understate the difficulty in scaling up
4
• 1,480,808,857 events
• 8 terabytes of primary data
• 35 nodes
• 420GB RAM on primaries
• 21TB SSD storage
• 14TB EBS storage
• 120,000 ops/s
• Model
• Scale
• Grow
• Extend
5
6
Model
November 2009 — MongoDB 1.2
• More indexes per collection
• Faster index creation
• Map/Reduce
• Stored JavaScript functions
• Configurable fsync time
• Several small features and fixes
7
{.}
8
{.?!?.}
9
Decoding/Unmarshalling
Django ORM
{.}
[---]business logic
REST
API
MySQL
10
Decoding/Unmarshalling
Django ORM
REST
API{.}
[---]business logic
MySQL
11
InningOutsBallsStrikesPitcherBatter
12
InningOutsBallsStrikesPitcherBatter
PeriodMinuteLocationShooterRebounderAssist
13
[play]
[participant][role]
[sport][play_property]
14
[play]
[participant][role]
[sport][play_property]
15
{_id: ObjectId(), code: “1B”, participants: [{player_id: ObjectId(), roles: [“batter”, “out”]}, {player_id: ObjectId(), roles: [“pitcher”]}], situation: {outs: 1, balls: 2, strikes: 0}, properties: {location: [0.45, 0.721]}}
16
{_id: ObjectId(), code: “shot”, participants: [{player_id: ObjectId(), roles: [“shooter”]}, {player_id: ObjectId(), roles: [“rebounder”]}], situation: {period: 1, time: 5:29}, properties: {location: [0.45, 0.721]}}
17
Decoding/Unmarshalling
Django ORM
REST
API{.}
business logic
{.}MongoDB
18
Decoding/Unmarshalling
Django ORM
REST
API{.}
business logic
{.}MongoDB
👏
19
Modeling data in MongoDB
20
• JSON won the internet
• Don’t write your own JSON storage engine
• Flexible schemas promote app simplicity
• Validation is your responsibility
• Invest in schema design early
21
Scale
22
23
24
25
$$$
26
$$$
😱
27
User Load
System Latency
28
User Load
System Latency
29
User Load
System Latency
30
Scaling is the process of decoupling load from latency.
Latency comes from
31
• Writing data to your database
• Reading data from your database
• Aggregating data from multiple locations
• Running complex calculations
32
{.}
This is a document.
33
{.} {.}{.}
{.}{.}
API MongoDB Browser
34
{.} {.}{.}
{.}{.}
API MongoDB Browser
35
{.} {.}{.}
{.}{.}
API MongoDB Browser
+/-*
36
Read Load
System Latency
37
{.} {.}{.}
{.}{.}
API MongoDB Browser
38
{.} {.}{.}
{.}{.}
API MongoDB Browser
+/-*
39
Write Load
System Latency
40
{.} {.}{.}
{.}{.}
API MongoDB Browser
Background+/-*
41
{.} {.}{.}
{.}{.}
API MongoDB Browser
Background+/-*
42
User Load
System Latency
43
{.}{.}{.}
44
{.}{.}{.}
{.} }
45
{.}{.}{.}
{.} }
46
Scaling data access
47
• Decouple load from latency
• Queries are expensive
• Aggregation is expensive
• Do calculation in the background
• Serve content from single* documents
48
Grow
49
50
51
52
{.}
53
{.}
54
{.}
55
{.}
56
57
{.} {$addToSet: {a: 2}}
58
{.} {$addToSet: {a: 2}}
{.} {v: 2}, {$set: {v: 3}}
59
{.}
60
61
{.} {.}
62
{a}{abc}{b}
{c} }
63
{.}
64
{.}{.}
65
{.} {.}{.}
66
{.} {.}{.}
67
{.} {.}{.}
68
{.} {.}{.}
69
{.} {.}{.}
70
{.} {.} {.}
71
<id><id><id><id><id><id><id>
To Propagate
72
<id><id><id><id><id><id><id>
To Propagate Propagating…
73
<id><id><id><id><id><id><id>
To Propagate Propagating…
<id> {.}{.}{.}
74
{$} {$} {$} {$} {$}
Growing load
75
• Denormalize for constant access time
• Use MongoDB atomic operators
• Check out optimistic locking and MVCC
• Leverage external concurrency control
• Watch your oplog
76
Extend
77
{.} +
78
79
80
So there we have it
• Design your schema to MongoDB’s strengths
• Use monolithic documents
• Don’t do (live) querying
• You can still do transactional things
• You may need to denormalize & propagate
• Think about your overall architecture
81
82
• have a sense of humor
• know what use cases work best
• remember that databases are hard
• don’t understate the difficulty in scaling up
@holacrat