scalable data modeling by example (carlos alonso, job and talent) | cassandra summit 2016

41
Scalable data modelling by example Carlos Alonso (@calonso)

Upload: datastax

Post on 16-Apr-2017

303 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Scalable data modelling by example

Carlos Alonso (@calonso)

Page 2: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Carlos Alonso

2

• Ex-Londoner • MSc Salamanca University, Spain • Software Engineer @ Jobandtalent • Cassandra certified developer • Datastax Cassandra MVP 2015 & 2016 • @calonso / http://mrcalonso.com

Page 3: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Jobandtalent

3

• Revolutionising how people find jobs and how businesses

hire employees.

• Leveraging data to produce a unique job matching

technology.

• 10M+ users and 150K+ companies worldwide

• @jobandtalentEng / http://jobandtalent.com

• We are hiring!!

Page 4: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Cassandra Concepts

Page 5: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

The data model is the only thing you can’t change once in production.

Page 6: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Data organisation

6

Token

Page 7: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Physical Data Layout

7

Page 8: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Consistent Hashing

8

Hash function

“Carlos” 185664

1773456738847666528349

-894763734895827651234

Page 9: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Replication factorHow many copies (replicas) for your data

9

Page 10: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Consistency LevelHow many replicas of your data must acknowledge?

10

Page 11: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

A complete read/write example

11

DriverClient

Partitioner

f81d4fae-…834

• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…

Page 12: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

12

Page 13: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

13

Page 14: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Data Modelling

Page 15: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Data Modelling

15

• Understand your data

• Decide (know) how you’ll query the data

• Define column families to satisfy those queries

• Implement and optimise

Page 16: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Data Modelling

16

Conceptual Model

Logical Model

Physical Model

Query-DrivenMethodology

Analysis &Validation

Page 17: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Query Driven Methodology: goals

17

• Spread data evenly around the cluster • Minimise the number of partitions read • Keep partitions manageable

Page 18: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Query Driven Methodology: process

18

• Entities and relationships: map to tables • Key attributes: map to primary key columns • Equality search attributes: must be at the beginning of the primary key • Inequality search attributes: become clustering columns • Ordering attributes: become clustering columns

Page 19: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

The Primary Key

19

PARTITION KEY + CLUSTERING

COLUMN(S)

CREATE TABLE . . .( fields . . . PRIMARY KEY (part_key, clust1, . . .) );

Page 20: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Analysis & Validation

20

• Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions?

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M • How much data duplication? (batches)

Page 21: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

An E-Library project.

Page 22: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Requirement: 1

22

Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.

Book

ISBN K

Title

Author

Genre

Publisher

QDM

Q1

Q1: Find books by ISBN

Page 23: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

• Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions?

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (5 - 1 - 0) + 0 < 1M

• How much data duplication? 0

Analysis & Validation

23

Book

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBN

Page 24: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Physical data model

24

Book

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR );

SELECT * FROM books WHERE ISBN = ‘…’;

Page 25: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Requirement 2

25

Users register into the system uniquely identified by an email and a password. We also want their full name. They will be accessed by email and password or internal unique ID.

Users_by_ID

KID

full_name

QDM

Q1

Q1: Find users by ID

Users_by_login_info

email K

password K

full_name

ID

Q2

Q2: Find users by login info

Q3: Find users by email (to guarantee uniqueness)

Q3

C

Page 26: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

• Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions?

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (2 - 1 - 0) + 0 < 1M

• How much data duplication? 0

Analysis & Validation

26

K

Users_by_ID

ID

full_name

Q1

Q1: Find users by ID

Page 27: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Physical Data Model

27

CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR );

SELECT * FROM users_by_id WHERE ID = …;

K

Users_by_ID

ID

full_name

Q1

Q1: Find users by ID

Page 28: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

• Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions?

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (4 - 1 - 0) + 0 < 1M

• How much data duplication? 1

Analysis & Validation

28

Q2: Find users by login info

Users_by_login_info

email K

password C

full_name

ID

Q3: Find users by email (to guarantee uniqueness)

Page 29: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Physical Data Model

29

CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) );

SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’];

Q2: Find users by login info

Users_by_login_info

email K

password C

full_name

ID

Q3: Find users by email (to guarantee uniqueness)

Page 30: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Physical Data Model

30

BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;

Page 31: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Requirement 3

31

Users read books. We want to know which books has a user read and

show them sorted by title and author

Books_read_by_user

Kuser_IDQDM

Q1: Find all books a loggeduser has read

Q1

ISBN

genre

publisher

title

author

C

C

full_name S

Page 32: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

• Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions?

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user

• How much data duplication? 0

Analysis & Validation

32

Q1: Find all books a loggeduser has read

K

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

S

Page 33: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Physical Data Model

33

CREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) );

SELECT * FROM books_read_by_user WHERE user_ID = …;

Q1: Find all books a loggeduser has read

K

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

S

Page 34: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Requirement 4

34

In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have with our site.

element

type

user_ID K

Actions_by_user

QDM

Q1

Q1: Find all actions a user does in a time range

time C

Page 35: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

• Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions?

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (4 - 1 - 0) + 0 < 1M => 333.333

• How much data duplication? 0

Analysis & Validation

35

K

Actions_by_user

user_ID

Q1

Q1: Find all actions a user does in a time range

time

element

type

C

Page 36: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Requirement 4: Bucketing

36

time

element

type

user_ID K

Actions_by_user

month K

C

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333

per user every <bucket_size>

bucket_size = 1 year => 38 actions / h

bucket_size = 1 month => 462 actions / h

bucket_size = 1 week => 1984 actions / h

Page 37: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

• Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions?

– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month

• How much data duplication? 0

Analysis & Validation

37

K

Actions_by_user

user_ID

month

Q1

Q1: Find all actions a user does in a time range

K

time

element

type

C

Page 38: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Physical Data Model

38

CREATE TABLE actions_by_user ( user_ID TIMEUUID, month INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, month), time) );

SELECT * FROM actions_by_user WHERE user_ID = … AND month = … AND time < … AND time > …;

K

Actions_by_user

user_ID

month

Q1

Q1: Find all actions a user does in a time range

K

time

element

type

C

Page 39: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Further validation

39

∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB

– pk = Partition Key column – sc = Static column – Nr = Number of rows – rc = Regular column – clc = Clustering column – Nv = Number of values

Page 40: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Next Steps

40

• Test your models against your hardware setup – cassandra-stress – http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)

• Monitor everything – DataStax OpsCenter – Graphite – Datadog – . . .

Page 41: Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Thanks!

Carlos Alonso Software Engineer

@calonso