distributed systems at ok.ru #rigadevday

50
Distributed Systems @ OK.RU Oleg Anastasyev @m0nstermind [email protected]

Upload: odnoklassnikiru

Post on 10-Feb-2017

1.691 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Distributed systems at ok.ru #rigadevday

Distributed Systems @ OK.RU

Oleg Anastasyev @m0nstermind [email protected]

Page 2: Distributed systems at ok.ru #rigadevday

1. Absolutely reliable network 2. with negligible Latency 3. and practically unlimited Bandwidth 4. It is homogenous 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now

2

OK.ru has come to:

Page 3: Distributed systems at ok.ru #rigadevday

1. Absolutely reliable network 2. with negligible Latency 3. and practically unlimited Bandwidth 4. It is homogenous (same HW and hop cnt to every server) 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now

3

Fallacies of distributed computing

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing[Peter Deutsch, 1994; James Gosling 1997]

Page 4: Distributed systems at ok.ru #rigadevday

4

4Datacenters

150distinct

microservices

8000iron servers

OK.RU has come to:

Page 5: Distributed systems at ok.ru #rigadevday

5

hardware engineers

network engineers

operations

developers

Page 6: Distributed systems at ok.ru #rigadevday

6

My friends page

1. Retrieve friends ids

2. Filter by friendship type

3. Apply black list

4. Resolve ids to profiles

5. Sort profiles

6. Retrieve stickers

7. Calculate summaries

Page 7: Distributed systems at ok.ru #rigadevday

7

The Simple WayTM

SELECT * FROM friendlist, users WHERE userId=? AND f.kind=? AND u.name LIKE ?AND NOT EXISTS( SELECT * FROM blacklist …)

Page 8: Distributed systems at ok.ru #rigadevday

• Friendships

• 12 billions of edges, 300GB

• 500 000 requests per sec

8

Simple ways don't work

• User profiles

• > 350 millions,

• 3 500 000 requests/sec, 50 Gbit/sec

Page 9: Distributed systems at ok.ru #rigadevday

9

How stuff works

web frontend API frontend

app server

one-graph user-cache black-list

microservices

Page 10: Distributed systems at ok.ru #rigadevday

10

Micro-service dissected

Remote interface

Business logic, caches

[ Local storage ]

1 JVM

Page 11: Distributed systems at ok.ru #rigadevday

11

Micro-service dissected

Remote interface

https://github.com/odnoklassniki/one-nio

interface GraphService extends RemoteService { @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask);}

interface UserCache { @RemoteMethod User getUserById(long id);}

Page 12: Distributed systems at ok.ru #rigadevday

12

App Server code

https://github.com/odnoklassniki/one-nio

long []friendsIds = graphService.getFriendsByFilter(userId, mask); List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) {

users.add(userCache.getUserById(id)); }

}…

return users;

Page 13: Distributed systems at ok.ru #rigadevday

• Partition by this parameter value

• Using partitioning strategy

• long id -> int partitionId(id) -> node1,node2,…

• Strategies can be different

• Cassandra ring, Voldemort partitions

• or …

13

interface GraphService extends RemoteService { @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask);}

Page 14: Distributed systems at ok.ru #rigadevday

14

Weighted quadrant

p = id % 16 p = 0

p = 15

p = 1

N01 N02 N03 . . . 019 020

W =

1

W =

100

N11

node = wrr(p)

SET

Page 15: Distributed systems at ok.ru #rigadevday

15

A coding issue

https://github.com/odnoklassniki/one-nio

long []friendsIds = graphService.getFriendsByFilter(userId, mask); List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) {

users.add(userCache.getUserById(id)); }

}…

return users;

Page 16: Distributed systems at ok.ru #rigadevday

16

latency = 1.0ms * 2 reqs * 200 friends = 400 ms

A roundtrip price

0.1-0.3 ms

0.7-1.0 ms

remote datacenter

* this price is tightly coupled with the specific infrastructure and frameworks

10k friends latency = 20 seconds

Page 17: Distributed systems at ok.ru #rigadevday

17

Batch requests to the rescue

public interface UserCache { @RemoteMethod( split = true ) Collection<User> getUsersByIds(long[] keys);}

long []friendsIds = graphService.getFriendsByFilter(userId, mask);friendsIds = blackList.filterAllowed(userId, friendsIds );List<User> users = userCache.getUsersByIds(friendsIds);

…return users;

Page 18: Distributed systems at ok.ru #rigadevday

18

split & merge

split ( ids by p ) -> ids0, ids1

p = 0

p = 1

N01 N02 N03 . . .

N11

ids0

ids1

users = merge (users0, users1)

Page 19: Distributed systems at ok.ru #rigadevday

19

1. Client crash

2. Server crash

3. Request omission

4. Response omission

5. Server timeout

6. Invalid value response

7. Arbitrary failure

What could possibly fail ?

Page 20: Distributed systems at ok.ru #rigadevday

Failures

Distributed systems at OK.RU

Page 21: Distributed systems at ok.ru #rigadevday

• We can not prevent failures - only mask them

• If a Failure can occur it will occur

• Redundancy is a must to mask failures

• Information ( error correction codes )

• Hardware (replicas, substitute hardware)

• Time (transactions, retries)

21

What to do with failures ?

Page 22: Distributed systems at ok.ru #rigadevday

22

What happened to transaction ?

Don’t give up! Must retry !

Must give up! Don't retry !

? ?Add Friend

Page 23: Distributed systems at ok.ru #rigadevday

• Client does not really know

• What client can do ?

• Don’t make any guarantees.

• Never retry. At Most Once.

• Always retry. At Least Once.

23

Was friendship succeeded ?

Page 24: Distributed systems at ok.ru #rigadevday

1. Transaction in ACID database

• single master, success is atomic (either yes or no)

• atomic rollback is possible

2. Cache cluster refresh

• many replicas, no master

• no rollback, partial failures are possible

24

Making new friendship

Page 25: Distributed systems at ok.ru #rigadevday

• Operation can be reapplied multiple times with same result

• e.g.: read, Set.add(), Math.max(x,y)

• Atomic change with order and dup control

25

Idempotence

“Always retry” policy can be appliedonly on

Idempotent Operations

https://en.wikipedia.org/wiki/Idempotence

Page 26: Distributed systems at ok.ru #rigadevday

26

Idempotence in ACID database

Make friends

wait; timeout

Make friends (retry)

Friendship, peace and bubble gum !

Already friends ?

No, let’s make it !

Already friends ?

Yes, NOP !

Page 27: Distributed systems at ok.ru #rigadevday

27

Sequencing

MakeFriends (OpId)

Made friends!

Is Dup (OpId) ?

No, making changes

OpId := Generate()

Generate() examples:

• OpId+=1

• OpId=currentTimeMillis()

• OpId=TimeUUIDhttp://johannburkard.de/software/uuid/

Page 28: Distributed systems at ok.ru #rigadevday

1. Transaction in ACID database

• single master, success is atomic (either yes or no)

• atomic rollback is possible

2. Cache cluster refresh

• many replicas, no master

• no rollback, partial failures are possible

28

Making new friendship

Page 29: Distributed systems at ok.ru #rigadevday

29

Cache cluster refresh

add(Friend)

p = 0 N01 N02 N03 . . .

But replicas state will diverge otherwise

Retries are meaningless

Page 30: Distributed systems at ok.ru #rigadevday

• Background data sync process

• Reads updated records from ACID storeSELECT * FROM users WHERE modified > ?

• Applies them into its memory

• Loads updates on node startup

• Retry can be omitted then

30

Syncing cache from DB

Page 31: Distributed systems at ok.ru #rigadevday

31

Death by timeout

GC

Make Friends

wait; timeout

thread pool exhausted

Page 32: Distributed systems at ok.ru #rigadevday

1. Clients stop sending requests to server

After X continuous failures for the last second

2. Clients monitor server availability

In background, once a minute

3. And turn it back on

32

Server cut-off

Page 33: Distributed systems at ok.ru #rigadevday

33

Death by slowing down

Avg = 1.5msMax = 1.5c24 cpu coresCap = 24,000 ops

Choose 2.4ms timeout ?

Cut it off from client if latency avg > 2.4ms ?

Avg = 24msMax = 1.5s24 cpu coresCap = 1,000 ops

10,000 ops

Page 34: Distributed systems at ok.ru #rigadevday

34

Speculative retry

Idemponent Op

wait; timeout

Retry

Result Response

Page 35: Distributed systems at ok.ru #rigadevday

• Makes requests to replicas before timeout

• Better 99%, even average latencies

• More stable system

• Not always applicable:

• Idempotent ops, additional load, traffic (to consider)

• Can be balanced: always, >avg, >99p

35

Speculative retry

Page 36: Distributed systems at ok.ru #rigadevday

More failures !

Distributed systems @ OK.RU

Page 37: Distributed systems at ok.ru #rigadevday

• Excessive load

• Excessive paranoia

• Bugs

• Human error

• Massive outages

37

All replicas failure

Page 38: Distributed systems at ok.ru #rigadevday

38

Use of non-authoritative datasources, degrade consistency

Use of incomplete data in UI, partial feature degradation

Single feature full degradation

Degrade (gracefully) !

Page 39: Distributed systems at ok.ru #rigadevday

39

The code

interface UserCache { @RemoteMethod Distributed<Collection<User>> getUsersByIds(long[] keys);}

interface Distributed<D>{

boolean isInconsistency(); D getData();}

class UserCacheStub implements UserCache { Distributed<Collection<User>> getUsersByIds(long[] keys) {

return Distributed.inconsistent(); }

}

Page 40: Distributed systems at ok.ru #rigadevday

Resilience testing

Distributed systems at OK.RU

Page 41: Distributed systems at ok.ru #rigadevday

41

The product you make

Operations in production env

What to test for failure ?

“Standard” products - with special care !

Page 42: Distributed systems at ok.ru #rigadevday

• What is does:

• Detects network connections between servers

• Disables them (iptables drop)

• Runs auto tests

• What we check

• No crashes, nice UI messages are rendered

• Server does start and can serve requests

42

The product we make : “Guerrilla”

Page 43: Distributed systems at ok.ru #rigadevday

Production diagnostics

Distributed systems at OK.RU

Page 44: Distributed systems at ok.ru #rigadevday

• To know an accident exists. Fast.

• To track down to the source of accident. Fast.

• To prevent accidents before they happen.

44

Why

Page 45: Distributed systems at ok.ru #rigadevday

• Zabbix

• Cacti

• Operational metrics

• Names od operations, e.g. “Graph.getFriendsByFilter”

• Call count, their success or failure

• Latency of calls

45

Is (will) there be accident ?

Page 46: Distributed systems at ok.ru #rigadevday

• Current metrics and trends

• Aggregated call and failure counts

• Aggregated latencies

• Average, Max

• Percentiles 50,75,98,99,99.9

46

What charts show to us

Page 47: Distributed systems at ok.ru #rigadevday

47

More charts

Page 48: Distributed systems at ok.ru #rigadevday

48

Anomaly detection

Page 49: Distributed systems at ok.ru #rigadevday

• The possibilities for failure in distributed systems are endless

• Don't “prevent”, but mask failures through redundancy

• Degrade gracefully on unmask-able failure

• Test failures

• Production diagnostics are key to failure detection and prevention

49

Short summary

Page 50: Distributed systems at ok.ru #rigadevday

50 Distributed Systems at OK.RU

slideshare.net/m0nstermind

https://v.ok.ru/publishing.html

http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf

Notes on Theory of Distributed Systems CS 465/565: Spring 2014James Aspnes

Try these links for more