optimizing native code for erlang · scheduler collapse • with riak we've seen problems in...

94
Optimizing Native Code for Erlang Steve Vinoski Basho Technologies [email protected] @stevevinoski 1 Monday, September 22, 14

Upload: others

Post on 25-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

Optimizing Native Code for ErlangSteve VinoskiBasho [email protected]@stevevinoski

1Monday, September 22, 14

Page 2: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

INTEGRATION, ERLANG STYLE

• External: OS processes separate from the Erlang VM

• Ports

• C Nodes

• Jinterface

• TCP/UDP/SCTP networking

2Monday, September 22, 14

Page 3: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

INTEGRATION, ERLANG STYLE

• Internal: statically or dynamically linked into the Erlang VM

• Erlang Built-in Functions (BIFs)

• Port Drivers

• Native Implemented Functions (NIFs)

3Monday, September 22, 14

Page 4: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

INTEGRATION EXAMPLES

• rebar uses ports for external commands like git, grep, rsync

• Erlang's inet_drv port driver

• written in C

• supports TCP, UDP, SCTP for Erlang applications

• Riak's eleveldb persistence backend is a C++ NIF

4Monday, September 22, 14

Page 5: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF DETAILS

5Monday, September 22, 14

Page 6: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF DETAILS

• Start with a regular Erlang module

5Monday, September 22, 14

Page 7: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF DETAILS

• Start with a regular Erlang module

• Functions can either be stubbed out to raise errors, or have default implementations

5Monday, September 22, 14

Page 8: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF DETAILS

• Start with a regular Erlang module

• Functions can either be stubbed out to raise errors, or have default implementations

• Corresponding NIFs live in a dynamically loaded library

5Monday, September 22, 14

Page 9: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF DETAILS

• Start with a regular Erlang module

• Functions can either be stubbed out to raise errors, or have default implementations

• Corresponding NIFs live in a dynamically loaded library

• Module typically specifies a NIF loading function via -on_load

5Monday, September 22, 14

Page 10: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF DETAILS

• Start with a regular Erlang module

• Functions can either be stubbed out to raise errors, or have default implementations

• Corresponding NIFs live in a dynamically loaded library

• Module typically specifies a NIF loading function via -on_load

• NIFs replace Erlang functions of the same name/arity at module load time

5Monday, September 22, 14

Page 11: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF EXAMPLE

• Example module: bitwise

• Provides a function exor/2 that takes a binary and a value

• exor/2 computes an exclusive or of each byte of the binary with the argument value

• Find the code here: https://github.com/vinoski/bitwise.git

6Monday, September 22, 14

Page 12: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF EXAMPLE

7Monday, September 22, 14

Page 13: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF EXAMPLE

8Monday, September 22, 14

Page 14: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF EXAMPLE

8Monday, September 22, 14

Page 15: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF EXAMPLE

9Monday, September 22, 14

Page 16: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NIF EXAMPLE

10Monday, September 22, 14

Page 17: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

EXOR/2 NIF

11Monday, September 22, 14

Page 18: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

12Monday, September 22, 14

Page 19: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

13Monday, September 22, 14

Page 20: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

13Monday, September 22, 14

Page 21: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

14Monday, September 22, 14

Page 22: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NOW FOR SOME BIG DATA

• 2 billion bytes

15Monday, September 22, 14

Page 23: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

LET'S TIME OUR NIF

16Monday, September 22, 14

Page 24: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

LET'S TIME OUR NIF

• Nearly 6 seconds!

• This is bad.

16Monday, September 22, 14

Page 25: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

ERLANG PROCESS ARCHITECTURE

17Monday, September 22, 14

Page 26: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CPUCore 1

. . . . . . CPUCore N

ERLANG PROCESS ARCHITECTURE

17Monday, September 22, 14

Page 27: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

OS + kernel threadsCPU

Core 1. . . . . . CPU

Core N

ERLANG PROCESS ARCHITECTURE

17Monday, September 22, 14

Page 28: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

Erlang VM

OS + kernel threadsCPU

Core 1. . . . . . CPU

Core N

ERLANG PROCESS ARCHITECTURE

17Monday, September 22, 14

Page 29: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

Erlang VM

OS + kernel threadsCPU

Core 1. . . . . . CPU

Core N

N1

SMPScheduler Threads

(one per core)

ERLANG PROCESS ARCHITECTURE

17Monday, September 22, 14

Page 30: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

Erlang VM

Run Queues

OS + kernel threadsCPU

Core 1. . . . . . CPU

Core N

N1

SMPScheduler Threads

(one per core)

ERLANG PROCESS ARCHITECTURE

17Monday, September 22, 14

Page 31: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

Erlang VM

Run QueuesProcess

Process

Process

Process

Process

Process

OS + kernel threadsCPU

Core 1. . . . . . CPU

Core N

N1

SMPScheduler Threads

(one per core)

ERLANG PROCESS ARCHITECTURE

17Monday, September 22, 14

Page 32: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULING A PROCESS

18Monday, September 22, 14

Page 33: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULING A PROCESS

• A scheduler takes a process from its run queue

18Monday, September 22, 14

Page 34: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULING A PROCESS

• A scheduler takes a process from its run queue

• It executes it until it hits 2000 reductions (function calls) or until it waits for a message, or if it hits an emulator trap

18Monday, September 22, 14

Page 35: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULING A PROCESS

• A scheduler takes a process from its run queue

• It executes it until it hits 2000 reductions (function calls) or until it waits for a message, or if it hits an emulator trap

• The process then gets scheduled out and another one chosen

18Monday, September 22, 14

Page 36: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULING A PROCESS

• A scheduler takes a process from its run queue

• It executes it until it hits 2000 reductions (function calls) or until it waits for a message, or if it hits an emulator trap

• The process then gets scheduled out and another one chosen

• See Jesper Louis Andersen's scheduling description:http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-scheduling.html

18Monday, September 22, 14

Page 37: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

THREAD PROGRESS

• Scheduler threads share some data structures

• But using traditional locks or ref counts to protect them scales poorly

• Instead, schedulers report their progress frequently to other schedulers

• Schedulers use their knowledge of other schedulers' progress to know when certain operations are safe

• For more details see https://github.com/erlang/otp/blob/master/erts/emulator/internal_doc/ThreadProgress.md

19Monday, September 22, 14

Page 38: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

BLOCKED SCHEDULERS

• Blocking a scheduler prevents thread progress, making other schedulers wait

• Blocking a scheduler also makes it unavailable to run other processes

• A NIF shouldn't occupy a scheduler for more than 1-2 ms

• NIF reductions should also be counted properly

20Monday, September 22, 14

Page 39: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULER COLLAPSE

• With Riak we've seen problems in production where schedulers go to sleep and stop executing processes

• Caused by misbehaving NIFs in Riak's storage backends interfering with normal scheduler operations

• Can also be caused by misbehaving standard Erlang functions

• See Scott Fritchie's nifwait repository, md5 branch:https://github.com/slfritchie/nifwait.git

21Monday, September 22, 14

Page 40: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

LET'S COUNT REDUCTIONS

22Monday, September 22, 14

Page 41: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

LET'S COUNT REDUCTIONS

22Monday, September 22, 14

Page 42: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

A MISBEHAVING NIF

23Monday, September 22, 14

Page 43: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

A MISBEHAVING NIF

• Blocked a scheduler thread for 5.86 seconds

• And only 4 reductions

23Monday, September 22, 14

Page 44: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

WORKAROUNDS

• Break the data into chunks

• Call exor_bad/2 repeatedly, once for each chunk

• Combine the resulting chunks into a final result

24Monday, September 22, 14

Page 45: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CHUNKING

25Monday, September 22, 14

Page 46: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CHUNKING

26Monday, September 22, 14

Page 47: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CHUNKING

27Monday, September 22, 14

Page 48: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CHUNKING

• Problem: how to determine optimal chunk size?

• Here, we arbitrarily chose 4MB chunks

28Monday, September 22, 14

Page 49: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CHUNKING

• Problem: how to determine optimal chunk size?

• Here, we arbitrarily chose 4MB chunks

28Monday, September 22, 14

Page 50: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CHUNKING RESULTS

29Monday, September 22, 14

Page 51: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

CHUNKING RESULTS

• 476 chunks processed

• Much better reduction count of 1445

• Scheduler was never blocked (probably anyway)

• But a longer execution time of 7.87 seconds

29Monday, September 22, 14

Page 52: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

A BETTER APPROACH

• For Erlang/OTP 17.3 (released 17 Sep 2014) I added a new NIF API function: enif_schedule_nif

• Takes a name and function pointer for a NIF, and an array of arguments to pass to it

• Schedules the argument NIF for future invocation with the specified arguments

• Allows the calling NIF to yield the scheduler

30Monday, September 22, 14

Page 53: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

31Monday, September 22, 14

Page 54: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

32Monday, September 22, 14

Page 55: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

32Monday, September 22, 14

Page 56: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

32Monday, September 22, 14

Page 57: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

33Monday, September 22, 14

Page 58: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

EXOR2/6

• exor2/6 is an "internal NIF" not visible to Erlang

• Works through as much of the binary as it can before its timeslice runs out

• Reports reductions using enif_consume_timeslice

• When its timeslice is up, reschedules itself via enif_schedule_nif

• Adjusts chunksize for the next iteration based on progress in each iteration

34Monday, September 22, 14

Page 59: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

...snip...

35Monday, September 22, 14

Page 60: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

...snip...

35Monday, September 22, 14

Page 61: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

...snip...

35Monday, September 22, 14

Page 62: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

...snip...

35Monday, September 22, 14

Page 63: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

...snip...

35Monday, September 22, 14

Page 64: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

36Monday, September 22, 14

Page 65: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

36Monday, September 22, 14

Page 66: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

36Monday, September 22, 14

Page 67: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

36Monday, September 22, 14

Page 68: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

36Monday, September 22, 14

Page 69: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

A YIELDING NIF

37Monday, September 22, 14

Page 70: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

A YIELDING NIF

• 5.36 seconds, fastest so far

• At over 10000 reductions, much more accurate accounting

• We yielded the scheduler 5 times

37Monday, September 22, 14

Page 71: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

ANOTHER APPROACH:DIRTY SCHEDULERS

38Monday, September 22, 14

Page 72: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

Erlang VM

Run Queues

OS + kernel threadsCPU

Core 1. . . . . . CPU

Core N

N1

SMPScheduler Threads

(one per core)

DIRTY SCHEDULERS

39Monday, September 22, 14

Page 73: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

OS + kernel threadsCPU

Core 1. . . . . . . . . . . . . CPU

Core N

N1

DIRTY SCHEDULERS

40Monday, September 22, 14

Page 74: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

OS + kernel threadsCPU

Core 1. . . . . . . . . . . . . CPU

Core N

N1

DIRTY SCHEDULERS

. . . . . . . . . . . . .DC1 DCN

DC: Dirty CPU Scheduler41Monday, September 22, 14

Page 75: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

OS + kernel threadsCPU

Core 1. . . . . . . . . . . . . CPU

Core N

N1

DIRTY SCHEDULERS

. . . . . . . . . . . . .DC1 DCN

DC: Dirty CPU Scheduler

Shared DC Run Queue

41Monday, September 22, 14

Page 76: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

DIRTY SCHEDULERS

N

CPUCore 1

. . . . . . . . . . . . . CPUCore N

1 . . . . . . . . . . . . .DC1 DCN

Shared DC Run Queue

42Monday, September 22, 14

Page 77: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

DIRTY SCHEDULERS

N

CPUCore 1

. . . . . . . . . . . . . CPUCore N

1 . . . . . . . . . . . . .DC1 DCN

Shared DC Run Queue

Shared DI/ORun Queue

DI/O NDI/O 1

DI/O: Dirty I/O Scheduler

OS + kernel threads

42Monday, September 22, 14

Page 78: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

DIRTY SCHEDULERSShared DI/ORun Queue

DI/O NDI/O 1

DI/O: Dirty I/O Scheduler

OS + kernel threads

43Monday, September 22, 14

Page 79: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

ENABLING DIRTY SCHEDULERS

• configure --enable-dirty-schedulers

• Your Erlang shell will print something like the following system version line:

Erlang/OTP 17 [erts-6.2] [source] [64-bit] [smp:8:8] \ [ds:8:8:10] [async-threads:10] [kernel-poll:false]

44Monday, September 22, 14

Page 80: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

USING DIRTY SCHEDULERS

• Either schedule a dirty NIF via enif_schedule_nif

• Pass a flag to indicate dirty CPU or dirty I/O scheduling

• Or specify a NIF as dirty in your ErlNifFuncs array

• Both of these are new with Erlang 17.3, replacing old experimental dirty NIF API

45Monday, September 22, 14

Page 81: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

USING DIRTY SCHEDULERS

46Monday, September 22, 14

Page 82: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

USING DIRTY SCHEDULERS

46Monday, September 22, 14

Page 83: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

USING DIRTY SCHEDULERS

46Monday, September 22, 14

Page 84: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

A DIRTY EXOR/2

47Monday, September 22, 14

Page 85: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

A DIRTY EXOR/2

• 5.95 seconds on a dirty scheduler thread

• 8 reductions and 0 yields

• But was (almost) never on a regular scheduler

• Regular schedulers were running other jobs normally

47Monday, September 22, 14

Page 86: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULE IT DIRTY

48Monday, September 22, 14

Page 87: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULE IT DIRTY

• No chunking or yielding needed for dirty exor/2

48Monday, September 22, 14

Page 88: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULE IT DIRTY

• No chunking or yielding needed for dirty exor/2

• But dirty schedulers are finite resources

48Monday, September 22, 14

Page 89: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULE IT DIRTY

• No chunking or yielding needed for dirty exor/2

• But dirty schedulers are finite resources

• Evil dirty NIFs can completely occupy all dirty schedulers and prevent other dirty jobs from running

48Monday, September 22, 14

Page 90: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULE IT DIRTY

• No chunking or yielding needed for dirty exor/2

• But dirty schedulers are finite resources

• Evil dirty NIFs can completely occupy all dirty schedulers and prevent other dirty jobs from running

• A dirty NIF can use enif_schedule_nif to reschedule, yielding to allow other dirty jobs to execute

48Monday, September 22, 14

Page 91: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

SCHEDULE IT DIRTY

• No chunking or yielding needed for dirty exor/2

• But dirty schedulers are finite resources

• Evil dirty NIFs can completely occupy all dirty schedulers and prevent other dirty jobs from running

• A dirty NIF can use enif_schedule_nif to reschedule, yielding to allow other dirty jobs to execute

• A NIF can use enif_schedule_nif to flip itself between regular mode and dirty mode

48Monday, September 22, 14

Page 92: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

NEXT STEPS

• Dirty drivers already in progress

• Native processes?

• see Rickard Green's original 2011 presentation on these topics: http://www.erlang-factory.com/upload/presentations/377/RickardGreen-NativeInterface.pdf

49Monday, September 22, 14

Page 93: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

ACKNOWLEDGEMENTS

• A huge thanks to Rickard Green of the Ericsson OTP team, who has patiently guided me in this work

• Also thanks to Sverker Eriksson of the OTP team

• And thanks to Anthony Ramine for mentioning "NIF traps" one day in the #erlang IRC channel, where I got the idea for enif_schedule_nif

50Monday, September 22, 14

Page 94: Optimizing Native Code for Erlang · SCHEDULER COLLAPSE • With Riak we've seen problems in production where schedulers go to sleep and stop executing processes • Caused by misbehaving

THANKS

http://shop.oreilly.com/product/0636920024149.do#

51Monday, September 22, 14