erlang factory 2011 london
DESCRIPTION
Slides of erlang factory 2011 London talk "Designing for performance with erlang" Video of this presentation available at http://vimeo.com/26715793#at=0TRANSCRIPT
![Page 1: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/1.jpg)
Designing for ScaleKnut Nesheim @knutin
Paolo Negri @hungryblank
![Page 2: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/2.jpg)
About this talk
2 developers and erlangvs.
1 million daily users
![Page 3: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/3.jpg)
Social GamesFlash client (game) HTTP API
![Page 4: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/4.jpg)
Social GamesFlash client
• Game actions need to be persisted and validated
• 1 API call every 2 secs
![Page 5: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/5.jpg)
Social GamesHTTP API
• @ 1 000 000 daily users
• 5000 HTTP reqs/sec
• more than 90% writes
![Page 6: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/6.jpg)
The hard nut
http://www.flickr.com/photos/mukluk/315409445/
![Page 7: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/7.jpg)
Users we expect
0
250000
500000
750000
1000000
July December
DAU
“Monster World”daily users
july - december 2010
![Page 8: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/8.jpg)
Users we have
0
50
march april may june
DAU
New gamedaily users
march - june 2011
![Page 9: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/9.jpg)
What to do?
1 Simulate users
![Page 10: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/10.jpg)
Simulating users
• Must not be too synthetic (like apachebench)
• Must look like a meaningful game session
• Users must come online at a given rate and play
![Page 11: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/11.jpg)
• Multi protocol (HTTP, XMPP) benchmarking tool
• Able to test non trivial call sequences
• Can actually simulate a scripted gaming session
http://tsung.erlang-projects.org/
Tsung
![Page 12: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/12.jpg)
http://tsung.erlang-projects.org/
Tsung - configuration
<request subst="true"><http url="http://server.wooga.com/users/%%ts_user_server:get_unique_id%%/resources/column/5/row/14?%%_routing_key%%"method="POST" contents='{"parameter1":"value1"}'></http></request>
Fixed content Dynamic parameter
![Page 13: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/13.jpg)
http://tsung.erlang-projects.org/
Tsung - configuration
• Not something you fancy writing
• We’re in development, calls change and we constantly add new calls
• A session might contain hundreds of requests
• All the calls must refer to a consistent game state
![Page 14: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/14.jpg)
http://tsung.erlang-projects.org/
Tsung - configuration
• From our ruby test code
user.resources(:column => 5, :row => 14)
• Same as<request subst="true"><http url="http://server.wooga.com/users/%%ts_user_server:get_unique_id%%/resources/column/5/row/14?%%_routing_key%%"method="POST" contents='{"parameter1":"value1"}'></http></request>
![Page 15: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/15.jpg)
http://tsung.erlang-projects.org/
Tsung - configuration
• Session
• requests
• Arrival phase
• duration
• arrival rate
A session is a group of requests
Sessions arrive in phases with a specific arrival
rate
![Page 16: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/16.jpg)
http://tsung.erlang-projects.org/
Tsung - setup
app server
app server
app server
tsung master
tsung workerHTTP reqs
Application
ssh
Benchmarkingcluster
![Page 17: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/17.jpg)
http://tsung.erlang-projects.org/
Tsung
• Generates ~ 2500 reqs/sec on AWS m1.large
• Flexible but hard to extend
• Code base rather obscure
![Page 18: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/18.jpg)
What to do?
2 Collect metrics
![Page 19: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/19.jpg)
http://tsung.erlang-projects.org/
Tsung-metrics
• Tsung collects measures and provides reports
• But these measure include tsung network/cpu congestion itself
• Tsung machines aren’t a good point of view
![Page 20: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/20.jpg)
HAproxy
app server
app server
app server
tsung master
tsung workerHTTP reqs
Application
ssh
Benchmarkingcluster
haproxy
![Page 21: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/21.jpg)
HAproxy
“The Reliable, High Performance TCP/HTTP Load Balancer”
• Placed in front of http servers
• Load balancing
• Fail over
![Page 22: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/22.jpg)
HAproxy - syslog
• Easy to setup
• Efficient (UDP)
• Provides 5 timings per each request
![Page 23: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/23.jpg)
HAproxy
app server
app servertsung
master
tsung worker
Application
ssh
Benchmarkingcluster
haproxy
• Time to receive request from client
![Page 24: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/24.jpg)
HAproxy
app server
app servertsung
master
tsung worker
Application
ssh
Benchmarkingcluster
haproxy
• Time spent in HAproxy queue
![Page 25: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/25.jpg)
HAproxy
app server
app servertsung
master
tsung worker
Application
ssh
Benchmarkingcluster
haproxy
• Time to connect to the server
![Page 26: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/26.jpg)
HAproxy
app server
app servertsung
master
tsung worker
Application
ssh
Benchmarkingcluster
haproxy
• Time to receive response headers from server
![Page 27: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/27.jpg)
HAproxy
app server
app servertsung
master
tsung worker
Application
ssh
Benchmarkingcluster
haproxy
• Total session duration time
![Page 28: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/28.jpg)
HAproxy - syslog
• Application urls identify directly server call
• Application urls are easy to parse
• Processing haproxy syslog gives per call metric
![Page 29: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/29.jpg)
What to do?
3 Understand metrics
![Page 30: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/30.jpg)
Reading/aggregatingmetrics
• Python to parse/normalize syslog
• R language to analyze/visualize data
• R language console to interactively explore benchmarking results
![Page 31: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/31.jpg)
R is a free software environment for statistical computing and graphics.
![Page 32: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/32.jpg)
What you get
• Aggregate performance levels (throughput, latency)
• Detailed performance per call type
• Statistical analysis (outliers, trends, regression, correlation, frequency, standard deviation)
![Page 33: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/33.jpg)
What you get
![Page 34: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/34.jpg)
4 go deeper
What to do?
![Page 35: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/35.jpg)
Digging into the data
• From HAproxy log analisys one call emerged as exceptionally slow
• Using eprof we were able to determine that most of the time was spent in a redis query fetching many keys (MGET)
![Page 36: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/36.jpg)
Tracing erldis query• More than 60% of runtime is spent
manipulating the socket
• gen_tcp:recv/2 is the culprit
• But why is it called so many times?
![Page 37: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/37.jpg)
Understanding the redis protocol
C: LRANGE mylist 0 2
s: *2
s: $5
s: Hello
s: $5
s: World
<<"*2\r\n $5\r\n Hello\r\n $5\r\n World\r\n">>
![Page 38: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/38.jpg)
Understanding erldis• recv_value/2 is used in the protocol parser
to get the next data to parse
![Page 39: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/39.jpg)
A different approach
• Two ways to use gen_tcp: active or passive
• In passive, use gen_tcp:recv to explicitly ask for data, blocking
• In active, gen_tcp will send the controlling process a message when there is data
• Hybrid: active once
![Page 40: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/40.jpg)
A different approach
• Is active sockets faster?
• Proof-of-concept proved active socket faster
• Change erldis or write a new driver?
![Page 41: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/41.jpg)
A different approach
• Radical change => new driver
• Keep Erldis queuing approach
• Think about error handling from the start
• Use active sockets
![Page 42: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/42.jpg)
A different approach
• Active socket, parse partial replies
![Page 43: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/43.jpg)
Circuit breaker
• eredis has a simple circuit breaker for when Redis is down/unreachable
• eredis returns immediately to clients if connection is down
• Reconnecting is done outside request/response handling
• Robust handling of errors
![Page 44: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/44.jpg)
Benchmarking eredis
• Redis driver critical for our application
• Must perform well
• Must be stable
• How do we test this?
![Page 45: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/45.jpg)
Basho bench
• Basho produces the Riak KV store
• Basho build a tool to test KV servers
• Basho bench
• We used Basho bench to test eredis
![Page 46: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/46.jpg)
Basho bench• Create callback module
![Page 47: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/47.jpg)
Basho bench• Configuration term-file
![Page 48: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/48.jpg)
Basho bench output
![Page 49: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/49.jpg)
eredis is open source
https://github.com/wooga/eredis
![Page 50: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/50.jpg)
5 measure internals
What to do?
![Page 51: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/51.jpg)
Measure internals
HAproxy point of view is valid but how to measure internals of our application, while we are live, without the overhead of tracing?
![Page 52: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/52.jpg)
Think Basho bench
• Basho bench can benchmark a redis driver
• Redis is very fast, 100K ops/sec
• Basho bench overhead is acceptable
• The code is very simple
![Page 53: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/53.jpg)
Cherry pick ideas from Basho Bench
• Creates a histogram of timings on the fly, reducing the number of data points
• Dumps to disk every N seconds
• Allows statistical tools to work on already aggregated data
• Near real-time, from event to stats in N+5 seconds
![Page 54: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/54.jpg)
Homegrown stats
• Measures latency from the edges of our system (excludes HTTP handling)
• And at interesting points inside the system
• Statistical analysis using R
• Correlate with HAproxy data
• Produces graphs and data specific to our application
![Page 55: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/55.jpg)
Homegrown stats
![Page 56: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/56.jpg)
Recap
Measure:
• From an external point of view (HAproxy)
• At the edge of the system (excluding HTTP handling)
• Internals in the single process (eprof)
![Page 57: Erlang factory 2011 london](https://reader034.vdocuments.net/reader034/viewer/2022042515/547939b1b479596d098b4715/html5/thumbnails/57.jpg)
Recap
Analyze:
• Aggregated measures
• Statistical properties of measures
• standard deviation
• distribution
• trends