metrics simplified
TRANSCRIPT
Metrics SimplifiedMark Lin
why?
"If you can not measure it, you can not improve it" -Lord Kelvin
99.999% ("five nines") = 5.26 minutes
previously ...
Sending/Collecting is complicated. Single collection server. Tedious to configure new metric collection or creation.Calculating metric from file is expensive.
bottlenecks ...
Poll based collection server
Not easy (!fun) to configure new metric collection or creation.
=grunt work for ops-engineer
uhhhh....
enabling technology
Graphite
RabbitMQ
Graphite Local Proxy
RockSteady ( w/ Esper )
path to graph
1min.juicer.output.apple.sc1.jcr1 20 1276822626
echo "1min.juicer.output.apple.sc1.jcr1 20 1276822626" | nc localhost 3400
path to graph
1min.juicer.output.apple.sc1.jcr1 20 1276822626
echo "1min.juicer.output.apple.sc1.jcr1 20 1276822626" | nc localhost 3400
graph
graph
graph
graph = post event forensic
Rocksteady, metric as event
1min.juicer.common.version.sc1.jcr1 100 1276822626 INSERT INTO Deploy SELECT * FROM Metric(name='common.revision') MATCH_RECORNIZE ( partition by colo, hostname measures A.value as revision, A.colo as colo, A.hostname as hostname, A.app as app, A.timestamp as timestamp pattern (A) define A as A.value > prev(A.value))
Rocksteady, metric as event
1min.juicer.common.version.sc1.jcr1 100 1276822626 INSERT INTO Deploy SELECT * FROM Metric(name='common.revision') MATCH_RECORNIZE ( partition by colo, hostname measures A.value as revision, A.colo as colo, A.hostname as hostname, A.app as app, A.timestamp as timestamp pattern (A) define A as A.value > prev(A.value))
auto threshold, prediction
correlation
Deployment related problem.
Capture sets of metrics when important ones crossed threshold.
Determine dependencies such as cpu to request to second or response time.
correlation
Deployment related problem.
Capture sets of metrics when important ones crossed threshold.
Determine dependencies such as cpu to request to second or response time.
revelation
beyond simple metric
Timing info per request.
Actual time spent in each component in an application.Map out dependency, find exact area of problem.
beyond simple metric
Timing info per request.
Actual time spent in each component in an application.Map out dependency, find exact area of problem.
what we learned?
1. Make metric sending simple.2. Nice UI to make sense of data.3. Real time processing of metric rocks.