profiling a warehouse-scale computer · “microservice architecture” thousands of services are...
TRANSCRIPT
![Page 1: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/1.jpg)
Profiling a warehouse-scale computerSvilen KanevJuan Pablo DaragoKim HazelwoodParthasarathy Ranganathan, Tipp MoseleyGu-Yeon Wei, David Brooks
Harvard UniversityUniversidad de Buenos Aires
Yahoo LabsGoogle Inc.
Harvard University
![Page 2: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/2.jpg)
The cloud is here to stay
[http://google.com/trends, 2015]2
![Page 3: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/3.jpg)
Warehouse-scale computers (of yore)
datacenters built arounda few “killer workloads”
problem sizes >> 1 machine
3
...distributed, buttightly interconnectedservices
communication through remote-procedure calls (RPCs)
![Page 4: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/4.jpg)
Now “the datacenter is the computer”(the WSC model has caught on)
4
“microservice architecture”
thousands of services are“one RPC away”
“... about a hundred of services that comprise Siri’s backend...”
[Apple, Mesos meetup 2015]
Did you mean: #pldi15
frequency[“#isca15”]++
![Page 5: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/5.jpg)
How do modern WSC applications interact with hardware?
And what does that imply for future server processors?
![Page 6: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/6.jpg)
Traditional profiling: load testing
6
Find representative inputs
Find representative operating point
Profile / optimize
Repeat
Isolate a service
![Page 7: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/7.jpg)
Live datacenter-scale profiling(Google-wide profiling)
Select randomproduction machines
~20,000 / day
GWP DB
[Ren et al. Google-wide profiling, 2010]7
Profile each one (for a while)without isolation
while running live trafficfor billions of users
Aggregate days, weeks,years worth of execution
![Page 8: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/8.jpg)
Live WSC profiling insights
8
Where are cycles spent in a datacenter?
Are there really no killer applications?
How do WSC applications interact with instruction caches?
How much ILP is there? Big / small cores?
DRAM latency vs. bandwidth?
Hyperthreading?
![Page 9: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/9.jpg)
Where are WSC cycles spent?
![Page 10: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/10.jpg)
No “killer” application to optimize for
10
Instead: a long tail of various different services
[1 week of sampled WSC cycles]
![Page 11: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/11.jpg)
Ongoing application diversification
11
[~3 years of sampled WSC cycles]
Optimizing hardware one-application-at-a-time has diminishing returns
![Page 12: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/12.jpg)
Within applications: no hotspots
Corollary: hunting for per-application hotspots is not justified
12
[search leaf node; 1 week of cycles]
![Page 13: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/13.jpg)
Shared low-level routines; typical for larger-than-1-server problems
Hotspots across applications:“datacenter tax’’
13
![Page 14: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/14.jpg)
Hotspots across applications:“datacenter tax’’
Prime candidates for accelerators in server SoCs
14
Only 6 self-contained routines account for ~30% of WSC cycles
![Page 15: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/15.jpg)
Live WSC profiling insights
15
Where are cycles spent in a datacenter? Everywhere.
Are there really no killer applications? Datacenter tax.
How do WSC applications interact with instruction caches?
How much ILP is there? Big / small cores?
DRAM latency vs. bandwidth?
Hyperthreading?
![Page 16: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/16.jpg)
Microarchitecture:WSC i-cache pressure
![Page 17: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/17.jpg)
Severe instruction cache bottlenecks
15-30% of core cycles wasted oninstruction-supply stalls
17
20,000 Intel IvyBridge servers2 daysTop-Down analysis [Yasin 2014]
![Page 18: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/18.jpg)
Severe instruction cache bottlenecks
Fetching instructions from L3 cachesVery high i-cache miss rates
10x the highest in SPEC50% higher than CloudSuite
15-30% of core cycles wasted oninstruction-supply stalls
Lots of lukewarm code100s MBs of instructions per binary; no hotspots
18
![Page 19: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/19.jpg)
A problem in the making
I-cache working sets 4-5x larger than largest in SPEC
Growing almost 30% / year
significantly fasterthan i-caches
One solution: L2 i/d partitioning
19
![Page 20: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/20.jpg)
Live WSC profiling insights
20
Where are cycles spent in a datacenter? Everywhere.
Are there really no killer applications? Datacenter tax.
How do WSC applications interact with instruction caches? Poorly.
How much ILP is there? Big / small cores? Bimodal.
DRAM latency vs. bandwidth? Latency.
Hyperthreading? Yes.
![Page 21: Profiling a warehouse-scale computer · “microservice architecture” thousands of services are “one RPC away” “... about a hundred of services that comprise Siri’s backend...”](https://reader034.vdocuments.net/reader034/viewer/2022042223/5eca4fa33e03014e8747efb7/html5/thumbnails/21.jpg)
To sum up
A growing number of programs cover “the world’s WSC cycles”. There is no “killer application”, and hand-optimizing each program is suboptimal.
Low-level routines (datacenter tax) are a surprisingly high fraction of cycles. Good candidates for accelerators in future server processors.
Common microarchitectural footprint: working sets too large for i-caches; many d-cache stalls; generally low IPC; bimodal ILP; low memory bandwidth utilization.