+ unobtrusive power proportionality for torque: design and implementation arka bhattacharya...
TRANSCRIPT
![Page 1: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/1.jpg)
+Unobtrusive power proportionality for Torque: Design and Implementation
Arka BhattacharyaAcknowledgements:Jeff Anderson LeeAndrew KrioukovAlbert Goto
![Page 2: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/2.jpg)
+Introduction
What is power proportionality ? Performance-power ratio at all performance levels is
equivalent to that at the maximum performance level Servers consume a high percentage of their max power
even idle Hence, power proportionality => switch off idle servers
![Page 3: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/3.jpg)
+NapSAC – Krioukov et.al.
3
IPSRequests
Power
Computational “Spinning Reserve”
Load DistributionScheduling
Power ManagementWikiPedia Request Rate
4/12/11CPS 2011
![Page 4: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/4.jpg)
+The need for power proportionality of IT equipment in Soda Hall
Soda Hall Power : 450-500kW
Cluster Room Power: 120-130kW (~25%)
Total HVAC for cluster rooms : 75-85kW(~15%)
![Page 5: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/5.jpg)
+PSI Cluster
Cluster Room Power: 120-130kW (~25% of Soda)
PSI Cluster: 20-25kW (~5% of Soda)
Total HVAC for cluster rooms : 75-85kW(~15% of Soda)
Total HVAC for PSI Cluster room : 20-25kW(~5% of Soda)
![Page 6: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/6.jpg)
+The PSI Cluster
PSI Cluster Consumes ~20-25kW of power irrespective of workload. Contains about 110 servers.
Recently server faults have reduced the size of the cluster to 78 servers. (The faulty servers mostly are powered on all the time)
Used mainly by NLP, Vision, AI and ML graduate students.
It is an HPC Cluster running Torque
![Page 7: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/7.jpg)
+PSI Cluster
![Page 8: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/8.jpg)
+Possible Energy savings
Can save ~ 50% of the energy
![Page 9: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/9.jpg)
+Current state :
![Page 10: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/10.jpg)
+ Result:
10 kW
We save 49% of the energy
![Page 11: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/11.jpg)
+What is Torque?
Tera-scale Open-source Research and QUEue manager
Built upon original Portable Batch System (PBS) project
Resource manager: Manages availability of, and requests for, compute node resources
Used by most academic institutions throughout the world for batch processing.
![Page 12: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/12.jpg)
+Maui Scheduler
Job scheduler
Implements and manages:
Scheduling policies
Dynamic priorities
Reservations
Fairshare
![Page 13: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/13.jpg)
+Sample Job Flow
Script submitted to TORQUE specifying required resources
Maui periodically retrieves from TORQUE list of potential jobs, available node resources, etc.
When resources become available, Maui tells TORQUE to execute certain jobs on particular nodes
TORQUE dispatches jobs to the PBS MOMs (machine oriented miniserver) running on the compute nodes - pbs_mom is the process that starts the job script
Job status changes reported back to Maui, information updated
![Page 14: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/14.jpg)
+Why are we building power-proportional Torque ?
To shed load in Soda Hall
To investigate why production clusters don’t implement power proportionality
To integrate power-proportionality into a software used in many clusters throughout the world
![Page 15: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/15.jpg)
+Desirables from an unobtrusive power proportionality feature
Avoid modifications to torque source code
Only use existing torque interfaces
Make the feature completely transparent to end users
Maintain system responsiveness
Centralized
No dependence resource manager/scheduler version
![Page 16: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/16.jpg)
+Analysis of the psi cluster
Logs : Active and Idle Queue Log Job placement statistics
Logs exist for 68 days in Feb-April,2011
Logs were recorded once every minute
Logs contain information of ~169k jobs , ~40 users
![Page 17: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/17.jpg)
+Type of servers in the psi cluster
Server Make Number of Cores
Memory Count
Dell 2 3GB 64
Dell 8 16GB 21
Intel 8 48GB 28
Intel 24 256GB 4
Total : 117
• Each server class is further divided according to various features
• Not all servers listed above are switched on all the time
![Page 18: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/18.jpg)
+CDF of server idle duration
TAKEAWAY 1: Most idle periods are small
![Page 19: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/19.jpg)
+Contribution of server idle period to total
TAKEAWAY 2: To save energy, tackle the large idle periods
![Page 20: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/20.jpg)
+CDF of job durations
(50,500s)
BATCH
INTERACTIVE
TAKEAWAY 3: Most jobs are long. Hence slight increase in queuing time wont hurt
![Page 21: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/21.jpg)
+Summary of takeaways
Small server idle times, though numerous, contribute very less to total server idle time.
Power proportionality algorithm need not be aggressive in switching of servers
Waking servers takes 5 min. Considered to the running time of a job, it is negligible
![Page 22: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/22.jpg)
+Loiter Time vs Energy Savings
![Page 23: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/23.jpg)
+
Design of unobtrusive Power Proportionality for Torque
![Page 24: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/24.jpg)
+Using Torque interfaces
What useful state information does torque/maui maintain ?
Maintains the state(active/offline/down) of each server, and jobs running on it. Obtained through “pbsnodes” command
Maintains a list of running and queued jobs Obtained through “qstat” command
Maintains job constraints and scheduling details of each job Obtained through “checkjob” command
![Page 25: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/25.jpg)
+First implementation- State machine for each server
Active
Offline
Down
Waking
• Server_idle_time > LOITER_TIME
• Server_offline_time >OFFLINE_LOITER_TIME
• No job has been scheduled on server
• Idle job exists
• Server has woken up
Problematic Server
• Server not waking
• If idle job can be scheduled on server
![Page 26: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/26.jpg)
+Does not work !
Each job is submitted to a specific queue, Must ensure right server wakes up.
![Page 27: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/27.jpg)
+Next implementation-State machine for each server
Active
Offline
Down
Waking
• Server_idle_time > LOITER_TIME
• Server_offline_time > OFFLINE_LOITER_TIME
• No job has been scheduled on server
• Idle job exists• Server
belongs to desired queue
• Server has woken up
Problematic Server
• Server not waking
• If idle job can be scheduled on server
![Page 28: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/28.jpg)
+Still did not work !
Each job has specific constraints which torque takes into account while scheduling
Job constraints can be obtained through “checkjob” command.
![Page 29: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/29.jpg)
+Next implementation-State machine for each server
Active
Offline
Down
Waking
• Server_idle_time > LOITER_TIME
• Server_offline_time > OFFLINE_LOITER_TIME
• No job has been scheduled on server
• Idle job exists• Server belongs
to desired queue
• Server satisfies job constraints
• Server has woken up
Problematic Server
• Server not waking
• If idle job can be scheduled on server
![Page 30: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/30.jpg)
+Scheduling problem: Job submission characteristics
Users tend to submit multiple jobs at a time (often >20)
Torque has its own fairness mechanisms, which wont schedule all the jobs even if there are free servers.
To accurately predict which jobs Torque will schedule, and not to switch on extra servers, we should emulate the Torque scheduling logic !
Ties Power Proportionality feature to specific Torque Policy
Solution : Switch on only a few servers at a time to check if torque schedules the idle job
![Page 31: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/31.jpg)
+Next implementation-State machine for each server
Active
Offline
Down
Waking
• Server_idle_time > LOITER_TIME
• Server_offline_time > OFFLINE_LOITER_TIME
• No job has been scheduled on server
• Idle job exists• Server belongs to
desired queue• Server satisfies job
constraints• Switch on only a few
servers at a time
• Server has woken up
Problematic Server
• Server not waking
• If idle job can be scheduled on server
![Page 32: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/32.jpg)
+Maintain responsiveness/headroom
The Debug cycle usually contains the users running short jobs and validating the output
If no server satisfying job contraints are switched on, a user might have to wait a long time to validate if his job is running
If jobs throw errors, he might have to wait for an entire server power cycle to run his modified job
Solution : Group servers according to features. In each group, have a limited numbers of servers as
spinning reserve all the time
![Page 33: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/33.jpg)
+Final implementation-State machine for each server
Active
Offline
Down
Waking
• Server_idle_time > LOITER_TIME
• Server_offline_time >OFFLINE_LOITER_TIME
• No job has been scheduled on server
• Switching off servers leaves no headroom
• Idle job exists• Server belongs to desired queue• Server satisfies job constraints• Switch on only MAX_SERVERS at a
time• Switch on server to maintain
headroom
• Server has woken up
Problematic Server
• Server not waking
• If idle job can be scheduled on server
![Page 34: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/34.jpg)
+But the servers don’t wake up !!!
Each server has to bootstrap a list of service, such as network file systems, work directories, portmapper, etc
Often these bootstraps fail, and hence servers are left in an undesired state ( e.g with no home directories mounted to write user output to ! )
Solution : Have a health-check script on each server Check for proper configurations of useful services, and
make server available for scheduling only if health-check succeeds.
![Page 35: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/35.jpg)
+Power Proportional Torque at a glance:
Completely transparent to user
Did not modify torque source code
1000 line python script which runs only on torque master server
Halts servers through ssh
Wake servers through wake-on-lan
Separates scheduling policy from mechanism. It allows torque to dictate the scheduling policy.
![Page 36: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/36.jpg)
+Deployment
Deployed on 57 of the 78 active nodes in the psi cluster. Total number of cores = 150
Servers were classified into 5 groups based on features.
HEADROOM_PER_GROUP = 3
MAX_SERVERS_TO_WAKE_AT_A_TIME = 5
LOITER_TIME = 7 minutes
OFFLINE_LOITER_TIME = 3 minutes
![Page 37: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/37.jpg)
+Average Statistics
Deployed since last week
~800 jobs analyzed
Avg utilization of cluster = 40%
% Energy saved = 49%
![Page 38: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/38.jpg)
+
![Page 39: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/39.jpg)
+Results:
![Page 40: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/40.jpg)
+HVAC power savings
![Page 41: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/41.jpg)
+Number of servers powered on at a time:
Headroom
![Page 42: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/42.jpg)
+Expected vs Actual savings
![Page 43: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/43.jpg)
+Submission vs Execution profile
5/17/12 12:00 5/18/12 0:00 5/18/12 12:00 5/19/12 0:00 5/19/12 12:00 5/20/12 0:00 5/20/12 12:00 5/21/12 0:000
20
40
60
80
100
120Submission Profile Execution profile
Time
Nu
mb
er
of
Act
ive C
ore
s
![Page 44: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/44.jpg)
+CDF of job queue time as a percentage of job length
![Page 45: + Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert](https://reader030.vdocuments.net/reader030/viewer/2022032607/56649ec55503460f94bcf971/html5/thumbnails/45.jpg)
+Conclusions – what we achieved
Power proportionality is easy to achieve for torque without changing any source code at all
The script could be run on any standard torque cluster to save energy.
Switching servers back on in a consistent state is the single biggest roadblock to deployment of script.
We saved a max of ~17kW of power is Soda Hall (~3%). This was only half the psi cluster !