Download - POWER MANAGEMENT AND ENERGY EFFICIENCYcsl.skku.edu/uploads/ECE5658S16/week5.pdfa high-end DVFS-enabled microprocessor, i.e., Intel Core2 Duo E6850 processor, along with the LTC3733

POWER MANAGEMENT AND ENERGY EFFICIENCY

2016 Operating Systems DesignEuiseong Seo ([email protected])

* Adopted “Power Management for Embedded Systems, Minsoo Ryu”

Need for Power Management

¨ Power consumption matters¨ PCs

¤ Energy cost¤ Thermal dissipation

¨ Mobile devices¤ Battery lifetime¤ Thermal dissipation

¨ Server systems¤ Energy cost¤ Electrical infrastructure¤ Power usage effectiveness

Power and Performance

¨ Power ∝ voltage2 ∗ clock¨ Clock ∝ voltage¨ Therefore, Power ∝ clock3

¨ Already server processors reached 150 Watt TDP

Power Consumers in a System

¨ Processors¤ Dominate power consumption¤ Usually consume 100 watts out of 300 watts

¨ Memory¤ Significant contributor

4Real-Time Computing and Communications Lab., Hanyang University

http://rtcc.hanyang.ac.kr 4Real-Time Computing and Communications Lab., Hanyang Universityhttp://rtcc.hanyang.ac.kr

Power Consumption of Memory� In a server system

� Memory consumes 19% of system power on average� Some work notes up to 40% of total system power

Power Consumers in a System

¨ Storages¤ A server HDD consumes 5 to10 watts¤ A laptop HDD consumes 1 to 5 watts¤ A SATA SSD consumes 1 to 5 watts¤ An NVME consumes 30~ watts

¨ NIC¤ A 10Gbps NIC consumes 5 to 20 watts

¨ Peripherals¤ Insignificant

Idle Power Consumption

¨ Only 30% of servers in data centers are fully utilized while keeping the other 70% in idle state¤ Idle servers consume between 60% and 66% of the

peak load power consumption



Idle Power Consumption� Only 30% of servers in data centers are fully utilized

while keeping the other 70% in idle state� Idle servers consume between 60% and 66% of the peak

load power consumption

Two Dimensions on Power Management

¨ Power management when the system is idle ¤ Select the most efficient idle state

¨ Power management when the system is active ¤ Dynamically change operating frequency and/or

voltage

APM and ACPI

¨ APM (Advanced Power Management)¤ Activated when system becomes idle

n Screen saver→ sleep→ suspend

¤ Controlled by firmware (BIOS)n Need reboot for reconfiguration

¤ OS has no knowledge

¨ ACPI (Advanced Config. and Power Interfaces)¤ Controlled by OS¤ First released in 1996 by Compaq, HP, Intel and MS

ACPI

¨ Standard interface specification¤ Brings power management under the control of the

operating system ¤ The specification is central to Operating System-

directed configuration and Power Management (OSPM)



ACPI� Standard interface specification

� Brings power management under the control of the operating system

� The specification is central to Operating System-directed configuration and Power Management (OSPM)

OS Power Management

Hardware: CPU, BIOS etc.

Software driversACPI

Applications

ACPI Functions

¨ System power management ¤ The entire computer

¨ Processor power management ¤ When OS is idle but not sleeping, it puts processors in

low- power states

¨ Device power management¤ ACPI tables describe motherboard devices, their power

states, the power planes the devices are connected to

Firmware-Level ACPI Architecture

¨ Three components¤ ACPI tables

n Contain definition blocks that describe all the hardware that can be managed through ACPI

n Include both data and machine-independent byte-code n OS must have an interpreter for the AML bytecode

¤ ACPI BIOS n Performs basic management operations on the hardware n Include code to help boot the system and to put the system to

sleep or wake it up

¤ ACPI registersn A set of hardware management registers defined by the ACPI

specification

ACPI States



ACPI States

Global States

¨ G0: Working (S0) ¤ Processor power states (C-state): C0, C1, C2, C3

¨ G1: Sleeping (e.g., suspend, hibernate) ¤ Sleep State (S-state): S0, S1, S2, S3, S4

¨ G2: Soft off (S5) ¤ Almost the same as G3 Mechanical Off, except that the

power supply unit (PSU) still supplies power at a minimum ¤ Other components may remain powered so the computer

can "wake" on input from the keyboard, clock, modem, LAN, or USB device

¨ G3: Mechanical off

Processor States (C-State)

¨ Global state is G0 (working) ¨ Four processor states

¤ C0: Operating n Performance state (P-State) n P0: highest performance, highest power n P1 ~ Pn: lower performance, lower power

¤ C1: Haltn The processor is not executing instructions, but can return to an executing state

essentially instantaneously ¤ C2: Stop-Clock (optional)

n The processor maintains all software-visible state, but may take longer to wake up

¤ C3: Sleep (optional)n The processor does not need to keep its cache coherent, but maintains other

state

Processor States (C-State)

¨ Intel Pentium M at 1.6 Ghz



Performance States (P-State)

� Intel Pentium M at 1.6 GHz

Device States (D-State)

¨ The device states D0–D3 are device-dependent¤ D0: Fully On

n The operating state

¤ D1 and D2n Intermediate power-states whose definition varies by device

¤ D3: Offn The device is powered off and unresponsive to its busn D3 Hot: Aux power is providedn D3 Cold: No power provided

Sleeping States (S-State)

¨ Four sleeping states¤ S1: Power on Suspend (POS)

n All the processor caches are flushed n The power to the CPU(s) and RAM is maintained n Wakeup takes about 1 ~ 2 seconds on desktops

¤ S2: CPU powered offn Dirty cache is flushed to RAM (Often not used)

¤ S3: Suspend to RAM (STR), or Standby, Sleep n RAM remains powered n Wakeup takes about 3 ~ 5 seconds on desktops

¤ S4: Suspend to Disk (STD) or hibernationn All content of the main memory is saved to non-volatile memory

such as a hard drive, and is powered down

Dynamic Voltage and Frequency Scaling

¨ Adjusting clock speed and operating voltage dynamically

¨ Most modern processors provide¨ Low clock switching overhead¨ usually within a few µs.

Deadline

Time

Performance Deadline

Time

Performance

Four Considerations for DVFS

¨ Workload amount¤ Adjust the processor frequency depending on the load

¨ Workload characteristics ¤ Compute-intensive vs. memory-intensive

¨ Deadline constraints ¤ Lowest possible frequency for meeting deadlines

¨ Load balancing ¤ Migrate or scale?

Workload Amount and DVFS

¨ Static approaches¤ Performance policy

n CPU runs at the maximum frequency regardless of load¤ Power save policy

n CPU runs at the minimum frequency regardless of load

¨ Dynamic approaches¤ On demand policy

n Increase the clock speed to the maximum frequency when the system load goes above the predefined threshold

n Decrease the clock speed gradually when the system load becomes below the predefined threshold

¤ Conservative policyn Gracefully increase the CPU speed rather than jumping to the

maximum speed

Workload Characteristics and DVFS

¨ Two types of workload¤ Compute-intensive

n The program execution is exclusively bound to the processor ¤ Memory-intensive

n The program makes heavy access to memory n The processor would spend a significant fraction of the time

waiting for memory ¨ A simple solution

¤ High processor frequency and low memory frequency for compute-intensive load

¤ Low processor frequency and high memory frequency for memory-intensive load

CPU VS Memory-Intensive

¨ Execution time variation ¤ CPU frequency ranging from 733 MHz to 333 MHz



CPU- vs. Memory-Intensive� Execution time variation

� CPU frequency ranging from 733 MHz to 333 MHz

GPU and Memory-Intensive

¨ Compute-intensive applications¤ Dense matrix multiplication¤ Run on NVIDIA GeForce GTX 280 GPU



GPU- and Memory-Intensive� Compute-intensive applications

� Dense matrix multiplication� Run on NVIDIA GeForce GTX 280 GPU

GPU and Memory-Intensive

¨ Memory-intensive applications¤ Dense matrix transpose¤ Run on NVIDIA GeForce GTX 280 GPU



GPU- and Memory-Intensive�Memory-intensive applications

� Dense matrix transpose� Run on NVIDIA GeForce GTX 280 GPU

Load Balancing and DVFS

¨ DVFS can be independently applied to each processor on multicore hardware ¤ But this may not lead to optimal power saving from a global

point of view ¨ A simple scenario

¤ We need to decide whether to transfer a thread from processor A to an idle processor B, or increase the frequency of A

¤ Compute Pmigrate_from_A_to_B and Pincrease_A_freq

¤ Transfer if Pmigrate_from_A_to_B < Pincrease_A_freq

¤ Otherwise, increase the frequency of A

Case Study: Intel Core2 Duo E685010

TABLE IIVOLTAGE (Vcpu (V)) AND CLOCK FREQUENCY ( fCPU (GHZ)) LEVELS FOR

INTEL CORE2 DUO E6850 PROCESSOR.

DVFS level Vcpu fcpu DVFS level Vcpu fcpuLevel 1 1.30 3.074 Level 4 1.15 2.281Level 2 1.25 2.852 Level 5 1.10 1.932Level 3 1.20 2.588 Level 6 1.05 1.540

TABLE IIIMEASURED AND ANALYTICAL MODELS OF INTEL CORE2 DUO E6850

POWER CONSUMPTION.

Vcpu(V ) fcpu(GHz) Measurement (W) Analyticalmodel (W)

1.056 1.776 21.520 21.2121.080 1.888 24.000 23.9561.104 2.004 26.320 26.8561.160 2.338 33.760 34.8381.224 2.672 43.200 44.4091.280 3.006 55.440 54.236

reflects the supply voltage and frequency changes. We choosea high-end DVFS-enabled microprocessor, i.e., Intel Core2Duo E6850 processor, along with the LTC3733 3-phase syn-chronous step-down DC–DC converter that supports discon-tinuous mode, which is a representative setup of a modernhigh-performance DVFS-enabled microprocessor.

The microprocessor power consumption model is describedin (18). The parameters Ce, a1, and a2 is obtained fromactual measurements. We insert a shunt monitor circuit rightin front of the DC–DC converter of the Intel Core2 DuoE6850 processor, and measure the power supply current withan Agilent A34401 digital multimeter. We compensate theDC–DC converter efficiency from the measured current values,and characterize IO. We run PrimeZ benchmark and changeVcpu and fcpu performing direct access to the BIOS (basicinput/output system) as described in Table II because the IntelSpeedStep supports only two voltage levels. We finally derivethe following power consumption model:

Pcpu = 8.4503V 2cpu fcpu +(36.3851Vcpu �33.9503), (30)

where the units of Pcpu, Vcpu, and fcpu are W, V, and GHz,respectively. The difference between the analytical model andmeasurement results is less than 4.6% as shown in Table III.The DC–DC converter parameters are given in Table IV. Thevalues are chosen according to guidelines in datasheet andreference designs offered by the vendor.

The delay overhead of DVFS transition is given in Table V.

TABLE IVDC–DC CONVERTER PARAMETERS OF LTC3733 3-PHASE CONVERTER

FOR INTEL CORE2 DUO E6850.

Parameter Value Parameter ValueVIN 12 (V) VOUT VO in Table IIC 8840 (µF) L 1 (µH) per phaseRL 2.3 (mW) fDC 530 (kHz) per phase

max(IL) 75 (A)

TABLE VDVFS TRANSITION DELAY OVERHEAD FOR INTEL CORE2 DUO E6850PROCESSOR WITH LTC3733 CONVERTER. (Epll =5 µs AND THE NUMBER

OF CYCLES ARE CALCULATED AT f = 3.074 GHz).

Level Actual value (µs) Proposed model (µs)Tuc Total Cycles Tuc Total Cycles

2!1 4.77 9.77 30018 4.11 9.11 280113!1 12.29 17.29 53141 12.21 17.21 528903!2 5.95 10.95 33672 5.72 10.72 329504!1 22.29 27.29 83894 24.24 29.24 898944!2 14.69 19.69 60531 16.21 21.21 652014!3 7.33 12.33 37921 8.06 13.06 401505!1 34.81 39.81 122389 40.37 45.37 1394575!2 26.47 31.47 96733 31.44 36.44 1120255!3 27.33 32.33 99383 21.87 26.87 826065!4 9.43 14.43 44361 11.49 16.49 506946!1 57.68 62.68 192684 60.90 65.90 2025906!2 49.50 54.50 167525 51.65 56.65 1741576!3 31.89 36.89 113409 41.52 46.52 1429946!4 28.35 33.35 102531 30.14 35.14 1080066!5 12.14 17.14 52688 16.84 21.84 67141

Downscale 0 5 15370 0 5 15370

The value of Tpll , 5 µs, is specified in the Intel Core2 DuoE6850 datasheet. The actual values are obtained from SPICEsimulation results. We obtain TX by observing the settling timeof VO(t) from SPICE results, and substitute it into equationsin Section IV to calculate the delay overhead. The estimatedoverhead from the proposed macro model well follows thetrend of actual values. For upscaling, the delay overhead issum of underclocking-related overhead Tuc and PLL locktime loss Tpll . Unlike assumption of previous works, theunderclocking-related overhead is the dominant factor for mostcases as we have discussed in Section III. For downscaling,PLL lock time is the only delay overhead, and thus theoverhead values are the same for all cases.

The energy overhead values of a DVFS transition forcontinuous- and discontinuous-mode operations are given inTables VI and VII, respectively. For the actual value, we obtainIL(t), IO(t), and VO(t) from SPICE simulation and substitutethem into (14), (16), and (19). There is no Ecap in the Tablesas it is implied in Eir. The value of Eir for the case 1 ! 6in Table VI is large because it drains significant amount ofcharge from bulk capacitor to the ground. On the other hand,Eir for the same case in Table VII is much smaller because ituses most of the stored charge to supply the load. This resultis very different from previous models such as [6] as theysimply calculate the overhead based on the charge transfer toand from the bulk capacitor.

B. Case 2: ARM Cortex-A8 ProcessorThe second target DVFS system is the ARM Cortex-A8

processor with LTC3446 converter. ARM Cortex-A8 processoris an application processor targeting high-end mobile productssuch as smartphones, tablets, and netbooks. It exhibits powerconsumption of 600 mW at full speed. We perform a procedure

10





POWER CONSUMPTION.


1.056 1.776 21.520 21.2121.080 1.888 24.000 23.9561.104 2.004 26.320 26.8561.160 2.338 33.760 34.8381.224 2.672 43.200 44.4091.280 3.006 55.440 54.236









max(IL) 75 (A)




2!1 4.77 9.77 30018 4.11 9.11 280113!1 12.29 17.29 53141 12.21 17.21 528903!2 5.95 10.95 33672 5.72 10.72 329504!1 22.29 27.29 83894 24.24 29.24 898944!2 14.69 19.69 60531 16.21 21.21 652014!3 7.33 12.33 37921 8.06 13.06 401505!1 34.81 39.81 122389 40.37 45.37 1394575!2 26.47 31.47 96733 31.44 36.44 1120255!3 27.33 32.33 99383 21.87 26.87 826065!4 9.43 14.43 44361 11.49 16.49 506946!1 57.68 62.68 192684 60.90 65.90 2025906!2 49.50 54.50 167525 51.65 56.65 1741576!3 31.89 36.89 113409 41.52 46.52 1429946!4 28.35 33.35 102531 30.14 35.14 1080066!5 12.14 17.14 52688 16.84 21.84 67141

Downscale 0 5 15370 0 5 15370





10





POWER CONSUMPTION.


1.056 1.776 21.520 21.2121.080 1.888 24.000 23.9561.104 2.004 26.320 26.8561.160 2.338 33.760 34.8381.224 2.672 43.200 44.4091.280 3.006 55.440 54.236









max(IL) 75 (A)




2!1 4.77 9.77 30018 4.11 9.11 280113!1 12.29 17.29 53141 12.21 17.21 528903!2 5.95 10.95 33672 5.72 10.72 329504!1 22.29 27.29 83894 24.24 29.24 898944!2 14.69 19.69 60531 16.21 21.21 652014!3 7.33 12.33 37921 8.06 13.06 401505!1 34.81 39.81 122389 40.37 45.37 1394575!2 26.47 31.47 96733 31.44 36.44 1120255!3 27.33 32.33 99383 21.87 26.87 826065!4 9.43 14.43 44361 11.49 16.49 506946!1 57.68 62.68 192684 60.90 65.90 2025906!2 49.50 54.50 167525 51.65 56.65 1741576!3 31.89 36.89 113409 41.52 46.52 1429946!4 28.35 33.35 102531 30.14 35.14 1080066!5 12.14 17.14 52688 16.84 21.84 67141

Downscale 0 5 15370 0 5 15370





Case Study: Exynos 4210



DVFS Overhead (Exynos 4210)

Linux Power Management Architecture



Linux Power Management Architecture

Amit Kucheria at 2011 Embedded Linux Conference

Policy ManagementLayer (Governors)

Device DriverLayer

CPUidle Architecture



CPUidle Architecture

CPUidle Governors

¨ Ladder Governor ¤ Takes a simple, step-wise approach to selecting an idle

state ¤ Enters the lightest state first, and will only move on to

the next deeper state if a sleep was long enough

¨ Menu Governor ¤ Picks the deepest possible idle state straight away ¤ Considers the expected sleep time, latency

requirements, previous C-state residency, etc

Idle Task

¨ When there are no runnable processes, and CFS schedules the idle task (PID 0)



Idle Task�When there are no runnable processes, and CFS

schedules the idle task (PID 0)

Tickless Idle

¨ Traditional systems use a periodic interrupt 'tick’¤ Update the system clock¤ Tick requires wakeup from idle state

¨ Tickless idle eliminates the periodic timer tick when the CPU is idle¤ The CPU can remain in power saving states for a longer

period of time, reducing the overall system power consumption

CPUFreq Architecture



CPUfreq Architecture

CPUFreq Governor



CPUfreq Governors

Governor Operations

Performance • Always set CPU to the highest frequency between scaling_min_freq and scaling_max_freq

Powersave • Always set CPU to the lowest frequency between scaling_min_freq and scaling_max_freq

Ondemand• Set frequency depending on the current usage

• Rapidly increase the frequency and gracefully decrease the frequency

Conservative• Basically operates like ondemand

• Gracefully increase and decrease the frequency

Userspace• Set CPU to the frequency using

scaling_setspeed by user

Ondemand Governor



Ondemand Governor

min

max

up_threshold = 95%Load

Time

Freq.

Conservative Governor



Conservative Governor

min

max

up_threshold = 80%down_threshold = 20%

Time

Freq.

Load

Basic Operations of CPUfreq

¨ Sample the processor utilization periodically¨ Adjust frequency based on the utilization¨ Adjust voltage based on frequency



Basic Operations of CPUfreq� Sample the processor utilization periodically� Adjust frequency based on the utilization� Adjust voltage based on frequency

Download - POWER MANAGEMENT AND ENERGY EFFICIENCYcsl.skku.edu/uploads/ECE5658S16/week5.pdfa high-end DVFS-enabled microprocessor, i.e., Intel Core2 Duo E6850 processor, along with the LTC3733

Top Related