ramazan bitirgen, engin ipek and jose f.martinez micro’08 presented by pak,eunji coordinated...

19
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors : A Machine Learning Approach

Upload: kathryn-blair

Post on 31-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Ramazan Bitirgen, Engin Ipek and Jose F.MartinezMICRO’08

Presented by PAK,EUNJI

Coordinated Management of Multiple Interacting Resources in Chip Multipro-

cessors :A Machine Learning Approach

Resource sharing problem in CMP Increasing levels of pressure on shared system resources Efficient sharing is necessary for high utilization and performance Multiple interacting resources

Cache Space, DRAM Bandwidth and Power Budget Allocation of a resource affects demands of other resources

Propose a resource allocation framework At runtime, monitors the execution of each application and

learns a predictive model of performance as a function of re-source allocation decisions and periodically allocates resources to each core using the model

Introduction

Per-application HW performance model Use Artificial Neural Networks (ANNs) Predict each app’s performance as a function of the re-

sources allocated to it Global resource manager

At every interval, searches the possible resource allocations by querying the application performance model

Resource Allocation Framework

Use ANNs Input units, hidden units

and an output unit con-nected via a set of weighted edges

Hidden(output) unit calcu-lates a weighted sum of their inputs(hidden values) based on edge weights

Edge weights are trained with training examples (data sets)

How to Predict a Performance?(Artificial Neural Networks)

Input units L2 cache space, off-chip bandwidth,

power budget Number of read hits, read misses,

write hits, and write misses over the last 20K inst and over the 1.5M inst

Fraction of cache ways that are dirty (the amount of WB traffic)

Activation function Use sigmoid (integer to value in [0, 1])

Model performance as a function of its allocated resources and recent behav-ior

Training during first 1.2 billion cycle with randomly allocated resource

Always keep a training set consisting of 300 points

Retrained at every 2,500,000 cycle

How to Predict a Performance?(Adaptation to per-APP Performance Model)

Optimization Prevent memorizing outliers

in a sample data Cross validation

Data set is divided into N equal-sized folds (N-1 training sets and 1 test set)

Ensemble consists of N ANN models

Performance is predicted av-eraging the predictions of all ANNs in the ensemble

Prediction error is estimated as a function of CoV of the predictions by each ANN in the ensemble (will be used for re-source allocation)

How to Predict a Performance?(Adaptation to per-APP Performance Model)

Training

Test TrningTest

Make resource allocation decision (at every 500,000 cycle) us-ing the trained per-application performance model

Discard queries involving an app with a high error estimate Fairly distribute resources to the running applications Predict the perf and compute the prediction error If the performance is estimated to be inaccurate (error > 9%), app is ex-

cluded from global resource allocation

Search the space with stochastic hill climbing It starts with a random solution, and iteratively makes small changes to

the solution, each time improving it a little. When the algorithm cannot see any improvement anymore, it terminates 2,000 trials produces the best tradeoff between search performance and

overhead

Resource Allocation

HW implementation Single HW ANN and multiplex edge weights on the fly to achieve 16 ‘virtual’ ANNs 12 * 4 + 4 multipliers as many as weighted edges 50 entry-table-based quantized sigmoid function Calculate in a pipelined manner Prediction(search) takes 16 cycles for 16 virtual ANNs

Area, Power, and Delay 3% of the chip’s area 3W power consumption Possible to make 2,000 queries within 5% of interval

OS Interface Embed training set and the ANN weights to the process state OS communicates the desired objective function through CR

Implementation & Overhead

Tools & architecture Heavily modified version of SESC With Wattch(power), HotSpot(temperature)

Baseline : Intel’s Core2Quad, DDR2-800 4-core CMP, frequency = 0.9GHz-4.0GHz(0.1GHz unit) 4MB, 16-way shared L2 cache

Distributed 60W power budget among 4 apps via per-core DVFS Outs is limited to 57W Statically allocate 5W

Partition L2 cache space at the granularity of cache ways Allocate one way to each app Distribute the remaining 12 ways

Each app statically allocated 800MB/s of off-chip DRAM bandwidth and the remain-ing 3.2GB/s is distributed

Experimental Setup

Metrics Weighted speedup Sum of IPCs Harmonic mean of normalized IPCs Weighted sum of IPCs

Workload 9 quad-core multi-programmed workloads from SPEC2000 and NAS

suites Classify into 3 categories

CPU-bound Memory-bound Cache Sensitive

Experimental Setup

Configurations Unmanaged Isolated Cache Management (Cache)

Utility-based cache partitioning, MICRO’2006 Distribute L2 cache ways to minimize miss rate

Isolated Power Management (Power) An analysis of efficient multi-core global power management

policies : Maximizing performance for a given power budget, MI-CRO’2006

Isolated Bandwidth Management (BW) Fair Queuing Memory System, Micro ‘06

Uncoordinated Cache + Power, Cache + BW, Power + BW, Cache + Power + BW Continuous Stochastic Hill-Climbing (Coordinated-HC)

Learning based SMT processor resource distribution(issue-queue, ROB, and register file), ISCA ’06

Fair-share Proposed scheme (Coordinated-ANN)

ANN-based models of the applications’ IPC response to resource allocation are used to guide a stochastic hill-climbing search

Experimental Setup

Performance Results are normalized to Fair-Share 14% average speedup over Fair-Share Similar for other metrics

Evaluation Results

P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Sensitivity to confidence threshold Results are normalized to Fair-Share

Evaluation Results

P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Confidence estimated mechanism Fraction of the total execution time where the ANN could predict the

resource allocation optimization for each application

Evaluation Results

P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Proposed a resource allocation framework that Manages multiple shared CMP resources in a coordinated fashion through ANNs and periodic resource allocation scheme

Coordinated approach to multiple resource management is a key to delivering high performance in multi-pro-grammed workloads

Conclusions

Extras

P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Extras

P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Extras

P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Extras