dll-conscious instruction fetch optimization for smt processors fayez mohamood mrinmoy ghosh...

22
DLL-Conscious Instruction Fetch DLL-Conscious Instruction Fetch Optimization for SMT Processors Optimization for SMT Processors Fayez Mohamood Fayez Mohamood Mrinmoy Ghosh Mrinmoy Ghosh Hsien-Hsin (Sean) Lee Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering School of Electrical and Computer Engineering Georgia Institute of Technology Georgia Institute of Technology

Upload: ralph-donovan

Post on 14-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

DLL-Conscious Instruction Fetch Optimization DLL-Conscious Instruction Fetch Optimization

for SMT Processorsfor SMT Processors

Fayez MohamoodFayez MohamoodMrinmoy GhoshMrinmoy Ghosh

Hsien-Hsin (Sean) LeeHsien-Hsin (Sean) Lee

School of Electrical and Computer EngineeringSchool of Electrical and Computer EngineeringGeorgia Institute of TechnologyGeorgia Institute of Technology

Page 2: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

2DLL-conscious Instruction Fetch, Mohamood

Dynamically Linked LibrariesDynamically Linked LibrariesAn efficient way to develop software on a common platformModules that provide a set of services to application softwareSystem DLLs help manage system functionalityApplication DLLs enable flexibility and modularity

Name Functionality

KERNEL32.DLL Memory, IO and Interrupt functions

NTDLL.DLL Core operating system functions

USER32.DLLUser Interface functionality like window handling, message passing

GDI32.DLL Functions for creating 2-D graphics

MFC42.DLLContains the Microsoft Foundation Classes used by many Windows applications

Page 3: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

3DLL-conscious Instruction Fetch, Mohamood

Shared LibrariesShared Libraries

DLLs house major system and application functionality

Typical Microsoft Windows applications uses 30 DLLs on an average

Average of 20 DLLs are shared among different applications

Different applications share system DLLs on the same virtual page

Operating System

Application

Application

ApplicationDLL

DLL

DLLDLL

ApplicationCode

SystemDLL

ApplicationCode

Process 0Address Space

Process 1Address Space

Page 4: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

4DLL-conscious Instruction Fetch, Mohamood

Simultaneous Simultaneous MultithreadingMultithreading

Boost instruction throughput with minimal hardware increaseBottleneck due to resource sharingI-Cache, branch predictor, LSQ, ROB etc sharedCommercial processors: IBM Power5, Intel Pentium4, Alpha 21464Presence of DLLs exacerbates I-Cache performance

RegisterRename

Allocate

RegisterRename

Allocate Registers

L1 D-Cache

Store Buffer

Registers

Reorder Buffer

InstructionQueue

Rename Queue SchedulerRegister

Read Execute L1 CacheRegister

WriteRetire

Page 5: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

5DLL-conscious Instruction Fetch, Mohamood

DLL Thrashing and DLL Thrashing and DuplicationDuplication

Virtual Memory is supported by common desktop platforms

Virtually-Indexed instruction caches accelerate lookup

Aliasing needs to be resolved in the I-Cache and the I-TLB

How can homonym aliasing be prevented ?Non-SMT processors can flush the cache/TLB upon a context switchSMT processors require a Process or Address Space Identifier to prevent access violation

PID or ASID induces false misses when a different process looks up an instruction that is part of a shared DLL

Page 6: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

6DLL-conscious Instruction Fetch, Mohamood

X 0 X X

DLL Thrashing and DLL Thrashing and DuplicationDuplication

DLL Thrashing: In a direct-mapped I-Cache, shared DLL instructions will result in an increased number of conflict misses

DLL Duplication: In a set-associative I-Cache, shared DLL instructions will exist in multiple locations resulting in wasted space

Process 0: 0x1000 0x3453

Process 1: 0x1000 0x3453

PID Valid Tag Data

0 1 0x100 0x3453 X 0 X X 1 1 0x100 0x3453

FALSE EVICTION

Process 0: 0x1000 0x3453

Process 1: 0x1000 0x3453

PID Valid Tag Data

X 0 X X

PID Valid Tag Data

0 1 0x100 0x3453

1 1 0x100 0x3453

DUPLICATION

Page 7: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

7DLL-conscious Instruction Fetch, Mohamood

DLL-Conscious Instruction DLL-Conscious Instruction FetchFetch

Program locality in presence of DLLs disturbed due to PID matching

Alleviate the DLL thrashing and/or duplication effect

We propose making the micro-architecture aware with capability to distinguish DLL and non-DLL instructions

DLL-Conscious Instruction Fetch:DLL (or L bit) in the page table, I-TLBModified OS page fault handler that will set the L bit for DLLsFor VIVT caches, an L bit in each line of the I-Cache to facilitate faster translation

Page 8: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

8DLL-conscious Instruction Fetch, Mohamood

VIVT I-Cache OptimizationVIVT I-Cache Optimization

I-TLB for Thread 2

VALID SHARED VPN PPNI-TLB for Thread 1

V L PID PPN

PID

Instruction Cache

PID V L TAG DATA

Virtual Page Number Page Offset

=

HIT !

=

I-L1 Tag Compare

L1 Cache Index Block Offset

I-TLB Lookup necessary only

upon I-Cache Miss

Page 9: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

9DLL-conscious Instruction Fetch, Mohamood

VIPT I-Cache OptimizationVIPT I-Cache Optimization

I-TLB for Thread 2

VALID SHARED VPN PPNI-TLB for Thread 1

V L PID PPN

PID

Instruction Cache

V TAG DATA

Virtual Address of Instruction

Virtual Page Number Page Offset

L1 Cache Index Block Offset

I-L1 Tag Compare=

HIT !

=

Page 10: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

10DLL-conscious Instruction Fetch, Mohamood

VIPT IllustrationVIPT Illustration

I-TLB for Thread 2

VALID SHARED VPN PPNI-TLB for Thread 1

V L PID PPN

Process Identifier

Instruction Cache

V TAG DATA

Virtual Page Number Page Offset

L1 Cache Index Block Offset

I-L1 Tag Compare=

HIT !

=

Process 0: 0x1000 0x3453

Process 1: 0x1000 0x3453

0 X X XX X0

1 1 0 0x100 0x10

00x34531

MISS

Page 11: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

11DLL-conscious Instruction Fetch, Mohamood

x86 SMT Out-Of-OrderPerformance Simulator

x86 Out-Of-OrderPerformance Simulator

Simulation MethodologySimulation MethodologyStudying DLLs required the modeling of an entire platformTAXI: Trace Analysis for x86 Interpretation (by Vlaovic et al.)

Bochs System EmulatorModified SimpleScalar with x86 front end

Kernel Debugger to capture DLL behavior

BochsSystem Emulator

InstructionTraces

MemoryTraces

InstructionTraces

MemoryTraces

Page 12: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

12DLL-conscious Instruction Fetch, Mohamood

Simulation ParametersSimulation ParametersParameters Values

Fetch/Decode width 4

Issue/Commit width 4

Branch Predictor 2-Level GAg, 512 entries

BTB 4-Way, 128 sets

L1 I-Cache DM, 2-Way and 4-Way

16KB and 8KB, 32B line

L1 D-Cache DM, 16KB, 32B line

L2 Cache 4-Way, Unified, 64B line

256KB

L1/L2 Latency 1 cycle / 6 cycles

Main Memory Latency 120 cycles

ROB Size 48 entries

Page 13: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

13DLL-conscious Instruction Fetch, Mohamood

DLL Instruction PercentageDLL Instruction Percentage

Application Total Instructions

(millions)

System DLL Instructions

Adobe Acrobat Reader 6.0 410 14.6 %

MS PowerPoint 97 366 20.8 %

MS Word 97 378 16.4 %

MS Internet Explorer 5.0 446 15.3 %

MS Visual C++ 6.0 398 11.4 %

Netscape Communicator 4.7 432 17.4 %

Page 14: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

14DLL-conscious Instruction Fetch, Mohamood

DLL Usage DistributionDLL Usage DistributionNormalized DLL Usage Distribution

0%

10%

20%

30%

40%

50%

60%

70%

80%

Adobe Acrobat Reader 6.0 Microsoft Internet Explorer 5.0 Netscape Navigator 4.7

Microsoft PowerPoint 97 Visual C++ 6.0 Microsoft Word 97

Page 15: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

15DLL-conscious Instruction Fetch, Mohamood

2-Way DLL I-Cache Misses2-Way DLL I-Cache Misses2-Way I-Cache Misses

0

2

4

6

8

10

12

14

16

Acroread, Acroread Pow erPoint,Pow erPoint

Netscape, Netscape Word, Acroread Visual C++,Pow erPoint

Internet Explorer,Visual C++

Nu

mb

er o

f M

isse

s (m

illio

ns)

DLL-Conscious Baseline

Number of misses per thread decrease anywhere between 3.3 and 5.0 times for homogeneous threads

Heterogeneous threads decrease the number of misses by up to 2.5 times

Homogeneous Threads

Heterogeneous Threads

Page 16: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

16DLL-conscious Instruction Fetch, Mohamood

2-Way I-Cache Hit Rate2-Way I-Cache Hit Rate2-Way I-Cache Hit Rate

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

Acroread,Acroread

Pow erPoint,Pow erPoint

Netscape,Netscape

Word, Acroread Visual C++,Pow erPoint

Internet Explorer,Visual C++

Hit

Rat

e

8K DMap DLL-Conscious 8K DMap Baseline

Overall I-Cache hit rate increased by 50% (from 30% to 47% for Netscape Communicator)

Homogeneous threads show promise for more performance benefits

Homogeneous Threads

Heterogeneous Threads

Page 17: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

17DLL-conscious Instruction Fetch, Mohamood

4-Way I-Cache Misses and 4-Way I-Cache Misses and Hit RateHit Rate

4-Way I-Cache DLL Misses

0

2

4

6

8

10

12

14

16

Acroread - 4 Instances Acroread and Pow erPoint- 2 Instances Each

Acroread, Pow erPoint,Word and Visual C++

Nu

mb

er o

f M

isse

s (m

illio

ns)

DLL-Conscious Baseline

4-Way I-Cache Hit Rate

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

Acroread - 4 Instances Acroread andPow erPoint - 2 Instances

Each

Acroread, Pow erPoint,Word and Visual C++

Hit

Rat

e

DLL-Conscious Baseline

Misses per thread decrease by up to 5.5 times for homogeneous threads

I-Cache hit rate improves by as much as 62% (from 28% to 47% for 4 instances of Acrobat Reader)

Page 18: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

18DLL-conscious Instruction Fetch, Mohamood

4-Way DLL IPC Improvement4-Way DLL IPC Improvement

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Adobe(1), Adobe(2), Adobe(3),Adobe(4)

Adobe(1), Adobe(2), PowerPoint(1),PowerPoint(2)

Adobe, PowerPoint, Word,Visual C++

DL

L IP

C

DLL-Conscious 4-Wide Baseline 4-Wide DLL-Conscious 8-Wide

Baseline 8-Wide DLL-Conscious High Latency Baseline High Latency

4-Wide Machine: Up to 21% improvement8-Wide Machine: Up to 24% improvementHigh Latency Machine: Up to 30% improvement

Page 19: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

19DLL-conscious Instruction Fetch, Mohamood

4-Way IPC Improvement4-Way IPC Improvement

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Adobe(1), Adobe(2), Adobe(3),Adobe(4)

Adobe(1), Adobe(2), PowerPoint(1),PowerPoint(2)

Adobe, PowerPoint, Word,Visual C++

IPC

DLL-Conscious 4-Wide Baseline 4-Wide DLL-Conscious 8-Wide

Baseline 8-Wide DLL-Conscious High Latency Baseline High Latency

4-Wide Machine: Up to 10% improvement8-Wide Machine: Up to 14% improvementHigh Latency Machine: Up to 15% improvement

Page 20: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

20DLL-conscious Instruction Fetch, Mohamood

Related WorkRelated WorkExecution Trace Characteristics of Windows NT Applications (Lee et. al, ISCA 1998)

DLL BTB proposed by Vlaovic et. al (MICRO 2000)

OS techniques including Page Coloring and Bin Hopping (Lo et. al, ISCA 1998)

Commercial implementation of Global bit for reducing burden of context switch:

MIPS: (G)lobal bit in TLBARM 1176: nG bit in the TLB for global dataIntel P6: PGE bit in the CR4 register

Page 21: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

21DLL-conscious Instruction Fetch, Mohamood

Conclusions & ContributionsConclusions & ContributionsCurrent and future generations of Operating Systems will be highly modular

Analyzed and quantified the effect of DLL thrashing and duplication

Devised a light-weight technique to reinstate DLL sharing in processor micro-architecture

Evaluated the benefits using a complete system level simulation methodology

2-Way IPC improved up to 10%4-Way IPC improved up to 15%

Exploiting system features is yet another way to continue providing performance boosts in processors at the system level

Page 22: DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering

22DLL-conscious Instruction Fetch, Mohamood

That’s All Folks !That’s All Folks !

Questions & AnswersQuestions & Answers