flexible architecturalarchitectural support support forfor...

22
Flexible Flexible Architectural Architectural Support Support for for Fine FineGrain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard Richard M Yoo Yoo Richard Richard M. M. Yoo Yoo Christos Christos Kozyrakis Kozyrakis March March 16 16 th th 2010 2010 Stanford Stanford University University

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Flexible Flexible ArchitecturalArchitectural SupportSupportforfor FineFine‐‐GGrainrain SchedulingScheduling

Daniel Daniel SanchezSanchezRichardRichard MM YooYooRichard Richard M. M. YooYoo

ChristosChristos KozyrakisKozyrakis

MarchMarch 1616thth 20102010StanfordStanford UniversityUniversity

Page 2: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

OverviewOverview

• Our focus: User‐level schedulers for parallel runtimes– Cilk, TBB, OpenMP, …

• Trends:– More cores/chip

– Deeper memory hierarchies

Need to exploit finer‐grain parallelism

C i ti th h h d– Deeper memory hierarchies

– Costlier cache coherence

• Existing fine‐grain schedulers:

Communication through shared memory increasingly inefficient

g g– Software‐only: Slow, do not scale

– Hardware‐only: Fast, but inflexible

• Our contribution: Hardware‐aided approach– HW: Fast, asynchronous messages between threads (ADM)

SW: Scalable message passing schedulers

2

– SW: Scalable message‐passing schedulers

– ADM schedulers scale like HW, flexible like SW schedulers

Page 3: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

OutlineOutline

• Introduction

• Asynchronous Direct Messages (ADM)

• ADM schedulers

• Evaluation

3

Page 4: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

FineFine‐‐grain parallelismgrain parallelism

• Fine‐grain parallelism: Divide work in parallel phase in small tasks (~1K‐10K instructions)( )

• Potential advantages:– Expose more parallelismp p

– Reduce load imbalance – Adapt to a dynamic environment (e.g. changing # cores)

• Potential disadvantages:– Large scheduling overheadsg g

– Poor locality (if application has inter‐task locality)

4

Page 5: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

TaskTask‐‐stealing schedulersstealing schedulers

T0 T1 Tn Threads

Dequeue

Task

Enqueue

TaskQueues

Steal

• One task queue per thread

• Threads dequeue and enqueue tasks from queues

• When a thread runs out of work, it tries to steal tasks 

5

,from another thread

Page 6: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

TaskTask‐‐stealing: Componentsstealing: Components

1. Queues

T0 T1 Tn 2. PoliciesEnq/deq

T0 T1 Tn

Steal 3. Communication

• In software schedulers:Starved

—Queues and policies are cheap

—Communication through sharedmemory increasingly expensive!

Queues

Stealing

Starved

6

memory increasingly expensive! App

Page 7: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Hardware schedulers: CarbonHardware schedulers: Carbon

• Carbon [ISCA ‘07]: HW queues, policies, communication– One hardware LIFO task queue per core

– Special instructions to enqueue/dequeue tasks

• Implementation:– Centralized queues for fast stealing (Global Task Unit)

– One small task buffer per core to hide GTU latency (Local Task Units)

l

Starved31x 26x

Queues

App

Stealing

Large benefits if appmatches HW policies

Useless if app doesn’t match HW policies

7

matches HW policies match HW policies

Page 8: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Approaches to fineApproaches to fine‐‐grain schedulinggrain scheduling

Fine‐grain scheduling

Hardware‐only Hardware‐aided

OpenMPCilk

CarbonGPUs

Asynchronous Direct Messages

Software‐only

TBBX10

…...

SW queues & policiesSW communication

HW queues & policiesHW communication

SW queues & policiesHW communication

High‐overheadFlexible

Low‐overheadInflexible

Low‐overheadFlexible

8

FlexibleNo extra HW

InflexibleSpecial‐purpose HW

FlexibleGeneral‐purpose HW

Page 9: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

OutlineOutline

• Introduction

• Asynchronous Direct Messages (ADM)

• ADM schedulers

• Evaluation

9

Page 10: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Asynchronous Direct MessagesAsynchronous Direct Messages

• ADM: Messaging between threads tailored to scheduling and control needs:—Low‐overhead

—Short messages

Send from/receive to  registersIndependent from coherence

—Overlap communication Asynchronous messageswith user level interruptsand computation

G l

with user‐level interrupts

Generic interface—General‐purpose Allows reuse

10

Page 11: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

ADM ADM MicroarchitectureMicroarchitecture

• One ADM unit per core:• One ADM unit per core:– Receive buffer holds messages until dequeued by thread

Send buffer holds sent messages pending acknowledgement– Send buffer holds sent messages pending acknowledgement

– Thread ID Translation Buffer translates TID → core ID on sends

– Small structures (16‐32 entries), don't grow with # cores

11

Small structures (16 32 entries), don t grow with # cores

Page 12: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

ADM ISAADM ISA

InstructionInstruction DescriptionDescription

adm_send r1, r2 Sends a message of (r1) words (0‐6) to thread with ID (r2)

adm_peek r1, r2 Returns source and message length at head of rx buffer

adm_rx r1, r2 Dequeues message at head of rx buffer

adm ei / adm di Enable / disable receive interrupts

• Send and receive are atomic (single instruction)

– Send completes when message is copied to send buffer

adm_ei / adm_di Enable / disable receive interrupts

Send completes when message is copied to send buffer

– Receive blocks if buffer is empty

– Peek doesn't block, enables polling

• ADM unit generates an user‐level interrupt on the running thread when a message is received

– No stack switching, handler code partially saves context (used registers) → fast

12

– Interrupts can be disabled to preserve atomicity w.r.t. message reception

Page 13: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

OutlineOutline

• Introduction

• Asynchronous Direct Messages (ADM)

• ADM schedulers

• Evaluation

13

Page 14: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

ADM ADM SchedulersSchedulers• Message‐passing schedulers

• Replace parallel runtime’s (e.g. TBB) scheduler— Application programmer is oblivious to this

• Threads can perform two roles:– Worker: Execute parallel phase, enqueue & dequeue tasks

– Manager: Coordinate task stealing & parallel phase termination

• Centralized scheduler: Single manager coordinates all

T0T0 is managerd k !

Manager! 0and worker!!

14T0 T1 T2 T3 Workers

Page 15: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Centralized Scheduler: UpdatesCentralized Scheduler: Updates

Manager164 2 4 6 18224 4 4 64 8 4 6

Approx task counts T0

UPDATE <4>UPDATE <8>

6 Workers2 3 5345678T0 T1 T2 T3

TaskQueues

• Manager keeps approximate task counts of each worker

15

• Workers only notify manager at exponential thresholds

Page 16: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Centralized Scheduler: StealsCentralized Scheduler: Steals

Manager

STEAL_REQ

T0

UPDATE <1>_ Q

<T1‐>T2, 1>

TASK

6 Workers8 2 57 1265 34T0 T1 T2 T3

TaskQueues

• Manager requests a steal from the worker with most tasks

16

Page 17: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Hierarchical SchedulerHierarchical Scheduler

• Centralized scheduler:Does all communication through messages

Enables directed stealing, task prefetching

Does not scale beyond ~16 threads

• Solution: Hierarchical scheduler—Workers and managers form a tree 

T1 2nd Level Manager

T0 1st Level ManagersT4

17T0 T1 T2 T3 WorkersT4 T5 T6 T7

Page 18: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Hierarchical Scheduler: StealsHierarchical Scheduler: Steals

2221 2nd Level Manager

1843 1st Level Managers

510 4 2 71 1 1 Workers

• Steals can span multiple levels

TASK (x2)TASK

TASK (x4)

18

p p— A single steal rebalances two partitions at once

— Scales to hundreds of threads

Page 19: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

OutlineOutline

• Introduction

• Asynchronous Direct Messages (ADM)

• ADM schedulers

• Evaluation

19

Page 20: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

EvaluationEvaluation

• Simulated machine: Tiled CMP– 32, 64, 128 in‐order dual‐thread SPARC cores(64 256 h d )

CMP tile

(64 – 256 threads)

– 3‐level cache hierarchy, directory coherence

• Benchmarks:Loop parallel: canneal cg gtfold– Loop‐parallel: canneal, cg, gtfold

– Task‐parallel: maxflow, mergesort, ced, hashjoin

– Focus on representative subset of results,p ,see paper for full set

20

64‐core, 16‐tile CMP

Page 21: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

ResultsResultsQueuesApp Stealing Starved

• SW scalability limited by scheduling overheads• SW scalability limited by scheduling overheads

• Carbon and ADM: Small overheads that scale

21

• ADM matches Carbon  No need for HW scheduler

Page 22: Flexible ArchitecturalArchitectural Support Support forfor ...web.stanford.edu/group/mast/cgi-bin/drupal/system/files/2010.adm... · • Carbon [ISCA ‘07]: HW queues, policies,

Flexible policies: Flexible policies: gtfoldgtfold case studycase study

• In gtfold, FIFO queues allow tasks to clear critical dependences fasterp—FIFO queues trivial in SW and ADM

—Carbon (HW) stuck with LIFO 31x 26x

• ADM achieves 40x speedupover Carbon

• Can’t implement allscheduling policies in HW!

22