parallel real -time scheduling from theory to implementationalp514/hpc2017/jing_li.pdf · snippet...

35
Parallel Real-Time Scheduling from Theory to Implementation Jing Li Department of Computer Science Ying Wu College of Computing New Jersey Institute of Technology [email protected]

Upload: others

Post on 06-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

ParallelReal-TimeSchedulingfromTheorytoImplementation

Jing LiDepartmentofComputerScienceYingWuCollegeofComputing

New Jersey Institute of [email protected]

Page 2: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

OutlineØ Why do we demand parallelism?

Ø My research: exploiting parallelism in real-time systems

Ø Two examples of my research

2

Page 3: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

FutureofComputingPerformance

100

101

102

103

104

1970 1980 1990 2000 2010 2020

Frequency (MHz)

Year

40 Years of Microprocessor Trend Data

3

Originaldatauptotheyear2010collectedandplottedbyM.Horowitz,F.Labonte,O.Shacham,K.Olukotun, L.Hammond,andC.Batten.Newplotanddatacollectedfor2010-2015byK.Rupp.

ThefastestslothworkingattheDMV (from Zootopia).

Page 4: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

PowerDensityØ Clock frequency hits the power wall.

4

Page 5: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Multi-CoreSystems

100

101

102

103

104

1970 1980 1990 2000 2010 2020

Number of

Logical Cores

Frequency (MHz)

Year

40 Years of Microprocessor Trend Data

5

Originaldatauptotheyear2010collectedandplottedbyM.Horowitz,F.Labonte,O.Shacham,K.Olukotun, L.Hammond,andC.Batten.Newplotanddatacollectedfor2010-2015byK.Rupp.

Page 6: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Multi-Core=Parallelism?Ø Concurrent execution of jobs

Ø Parallel execution of a job

6

coressinglemulti-coremachine

jobs Job 1 Job 2 Job 3

coressinglemulti-coremachine

jobs

Page 7: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

LaneKeepingAssistSystem in Cars

Since the system interacts with the physical world, its computation must be completed under a time constraint.

7

Page 8: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Applications Benefit from Intra-Job ParallelismØ Motion planning program in a self-driving car

8

FromKimetal.[ICCPS13]

Page 9: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Cyber-Physical Systems(CPS)CPS are built from, and depend upon, the seamless integration of computational algorithms and physical components.

E.g., robotics, drones, autonomous vehicles, etc. 9

Page 10: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Applications Benefit from Intra-Job Parallelism

10

Searchtheweb

*Jeff Dean et al.(Google) "Thetailatscale."CommunicationsoftheACM56.2(2013)

2nd phaseranking

Snippetgenerator

doc

Doc.indexsearch

Response

Query

Ø Web searchNeed to respond within100msfor users to find responsive*.

Page 11: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

InteractiveCloudServices(ICS)E.g., web search, online gaming, stock trading etc.

11

Searchtheweb

Page 12: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Real-Time SystemsThe performance of the systems depends not only upon theirfunctional aspects, but also upon their temporal aspects.

Real-time performance:

1) Provide hard guarantee of meeting jobs’ deadlines (e.g. CPS)2) Optimize latency-related objectives for jobs (e.g. ICS)

12

coressinglemulti-coremachine

jobs Job 1 Job 2 Job 3

Page 13: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

NewGenerationofReal-TimeSystemsCharacteristics:

Ø New classes of applications with complex functionalitiesØ Increasing computational demand of each application

Ø Consolidating multiple applications onto a shared platform Ø Rapid increase in the number of cores per chip

Demand: leverage parallelism within the applications, to improve real-time performance and system efficiency

13

coressinglemulti-coremachine

jobs Job 1 Job 2 Job 3

Myresearch:parallelreal-timescheduling from theory to implementation

Page 14: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

OutlineØ Why do we demand parallelism?

Ø My research: exploiting parallelism in real-time systems

Ø Example 1: parallel real-time scheduling for meeting deadlines

Ø Example 2: parallel real-time scheduling for a target latency

14

Page 15: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Real-TimeHybridSimulation(RTHS)

15

Since the numerical simulation interacts with the physical specimen, its computation must be completed by its deadline.

Cyber-PhysicalBoundary

^RobertL.andTerryL.BowenLargeScaleStructuresLaboratoryatPurdueUniversity

Numericalsimulation Physicalspecimen

Page 16: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

HowtoAllocateCorestoMultipleParallelJobs?Ø A real world system consists of many parallel jobs

Ø Each job demands different real-time performanceq E.g., jobs need to meet their deadlines in RTHS

Ø The goal of parallel real-time scheduling:smartly allocate cores to parallel jobs to meet their deadlines

16

coressinglemulti-coremachine

jobs

Page 17: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

HowtoAnalyzeaParallelJob?Ø An example:

Computation on a array A[i] with m elements

17

m=3

int i = 0;

parallel_for (; i < m; i++) {compute( A[i] );

}bar();

Page 18: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

Unitnode– singleinstruction

18

Page 19: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

19

Page 20: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

20

Page 21: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

21

Page 22: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

22

Page 23: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Work T1 : execution time on one coreSpan T∞ : execution time on ∞ cores

(critical-path length)

Work and Span

T1 =18T∞ =9

23

Page 24: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

FederatedSchedulingFor parallel tasks, FS has the best bound in term of schedulability

FS assigns ni dedicated cores to each parallel task

ni – the minimum #cores needed for a task to meet its deadline

cores

𝑛" =𝐶" − 𝐿"𝐷" − 𝐿"

deadlineDi =periodworst-case span Liworst-case workCi

24

tasks

Page 25: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

FSPlatformØ Middleware platform providing FS service in Linux

Ø Work with GNU OpenMP runtime system Ø Run OpenMP programs with minimum modification

25

cores

Linux

FederatedScheduling(FS)

OpenMPRuntime

OpenMPRuntime

OpenMPRuntime

Page 26: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

ParallelismImprovesRTHSAccuracy

26

A RTHS simulates a nine stories building, with first story damper

Ø Previously, sequential processing power limits a rate of 575HzØ Parallel execution now allows a rate of 3000Hz

Ø Reduction in error for acceleration and displacement

Ø Parallelism increases accuracy via faster actuation and sensing

Sequential (575Hz)Parallel(3000Hz)

Time(sec)

Normalize

dError(%)

Page 27: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

OutlineØ Why do we demand parallelism?

Ø My research: exploiting parallelism in real-time systems

Ø Example 1: parallel real-time scheduling for meeting deadlines

Ø Example 2: parallel real-time scheduling for a target latency

27

Page 28: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

SystemforInteractiveCloudServicesOnline system: do not know when jobs arrive

Objective: maximize the number of jobs that meet a target latency T

28

2nd phaseranking

Snippetgenerator

doc

Doc.indexsearch

Query

Aggregator

Aggregator

Page 29: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

WorkloadDistributionHasaLongTail

29

Job SequentialExecutionTime(ms)(work)

Bing searchworkload

Ø Large jobs must run in parallel to meet target latency

Ø Always run large jobs in full parallelism?

Target latency

Page 30: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

Parallelize Large Jobs According to LoadTail-Control Strategy: when load is low, run all jobs in parallel; when load is high, run large jobs sequentially.

Latency = Processing Time + Waiting time

At low load: processing time dominates latency

At high load:waiting time dominates latency

time

core1

core2

core3

✔Miss0 request

core1

core2

core3time

✔Miss1 request

30

target

target

Page 31: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

31

Page 32: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

We implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

TheInnerWorkingsofTail-Control

TargetLatency

32

defaultwork-stealing≥

Page 33: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

TargetLatency

33

defaultwork-stealing≥

Page 34: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

TargetLatency

34

defaultwork-stealing≥

Page 35: Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet generator doc Doc. index search Response Query Ø Web search Need to respond within100ms

ConclusionExploit the untapped efficiency in parallel computing platforms and drastically improve the real-time performance of applications.

Ø System Guaranteed to Meet Deadlines for CPSq Develop provably good schedulers for parallel applications

q Incorporate real-time scheduling into parallel runtime system

Ø System Optimized to Meet Target Latency for ICSq Design and implement strategy to optimize real-time performance

35