scenario-oriented design for single chip heterogeneous multiprocessors joann m. paul electrical and...

Scenario-Oriented Design for Single Chip Heterogeneous Multiprocessors

JoAnn M. PaulElectrical and Computer Engineering DepartmentCarnegie Mellon UniversityPittsburgh, PA

Presented by: Mohammad Farsakh

What is this paper about

Challenges of selection, programming, and coordination for future single chip computers designs

Consisting of processing elements (PEs) Heterogeneous type

Outlines the differences between Next generation single chip systems designs Traditional designs.

Focus on Scenario-Oriented Design (SOD) strategy Applications, schedulers, and hardware viewed as a system Leveraging one against the other. Reducing the modeling detail of each design domain within a

system in high level simulation.

Introduction

The design process of digital computation has three categories Models Tools Strategies

Existing models, tools and strategies are failing to permit designers of single chip computers, to efficiently capture the design space at hand.

Tools do not capture software as part of the system model. Instruction Set Simulators (ISS) are too detailed and slow to capture

systems with many processors Designers are left to their own devices, which limits the effective

realization of many potentially significant designs. Models, tools and strategies for both the software and

hardware of single chip heterogeneous multiprocessors is required.

Introduction (cont)

The future individual, programmable processors will be like registers in a larger framework.

– Processor blocks will be differentiated by Capabilities of the hardware in the processors, The way they are programmed, Their manner of interconnection.

Should maximize the ratio DQ/DE for a given design to initiate the next design level

DQ = Design Quality DE = Development Effort

A common basis for design, at a modeling level is required to manipulate the design decisions that has the most impact.

Programmable HeterogeneousMultiprocessors (PHMs)

Collections of PEs must be considered programmable The chip is a programmable collection of processors grouped

dynamically. Different design challenges in heterogeneous multiprocessors because

three primary reasons:

– A single chip is a finite resource, unlike wide-area networks.– The design will be semi-custom.

Under hardware more customized to the application space than traditional programmable system

Traditional heterogeneous multiprocessors , provide transaction-like services on a diverse collection of resources

Single chip devices such as SoCs are customized to meet fixed latency requirements as a reactive system.

PHMs will be semi-custom and have aspects of both design styles.– Coordination of system resources is required.

The large differential in on-chip vs. off-chip communications will force efficient utilization and management of on-chip system resources — including processing elements, memory, communications bandwidth and chip I/O.

Design Environment of Single Chip PHM

H , Single Chip heterogeneous multiprocessor. Data Inputs

– DP, time stamped system inputs that are conceptually presented to the system hardware on I/O pins.

– DM, data values reside in some external memory. Analogous to jobs, packets or other requests in a queue waiting to be processed by H.

Programs– BC, clocked benchmarks programs with fixed latency requirements with required latency

specified to a fixed time reference. Designed to meet the worst-case demands that are presented to the system by DP. Programs have fixed performance requirements.

– BI, programmatic inputs benchmarks for which performance is calculated by the internal timing of the processing capabilities of the design. This run over many PEs.

– BX, schedulers programs, that acts as a means of resolving the other benchmarks to the architecture.

Design Output

Single output Q, has the quality metric of the design including the performance for the two classes of behaviors (BC and BI)

General form of such environment E = {D, B, H, Q} In case of E = {BC, DP, Q}

Pass/fail Quality metric Fully specified by DP and BC and not a separately performance-evaluated architecture. Hardware Description Language (HDL)

In case of E = {BC, BX, DP, H, Q} where H is a single processor Pass/fail Quality metric Kind of analytical modeling typical of research in real-time operating systems (RTOSs). RTOS

In case of E = {BI, DM, H, Q} H is a single processor executing at the instruction set simulator level or below It is typical of simulators such as Simple scalar used to model a micro architecture or ISA. Simple ISS

Complexity of the application space Current day approach ISS, can’t permit effective exploration of the design

space Complete level of detail required in the model Takes long time to generate any single value of Q.

Scenario-Oriented Design

A novel design strategy Orients heterogeneous multiprocessor single chip

design according to a blend of performance requirements,

Implemented in new chip-wide programmer’s views. Leverages increased heterogeneity in the future

application space Results in greater efficiency in design process and

Resource utilization.

Fixed performance (FP)

To meet the current systems requirement for system with Dp and BC current system must be overdesigned for two reasons:– The capacity of system resources is wasted, with

the time taken to matching functionality to available processing power, to make sure that the WC behavior is met.

– The irregular loading situations and data dependent processing times contribute to underutilized processing resources except in peak loading situations with WC.

Throughput performance (TP)

Bi designed to be a broad representative set of program types used to evaluate and optimize a programmable device’s throughput performance (TP).

Optimize a common case (CC) instead of ensuring that WC behaviors are met.

Like network switches dropping packets presumed to be resent.

Applies to caches, branch predictors, OS scheduling strategies

Future Vs. Current Designs

Two design strategies are worlds apart.– worst case (WC) with fixed performance– common case (CC) with throughput performance (TP)

Future single chip designs – Execute a mix of the BC and BI to handle a mix of DP and DM – FP behaviors are met – CC behaviors are optimized.

Currently, systems with FP and TP performance oriented design

– Separated into different devices General purpose programming resides on the general purpose

processor, Other processors utilize individual RTOSs to ensure WC behaviors are

met, or WC behaviors are ensured by implementation in custom hardware.

Layered, SOD approach to SoC Design

SOD can satisfy performance for FP functionality and provide a basis for a TP-optimized remainder architecture.

Hardware architecture and a remainder architecture are co-designed.

Map the FP functionality across the entire chip, consuming part of the proposed architecture

Leverages the presence of both classes Optimize design time Optimize design quality

Measuring exact execution times for FP is not required at the start of design

Hardware architecture and a remainder architecture are co-designed.

SoC Hardware View

Different Processing Elements (PE)

Different functionality

Common communication channel

SoC With Remainder Architecture

Software partitioning PE divided to two parts PE = {F-I,R-i} COMM ={R-COMM, F-COMM} Functional Overlay, {F-i}, BC to

Processing resources Remainder architecture carry

BI, R = {R-I,R-COMM}

Layered, SOD approach to SoC Design

New layer between R-i and F-I Enlarge the boundary between

performance group partitions Reduce design time

– FP mapped to chip need not be known beforehand

– Optimize TP SOD partitioning produces a chip-wide,

horizontal view– Hardware resources in the bottom layer– Schedulers in the middle layer ( permit the co-

operation of …. ) – General software at the top layer – Last two layers could have multiple internal

layers Layering concept, leverages schedulers

as a basis for a soft partitioning of a hardware design.

Simulation Foundation — MESH

The Modeling Environment for Software and Hardware (MESH) is a good simulator

– Provide a layered modeling basis above ISS models – Use schedulers to model concurrent, high level software

running on high level models of processor resources.– Resolve the timing through design layers where unrestricted

software executes on hardware models without relying upon ISS

Modeling Environment for Software and Hardware (MESH)

ThLij — One of j logical threads (software) that will execute on processor i.

ThPi — A model of the ith physical resource in the system, such as a processor.

UPi — A scheduler that selects logical threads intended to execute on resource ThPi.

ULi — A logical scheduler that can schedule M threads to N resources. e.g., a pthread scheduler

How Mesh Works

Dynamic number of logical threads Execution is scheduled onto a single resource Scheduling decisions based on the state of the threads and

other system state. Resolves the logical events of the software threads to physical

timing Schedulers serve two roles:

Modeling scheduling decisions, Resolving logical computation to physical time.

Complex system have many resources (ThPi) Two dimensions of scheduling:

Based on physical time Based logical state.

M threads may dynamically mapped to N resources 2.5 times faster than an internal ISS level simulator

Conclusion

Challenges for future designs Performance Power Chip size

Future computer design should be evaluated as a system

SOD: strategy result from considering applications, schedulers, and hardware as they interact to form a system

Leveraging each against the other Reducing modeling details

Thanks

scenario-oriented design for single chip heterogeneous multiprocessors joann m. paul electrical and...

Documents