tera mta (multi-threaded architecture) thriveni movva (cmps 5433)

29
Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Upload: randell-lamb

Post on 27-Dec-2015

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA(Multi-Threaded Architecture)

Thriveni Movva

(CMPS 5433)

Page 2: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Presentation Contains

Evolution of Tera MTA Design goals of Tera MTA Tera MTA Architecture Interconnection Network Applications Advantages & Drawbacks Current MTA Status

Page 3: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Evolution Of Tera MTA

1987: Tera Computer Company was established by Burton Smith in Washington, USA

1988: Software development starts

1991: Hardware development starts

1997: First MTA-1shipment to SDSC (San Diego Supercomputer Center)

Page 4: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA: Design Goals

To solves the two major problems then faced by high-performance parallel computers

• scalability• Programmability

To be suitable for very high-speed implementations The architecture to be applicable to a wide spectrum of problems. To Ease compiler implementation To overcome John von Neumann’s bottleneck (a problem of memory

usage)

Page 5: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

About Tera MTA

The Tera MTA is a high performance system having• scalar multithreaded processors with synchronization among

threads• uniform access shared memory i.e all data accessible with equal

ease -No locality - No cache - No mapping• simple programming• zero cost context switching

Page 6: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

About Multi-Threading architecture (MTA)

Uses a new technique called Multi-threading that lets multiprocessors share memory without using caches

Because these multi-threaded architecture computers can have thousands of processors that stay almost constantly busy, there will be no waits for slow memory accesses

Multi-threading allows each processor to switch thread contexts between execution cycles and as a result the processor stays busy

Whenever a processor starts a slow memory or I/O instruction, rather than waiting tens of cycles for the stalled instruction to complete, the processor executes its next instruction from a different thread using different registers

Each processor has many copies of the programming and pipeline control registers, one copy for each execution thread that it can support

Page 7: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA Overview

Up to 256 processors with each processor running @ 260MHz Up to 128 active threads per processor Up to 256 I/O processors Peak Performance of 256 GFlop/sec Processors and memory modules populate a sparse 3D torus

interconnection network 4096 interconnection network nodes Flat, shared main memory ranging from 16 to 512 GB Cost : $5 million to $40 million

Page 8: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

A View of the Tera Multiprocessor

Page 9: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Key Architecture Details

Each MTA processor has 128 “streams” each of which is hardware (including 32 registers and a program counter that is devoted to running single thread of control

The processor executes instructions from streams, that are not blocked, in a fair round robin fashion

A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy

The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute

Using ‘rich’ interconnect network guarantees that any potential delays caused by references to data in memory are completely hidden

Randomized memory mapping and high interconnectivity network provide near-uniform access time from any processor to any memory location.

Page 10: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Key Architecture Details

Hardware multithreading is used to tolerate high latencies to memory. This latency is typically on the order of 150 clock cycles

Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming effort specially compared to distributed memory machines using explicit message passing

The current MTA interconnect network is a 3–D toroidal mesh

Page 11: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA’S Interconnection Network

The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors

Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick.

Some of the nodes are also linked to resources, i.e., processors, data memory units, I/O processors, and I/O cache units.

Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network.

Page 12: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA’S Interconnection Network

The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh

As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p3/2 rather than as the p log p associated with the more commonly used multistage networks. For example, a 1024-processor system would have 32,768 nodes

Page 13: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Multithreading on one processor

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Unused streams

. . . .

Programs running in parallel

Concurrent threads of computation

Hardware streams (128)

Instruction Ready Pool;

Pipeline of executing instructions

Unused streams

Page 14: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Multithreading on multiple processors

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Programs running in parallel

Concurrent threads of computation

Multithreaded across multiple processors

. . . . . . . . . . . .

Page 15: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Latency Tolerance In Tera MTA

The latency incurred in memory references is hidden by multithreading As there may be up to 128 instruction streams (threads) and 8 memory

references can be issued without waiting for the preceding ones, a latency of 1024 cycles can be tolerated

The lookahead allows threads to achieve peak performance. Three operations (M, A, C) can be executed simultaneously per

instruction per processor

Page 16: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead

Page 17: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA Applications

PULSE 3D, used for simulating real-time heartbeats to better treat heart diseases.

MSC Software’s NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries.

Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping.

GAUSSIAN 98, a computational chemistry application used in molecular modeling.

MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena.

Used in seismic analysis, national security and weather forecasting.

Page 18: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Advantages of Tera MTA

Tera MTA uses multiple contexts to hide latency Tera machines perform a context switch every clock cycle Both pipeline latency and memory latency are hidden in the Tera

approach The thread creation is very cheap With 128 contexts per processor, a large number(2k) of registers must

be shared finely between threads As long as there is plenty of parallelism in user programs to hide

latency and plenty of compiler support, the performance is potentially very high.

The advantages of Tera's architecture are available to users via minimal changes to their application code.

Page 19: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Drawbacks of Tera MTA

The performance will be bad for limited parallelism, such as guaranteed low single-context performance.

A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity.

Finally, the limited focus on latency reduction and caching entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine.

Bandwidth (not latency) limits practical MTA system size and large MTA systems will have expensive memory networks.

Page 20: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA: Tools

Tera provides two powerful tools Traceview and Canal that allow the

programmer to:

Understand how the compiler has multithreaded a program How effectively the program actually utilizes the hardware.

Page 21: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Customers

San Diego Supercomputer Center (SDSC) Logicon, under a Naval research Lab Tera computer company

Page 22: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA Macro Architecture

Page 23: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Problems Solved using Tera MTA

irregular memory access patterns Synchronization among threads load balancing

Page 24: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Current Industry Status: Cray Inc (ex-Tera)

Cray Inc. (Nasdaq NM: CRAY)

Est.:April 1, 2000

( Tera Computer + Cray Research)

HQ: Seattle WA, USA

Products: Supercomputers

( Vector, Micro Processor, Multithread )

Market: Government, Industry, Academic Research

1972 : Est. by Seymour Cray in Minnesota, USA1976 : First Cray-1 shipment to Los Alamos1980s : Ship follow-on products

Cray XMP , Cray YMP, Cray-2 1990s : More follow-on products

Cray C90 , Cray J90 , Cray T3D

Cray T90 , Cray T3E, Cray SV1

1996 : Merged with Silicon Graphics ( SGI)

1987 : Est. by Burton Smith in Washington, USA

1988 : Software development starts

1991 : Hardware development starts

1997 : First MTA-1shipment to SDSC (San Diego

Supercomputer Center)

2000 : Purchased Cray business unit from SGI

Page 25: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research)

Cray SX-6 Cray MTA-2 Cray SV1 Cray Red Storm Cray X1 Cray XD1

Page 26: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Cray MTA-2 , Multi-threaded Architecture

128 Virtual Processors in a CPU module

Zero Overhead Thread Switching Up to 1TB Scalable Shared memory

Page 27: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Cray MTA-2 Overview

Multithread system

Cray MTA-2CPUs 16 64 256

hardware streams 2,046 8,192 32,768

peak GFlops 12+ 48+ 192+

memory size GB 64 256 1TB

bi-sectionbandwidth GB/sec

125 500 2,000

Page 28: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Unique capability of Cray MTA

Visualization of Nebula using MPIREApplication on Cray MTA system

Page 29: Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

References

• http://www.hoise.com/vmw/00/articles/vmw/JH-VM-01-00-1.html

• http://www.cs.njit.edu/pact/eight/tutorial/tera.html• http://techreports.larc.nasa.gov/icase/1998/icase-1998-interim33.pdf• http://www.bearcave.com/misl/misl_tech/venture_capital.html