tera mta (multi-threaded architecture) thriveni movva (cmps 5433)

Tera MTA(Multi-Threaded Architecture)

Thriveni Movva

(CMPS 5433)

Presentation Contains

Evolution of Tera MTA Design goals of Tera MTA Tera MTA Architecture Interconnection Network Applications Advantages & Drawbacks Current MTA Status

Evolution Of Tera MTA

1987: Tera Computer Company was established by Burton Smith in Washington, USA

1988: Software development starts

1991: Hardware development starts

1997: First MTA-1shipment to SDSC (San Diego Supercomputer Center)

Tera MTA: Design Goals

To solves the two major problems then faced by high-performance parallel computers

• scalability• Programmability

To be suitable for very high-speed implementations The architecture to be applicable to a wide spectrum of problems. To Ease compiler implementation To overcome John von Neumann’s bottleneck (a problem of memory

usage)

About Tera MTA

The Tera MTA is a high performance system having• scalar multithreaded processors with synchronization among

threads• uniform access shared memory i.e all data accessible with equal

ease -No locality - No cache - No mapping• simple programming• zero cost context switching

About Multi-Threading architecture (MTA)

Uses a new technique called Multi-threading that lets multiprocessors share memory without using caches

Because these multi-threaded architecture computers can have thousands of processors that stay almost constantly busy, there will be no waits for slow memory accesses

Multi-threading allows each processor to switch thread contexts between execution cycles and as a result the processor stays busy

Whenever a processor starts a slow memory or I/O instruction, rather than waiting tens of cycles for the stalled instruction to complete, the processor executes its next instruction from a different thread using different registers

Each processor has many copies of the programming and pipeline control registers, one copy for each execution thread that it can support

Tera MTA Overview

Up to 256 processors with each processor running @ 260MHz Up to 128 active threads per processor Up to 256 I/O processors Peak Performance of 256 GFlop/sec Processors and memory modules populate a sparse 3D torus

interconnection network 4096 interconnection network nodes Flat, shared main memory ranging from 16 to 512 GB Cost : $5 million to $40 million

A View of the Tera Multiprocessor

Key Architecture Details

Each MTA processor has 128 “streams” each of which is hardware (including 32 registers and a program counter that is devoted to running single thread of control

The processor executes instructions from streams, that are not blocked, in a fair round robin fashion

A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy

The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute

Using ‘rich’ interconnect network guarantees that any potential delays caused by references to data in memory are completely hidden

Randomized memory mapping and high interconnectivity network provide near-uniform access time from any processor to any memory location.

Key Architecture Details

Hardware multithreading is used to tolerate high latencies to memory. This latency is typically on the order of 150 clock cycles

Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming effort specially compared to distributed memory machines using explicit message passing

The current MTA interconnect network is a 3–D toroidal mesh

Tera MTA’S Interconnection Network

The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors

Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick.

Some of the nodes are also linked to resources, i.e., processors, data memory units, I/O processors, and I/O cache units.

Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network.

Tera MTA’S Interconnection Network

The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh

As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p3/2 rather than as the p log p associated with the more commonly used multistage networks. For example, a 1024-processor system would have 32,768 nodes

Multithreading on one processor

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Unused streams

. . . .

Programs running in parallel

Concurrent threads of computation

Hardware streams (128)

Instruction Ready Pool;

Pipeline of executing instructions

Unused streams

Multithreading on multiple processors

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Programs running in parallel

Concurrent threads of computation

Multithreaded across multiple processors

. . . . . . . . . . . .

Latency Tolerance In Tera MTA

The latency incurred in memory references is hidden by multithreading As there may be up to 128 instruction streams (threads) and 8 memory

references can be issued without waiting for the preceding ones, a latency of 1024 cycles can be tolerated

The lookahead allows threads to achieve peak performance. Three operations (M, A, C) can be executed simultaneously per

instruction per processor

The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead

Tera MTA Applications

PULSE 3D, used for simulating real-time heartbeats to better treat heart diseases.

MSC Software’s NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries.

Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping.

GAUSSIAN 98, a computational chemistry application used in molecular modeling.

MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena.

Used in seismic analysis, national security and weather forecasting.

Advantages of Tera MTA

Tera MTA uses multiple contexts to hide latency Tera machines perform a context switch every clock cycle Both pipeline latency and memory latency are hidden in the Tera

approach The thread creation is very cheap With 128 contexts per processor, a large number(2k) of registers must

be shared finely between threads As long as there is plenty of parallelism in user programs to hide

latency and plenty of compiler support, the performance is potentially very high.

The advantages of Tera's architecture are available to users via minimal changes to their application code.

Drawbacks of Tera MTA

The performance will be bad for limited parallelism, such as guaranteed low single-context performance.

A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity.

Finally, the limited focus on latency reduction and caching entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine.

Bandwidth (not latency) limits practical MTA system size and large MTA systems will have expensive memory networks.

Tera MTA: Tools

Tera provides two powerful tools Traceview and Canal that allow the

programmer to:

Understand how the compiler has multithreaded a program How effectively the program actually utilizes the hardware.

Customers

San Diego Supercomputer Center (SDSC) Logicon, under a Naval research Lab Tera computer company

Tera MTA Macro Architecture

Problems Solved using Tera MTA

irregular memory access patterns Synchronization among threads load balancing

Current Industry Status: Cray Inc (ex-Tera)

Cray Inc. (Nasdaq NM: CRAY)

Est.:April 1, 2000

（ Tera Computer + Cray Research)

HQ: Seattle WA, USA

Products: Supercomputers

（ Vector, Micro Processor, Multithread ）

Market: Government, Industry, Academic Research

1972 ： Est. by Seymour Cray in Minnesota, USA1976 ： First Cray-1 shipment to Los Alamos1980s ： Ship follow-on products

Cray XMP ， Cray YMP, Cray-2 1990s ： More follow-on products

Cray C90 ， Cray J90 ， Cray T3D

Cray T90 ， Cray T3E, Cray SV1

1996 ： Merged with Silicon Graphics （ SGI)

1987 ： Est. by Burton Smith in Washington, USA

1988 ： Software development starts

1991 ： Hardware development starts

1997 ： First MTA-1shipment to SDSC (San Diego

Supercomputer Center)

2000 ： Purchased Cray business unit from SGI

Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research)

Cray SX-6 Cray MTA-2 Cray SV1 Cray Red Storm Cray X1 Cray XD1

Cray MTA-2 , Multi-threaded Architecture

128 Virtual Processors in a CPU module

Zero Overhead Thread Switching Up to 1TB Scalable Shared memory

Cray MTA-2 Overview

Multithread system

Cray MTA-2CPUs 16 64 256

hardware streams 2,046 8,192 32,768

peak GFlops 12+ 48+ 192+

memory size GB 64 256 1TB

bi-sectionbandwidth GB/sec

125 500 2,000

Unique capability of Cray MTA

Visualization of Nebula using MPIREApplication on Cray MTA system

References

• http://www.hoise.com/vmw/00/articles/vmw/JH-VM-01-00-1.html

• http://www.cs.njit.edu/pact/eight/tutorial/tera.html• http://techreports.larc.nasa.gov/icase/1998/icase-1998-interim33.pdf• http://www.bearcave.com/misl/misl_tech/venture_capital.html

tera mta (multi-threaded architecture) thriveni movva (cmps 5433)

Documents

memory modules

memory location

processor stays

busythe processor

main memory ranging

io instruction

instruction pipeline

stalled instruction