tera mta (multi-threaded architecture) thriveni movva (cmps 5433)
TRANSCRIPT
Tera MTA(Multi-Threaded Architecture)
Thriveni Movva
(CMPS 5433)
Presentation Contains
Evolution of Tera MTA Design goals of Tera MTA Tera MTA Architecture Interconnection Network Applications Advantages & Drawbacks Current MTA Status
Evolution Of Tera MTA
1987: Tera Computer Company was established by Burton Smith in Washington, USA
1988: Software development starts
1991: Hardware development starts
1997: First MTA-1shipment to SDSC (San Diego Supercomputer Center)
Tera MTA: Design Goals
To solves the two major problems then faced by high-performance parallel computers
• scalability• Programmability
To be suitable for very high-speed implementations The architecture to be applicable to a wide spectrum of problems. To Ease compiler implementation To overcome John von Neumann’s bottleneck (a problem of memory
usage)
About Tera MTA
The Tera MTA is a high performance system having• scalar multithreaded processors with synchronization among
threads• uniform access shared memory i.e all data accessible with equal
ease -No locality - No cache - No mapping• simple programming• zero cost context switching
About Multi-Threading architecture (MTA)
Uses a new technique called Multi-threading that lets multiprocessors share memory without using caches
Because these multi-threaded architecture computers can have thousands of processors that stay almost constantly busy, there will be no waits for slow memory accesses
Multi-threading allows each processor to switch thread contexts between execution cycles and as a result the processor stays busy
Whenever a processor starts a slow memory or I/O instruction, rather than waiting tens of cycles for the stalled instruction to complete, the processor executes its next instruction from a different thread using different registers
Each processor has many copies of the programming and pipeline control registers, one copy for each execution thread that it can support
Tera MTA Overview
Up to 256 processors with each processor running @ 260MHz Up to 128 active threads per processor Up to 256 I/O processors Peak Performance of 256 GFlop/sec Processors and memory modules populate a sparse 3D torus
interconnection network 4096 interconnection network nodes Flat, shared main memory ranging from 16 to 512 GB Cost : $5 million to $40 million
A View of the Tera Multiprocessor
Key Architecture Details
Each MTA processor has 128 “streams” each of which is hardware (including 32 registers and a program counter that is devoted to running single thread of control
The processor executes instructions from streams, that are not blocked, in a fair round robin fashion
A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy
The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute
Using ‘rich’ interconnect network guarantees that any potential delays caused by references to data in memory are completely hidden
Randomized memory mapping and high interconnectivity network provide near-uniform access time from any processor to any memory location.
Key Architecture Details
Hardware multithreading is used to tolerate high latencies to memory. This latency is typically on the order of 150 clock cycles
Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming effort specially compared to distributed memory machines using explicit message passing
The current MTA interconnect network is a 3–D toroidal mesh
Tera MTA’S Interconnection Network
The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors
Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick.
Some of the nodes are also linked to resources, i.e., processors, data memory units, I/O processors, and I/O cache units.
Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network.
Tera MTA’S Interconnection Network
The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh
As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p3/2 rather than as the p log p associated with the more commonly used multistage networks. For example, a 1024-processor system would have 32,768 nodes
Multithreading on one processor
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Unused streams
. . . .
Programs running in parallel
Concurrent threads of computation
Hardware streams (128)
Instruction Ready Pool;
Pipeline of executing instructions
Unused streams
Multithreading on multiple processors
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Programs running in parallel
Concurrent threads of computation
Multithreaded across multiple processors
. . . . . . . . . . . .
Latency Tolerance In Tera MTA
The latency incurred in memory references is hidden by multithreading As there may be up to 128 instruction streams (threads) and 8 memory
references can be issued without waiting for the preceding ones, a latency of 1024 cycles can be tolerated
The lookahead allows threads to achieve peak performance. Three operations (M, A, C) can be executed simultaneously per
instruction per processor
The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead
Tera MTA Applications
PULSE 3D, used for simulating real-time heartbeats to better treat heart diseases.
MSC Software’s NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries.
Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping.
GAUSSIAN 98, a computational chemistry application used in molecular modeling.
MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena.
Used in seismic analysis, national security and weather forecasting.
Advantages of Tera MTA
Tera MTA uses multiple contexts to hide latency Tera machines perform a context switch every clock cycle Both pipeline latency and memory latency are hidden in the Tera
approach The thread creation is very cheap With 128 contexts per processor, a large number(2k) of registers must
be shared finely between threads As long as there is plenty of parallelism in user programs to hide
latency and plenty of compiler support, the performance is potentially very high.
The advantages of Tera's architecture are available to users via minimal changes to their application code.
Drawbacks of Tera MTA
The performance will be bad for limited parallelism, such as guaranteed low single-context performance.
A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity.
Finally, the limited focus on latency reduction and caching entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine.
Bandwidth (not latency) limits practical MTA system size and large MTA systems will have expensive memory networks.
Tera MTA: Tools
Tera provides two powerful tools Traceview and Canal that allow the
programmer to:
Understand how the compiler has multithreaded a program How effectively the program actually utilizes the hardware.
Customers
San Diego Supercomputer Center (SDSC) Logicon, under a Naval research Lab Tera computer company
Tera MTA Macro Architecture
Problems Solved using Tera MTA
irregular memory access patterns Synchronization among threads load balancing
Current Industry Status: Cray Inc (ex-Tera)
Cray Inc. (Nasdaq NM: CRAY)
Est.:April 1, 2000
( Tera Computer + Cray Research)
HQ: Seattle WA, USA
Products: Supercomputers
( Vector, Micro Processor, Multithread )
Market: Government, Industry, Academic Research
1972 : Est. by Seymour Cray in Minnesota, USA1976 : First Cray-1 shipment to Los Alamos1980s : Ship follow-on products
Cray XMP , Cray YMP, Cray-2 1990s : More follow-on products
Cray C90 , Cray J90 , Cray T3D
Cray T90 , Cray T3E, Cray SV1
1996 : Merged with Silicon Graphics ( SGI)
1987 : Est. by Burton Smith in Washington, USA
1988 : Software development starts
1991 : Hardware development starts
1997 : First MTA-1shipment to SDSC (San Diego
Supercomputer Center)
2000 : Purchased Cray business unit from SGI
Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research)
Cray SX-6 Cray MTA-2 Cray SV1 Cray Red Storm Cray X1 Cray XD1
Cray MTA-2 , Multi-threaded Architecture
128 Virtual Processors in a CPU module
Zero Overhead Thread Switching Up to 1TB Scalable Shared memory
Cray MTA-2 Overview
Multithread system
Cray MTA-2CPUs 16 64 256
hardware streams 2,046 8,192 32,768
peak GFlops 12+ 48+ 192+
memory size GB 64 256 1TB
bi-sectionbandwidth GB/sec
125 500 2,000
Unique capability of Cray MTA
Visualization of Nebula using MPIREApplication on Cray MTA system
References
• http://www.hoise.com/vmw/00/articles/vmw/JH-VM-01-00-1.html
• http://www.cs.njit.edu/pact/eight/tutorial/tera.html• http://techreports.larc.nasa.gov/icase/1998/icase-1998-interim33.pdf• http://www.bearcave.com/misl/misl_tech/venture_capital.html