dynamically programmable array architecture
DESCRIPTION
Dynamically Programmable Array Architecture. Robert Heaton Obsidian Technology. Mesh of Trees. PU. PU. PU. PU. Busses are BI-directional 2 Cycles to exchange data Separate X and Y dimensions Diagonal routing not directly supported - PowerPoint PPT PresentationTRANSCRIPT
Confidential
Dynamically Programmable Array Architecture
Dynamically Programmable Array Architecture
Robert Heaton
Obsidian Technology
Confidential
Mesh of TreesMesh of Trees Busses are BI-directional 2 Cycles to exchange data Separate X and Y dimensions Diagonal routing not directly
supported PU’s difficult to program to
take advantage of structure
PU PU
PU PU
PU PU
PU
PU PU
PU PU PU
PU PU PU PU
Confidential
Two Dimensional MeshTwo Dimensional Mesh
PU
PU PU
PUPU
PU PU
PU
PU
PU PU
PU PU
PU PU
PU
Confidential
4x4 Hierarchical Cluster4x4 Hierarchical Cluster
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU
Confidential
Simple 4x4 Cluster WiringSimple 4x4 Cluster Wiring
Bus width = 140u for 16 bit busses
That is a lot of wires!
Budget 4x4 Cluster area is 1mm2
PU PU PU PU
N
Hin1
Hadr12L-2
Hout1
Switch
1.4
6*N W
ires
Joint
M2 Pitch
Confidential
Routing HierarchyRouting Hierarchy 256 PUs 4 Levels of hierarchy
Hadr: up level till L0adr: local address L1adr: level 1 address L2adr: level 2 address L3adr: level 3 address
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
RU2
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
RU2
RU3
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
RU2
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
PU
PU
PU
PU
RU
RU1
RU2
Hadr L0adr L1adr L2adr L3adr
Confidential
Weeks Investigation (9/12/97)Weeks Investigation (9/12/97)
Investigate routing structures Dynamic routing assignment/programming Compromise between area and flexibility Support for tree of trees
Not a complete story yet!
Confidential
Routing UnitRouting Unit
Full Duplex connect busses Each PU node controls its source port via a 2 bit local or 6 bit hierarchical address
Broadcast support Any node may listen to any
other input to the cluster Hierarchical node addressing
must not clash
ProcessUnit(PU)
ProcessUnit(PU)
ProcessUnit(PU)
ProcessUnit(PU)
RoutingUnit(RU)
Confidential
Routing Unit PU Port DetailRouting Unit PU Port Detail
Port numbering is clockwise & relative to each PU port
HBUS port is always at port 3
from port 0from port 1from port 2from port H
PU Input
PU Output
PU Input address
6
N
N
2
4
to other ports
&
s0
s1
Confidential
PU OverviewPU Overview
Simple data path functionality Primitive control options Wide instructions control data path function
and operand routing Conditions may be inverted for “repeat until”
or “Branch If” control Very primitive address arithmetic 32 or less instructions in program
Confidential
N Bit Functional UnitN Bit Functional Unit
Logic functions: OR, XOR, AND, 0, 1 Arithmetic: Add, subtract, Multiply Shifts: single bit left and right Conditional detection: 0, -1, <0, >0.
More optimization needed Routing issues need more work
ALU/MULT
DFF
Bit Shift
CarryLogic
Constbit
ALUCTL
mux0 mux1
mux2
A
F
CinCout
LSin RSin
SFTCTL
Constbit
Confidential
N Bit Functional Unit (V2)N Bit Functional Unit (V2)
Logic functions: OR, XOR, AND, 0, 1 Arithmetic: Add, subtract Shifts: right and left shifts Conditional detection: 0, <0, >0, OF
Memory mapped RAM access to operands
ALU
DFF
B Shift
CarryLogic
ALUCTL
mux0 mux1
mux2
Out
CinCout
LSin RSin
SFTCTL
N b it RAM
Operands
N b it RAM
MultiplySequencer
Confidential
Instruction FieldsInstruction Fields
?? + XN Bits per context
Field Comment BitsALU_CTL Control of Basic ALU Functions 5
SHIFT_CTL Control of the operand shift 2MUX_CTL Control operand muxes 3
BRANCH_ADR Next address if condition true 2COND_MSK Condition mask 5COND_FLD Condition field 5
EXT_COND_SRC Select source for external condition inputs 2HEIR_ADDR Hierarchical routing level address 2
L0_ADDR Level 0 source address 2L1_ADDR Level 1 source address 2L2_ADDR Level 2 source address 2L3_ADDR Level 3 source address 2
Confidential
PU Instruction TypesPU Instruction TypesData Process 00 ALU_CTL, SFT_CTL, MUX_CTL, ROUTE_CTL
Move 01
Immediate OperandMultiply 100
Operand_ValueOP_SEL
Invert +ve OF-ve zero X1 X0 Condition Mask Ext’ Source Sel
15 Bits
R/W
OptionsOP_SEL
Condition Field:
Hadr L0adr L1adr L2adr L3adr
ROUTE_CTL Field:
Attention 101 Options FlagCondition Branch_Adr
Branch 110 Options LinkCondition Branch_Adr
32 Bits
Confidential
Condition FieldCondition Field
X[1:0] are external condition bits & may be source from: Operand bits Global synchronization bus Nearest nabough conditions outputs
Condition Mask is anded with flag bits
Invert +ve OF-ve zero X1 X0 Condition Mask Ext’ Source Sel
15 Bits
Condition Field:
Confidential
Static ProgramStatic Program
PU Never changes function Branch is set to always true Just two Instructions
Data Process
Branch
AlwaysAdr +1
Confidential
More Typical ProgramMore Typical Program
Confidential
Open IssuesOpen Issues
PU Data path width Complexity of shift operations RU Trunking Number of contexts per PU Flexible context RAM partitioning Improve PU synchronization
Confidential
Shifter InstructionsShifter Instructions
Confidential
Design ToolsDesign Tools
PU Assembler Architecture mapping Global resource allocation
Confidential
Conditional N Bit PU CellConditional N Bit PU Cell
ALU/MULT
DFF
Bit Shift
CarryLogic
Constbit
ALUCTL
mux0 mux1
mux2
A B
F
CinCout
LSin RSin
SFTCTLRA
M
ColS
el
ConditionLogic
EXT[1:0]
AddressLogic
Branch
Cout
Cin
RSin LSin
Input
Out
Port address
Confidential
Commercial ViabilityCommercial Viability
X5 performance improvement over conventional solutions (mix of cost & power)
Conceptually simple Clearly defined target applications Simple systems connections Scaleable Support hardware & software standards
Confidential
Conditional N Bit DPA CellConditional N Bit DPA Cell
ALU
DFF
Bit Shift
CarryLogic
Constbit
ALUCTL
mux0 mux1
mux2
A B
F
CinCout
LSin RSin
SFTCTLRA
M
ColS
el
ConditionLogic
EXT[1:0]
AddressLogic
Branch
Routing Matrix
Routing Matrix
Rou
ting M
atrix
Rou
ting M
atrix
Cout
Cin
RSin LSin
4 Bit Cell:180 Gates112 Bits RAM
Confidential
N Bit Wide DPAN Bit Wide DPA
N bit wide FUStatusReg
A B
CCondition Logic
N bit wide FUCondition Logic
A B
C
N bit wide FUCondition Logic
A B
FU DecodeM PlaneRAM
StatusReg
FU DecodeM PlaneRAM
StatusReg
FU DecodeM PlaneRAM
Program
Storage
Program
Storage
Program
Storage
Confidential
N Bit Wide PU BlockN Bit Wide PU Block
N bit wide ALUStatusReg Condition Logic
A B
I DecodeAddrLogic
InstRAM
N Bit wide Shift
NOTES/QUESTIONS- Inst has no const, but has offsets,- Inst RAM can be small. 64 words? - note counter takes 3 instructions.- How much subroutine support? None?- Simplified 16 bit or full 32 bit instructions.- 2 or 4 local area busses?- Synchronization issue: Master states accessible, Cond mask use.- Option to break or combine N bit DP elements?- Resource pool on busses? E.g... MULT?- Approx.. size of 32 bit FU 800u x 500u? - If so a 16x8 processor array is possible. - I.e.. 128 processors at 100MHz = 12800MIPS- Turn off till global state instruction for power reduction- Handling of interrupts (if at all) - Handle global signal interrupts how?- Multiple bit wide segmentation through masks? E.g... 2 counter in one PU?
Local RAM
Arbit
Arbit
StateH
ierBus
BusW
BusX
PipeBus
PipeBus
Status Msk Source A Source B Shift OpOP Code
Instruction Format
Confidential
Potential ConfigurationPotential Configuration
128 32 Bit “Pico” Process Units 12800MIPS @ 100MHz 80mm2 in 0.35u CMOS Concept of hierarchical hardware
scope Very fast streaming operations Simple PU programming model Applications:
Video processing LAN Routing DSP Fast Prototyping
16 x 8 PU ARRAY
MUX/DMA/FIFO
RAMBUS Interface
Controller 256GlobalRam
Confidential
PU Program EnvironmentPU Program Environment
Operands: BusW, BusX, Accumulator, HierBus, PipeBus, Local Ram. Use PU Typically runs a small program
– May be as little as two instructions
– 64 words of code maximum
Instruction types:Arithmetic, logicalData movingInterrupt
Function InstructionsArithmetic 1
Counter 1-2Mux 1
Multiply Accumulate 3FIFO Stage 3
Multiport Register 1Shift Register 2
Confidential
Architecture Figures of MeritArchitecture Figures of Merit
Average density vs application specific cells
Speed of applications vs hardwired logic Percentage reuse
Confidential
Next StepsNext Steps
VHDL Modeling of Architecture Primitive assembler tools for PUs Selection coding and simulation of
applications Architecture tuning Layout and verification of complete DPA
Confidential
Design ToolsDesign Tools
Tanner:Schematic entry, logic simulation, custom layout,
layout verification.Circuit Simulation.PC & Sun platforms.MOSIS Libraries.
Mentor Graphics:VHDL compilation and simulation.
Confidential
Basic FU RoutingBasic FU Routing
FU FU
FU FU
FU
FU
FU
FU
FU FUFU FU