mlcs architecture new version3...
TRANSCRIPT
Mirai Ltd. & Meisei University 1
新しい概念のメモリ・論理共役システムとその周辺
-メモリのみで構成した動的再構成可能システム-
大塚寛治†、 佐藤陽一†、 河西純一‡
†明星大学、連携研究センター‡ MIRAI(株)
Mirai Ltd. & Meisei University 2
How to Get; High Speed, Flexible, Robust and Low Power Processing
Note: strong in Chinese network (Huawei, ZTE, Cisco, ALU, IBM, Ericson)
Memory CPU
Simple imagination of Von Neumann type processor
Band width bottle neck
Improvement concept
There are major two limitations
Growing power consumption
CPUCPUCPUCPUMemory
Multi core & time sharing
Still bottle neck
Need huge power!
Further improvementCPU
CPU
CPU
CPU
MemoryCPU
CPU
CPU
CPU
MemoryCPU
CPU
CPU
CPU
MemoryMemoryCPU
CPU CPU
CPU
Functional logic cluster in memory sea
Need the same power!Complicated software accompany
Intel: 71 coresQualcomm: 24 cores
AMD : Heterogeneous system architecture
Mirai Ltd. & Meisei University 4
Our MLCSMemory Logic Conjugated System
LUT based logic block
Memory
FPGAEscaped bottle neck by distributed LUT, but too small to make function in a block or cell. So it needs switch and wire.
Functional cluster array composes with pure memories.Any cluster canbuilt to memory either functional logic depending on needs
Dynamic reconfiguration
Elimination of the limitation by Non-Neumann type processorOne big approach
1/10
Neumann’s power
LUT base architecture
<1/20
Von Neumann : Software download methodologyFPGA : Hardware download methodology
:4bit Register
Why LUT architecture is superior in performance and power? Feature for 4bit half and full adder and LUT:
Cout S
Y X
A0
4bit LUT base4bit conventional logic (binary adder)
(HA)
(FA)
Half Adder
Full Adder
Cout S
X CinY
A1
A0
S0
X0
Cin
Y0
S1
X1Y1
S2
X2Y2
Cout S3
X3Y3
S0
X0Y0
S1
X1Y1
S2
X2Y2
Cout S3
X3Y3
Y:Y0~Y3X:X0~X3
S:S0~S3
Meisei University Confidential 52015/11/18
High speed Low power
Mirai Ltd. & Meisei University 6
8bit multiplier calculation on Wired logic/FPGA and MLCS
RCA steps = 12 which are on restriction of processing speed
Our MLCSLatency = 2
1358gates (x 1.25um2 =1.70kum2) 256W = 4096bit memory (x 0.5um2 =2.05kum2)
By 65nm TSMC
LUT based Wired logic and FPGA are still complicate.
Then ---6-14
Mirai Ltd. & Meisei University 7
Connecting Block
Switching Block
FF
0:off1:on
10
0
0
0 0
LUT architecture of Xilinx Virtex-5
As FPGA provides too small LUT unit cells, a lot of selectors are needed by switch blocks.
LUT based logic stripe
Memory stripe
CIN
BX
B1B2B3B4B5B6
6-LUT
MUX5-LUT
5-LUT
FF BQ
B
BMUX
COUTLogic block
Logic Block
I/O
FPGA logic block net work
Unit cell
A function
Memory stripe
Need logic-memory communication
When job capacity increasing or protocol change
Reconfiguration, but not dynamic!8 unit cellを1 SliceにまとめてPartially dynamic reconfiguration(一時クロック動作を遮断)を実現する改良がおこなわれている(Xilinx)。LUT based logic stripe
LUT : 6入力2出力+FF
Too small LUT blocks needs huge wiring required here in FPGA!
One example in Xilinx using Si interposer
Basic structure of an FPGA
Processing speed: 11.5timesPower: 1/5Performance: 38times
One benchmark study by OpenCL
http://www.electronicproducts.com/Digital_ICs/Standard_and_Programmable_Logic/Compiling_OpenCL_to_FPGAs.aspx
Mirai Ltd. & Meisei University10
LUT cluster must build for fundamental function to get efficiency.
processing efficiency for function
FPGA
Large LUT
MLCS
Proc
essi
ng e
ffici
ency
2b, 128b
FPGA
LUT Size
Mirai Ltd. & Meisei University 11
Mat by cluster array
Logic
Cache surrounded the logic
Increasing and decreasing depend on cache hit ratio
Adding cache by new generated logic
When job capacity increasing
Expanding Logic
Cache surrounded the logic
Multi task with shared cache
Dynamic reconfiguration algorism by unified function array Efficient communication between neighbor cluster with high band width and
high processing rate
Cluster
1
1 2
3
2
Mirai Ltd. & Meisei University 12
SRAM (LUT)256W x 8bit
R/W
CKCE
DIN
D
Ch. set register
ADD (Write)
Input control circuit(mode change control
& channel control)
Output control circuit(register, switch, etc.Control)
(4bit REG x 8)
Mode set register
ADD
Control bus (CY etc)
(4bit x 4) (4bit x 4)
(4bit x 4)
The unified structure of basic clusterSimple operation can be programmable by using rich internal registers.Bus wiring can be routing on the memory area (about 70%), which can save area.
Sub control bus (8bit)
(4bit x 4)
(4bit x 4)
(4bit x 4) (4bit x 2)(4bit x 2)
(4bit x 4)
:Control signal (1bit each)
Address bus
Write command bus
Reconfiguration bus
Data bus
Mirai Ltd. & Meisei University 13
m rows
n columns
Basic cluster array
Other cluster array
8 bitq bit
Memory address of basic block
Extension address
(Address space of cluster memory)
Addresses
Clk + Control signal
Data (8 bit x n )
Multiple bus
Basic clusterarray
decoders
Control Circuit+Bus I/F
CX
CYCluster array memory
Basic cluster
Outlook of MLCS structure using the basic blockCluster allocation matches depending on performance and memory size.
Other cluster array
Mirai Ltd. & Meisei University 14
Operation mode
Through Access mode (= initial mode)
System mode
Arithmetic operation mode
Combinational Circuit mode
Internal memory mode
External memory mode
S/R=“L”(reset mode)
S/R=“H”
Memorymode
Logic mode
External memory mode
Logic library mode (Macro-cell)
Operation mode of basic block (Memory-logic conjugate cell)
Route Configuration Register Mode (making LUT)
Information Update mode for Route Configuration Register
Route Configuration modeby Mode Register
Route ConfigurationRegister Mode (making LUT for dynamic reconfiguration)
Rich operation modes can construct flexible and variable systems.
For dynamic reconfiguration
Mirai Ltd. & Meisei University 15
Memory space of LSI Memory space of MLCS
:memory mode
:logic mode
Basic cell
MLCS memory space
Cluster memory 1
Bus switchFor other cluster space
256w256w256w
256w256w256w256w
256w256w256w
256w
Channel set register
Memory space is adjustable for dynamic reconfiguration function.
Cluster memory 2
Cluster memory 3
Cluster memory n
For other cluster space
Mirai Ltd. & Meisei University 16
● Area is about 330X330um2 @90nm process (One Cluster)
X
Y00 01 10 11
11
10
01
00
Program memory(512w x 8b)
Logical judgment circuit
Instruction decoder
Reserve part
(decoder control)Basic cluster
Basic cluster array
shifter(8bit)
decoder
(Note)(1) Program counter:16bit
.2-cycle operation in case of overflow inaddress operation
.1-cycle operation (without overflow)(by using 8bit ALU)
(2) structure of 8bit ALU.To enable 2-cycle 16bit addition,
new type of adder with carry code input is introduced (which uses 4 Basic Cells).
Cluster memory layout example in single 8 bit CPU
PC Adder & 8bit ALU (one resource shared)
Mirai Ltd. & Meisei University 17
Actual design of four basic cluster configuration
Four basic block Area for TSVs
Memory (SRAM) for testing
256W x 8bit x 4cell
Unfortunately, quitted to produce due to our budget
Mirai Ltd. & Meisei University 18
The Outlook of the Memory - Logic Conjugated System
1.Solving the problem of band width and power consumption can be done by LUT with functional block architecture and neighboring allocation.2. Functional blocks can be done dynamical change within few clocks. 3.Consequent performance is introducing high speed, flexible robust and low power.4.It is suitable for 3D-TSV assembly design and scalability from small scale to large configuration.
Many core CPU
Cache
I/O
Off chip cache
Main memory
Other connection
Now-a-day high performance processing system
CPU
Cache
I/O
NAND
Near future high performance processing system
high speed FPGA
I/O
NVM
Final destination processing system
FPGA
I/O
FPGA
I/O
FPGA structure
LUT base logicCacheSize: depend on needs
GPU
Cache
I/O
FPGA
NAND with cache
FPGA
I/O
For many application processing
For many application processing
Other connection
Other connection
MLCSMLCS
Mirai Ltd. & Meisei University 20
Operation speed of processor mode
Area consumption on the same logic with different peripheral circuitArea Pure logic MLCS FPGARatio
: constant size with some allowance design: dynamic size with minimum
design
Performance comparison between pure logic and MLCS
Power Pure logic MLCS FPGARelative ratio 1 <0.05 0.1
Power consumption on the same logic with one thread
Band frequency
Pure logic**
(8/32bit)
MLCS/FPGA (8bit) MLCS/FPGA (32bit)Non-parallel
Four parallel*
Non-parallel
Four parallel*
Maximum 4GHz 1GHz 4GHz 1GHz 4GHzMean rate ? (1GHz) (3GHz ) (1GHz) (4GHz)
Note: *In case of 50% independency between four threads**One thread in pure logic that is superior than the SRAM based MLCS
γβα ⟨⟨⟩⟩
α+1 β+2 γ+3
γα ,
β
Pure logic would be the fastest processing, however MLCS can operate dynamic reconfiguration mode and eliminate band width bottle neck .
Four multi-thread processing Program command + data
Rearrangement
Mirai Ltd. & Meisei University 21
Implemented in regular FPGA• LUT based strips and memory strips in FPGA are used for
emulation.
LUT based logic strip
Memory strip
MLCS basic clusterCluster
decoder
FPGA Chip
Mirai Ltd. & Meisei University 2222
Ø Implemented function for verification of dynamic reconfiguration algorism function 1: adderfunction 2: shifterfunction 3: RAMfunction 4: T.B.D (multiplier etc.)
Ø Between functional cluster connections are realized by F/F in this emulation by FPGA.
Our algorism in final with neighbor wiring
F/F connection by FPGA
A emulated verification of MLCS performance by FPGA is as follows:
Mirai Ltd. & Meisei University 2323
Ø ニューロコンピュータのパーセプトロンの学習モデルを題材案とする。
Ø 比較的容易な構成で実現できる。(加算、乗算、RAM、LUT、シーケンサー)
An example of demonstration: Perseptron learning model on neuro-computer
24
Product-sum operation circuit by MLCS
P10a
Y0
X0X1
Y1
P00bP00a
P10b
P01bP01a
P11b
P11a
0
X 0
RGRGRGRG
PS0PS1 PS2 PS3 C3
Reset
16bitadder
RG : 4 bit register16bit RG
8bit X input
8bit Y input
16bit output PS
Carry outputPS
8bit multiplier
X
0
S1 S2 S3 S0
Major function of picture processing
High speed processing with our IP of AxonerveTM
and MLCS
25
OpeCord Opeland
Axonerve(Decoding Opecord)
Register
SRAM(MLCS)(Command access)
Execution unit(MLCS)
Executing output
Register
①
②
①
②
③
③
④
○ CPU execution step by 4 pipelines
(Timing adjust register)
(CUP Command)
①~④:Execution step
Quick access by using the search engine of AxonerveTM
(Fetch Opeland)
Data
Mirai Ltd. & Meisei University 26
Carrier: Dr. Kanji Otsuka, IEEE Fellow1959 – 1973 Design and development of Semiconductor, LSI and module in Hitachi Ltd.1970 – 1993 Design and development of main frame computer in Hitachi Ltd.1993 – 2004 Professor of Meisei University in Faculty Information Science including
Director of master course for two years and Dean for 4 years.2004 – present Emeritus professor, Executive Researcher including Invited Prof. of Osaka University for 4 years and Guest Lecturer of University of Tokyo for 1 year.
Centered large shared cache for the many gate array CPU could easy communicate each others with shortest wiring.The performance was preeminent against IBM one at this time.
This was one of success design with my idea implemented on 1984.
M680
One board computer