mlcs architecture new version3...

Mirai Ltd. & Meisei University 1

新しい概念のメモリ・論理共役システムとその周辺

－メモリのみで構成した動的再構成可能システム－

大塚寛治†、佐藤陽一†、河西純一‡

†明星大学、連携研究センター‡ MIRAI(株)


How to Get; High Speed, Flexible, Robust and Low Power Processing

Note: strong in Chinese network (Huawei, ZTE, Cisco, ALU, IBM, Ericson)

Memory CPU

Simple imagination of Von Neumann type processor

Band width bottle neck

Improvement concept

There are major two limitations

Growing power consumption

CPUCPUCPUCPUMemory

Multi core & time sharing

Still bottle neck

Need huge power!

Further improvementCPU

CPU

CPU

CPU

MemoryCPU

CPU

CPU

CPU

MemoryCPU

CPU

CPU

CPU

MemoryMemoryCPU

CPU CPU

CPU

Functional logic cluster in memory sea

Need the same power!Complicated software accompany

Intel: 71 coresQualcomm: 24 cores

AMD : Heterogeneous system architecture


Our MLCSMemory Logic Conjugated System

LUT based logic block

Memory

FPGAEscaped bottle neck by distributed LUT, but too small to make function in a block or cell. So it needs switch and wire.

Functional cluster array composes with pure memories.Any cluster canbuilt to memory either functional logic depending on needs

Dynamic reconfiguration

Elimination of the limitation by Non-Neumann type processorOne big approach

1/10

Neumann’s power

LUT base architecture

<1/20

Von Neumann : Software download methodologyFPGA : Hardware download methodology

：4bit Register

Why LUT architecture is superior in performance and power? Feature for 4bit half and full adder and LUT:

Cout S

Y X

A0

4bit LUT base4bit conventional logic (binary adder)

(HA)

(FA)

Half Adder

Full Adder

Cout S

X CinY

A1

A0

S0

X0

Cin

Y0

S1

X1Y1

S2

X2Y2

Cout S3

X3Y3

S0

X0Y0

S1

X1Y1

S2

X2Y2

Cout S3

X3Y3

Y：Y0～Y3X：X0～X3

S：S0～S3

Meisei University Confidential 52015/11/18

High speed Low power


8bit multiplier calculation on Wired logic/FPGA and MLCS

RCA steps = 12 which are on restriction of processing speed

Our MLCSLatency = 2

1358gates (x 1.25um2 =1.70kum2) 256W = 4096bit memory (x 0.5um2 =2.05kum2)

By 65nm TSMC

LUT based Wired logic and FPGA are still complicate.

Then ---6-14


Connecting Block

Switching Block

ＦＦ

0：off1：on

10

0

0

0 0

LUT architecture of Xilinx Virtex-5

As FPGA provides too small LUT unit cells, a lot of selectors are needed by switch blocks.

LUT based logic stripe

Memory stripe

CIN

BX

B1B2B3B4B5B6

6-LUT

MUX5-LUT

5-LUT

FF BQ

B

BMUX

COUTLogic block

Logic Block

I/O

FPGA logic block net work

Unit cell

A function

Memory stripe

Need logic-memory communication

When job capacity increasing or protocol change

Reconfiguration, but not dynamic!8 unit cellを1 SliceにまとめてPartially dynamic reconfiguration(一時クロック動作を遮断)を実現する改良がおこなわれている(Xilinx)。LUT based logic stripe

LUT : 6入力2出力＋FF

Too small LUT blocks needs huge wiring required here in FPGA!

One example in Xilinx using Si interposer

Basic structure of an FPGA

Processing speed: 11.5timesPower: 1/5Performance: 38times

One benchmark study by OpenCL

http://www.electronicproducts.com/Digital_ICs/Standard_and_Programmable_Logic/Compiling_OpenCL_to_FPGAs.aspx

Mirai Ltd. & Meisei University10

LUT cluster must build for fundamental function to get efficiency.

processing efficiency for function

FPGA

Large LUT

MLCS

Proc

essi

ng e

ffici

ency

2b, 128b

FPGA

LUT Size


Mat by cluster array

Logic

Cache surrounded the logic

Increasing and decreasing depend on cache hit ratio

Adding cache by new generated logic

When job capacity increasing

Expanding Logic

Cache surrounded the logic

Multi task with shared cache

Dynamic reconfiguration algorism by unified function array Efficient communication between neighbor cluster with high band width and

high processing rate

Cluster

1

1 2

3

2


SRAM (LUT)256W x 8bit

R/W

CKCE

DIN

D

Ch. set register

ADD (Write)

Input control circuit(mode change control

& channel control)

Output control circuit(register, switch, etc.Control)

(4bit REG x 8)

Mode set register

ADD

Control bus (CY etc)

(4bit x 4) (4bit x 4)

(4bit x 4)

The unified structure of basic clusterSimple operation can be programmable by using rich internal registers.Bus wiring can be routing on the memory area (about 70%), which can save area.

Sub control bus (8bit)

(4bit x 4)

(4bit x 4)

(4bit x 4) (4bit x 2)(4bit x 2)

(4bit x 4)

：Control signal (1bit each)

Address bus

Write command bus

Reconfiguration bus

Data bus


m rows

n columns

Basic cluster array

Other cluster array

8 bitq bit

Memory address of basic block

Extension address

(Address space of cluster memory)

Addresses

Clk + Control signal

Data (8 bit x n )

Multiple bus

Basic clusterarray

decoders

Control Circuit+Bus I/F

CX

CYCluster array memory

Basic cluster

Outlook of MLCS structure using the basic blockCluster allocation matches depending on performance and memory size.

Other cluster array


Operation mode

Through Access mode (= initial mode)

System mode

Arithmetic operation mode

Combinational Circuit mode

Internal memory mode

External memory mode

S/R=“L”（reset mode）

S/R=“H”

Memorymode

Logic mode

External memory mode

Logic library mode (Macro-cell)

Operation mode of basic block (Memory-logic conjugate cell)

Route Configuration Register Mode (making LUT)

Information Update mode for Route Configuration Register

Route Configuration modeby Mode Register

Route ConfigurationRegister Mode (making LUT for dynamic reconfiguration)

Rich operation modes can construct flexible and variable systems.

For dynamic reconfiguration


Memory space of LSI Memory space of MLCS

:memory mode

:logic mode

Basic cell

MLCS memory space

Cluster memory 1

Bus switchFor other cluster space

256w256w256w

256w256w256w256w

256w256w256w

256w

Channel set register

Memory space is adjustable for dynamic reconfiguration function.

Cluster memory 2

Cluster memory 3

Cluster memory n

For other cluster space


● Area is about 330X330um2 @90nm process (One Cluster)

X

Y00 01 10 11

11

10

01

00

Program memory(512w x 8b)

Logical judgment circuit

Instruction decoder

Reserve part

（decoder control）Basic cluster

Basic cluster array

shifter(8bit）

decoder

(Note)(1) Program counter:16bit

．2-cycle operation in case of overflow inaddress operation

．1-cycle operation (without overflow)(by using 8bit ALU)

(2) structure of 8bit ALU．To enable 2-cycle 16bit addition,

new type of adder with carry code input is introduced (which uses 4 Basic Cells).

Cluster memory layout example in single 8 bit CPU

PC Adder & 8bit ALU (one resource shared)


Actual design of four basic cluster configuration

Four basic block Area for TSVs

Memory (SRAM) for testing

256W x 8bit x 4cell

Unfortunately, quitted to produce due to our budget


The Outlook of the Memory - Logic Conjugated System

1.Solving the problem of band width and power consumption can be done by LUT with functional block architecture and neighboring allocation.2. Functional blocks can be done dynamical change within few clocks. 3.Consequent performance is introducing high speed, flexible robust and low power.4.It is suitable for 3D-TSV assembly design and scalability from small scale to large configuration.

Many core CPU

Cache

I/O

Off chip cache

Main memory

Other connection

Now-a-day high performance processing system

CPU

Cache

I/O

NAND

Near future high performance processing system

high speed FPGA

I/O

NVM

Final destination processing system

FPGA

I/O

FPGA

I/O

FPGA structure

LUT base logicCacheSize: depend on needs

GPU

Cache

I/O

FPGA

NAND with cache

FPGA

I/O

For many application processing

For many application processing

Other connection

Other connection

MLCSMLCS


Operation speed of processor mode

Area consumption on the same logic with different peripheral circuitArea Pure logic MLCS FPGARatio

: constant size with some allowance design: dynamic size with minimum

design

Performance comparison between pure logic and MLCS

Power Pure logic MLCS FPGARelative ratio 1 <0.05 0.1

Power consumption on the same logic with one thread

Band frequency

Pure logic**

(8/32bit)

MLCS/FPGA (8bit) MLCS/FPGA (32bit)Non-parallel

Four parallel*

Non-parallel

Four parallel*

Maximum 4GHz 1GHz 4GHz 1GHz 4GHzMean rate ? （1GHz）（3GHz ） (1GHz) (4GHz)

Note: *In case of 50% independency between four threads**One thread in pure logic that is superior than the SRAM based MLCS

γβα ⟨⟨⟩⟩

α+1 β+2 γ+3

γα ,

β

Pure logic would be the fastest processing, however MLCS can operate dynamic reconfiguration mode and eliminate band width bottle neck .

Four multi-thread processing Program command + data

Rearrangement


Implemented in regular FPGA• LUT based strips and memory strips in FPGA are used for

emulation.

LUT based logic strip

Memory strip

MLCS basic clusterCluster

decoder

FPGA Chip


Ø Implemented function for verification of dynamic reconfiguration algorism function 1: adderfunction 2: shifterfunction 3: RAMfunction 4: T.B.D (multiplier etc.)

Ø Between functional cluster connections are realized by F/F in this emulation by FPGA.

Our algorism in final with neighbor wiring

F/F connection by FPGA

A emulated verification of MLCS performance by FPGA is as follows:


Ø ニューロコンピュータのパーセプトロンの学習モデルを題材案とする。

Ø 比較的容易な構成で実現できる。（加算、乗算、RAM、LUT、シーケンサー）

An example of demonstration: Perseptron learning model on neuro-computer

24

Product-sum operation circuit by MLCS

P10a

Y0

X0X1

Y1

P00bP00a

P10b

P01bP01a

P11b

P11a

0

X 0

RGRGRGRG

PS0PS1 PS2 PS3 Ｃ３

Reset

16bitadder

RG : 4 bit register16bit RG

8bit X input

8bit Y input

16bit output PS

Carry outputPS

8bit multiplier

X

0

S1 S2 S3 S0

Major function of picture processing

High speed processing with our IP of AxonerveTM

and MLCS

25

OpeCord Opeland

Axonerve（Decoding Opecord）

Register

SRAM(MLCS)（Command access）

Execution unit（MLCS）

Executing output

Register

①

②

①

②

③

③

④

○ CPU execution step by 4 pipelines

（Timing adjust register）

（CUP Command）

①～④：Execution step

Quick access by using the search engine of AxonerveTM

（Fetch Opeland）

Data


Carrier: Dr. Kanji Otsuka, IEEE Fellow1959 – 1973 Design and development of Semiconductor, LSI and module in Hitachi Ltd.1970 – 1993 Design and development of main frame computer in Hitachi Ltd.1993 – 2004 Professor of Meisei University in Faculty Information Science including

Director of master course for two years and Dean for 4 years.2004 – present Emeritus professor, Executive Researcher including Invited Prof. of Osaka University for 4 years and Guest Lecturer of University of Tokyo for 1 year.

Centered large shared cache for the many gate array CPU could easy communicate each others with shortest wiring.The performance was preeminent against IBM one at this time.

This was one of success design with my idea implemented on 1984.

M680

One board computer

mlcs architecture new version3...

Documents