in-memory associative computing

52
IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY [email protected]

Upload: others

Post on 24-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IN-MEMORY ASSOCIATIVE COMPUTING

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI [email protected]

Page 2: IN-MEMORY ASSOCIATIVE COMPUTING

AGENDA

The AI computational challengeIntroduction to associative computing Examples An NLP use caseWhat’s next?

Page 3: IN-MEMORY ASSOCIATIVE COMPUTING

THE CHALLENGE IN AI COMPUTING

AI Requirement32 bit FP Neural network learningMulti precision Neural network inference, data mining, etc.Scaling Data centerSort-search Top-K, recommendation, speech, classify image/videoHeavy computation Non linearity, Softmax, exponent , normalize Bandwidth Required for speed and power

Use Case Example

Page 4: IN-MEMORY ASSOCIATIVE COMPUTING

CURRENT SOLUTION

CPUGeneralPurpose

GPU

Question

Answer

DRAM

• Bottleneck when register file data needs to be replaced on a regular basis ➢ Limits performance

➢ Increases power consumption

• Does not scale with the search, sort, and rank requirements of applications like recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.

Very Wide Bus

Tens of Cores Thousands of Cores

Page 5: IN-MEMORY ASSOCIATIVE COMPUTING

GPU VS CPU VS FPGA

Page 6: IN-MEMORY ASSOCIATIVE COMPUTING

GPU VS CPU VS FPGA VS APU

Page 7: IN-MEMORY ASSOCIATIVE COMPUTING

GSI’S SOLUTION APU—ASSOCIATIVE PROCESSING UNIT

• Computes in-place directly in the memory array—removes the I/O bottleneck

• Significantly increases performances

• Reduces power

Simple CPU APUAssociative Memory

Question

Answer

Simple & Narrow Bus

Millions Processors

Page 8: IN-MEMORY ASSOCIATIVE COMPUTING

IN-MEMORY COMPUTING CONCEPT

Page 9: IN-MEMORY ASSOCIATIVE COMPUTING

THE COMPUTING MODEL FOR THE PAST 80 YEARS

100011001001

Address Decoder

Sense Amp /IO Drivers

ALU

0101

1000

Read

Read

0010

Write

Write

Page 10: IN-MEMORY ASSOCIATIVE COMPUTING

THE CHANGE—IN-MEMORY COMPUTING

• Patented in-memory logic using only Read/Write operations

• Any logic/arithmetic function can be generated internally

0101100011001001

Read

Read

WriteNOR

0101

Read

Read

Write

0010

NOR

Sim

ple

Con

trolle

r

Page 11: IN-MEMORY ASSOCIATIVE COMPUTING

1101

1000

1001

1001

0110

0010

0111

0110

Records

1=match00

Duplicate vales with inverse data

Duplicate the key with inverse. Move the original

key next to the inverse data

RE

RE

RE

RE

1 in the combines key goes to the

read enable

11

CAM/ ASSOCIATIVE SEARCH

Values KEY:Search 0110

Page 12: IN-MEMORY ASSOCIATIVE COMPUTING

0110

Don’t care

010

Don’t care

11

Don’t care

0110

12

TCAM SEARCH WITH STANDARD MEMORY CELLS

Page 13: IN-MEMORY ASSOCIATIVE COMPUTING

0101

0000

1001

1001

0110

0010

0110

0110 KEY:

Search 0110

1=match01= match

Duplicate data. Inverse only to those which are not don’t care Duplicate the key with

inverse. Move The original Key next to the

inverse data

RE

RE

RE

RE

1 in the combines key goes to the

read enable

Insert zero instead of don’t-care

13

TCAM SEARCH WITH STANDARD MEMORY CELLS

Page 14: IN-MEMORY ASSOCIATIVE COMPUTING

COMPUTING IN THE BIT LINES

Vector A

Vector B

Each bit line becomes a processor and storagemillions of bit lines = millions of processors

C=f(A,B)

Page 15: IN-MEMORY ASSOCIATIVE COMPUTING

NEIGHBORHOOD COMPUTING

Parallel shift of bit lines @ 1 cycle sections

Enables neighborhood operations such as convolutions

C=f(A,SL(B,1))

Shift vector

Page 16: IN-MEMORY ASSOCIATIVE COMPUTING

SEARCH & COUNT

5

Count = 3 Search 20

20-17

3 20 54 20

1 1 1

• Key applications for search and count for predictive analytics:• Recommender systems • K-nearest neighbors (using cosine similarity search)• Random forest• Image histogram• Regular expression

8

• Search (binary or ternary) all bit lines in 1 cycle• 128 M bit lines => 128 Peta search/sec

1 1 1

Page 17: IN-MEMORY ASSOCIATIVE COMPUTING

DATABASE SEARCH AND UPDATE

Content-based search, record can be placed anywhereUpdate, modify, insert, delete is immediate

Exact MatchCAM/TCAM

Similarity MatchIn-Place Aggregate

Page 18: IN-MEMORY ASSOCIATIVE COMPUTING

TRADITIONAL STORAGE CAN DO MUCH MORE

Standard memory

cell1 bit

Standard memory

cell

2 bits2 input NOR, 1 TCAM cell

Standard memory

cell

3 Bits3 input NOR, 2 Input NOR + 1 Output 4 State CAM

Standard memory

cell

Page 19: IN-MEMORY ASSOCIATIVE COMPUTING

CPU/GPGPU VS APU

Page 20: IN-MEMORY ASSOCIATIVE COMPUTING

ARCHITECTURE

Page 21: IN-MEMORY ASSOCIATIVE COMPUTING

SECTION COMPUTING TO IMPROVE PERFORMANCE

21

Mem

ory

MLB section 0

Connecting Mux

. . .

control

Instr.Buffer

MLB section 1

Connecting mux

24 rows

Page 22: IN-MEMORY ASSOCIATIVE COMPUTING

Store, compute, search and move data anywhere

Shift between sections enable neighborhood operations (filters , CNN etc.)

COMMUNICATION BETWEEN SECTIONS

22

Page 23: IN-MEMORY ASSOCIATIVE COMPUTING

APU CHIP LAYOUT

2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance

Page 24: IN-MEMORY ASSOCIATIVE COMPUTING

EVALUATION BOARD PERFORMANCE

Precision : Unlimited : from 1 bit to 160 bits or more.6.4 TOPS (FP) – 8 Peta OPS for one bit computing or 16 bit exact search

Similarity Search, Top-k , min, max , Softmax,O(1) complexity in μs, any size of K compared to ms with current solutions

In-memory IO 2 Petabit/sec > 100X GPGPU/CPU/FPGA

Sparse matrix multiplication> 100X GPGPU/CPU/FPGA

Page 25: IN-MEMORY ASSOCIATIVE COMPUTING

APU SERVER

64 APU chips , 256-512GByte DDR,From 100TFLOPS Up to 128 Peta OPS with peak performance 128TOPS/WO(1) Top-K, min, max,32 Peta bits/sec internal IO< 1K Watts> 1000X GPGPs on averageLinearly scalableCurrently 28nm process and scalable to 7nm or less

Well suited to advanced memory technology such as non volatile ReRAM and more

Page 26: IN-MEMORY ASSOCIATIVE COMPUTING

EXAMPLE APPLICATIONS

Page 27: IN-MEMORY ASSOCIATIVE COMPUTING

K-NEAREST NEIGHBORS (K-NN)

Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y

K = 4Group Green selected as the majority

For actual applications: N = Billions, D = Tens, K = Tens of thousands

Page 28: IN-MEMORY ASSOCIATIVE COMPUTING

K-NN USE CASE IN AN APU

Item1

Item2

Item N

 

Q

4 1 3 2

Majority Calculation

Item features and label storage

Computing Area

Compute cosine distancesfor all N in parallel

(≤ 10μs, assuming D=50 features)

K Mins at O(1) complexity (≤ 3μs)Distribute data – 2 ns (to all)

In-Place ranking

Features of item 1

Item 1

Item 2

Item 3

Item N

Features of item 2

Features of item N

With the data base in an APU, computation for all N items done in

≤ 0.05 ms, independent of K

Page 29: IN-MEMORY ASSOCIATIVE COMPUTING

LARGE DATABASE EXAMPLE USING APU SERVERS

• Number of items: billions

• Features per item: tens to hundreds

• Latency: ≤ 1 msec

• Throughput: Scales to 1M similarity searches/sec

• k-NN: Top 100,000 nearest neighbors

Page 30: IN-MEMORY ASSOCIATIVE COMPUTING

EXAMPLE K-NN FOR RECOGNITION

Text BOW, Word Embedding

Feature Extractor

(Neural Network)

K-NN Classifier(Associative

Memory)

Image Convolution Layer

Page 31: IN-MEMORY ASSOCIATIVE COMPUTING

K-MINS: O(1) ALGORITHM

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

C0

C1

C2

N

MSB LSB

Page 32: IN-MEMORY ASSOCIATIVE COMPUTING

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V M

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1

D0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1

N0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1

N|V

cnt=11

0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[0]

Page 33: IN-MEMORY ASSOCIATIVE COMPUTING

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V M

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1

D0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1

N0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1

N|V

cnt=8

0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[1]

Page 34: IN-MEMORY ASSOCIATIVE COMPUTING

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

VV

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

0 0 10101 0 0 0 0 0 1 0 0 0

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C

final output

O(1) Complexity

Page 35: IN-MEMORY ASSOCIATIVE COMPUTING

DENSE (1XN) VECTOR BY SPARSE NXM MATRIX

4 -2 3 -1Input Vector

Sparse Matrix

Output Vector

3 5 9 171 2 2 4

2 3 4 4

Shift and Add all belonging to same column

0 0 0 0

3 0 0 0

0 5 0 0

0 9 0 17

-6 6 0 -17

Row

Column

-6 15 -9 -17

-6 6 0 -17

Search all columns for row = 2 :Distribute -2 : 2 CySearch all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute -1 : 2 CyMultiply in Parallel : 100 Cy

APU Representations and Computing

-2 3 -1 -1

Complexity including IO : O (N +logβ)where β is the number of nonzero elements in the sparse matrix

N << M in general for recommender systems

Page 36: IN-MEMORY ASSOCIATIVE COMPUTING

SPARSE MATRIX MULTIPLICATION PERFORMANCE ANALYSIS

• G3 circuit matrix • 1.5M X 1.5M sparse matrix

• Roughly 8M nonzero elements

• 20–100 GFLOPS with GPGPU solution

• APU solution provides 64 TFLOPS using the same

amount of power as the GPGPU solution above

• > 500x improvement with APU solution

Page 37: IN-MEMORY ASSOCIATIVE COMPUTING

ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP)

Q&A, dialog, language translation, speech recognition’ etc.Requires learning things from the past – needs memoryMore memory, more accuracy

i.e. “Dan put the book in his car, ….. Long story here…. Mike took Dan’s car … Long story here …. He drove to SF”Q : Where is the book now?A: Car, SF

Page 38: IN-MEMORY ASSOCIATIVE COMPUTING

END-TO-END MEMORY NETWORKS

End-To-End Memory Networks, (Weston et. al., NIPS 2015). (a): Single hope, (b) 3 hops

Page 39: IN-MEMORY ASSOCIATIVE COMPUTING

Q&A : END TO END NETWORK

Page 40: IN-MEMORY ASSOCIATIVE COMPUTING

REQUIREMENTS FOR AUGMENTED MEMORY

Cosine Similarity Search + Top-K

Compute softmax to selected

Vertical Multiplication selected Columns and horizontal sum Output

Embedding Input features to next coloumn , to any other location or location based on

content

Page 41: IN-MEMORY ASSOCIATIVE COMPUTING

Control

I

Omi

Ti

M

0 1 2 N-1

 

i mi

 i mi Broadcast input I to selected columnsCompute any function at selected columns Generate outputGenerate tags for selection

APU MEMORY FOR NLP

Page 42: IN-MEMORY ASSOCIATIVE COMPUTING

GSI SOLUTION FOR END TO END

Constant time of 3 µsec per iteration, any memory size.

Page 43: IN-MEMORY ASSOCIATIVE COMPUTING

PROGRAMING MODEL

Page 44: IN-MEMORY ASSOCIATIVE COMPUTING

PROGRAMMING MODEL

APU (Associative Processing Unit Hardware)

Graph Execution and Tasks Scheduling

Framework (TensorFlow )

Application (C++, Python)HOST

Device

Page 45: IN-MEMORY ASSOCIATIVE COMPUTING

c

A TF EXAMPLE: MATMUL

a = tf.placeholder(tf.int32, shape=[3,4])

b = tf.placeholder(tf.int32, shape=[4,6])

c = tf.matmul(a, b) # shape = [3,6]

ba

matmul

Page 46: IN-MEMORY ASSOCIATIVE COMPUTING

A TF EXAMPLE: MATMUL GRAPH PREPARATION

apu

devi

ce (t

f+ei

gen)

a = tf.placeholder(tf.int32, shape=[3, 4])

b = tf.placeholder(tf.int32, shape=[4, 6])

c = tf.matmul(a, b)

gnl_create_array(a)

APU DEVICE SPACE

gnl_create_array(b)

gnl_create_array(c)

a

b

c

apuc’sL4host

L1

MMB

Page 47: IN-MEMORY ASSOCIATIVE COMPUTING

MMB

L1

A TF EXAMPLE: MATMUL GVML_SET, GVML_MUL

with tf.Session() as sess: result = sess.run(c, feed_dict= {a: [[2 -4 15 3] b: [[27 -8 -14 2 9 -32] [11 1 -30 4] [-7 52 -6 21 0 4] [-8 23 -9 7]], [-81 1 20 6 19 -3] [-38 90 5 2 13 77]]})

2 -4 15 3 11 1 -30 4 -8 23 -9 7

APU DEVICE SPACE

a

apucL4

27 -8 -14 2 9 -32 -7 52 -6 21 0 4 -81 1 20 6 19 -3 -38 90 5 2 13 77

b

c

gnlpd_mat_mul(c,a,b)

controller

dma copy L4 to

L1 -7 52 -6 21 0 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 c

gnlpd_dma_16b_start(GNLPD_SYS_2_VMR,…) gvml_mul_s16(……)

dma keepsloading

data to apucwhile matmul

is being computed

in apuc

gvml_set_16(……)

27 -8 -14 2 9 -32 2 2 2 2 2 2

54 -16 -28 4 18 -64

X gvml_mul_s16(……)

gvml_set_16(……)

Page 48: IN-MEMORY ASSOCIATIVE COMPUTING

TENSORFLOW ENHANCEMENT: FUSED OPERATIONS

a = tf.placeholder(tf.int32, shape=[3,4])

b = tf.placeholder(tf.int32, shape=[4,6])

c = tf.matmul(a, b) # shape = [3,6]

ba

cmatmul

d = tf.nn.top_k(c, k=2) # shape = [3,2] dtop_k

fused(matmul,top_k)

• The two operations are computed inside the apuc • Data stays in L1 • No IO operations between them • Saves valuable data transfer time and powerfused

Page 49: IN-MEMORY ASSOCIATIVE COMPUTING

MMB

L1

A TF EXAMPLE: MATMUL CODE EXAMPLE

APU DEVICE

27 -8 -14 2 9 -32

54 -16 -28 4 18 -64 0 0 0 0 0 0 0 0 0 0 0 0 c

27 -8 -14 2 9 -32 2 2 2 2 2 2

54 -16 -28 4 18 -64

APL_FRAG add_u16(RN_REG x, RN_REG y, RN_REG t_xory, RN_REG t_ci0) { SM_0XFFFF: RL = SB[x]; // RL[0-15] = x SM_0XFFFF: RL |= SB[y]; // RL[0-15] = x|y { SM_0XFFFF: SB[t_xory] = RL; // t_xory[1-15] = x|y SM_0XFFFF: RL = SB[x , y]; // RL[0-15] = x&y } // Add init state: // 0: RL = co[0] // 1..15: RL = x&y { (SM_0X1111 << 1): SB[t_ci0] = NRL; // t_ci0[5,9,13] = x&y (SM_0X1111 << 1): RL |= SB[t_xory] & NRL; // RL[1] = Cout[1] = x&y | ci(x|y) // 5,9,13: RL = Cout0[5,9,13] = x&y | ci(x|y) (SM_0X1111 << 4): RL |= SB[t_xory]; // RL[4,8,12] = Cout1[4,8,12] = x&y | 1&(x|y) } { (SM_0X1111 << 2): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 2): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 5): RL |= SB[t_xory] & NRL; // Propagate Cout1 } { (SM_0X1111 << 3): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 3): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 6): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } { (SM_0X1111 << 4): SB[t_ci0] = NRL; // t_ci0[8,12,16] = Cout0[7, 11, 15] SM_0X0001: SB[t_ci0] = GL; // t_ci0[8,12,16] = Cout0[7, 11, 15] (SM_0X1111 << 7): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } . . .

Page 50: IN-MEMORY ASSOCIATIVE COMPUTING

FUTURE APPROACH – NON VOLATILE CONCEPT

Page 51: IN-MEMORY ASSOCIATIVE COMPUTING

CPU Register File

L1/L2/L3

Flash

HDD

SOLUTIONS FOR FUTURE DATA CENTERS

DRAM

ASSOCIATIVE

STT-RAM RAM Based

PC-RAM Based

ReRam Based

Full computing (floating points etc.) requires

read & write : Machine learning, malware detection detection etc., : Much more read and

much less writeData search engines (read most of the time)

High endurance

Mid enduranceLow endurance

Standard SRAM Based VolatileNon Volatile

Page 52: IN-MEMORY ASSOCIATIVE COMPUTING

THANK YOU