in-place associative computingee380.stanford.edu/abstracts/171108-slides.pdf · 1 in-place...

64
1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU [email protected] All Images are Public in the Web

Upload: lydien

Post on 15-Feb-2019

255 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

1

In-Place Associative Computing

Avidan Akerib Ph.D.

Vice President Associative Computing BU

[email protected] All Images are Public in the Web

Page 2: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

2

Agenda

• Introduction to associative computing

• Use case examples

• Similarity search

• Large Scale Attention Computing

• Few-shot learning

• Software model

• Future Approaches

Page 3: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

3

The Challenge In AI Computing(Matrix Multiplication is not enough!!)

AI Requirement

• High Precision Floating Point Neural network learning

• Multi precision Real time inference, saving memory

• Linearly Scalable Big Data

• Sort-search Top-K, recommendation, speech, classify image/video

• Heavy computation Non linearity, Softmax, exponent , normalization

• Bandwidth/power tradeoff High speed at low power

Use Case Example

Page 4: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

4

Von Neumann Architecture

4

CPUMemory

High Density

(Repeated cells)

Slower

Lower Density

(Lots of Logic )

Faster

Leveraging Moore’s Law

Page 5: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

5

Von Neumann Architecture

5

CPUMemory

High Density

(Reputed cells)

Slower

Lower Density

(Lots of Logic)

Faster

CPU frequency outpacing memory - need to add

cache …. Continue to leverage Moore’s Law

Page 6: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

6

Since 2006 Clock speed start flattening Sharply …

6

Source: Intel

Page 7: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

7

Thinking Parallel : 2 cores and more

7

CPUMemory

However, memory utilization becomes an issue …

Page 8: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

8

More and more memory to solve utilization problem

CPUMemory

Local and Global Memory

Page 9: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

9

Memory still growing rapidly

9

CPUMemory

Memory becomes a larger part of each chip

Page 10: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

10

Same Concept even with GPGPUs

10

GPGPUMemories

Very High Power , large die, Expensive …

What’s Next ??

Page 11: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

11

Most of Power goes to BW

Source: Song Han

Stanford University

Page 12: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

12

Changing the Rules of the Game!!!

12

Standard Memory cells are smarter than we thought !!

Page 13: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

13

APU—Associative Processing Unit

• Computes in-place directly in the memory array—removes the I/O

bottleneck

• Significantly increases performance

• Reduces power

Simple CPU

APU

Associative

Processing

Question

Answer

Simple

& Narrow Bus

Millions Processors

Page 14: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

14

How Computers Work Today

14

1000

1100

1001

Addre

ss D

ecoder

Sense Amp /IO Drivers

ALU

0101

RE/WE

Page 15: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

15

Accessing Multiple Rows Simultaneously

15

0101

1000

1100

1001

RE

RE

WE?0010NOR

0001

RE

RE

WE

Bus Contention is not an error !!!

It’s a simple NOR/NAND satisfying

De-Morgan’s law

Page 16: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

16

Truth Table Example

A B C

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

D

1

0

1

1

0

0

0

1

AB

C00 01 11 10

1 1 0 0

0 1 1 0

0

1

!A!C + BC =

!!( !A!C + BC ) = ! (!(!A!C)!(BC))

= NAND( NAND(!A,!C),NAND(B,C))

Read (B,C) ; WRITE T2Read (!A,!C) ; WRITE T1

Read (T1,T2) ; WRITE D

1 CLOCK

1 CLOCK

• Every Minterm

takes one Clock

• All bit lines

executes Karnaugh

tables in parallel

Page 17: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

17

Vector Add Example

17

A[] + B[] = C[]

No. Of Clocks = 4 * 8 = 32

Clocks/byte= 32/32M=1/1M

OPS = 1Ghz X 1M

= 1 PetaOPS

vector A(8,32M)

vector B(8,32M)

Vector C(9,32M)

C = A + B

Page 18: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

18

1101

1000

1001

1

0

0

1

0

1

1

0

0010

0111

0110

Records

1=match00

Duplicate Vales with

inverse data

Duplicate the Key with

Inverse. Move The original

Key next to the inverse

data

RE

RE

RE

RE

1 in the combines

key goes to the

read enable

CAM/ Associative Search

ValuesKEY:

Search

0110

Page 19: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

19

0

1

1

0

Don’t’ Care

010

Don’t Care

11

Don’t Care

0110

TCAM Search By Standard Memory Cells

Page 20: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

20

0101

0000

1001

1

0

0

1

0

1

1

0

0010

0110

0110

KEY:

Search

0110

1=match01= match

Duplicate data. Inverse only to

those which are not don’t careDuplicate the Key with

Inverse. Move The

original Key next to the

inverse data

RE

RE

RE

RE

1 in the combines

key goes to the

read enable

Insert Zero instead of don’t-care

TCAM Search By Standard Memory Cells

Page 21: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

21

Computing in the Bit Lines

Vector A

Vector B

Each bit line becomes a processor and storage

Millions of bit lines = millions of processors

a0 a1 a2 a3 a4 a5 a6 a7

b0 b1 b2 b3 b4 b5 b6 b7

C=f(A,B)

Page 22: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

22

Neighborhood Computing

Parallel shift of bit lines @ 1 cycle sections

Enables neighborhood operations such as convolutions

C=f(A,SL(B,1))

Shift vector

Page 23: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

23

Search & Count

5

Count = 3 Search 20

20

-17

3 20

54

20

1 1 1

• Key Applications for search and count for predictive analytics:• Recommender systems

• K-Nearest Neighbors (using cosine similarity search)

• Random forest

• Image histogram

• Regular expression

8

• Search (binary or ternary) all bit lines in 1 cycle• 128 M bit lines => 128 Peta search/sec

1 1 1

Page 24: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

24

CPU vs GPU vs FPGA vs APU

Page 25: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

25

CPU/GPGPU vs APU

CPU/GPGPU(Current Solution) In-Place Computing (APU)

Send an address to memory Search by content

Fetch the data from memory and

send it to the processorMark in place

Compute serially per core

(thousands of cores at most)

Compute in place on millions of

processors (the memory itself

becomes millions of processors

Write the data back to memory,

further wasting IO resources

No need to write data back—the

result is already in the memory

Send data to each location that

needs it

If needed, distribute or broadcast

at once

Page 26: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

26

ARCHITECTURE

Page 27: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

27

Store, Compute, Search and Transport data anywhere.

Shift between sections

enable neighborhood

operations (filters , CNN

etc.)

Communication between Sections

Page 28: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

28

Section Computing to Improve Performance

Mem

ory

MLB section 0

Connecting Mux

. . .

control

Instr.

Buffer

MLB section 1

Connecting mux

24 rows

Page 29: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

29

APU Chip Layout

2M bit processors or 128K vector processors runs at 1G Hz

with up to 2 Peta OPS peak performance

Page 30: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

30

APU Layout vs GPU Layout

Multi-Functional,

Programmable Blocks

Acceleration of FP

operation Blocks

Page 31: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

31

EXAMPLE APPLICATIONS

Page 32: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

32

K-Nearest Neighbors (k-NN)

Simple example: N = 36, 3 Groups, 2 dimensions (D = 2 ) for X and Y

K = 4

Group Green selected as the majority

For actual applications: N = Billions, D = Tens, K = Tens of thousands

Page 33: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

33

k-NN Use Case in an APU

𝐶𝑝 =𝐷𝑝∙𝑄

𝐷𝑝 𝑄=

σ𝑖=0𝑛 𝐷𝑖

𝑝𝑄𝑖

σ𝑖=0𝑛 𝐷𝑖

𝑝 2σ𝑖=0𝑛 𝑄𝑖

2

Q

4 1 3 2

Majority Calculation

Item features and label

storage

Computing Area

Compute cosine distances

for all N in parallel ( 10s, assuming D=50 features)

K Mins at O(1) complexity ( 3s)

Distribute data – 2 ns (to all)

In-Place ranking

Fe

atu

res o

f item

1

Item 1

Item 2

Item 3

Item N

Fe

atu

res o

f item

2

Fe

atu

res o

f item

N

With the data base in an APU, computation for all N items done in

0.05 ms, independent of K (1000X Improvement over current solutions)

Page 34: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

34

K-MINS: O(1) AlgorithmKMINS(int K, vector C){

M := 1, V := 0;

FOR b = msb to b = lsb:

D := not(C[b]);

N := M & D;

cnt = COUNT(N|V)

IF cnt > K:

M := N;

ELIF cnt < K:

V := N|V;

ELSE: // cnt == K

V := N|V;

EXIT;

ENDIF

ENDFOR}

C0

C1

C2

N

MSB LSB

Page 35: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

35

110101

010100

000101

000111

000001

111011

000101

010110

001100

011100

111100

101101

000101

111101

000101

000101

K-MINS: The Algorithm

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

V M

KMINS(int K, vector C){

M := 1, V := 0;

FOR b = msb to b = lsb:

D := not(C[b]);

N := M & D;

cnt = COUNT(N|V)

IF cnt > K:

M := N;

ELIF cnt < K:

V := N|V;

ELSE: // cnt == K

V := N|V;

EXIT;

ENDIFENDFOR

}

0

1

1

1

1

0

1

1

1

1

0

0

1

0

1

1

D

0

1

1

1

1

0

1

1

1

1

0

0

1

0

1

1

N

0

1

1

1

1

0

1

1

1

1

0

0

1

0

1

1

N|V

cnt=11

0

1

1

1

1

0

1

1

1

1

0

0

1

0

1

1

110101

010100

000101

000111

000001

111011

000101

010110

001100

011100

111100

101101

000101

111101

000101

000101

C[0]

Page 36: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

36

110101

010100

000101

000111

000001

111011

000101

010110

001100

011100

111100

101101

000101

111101

000101

000101

K-MINS: The Algorithm

0

1

1

1

1

0

1

1

1

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

V M

KMINS(int K, vector C){

M := 1, V := 0;

FOR b = msb to b = lsb:

D := not(C[b]);

N := M & D;

cnt = COUNT(N|V)

IF cnt > K:

M := N;

ELIF cnt < K:

V := N|V;

ELSE: // cnt == K

V := N|V;

EXIT;

ENDIF

ENDFOR

}

0

0

1

1

1

0

1

0

1

0

0

1

1

0

1

1

D

0

0

1

1

1

0

1

0

1

0

0

0

1

0

1

1

N

0

0

1

1

1

0

1

0

1

0

0

0

1

0

1

1

N|V

cnt=8

0

0

1

1

1

0

1

0

1

0

0

0

1

0

1

1

110101

010100

000101

000111

000001

111011

000101

010110

001100

011100

111100

101101

000101

111101

000101

000101

C[1]

Page 37: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

37

110101

010100

000101

000111

000001

111011

000101

010110

001100

011100

111100

101101

000101

111101

000101

000101

K-MINS: The Algorithm

VV

KMINS(int K, vector C){

M := 1, V := 0;

FOR b = msb to b = lsb:

D := not(C[b]);

N := M & D;

cnt = COUNT(N|V)

IF cnt > K:

M := N;

ELIF cnt < K:

V := N|V;

ELSE: // cnt == K

V := N|V;

EXIT;

ENDIF

ENDFOR

}

0

0

1

0

1

0

1

0

0

0

0

0

1

0

0

0

110101

010100

000101

000111

000001

111011

000101

010110

001100

011100

111100

101101

000101

111101

000101

000101

C

final output

O(1) Complexity

Page 38: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

38

Similarity Search and Top-K for Recognition

Text

Image

Data BaseEvery image/sentence/doc has a label

Feature Extractor

(Neural network)

Convolution

Layer

Word/Sentence/doc

Embedding

Page 39: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

39

Dense (1XN) Vector by Sparse NxM Matrix

4 -2 3 -1

Input Vector

Sparse Matrix

Output Vector

3 5 9 17

1 2 2 4

2 3 4 4

Shift and Add all belonging to same column

0 0 0 0

3 0 0 0

0 5 0 0

0 9 0 17

-6 6 0 -17

Row

Column

-6 15 -9 -17

-6 6 0 -17

Search all columns for row = 2 :Distribute -2 : 2 CySearch all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute -1 : 2 Cy

Multiply in Parallel : 100 Cy

APU Representations and Computing

-2 3 -1 -1

Complexity including IO : O (N +logβ)

where β is the number of nonzero elements in the sparse matrix

N << M in general for recommender systems

Page 40: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

40

Two NxN Sparse Matrix Multiplication

Choose Next free Entry from In-DB1

Read its Row value

Search and Mark Similar Rows

For all Marked Row

Search where Col(In-DB1) = Row(In-DB2)

Broadcast selected value to Output Table bit lines

Multiply in Parallel

Shift and Add all belonging to same Column

Update Out-DB

Go Back to Step 1 if there are more free entries

Exit

Sparse In1 Matrix Sparse Inb-2 Matrix

0 1 0 2 0 0 0 0 3 18 0 34

0 0 0 0 3 0 0 0 0 0 0 0

0 0 6 0 0 5 0 0 0 30 0 0

0 3 0 0 0 9 0 17 9 0 0 0

1 6 3 2 3 5 9 17

2 3 2 4 1 2 2 4

1 3 4 1 2 3 4 4

. . 1 2 2

1

Output Matrix

X =

In-DB1 In-DB2 Out-DB

3 18 34

3 18 34

1 2 4

1 1 1

. 6

30

30

2

3

.9 18 34

3

9

1

4

Complexity Including IO : O(β+logβ) Compared to O(β0.7N1.2 +N2 ) in CPU ( > 1000X Improvement)

COL

ROW

Page 41: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

41

Softmax

• Used in many neural networks applications, especially for attention networks

• The Softmax function takes an N dimensional vector of scores and

generates probabilities between 0 to 1, as defined by the function

• Where Z is the dot product between a query vector and feature vector ( for

example, word emending of English vocabulary )

𝑆𝑖 =𝑒𝑍𝑖

σ𝑗=1𝑁 𝑒𝑍𝑗

Page 42: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

42

The Difficulties in Softmax Computing

1. Dot Product for millions vectors

2. Non Linearity function (Exp)

3. Dependency : every score depends on all others in data base

4. Dynamic range: fast overflow, requires high precision calculations

5. Speed and Latency

Page 43: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

43

Taylor Series

𝑒𝑥 = 1 + 𝑥 +𝑥2

2+𝑥3

3!+⋯

Very Expensive, Requires more than 20 coefficients and

double precision for good accuracy.

Page 44: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

44

1M SoftMax Performance

• Proprietary algorithm leverages APU’s

lookup capability•Provide 1M High accuracy exact Softmax values

= < 5 µsec vs 10-100 msec in GPU

•> 3 orders of magnitude improvement

Page 45: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

45

Associative Memory for Natural Language Processing (NLP)

• Q&A, dialog, language translation, speech recognition etc.

• Requires learning past events

• Needs large array with attention capabilities

Page 46: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

46

Examples

• Q&A:

“Dan put the book in his car, ….. Long story

here…. Mike took Dan’s car … Long story here

…. He drove to SF”

• Q : Where is the book now?

• A: Car, SF

• Language Translation:

• “The cow ate the hay because it was delicious”.

• “The cow ate the hay because it was hungry”.

Attention Computing

Source: Łukasz Kaiser

Page 47: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

47

Example of Associative Attention Computing

Encoder

(NN)Input Data

(i.e. Sentence in English

for translation or for Q&A)

Fe

atu

re V

ec

tor

Em

be

dd

ing

Sentences Features

Representation (Key)

Page 48: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

48

Example of Associative Attention Computing

Encoder

(NN)Query Fe

atu

re V

ec

tor

Do

t Pro

du

ct

Key

Dot Product :O(1)

0.1 0.01 0.03 0.07 0.2 0.1 0.09 0.4 0.02 0.08 0.05 0.04 0.17 0.22

. . .V6V1 V2 V3 V4 V5

X

Attention

Compute TOP K

SoftMax O(1)

Top-K O(1)

Next Stage(Encoder or

Decoder)

Dot Product Result

Value)

SoftMax Result

Page 49: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

49

Q&A : End to End Network (Weston)

Source: Weston et

al

Page 50: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

50

GSI Associative Solution for End to End

Constant time of 3 µSec per one iteration, any memory size

> few orders of magnitude improvementSource: Weston et

al

Page 51: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

51

Associative Computing for Low-Shot Learning

• Gradient-Based Optimization has achieved impressive results

on supervised tasks such as image classification

• These models need a lot of data

• People can learn efficiently from few examples

• Associative Computing

• Like people, can measure similarity to features stored in memory

• Can also create a new label for similar features in the future

Page 52: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

52

Zero-Shot Learning with k-NN

• Extract features using any pre- trained CNN, for example VGG/Inception on ImageNet

• New data set is embedded using a pre-trained model and stored in memory with its label

• Query (test images) are input without label and their features are cosine similarity searched to predict the

label

Input Images

with labels

Embedding Input features

Pixels Features

Feature

Extractor by

Convolution

Layer

Cosine Similarity Search + Top-K

Similar Image

without label

Similar

Image

Label

Page 53: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

53

Dimension Reduction

• Output of convolution layer is large ( 20,000

features in VGG, very sparse)

• Simple matrix or multi-layer non-linear

transformation

• Learned simply

• Loss function:

• Cosine distance found between any two records

• Difference between the distance of input and output

• Learns to preserve the cosine distance through transform

20,0

00

200 Associative

Page 54: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

54

Low-Shot: Train the network on distance

• Start with untrained network

• Output of network is already reduced-dimension keys for k-NN DB

• Train the network only to keep similar-valued keys close

k-NN Data Base

Page 55: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

55

Cut Short

• Stop training when system starts to converge (Cut Short)

• Use similarity search instead of Fully Connected

• Requires less complete training

k-NN Data Base

(Associative)

Page 56: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

56

PROGRAMING MODEL

Page 57: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

57

Programming Model

Write application In Standard Host Using TensorFlow

/Tesor2Tensor Frame Work

Generates TensorFlow Graph

for Execution in Device

Memory

APU Chip/Card

Execute the

Graph using fused

Capabilities

Page 58: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

58

PCIe Development Boards

• 4 APU Chips

• 8 Millions bit lines rows (processors)

• 8 Peta Boolean OPS

• 6.4 TFLOPS

• 2 Petabit/sec Internal IO

• 16-64 GB(Device memory)

• TensorFlow Frame-work (basic functions)

• GNL (GSI Numeric Lib)

Page 59: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

59

FUTURE APPROACH – NON VOLATILE CONCEPT

Page 60: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

60

Computing in Non-Volatile Cells

Non

Volatile

bit cell

Sense Unit

&

Write

Control

Select

REF

• Select Multiple Lines for read (as

NOR/NAND input)

• Ref = V-read

• The Sense Unit is Sensing Bit Line for

Logic “1” or “0”

• Select 1 or Multiple Lines for write

(NOR/NAND results)

• Ref = V-write

• Write Control Generates logic “1” or “0”

for bit line

Page 61: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

61

CPU Register File

L1/L2/L3

Flash

HDD

Solutions for Future Data Centers

DRAM

ASSOCIATIVE

STT-RAM RAM Based

PC-RAM Based

ReRam Based

Full computing

(floating points etc.)

requires read & write :

Machine learning, malware detection

detection etc., : Much more read and

much less writeData search engines

(read most of the time)

High endurance

Mid endurance

Low endurance

Standard SRAM

BasedVolatile

Non Volatile

Page 62: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

62

Summary

APU enables state of the art, next-generation machine learning :

• In-Place from basic Boolean Algebra to complex algorithms

• O(1) Dot Produces computation

• O(1) Min/Max

• O(1) Top K

• O(1) Softmax

• Ultra high Internal BW – 0.5 Peta bit/sec

• Up to 2 PetaOps of Boolean Algebra in a single chip

• Fully Scalable

• Fully Programmable

• Efficient TensorFlow based capabilities

Page 63: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

63

Summary

Extending

Moore’s Law and

Leveraging Advanced Memory Technology Growth

For M.Sc./Ph.D. students that would like to collaborate on

research please contact me:

[email protected]

Page 64: In-Place Associative Computingee380.stanford.edu/Abstracts/171108-slides.pdf · 1 In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

64

Thank You!

Any Questions?

APU Page – http://www.gsitechnology.com/node/123377