elsa: hardware-software co-design for efficient

ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]

Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]

The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University

* These authors contributed equally to this work

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 2

Self-Attention: A Key Primitive of Transformer-based Models

• Transformer-based models achieve state-of-the-art performance on NLP

• Natural Language Processing: QA, Translation, Language Modeling

• CV: Image Classification, Object Detection

• Self-attention is the key primitive for the Transformer-based models

• Self-attention identifies relations within entities and enablesmodels to effectively handle sequential data.


Self-Attention: A Key Primitive of Transformer-based Models

• Transformer-based models often come with large computational cost • Self-attention often accounts for more than 30% of the runtime on Transformers

• Even larger portion with longer input sequences

OthersSelf-Attention

BERT RoBERTa ALBERT SASRec BERT4Rec

Default Sequence Length

4x Larger Sequence Length

BERT RoBERTa ALBERT SASRec BERT4Rec

ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1

-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1

Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors

corresponding to relevant key vectors

2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94

-0.6 -0.8 -1.1

0.4 0.1 -0.7

0.5 -0.1 0.9

4

Query Matrix(n x d)

Step 1. Attention Score: Compute dot product of each row in the key matrix and the query

Key Matrix(n x d)

AttentionScore (n x n)

Value Matrix(n x d)

Self-Attention Mechanism

Dot Product Computation

0.4 0.1 -0.7

0.5 -0.1 0.9


-0.6 -0.8 -1.1

0.4 0.1 -0.7

0.5 -0.1 0.9

-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1

0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88

-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1



2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94

5

Query Matrix(n x d)


Key Matrix(n x d)


NormalizedScore (n x n)

Value Matrix(n x d)

Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention Score

σ


Softmax Computation

0.4 0.1 -0.7

0.5 -0.1 0.9


-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1

-0.6 -0.8 -1.1

0.4 0.1 -0.7

0.5 -0.1 0.9

0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88

-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1



2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94

6

Query Matrix(n x d)


Output(n x d)

Key Matrix(n x d)


NormalizedScore (n x n)

Value Matrix(n x d)

Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention ScoreStep 3. Weighted Sum of Value vectors: Weighted sum of value matrix row vectors using Normalized Score as weights

σ


Weighted Sum Computation

0.4 0.1 -0.7

0.5 -0.1 0.9

-1.5 -0.5 -0.4

0.2 -0.3 0.4

1.7 -0.4 1.0

x

x

x


• Attention mechanism looks like a series of dense matrix operations,

but it is also a content-based search

• Softmax operation filters out data that are NOT relevant

to the query based on attention scores

Key Idea

What if we can find out the set of candidates that are likelyto be relevant to the query without much computation?

Avoid processing irrelevant keys

Reduce the amount of computation

AttentionScore

SoftmaxNormalized

Score

!!"#$%∑!!"#$%

0.0140.0000.0050.0000.0000.0000.0000.0000.9760.0000.005

6.661.745.581.80-0.85-0.29-0.19-1.0310.902.335.54

Opportunities for Approximation in Self-Attention Mechanism


Why Specialized Hardware?

Observation Conventional HW such as GPUs do not benefit from such approximation• GPUs are better optimized for full similarity computation (dot product) than presented

approximate similarity computation!

GPU Device ELSA HW Accelerator

Specialized HWs can fully exploit the potential benefits of the presented approximate self-attention mechanism


ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism

1

2

1

3

1 1 1-0.9 0.8 1.2 -0.30.1 0.0 -0.1 0.70.9 -0.4 0.2 0.1-0.1 -0.2 0.7 0.6

-0.3 -0.2 1.1 0.71.4 -0.7 0.9 -0.80.4 0.3 0.8 -0.20.9 0.5 0.1 -1.3

1.351.970.961.66

ᬠProcess

Next Query ௬ܭ

0.92

1.97

0.92

3.01

0.61

-0.39

0.61

-0.99

0.82

-0.76

0.58

-1.65

1 0 10 0 11 0 10 0 0

Key Matrix K

ᬮEfficient Key Hash

Computation

Key Hash

Query Matrix Q

ᬛComputeHamming Distance

ᬜHamming DistanceTo Angle

ᬝCosine

ᬞ Key Norm

Multiplication

( ௫ ((௬ܭ),

గ ȉ,ȉെߠ௦

cos(ߠொ,) ௬ܭ cos(ߠொ, )

ᬟCandidateSelection

9

8

8

8

Key Norm

௫

ᬮKey NormComputation

ᬚEfficient Query Hash

Computation

With novel approximate self-attention algorithm and specialized hardware, ELSA achieves

significant reduction in the runtime as well as energy spent on the self-attention mechanism

!!∑

Buffer

××

××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

KeyMem

ValMem

××××

+=

HashComp

HashMem

…

××

!/#

Buffer

Approximate Self-Attention Algorithm

Specialized Hardware for Approximation


Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",

#$$%&$'(&)*(+% ", Q = " / !

Assuming & entities, there are total &! pairs à 0"1 MAC operations

Key Idea ELSA maps query vector ! and a key vector " to binary vectors and approximates attention scores

1 multiply-and-add (MAC) operations

-0.6 -0.8 -1.1...Query Q

Key Matrix K

0.3 -0.2 -1.10.5 0.6 -0.30.1 0.4 0.5

-0.3 -0.5 -0.2

...

...

...

...

1 0 1...

1 0 00 0 10 1 01 1 0

...

...

...

...

Hash for Q

Hash for K


Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",

Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing

#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )

Self-attention is finding relevant keys to a given query !,

thus we can ignore " since it is the same for all keys 얞猷邀

same IIQ1 -


Attention Score between a given query vector ! and a key vector ",


10

Estimating Angular Distance

!#!$

!%+-

❸"!!

❷"!" +-

❶"!#+-

❹"!$

+ -

Initialize k (=4) random vectors (v1, v2, v3, v4)Normal vectors defined by them (n!!,n!",n!# ,n!$) are illustrated

1

#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )


thus we can ignore " since it is the same for all keys

1 ) 7차원 vector Space

orandanvectorv.GR

V, = 0.5 - 0.7 0.3 0.9 1.2 . J

1 ) 7차원 vector Space

orandanvectorv.GR

2) hy perplate 생성

NormalA vector

nvithyperplanet.ee-

± x.nu.no

Y.nu, < o

1 ) 7차원 vector Space 3) binary hasKing (ex : K-3)

irandom.ca iii.v.GRi.

t

F V3-

t2) hy perplate 생성

h (K) = ( 1 0 0 JNormalA vectorna h (Q) = ( 1 I]

thyperplanet.ee-

-

: K는 Random vector의 수± x.nu.no

Y.nu, < o

4)

hammingdistaneh.lk) = ( 1 0 0J

④ ✗or

h (Q) = [ 1 1 1]

hammi ng ( K .Q) = 도 (0 1 1 ) = 2

4)

[email protected]) = 도 (0 1 1 ) = 2

5) Complete E에 (1<=3)

OQKFEhamming.lk ,Q) = E . 2 = FL

Men Score (Q14계K '쌈臧뻬e


Estimating Angular Distance

!#!$

!%+-

❸&!!

❷&!" +-

;.&,.' ≈/0 ⋅hamming = π

;.',.( ≈/0 ⋅hamming = )

-π

;.(,.& ≈/0 ⋅hamming = (

-π01101110( )11100001( )0001 0110( )

❶&!#+-

❹&!$

+ -

3 Can obtain angular distance just from binary embeddings

- ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)

ℎ >( = [0110]

ℎ >! = [1110]

ℎ >) = [0001]

*&*'*(*)

Attention Score between a given query vector ! and a key vector ",


#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )


thus we can ignore " since it is the same for all keys


Approximate Attention Score with Binary HashingStep 1. Preprocessing Compute hash values for keys and a query and norms for each key

Step 2. Approximate Attention Score Computation Compute approximate attention scores for each key

!""#$"%&$'(&)# *, Q = !""#$"%&$'(&)# K, Q / /= * ∙ / / / =≈ * (&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:), where C=number of hash bits

d × multiply-and-addApproximate Computation

1 Bitwise XOR(Hamming Distance)

2 Lookup Tablef (x) = cos ( ! ⋅ π/k )

3 Multiplication× 'K ⋅ Q

* cos ?(,*

Original Computation

Step 3. Candidate Selection Filter out potentially irrelevant keys whose approximate attention score is below a pretrained threshold (F)

: ⋅ -%&' ≥ - ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)t 보다 작은 세대 Score를 걸러낸다.


Candidate Selection

Q. How do we find the right value for all of NN layers (and sub-layers)?A. Learn from training data!

One should sort all approximate attention scores and select top-scoring keysIssue #1 Sorting is expensive in hardware

Issue #2 Sorting can only happen after computing all key’s estimated score is done

Select keys whose estimated score exceeds certain thresholdPros #1 Do not need expensive sort

Pros #2 Filtering can happen during computing each key’s estimated score


Candidate Selection

AttentionScore

SoftmaxNormalized

Score!!"#$%∑!!"#$% 0.48

0.060.300.16

1.35-0.720.860.24

Step 1. Set Hyperparameter ! User enters hyperparameter G for accuracy-performance

Assume 0 is the number of input entities (keys), we identify the set of keys whose softmax-normalized attention score (4′) exceeds G/0.

If G = J, 0 = JKK, user considers entities whose 41 > 0.01 to be relevant

Step 2. Learn Raw Attention Score whose Softmax-Normalized Attention Score > !/"Higher G: aggressive approximation, Lower G: conservative approximation

Assume < = =, > = ?. Filter out keys with @′< 0.25

Step 3. Repeat for All Cases In Training Data

User gets different thresholds for each layer (sub-layer)

tradeoff⇒ 11K$


ELSA Hardware Implementation

HashComp

HashMemKey

Preprocessing Phase (Key hash computation)1

Execution Phase2

2-1 Compute Query Hash

2-3 Output Division

2-2 Candidate Selection & Attention Computation

⇒ binary hash 생성




Execution Phase

Query HashComp

HashMem

2


2-3 Output Division


Compute hash for the query

h Q = !/+


Compute approximate similarity & filter out keys with low similarity

Approximate Similarity||*||(&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:)

18



Execution Phase

2-3 Output Division

2



∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

Key & Query Hash

HashComp

HashMem

Selected Row IDs


Compute partial attention for selected keys and accumulate it in the buffer

Σ #((⋅*( ⋅ B-

19



Execution Phase2


B)

Query

∑

Buffer

××××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

Selected Row IDs

KeyMem

ValMem

××××

+=

HashComp

HashMem

2-3 Output Division



Normalize the outcome by dividing it by the sum of exponents

(Σ #C'⋅D' ⋅ %E) / Σ #C'⋅D'

20



Execution Phase2


B)

Query

∑

Buffer

××××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

KeyMem

ValMem

××××

+=

HashComp

HashMem …

××

*/,

Buffer


2-3 Output Division


Query-level pipelining ( , , )

Fine-grained pipelining within each pipeline stage

21

ELSA Hardware Pipelining


B)

Query

∑

Buffer

××××

+=+=+=+=

∙∙∙ ∙∙∙

XOR ∑

XOR ∑

×

× >

>

LUT

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

LUTArb

KeyMem

ValMem

××××

+=

HashComp

HashMem

2-3 Output Division

…

××

*/,

Buffer



Execution Phase2

i-1 th Query I th Query i+1th Query

2-1 2-2 2-3


Evaluation: Impact of Approximate Attention

• Line Graph: model accuracy metric• Bar Graph: portion of selected candidates

Hyperparameter for Accuracy/Performance

< 1% accuracy metric degradations are observed while inspecting only ~33% of the keys


Evaluation: Performance EvaluationGPU Ideal Ours (<1%) Ours (<2.5%) Ours (<5%)Ours (No Approx)

8x – 46x speedup over GPU

Proposed HW Accelerator (w/o Approximation)

20x – 157x speedup over GPU

Proposed HW Accelerator (w/ Approximation)<1% accuracy metric degradation

32x – 168x speedup over GPU<2.5% accuracy metric degradation

Pat/WEnergy onsumration

Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]

Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]

* These authors contributed equally to this work

ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University

elsa: hardware-software co-design for efficient

Documents