elsa: hardware-software co-design for efficient
TRANSCRIPT
![Page 1: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/1.jpg)
ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]
Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]
The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University
* These authors contributed equally to this work
![Page 2: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/2.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 2
Self-Attention: A Key Primitive of Transformer-based Models
• Transformer-based models achieve state-of-the-art performance on NLP
• Natural Language Processing: QA, Translation, Language Modeling
• CV: Image Classification, Object Detection
• Self-attention is the key primitive for the Transformer-based models
• Self-attention identifies relations within entities and enablesmodels to effectively handle sequential data.
![Page 3: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/3.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 3
Self-Attention: A Key Primitive of Transformer-based Models
• Transformer-based models often come with large computational cost • Self-attention often accounts for more than 30% of the runtime on Transformers
• Even larger portion with longer input sequences
OthersSelf-Attention
BERT RoBERTa ALBERT SASRec BERT4Rec
Default Sequence Length
4x Larger Sequence Length
BERT RoBERTa ALBERT SASRec BERT4Rec
![Page 4: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/4.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1
-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1
Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors
corresponding to relevant key vectors
2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94
-0.6 -0.8 -1.1
0.4 0.1 -0.7
0.5 -0.1 0.9
4
Query Matrix(n x d)
Step 1. Attention Score: Compute dot product of each row in the key matrix and the query
Key Matrix(n x d)
AttentionScore (n x n)
Value Matrix(n x d)
Self-Attention Mechanism
Dot Product Computation
0.4 0.1 -0.7
0.5 -0.1 0.9
![Page 5: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/5.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
-0.6 -0.8 -1.1
0.4 0.1 -0.7
0.5 -0.1 0.9
-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1
0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88
-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1
Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors
corresponding to relevant key vectors
2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94
5
Query Matrix(n x d)
Step 1. Attention Score: Compute dot product of each row in the key matrix and the query
Key Matrix(n x d)
AttentionScore (n x n)
NormalizedScore (n x n)
Value Matrix(n x d)
Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention Score
σ
Self-Attention Mechanism
Softmax Computation
0.4 0.1 -0.7
0.5 -0.1 0.9
![Page 6: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/6.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
-1.6 -0.6 -0.5-1.1 0.5 0.32.1 -0.5 1.1
-0.6 -0.8 -1.1
0.4 0.1 -0.7
0.5 -0.1 0.9
0.91 0.08 0.010.33 0.22 0.450.03 0.09 0.88
-1.6 -0.6 -0.7-1.2 0.7 0.32.0 0.5 1.1
Q. What is Self-Attention? A. Find key vectors relevant to a query vector, then return the weighted sum of value vectors
corresponding to relevant key vectors
2.21 -0.17 -2.81-0.21 -0.62 0.08-1.37 -0.40 1.94
6
Query Matrix(n x d)
Step 1. Attention Score: Compute dot product of each row in the key matrix and the query
Output(n x d)
Key Matrix(n x d)
AttentionScore (n x n)
NormalizedScore (n x n)
Value Matrix(n x d)
Step 2. Softmax-normalized Attention Score: Softmax normalization on Attention ScoreStep 3. Weighted Sum of Value vectors: Weighted sum of value matrix row vectors using Normalized Score as weights
σ
Self-Attention Mechanism
Weighted Sum Computation
0.4 0.1 -0.7
0.5 -0.1 0.9
-1.5 -0.5 -0.4
0.2 -0.3 0.4
1.7 -0.4 1.0
x
x
x
![Page 7: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/7.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 5
• Attention mechanism looks like a series of dense matrix operations,
but it is also a content-based search
• Softmax operation filters out data that are NOT relevant
to the query based on attention scores
Key Idea
What if we can find out the set of candidates that are likelyto be relevant to the query without much computation?
Avoid processing irrelevant keys
Reduce the amount of computation
AttentionScore
SoftmaxNormalized
Score
!!"#$%∑!!"#$%
0.0140.0000.0050.0000.0000.0000.0000.0000.9760.0000.005
6.661.745.581.80-0.85-0.29-0.19-1.0310.902.335.54
Opportunities for Approximation in Self-Attention Mechanism
![Page 8: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/8.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 6
Why Specialized Hardware?
Observation Conventional HW such as GPUs do not benefit from such approximation• GPUs are better optimized for full similarity computation (dot product) than presented
approximate similarity computation!
GPU Device ELSA HW Accelerator
Specialized HWs can fully exploit the potential benefits of the presented approximate self-attention mechanism
![Page 9: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/9.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 7
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism
1
2
1
3
1 1 1-0.9 0.8 1.2 -0.30.1 0.0 -0.1 0.70.9 -0.4 0.2 0.1-0.1 -0.2 0.7 0.6
-0.3 -0.2 1.1 0.71.4 -0.7 0.9 -0.80.4 0.3 0.8 -0.20.9 0.5 0.1 -1.3
1.351.970.961.66
ᬠProcess
Next Query ௬ܭ
0.92
1.97
0.92
3.01
0.61
-0.39
0.61
-0.99
0.82
-0.76
0.58
-1.65
1 0 10 0 11 0 10 0 0
Key Matrix K
ᬮEfficient Key Hash
Computation
Key Hash
Query Matrix Q
ᬛComputeHamming Distance
ᬜHamming DistanceTo Angle
ᬝCosine
ᬞ Key Norm
Multiplication
( ௫ ((௬ܭ),
గ ȉ,ȉെߠ௦
cos(ߠொ,) ௬ܭ cos(ߠொ, )
ᬟCandidateSelection
9
8
8
8
Key Norm
௫
ᬮKey NormComputation
ᬚEfficient Query Hash
Computation
With novel approximate self-attention algorithm and specialized hardware, ELSA achieves
significant reduction in the runtime as well as energy spent on the self-attention mechanism
!!∑
Buffer
××
××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
KeyMem
ValMem
××××
+=
HashComp
HashMem
…
××
!/#
Buffer
Approximate Self-Attention Algorithm
Specialized Hardware for Approximation
![Page 10: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/10.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 8
Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",
#$$%&$'(&)*(+% ", Q = " / !
Assuming & entities, there are total &! pairs à 0"1 MAC operations
Key Idea ELSA maps query vector ! and a key vector " to binary vectors and approximates attention scores
1 multiply-and-add (MAC) operations
-0.6 -0.8 -1.1...Query Q
Key Matrix K
0.3 -0.2 -1.10.5 0.6 -0.30.1 0.4 0.5
-0.3 -0.5 -0.2
...
...
...
...
1 0 1...
1 0 00 0 10 1 01 1 0
...
...
...
...
Hash for Q
Hash for K
![Page 11: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/11.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 9
Estimating Angular DistanceAttention Score between a given query vector ! and a key vector ",
Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing
#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )
Self-attention is finding relevant keys to a given query !,
thus we can ignore " since it is the same for all keys 얞猷邀
same IIQ1 -
![Page 12: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/12.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Attention Score between a given query vector ! and a key vector ",
Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing
10
Estimating Angular Distance
!#!$
!%+-
❸"!!
❷"!" +-
❶"!#+-
❹"!$
+ -
Initialize k (=4) random vectors (v1, v2, v3, v4)Normal vectors defined by them (n!!,n!",n!# ,n!$) are illustrated
1
#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )
Self-attention is finding relevant keys to a given query !,
thus we can ignore " since it is the same for all keys
![Page 13: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/13.jpg)
![Page 14: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/14.jpg)
1 ) 7차원 vector Space
orandanvectorv.GR
V, = 0.5 - 0.7 0.3 0.9 1.2 . J
![Page 15: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/15.jpg)
1 ) 7차원 vector Space
orandanvectorv.GR
2) hy perplate 생성
NormalA vector
nvithyperplanet.ee-
± x.nu.no
Y.nu, < o
![Page 16: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/16.jpg)
1 ) 7차원 vector Space 3) binary hasKing (ex : K-3)
irandom.ca iii.v.GRi.
t
F V3-
t2) hy perplate 생성
h (K) = ( 1 0 0 JNormalA vectorna h (Q) = ( 1 I]
thyperplanet.ee-
-
: K는 Random vector의 수± x.nu.no
Y.nu, < o
![Page 17: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/17.jpg)
4)
hammingdistaneh.lk) = ( 1 0 0J
④ ✗or
h (Q) = [ 1 1 1]
hammi ng ( K .Q) = 도 (0 1 1 ) = 2
![Page 18: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/18.jpg)
4)
[email protected]) = 도 (0 1 1 ) = 2
5) Complete E에 (1<=3)
OQKFEhamming.lk ,Q) = E . 2 = FL
Men Score (Q14계K '쌈臧뻬e
![Page 19: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/19.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 12
Estimating Angular Distance
!#!$
!%+-
❸&!!
❷&!" +-
;.&,.' ≈/0 ⋅hamming = π
;.',.( ≈/0 ⋅hamming = )
-π
;.(,.& ≈/0 ⋅hamming = (
-π01101110( )11100001( )0001 0110( )
❶&!#+-
❹&!$
+ -
3 Can obtain angular distance just from binary embeddings
- ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)
ℎ >( = [0110]
ℎ >! = [1110]
ℎ >) = [0001]
*&*'*(*)
Attention Score between a given query vector ! and a key vector ",
Key Idea Approximate 234(6#,%) to efficiently approximate the attention score using binary hashing
#$$%&$'(&)*(+% ", Q = " / != ! " cos(;&,')∝ " cos(;&,' )
Self-attention is finding relevant keys to a given query !,
thus we can ignore " since it is the same for all keys
![Page 20: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/20.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 13
Approximate Attention Score with Binary HashingStep 1. Preprocessing Compute hash values for keys and a query and norms for each key
Step 2. Approximate Attention Score Computation Compute approximate attention scores for each key
!""#$"%&$'(&)# *, Q = !""#$"%&$'(&)# K, Q / /= * ∙ / / / =≈ * (&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:), where C=number of hash bits
d × multiply-and-addApproximate Computation
1 Bitwise XOR(Hamming Distance)
2 Lookup Tablef (x) = cos ( ! ⋅ π/k )
3 Multiplication× 'K ⋅ Q
* cos ?(,*
Original Computation
Step 3. Candidate Selection Filter out potentially irrelevant keys whose approximate attention score is below a pretrained threshold (F)
: ⋅ -%&' ≥ - ./0(ℎ2334%5 ℎ ! , ℎ - ⋅ 6/8)t 보다 작은 세대 Score를 걸러낸다.
![Page 21: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/21.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 14
Candidate Selection
Q. How do we find the right value for all of NN layers (and sub-layers)?A. Learn from training data!
One should sort all approximate attention scores and select top-scoring keysIssue #1 Sorting is expensive in hardware
Issue #2 Sorting can only happen after computing all key’s estimated score is done
Select keys whose estimated score exceeds certain thresholdPros #1 Do not need expensive sort
Pros #2 Filtering can happen during computing each key’s estimated score
![Page 22: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/22.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 15
Candidate Selection
AttentionScore
SoftmaxNormalized
Score!!"#$%∑!!"#$% 0.48
0.060.300.16
1.35-0.720.860.24
Step 1. Set Hyperparameter ! User enters hyperparameter G for accuracy-performance
Assume 0 is the number of input entities (keys), we identify the set of keys whose softmax-normalized attention score (4′) exceeds G/0.
If G = J, 0 = JKK, user considers entities whose 41 > 0.01 to be relevant
Step 2. Learn Raw Attention Score whose Softmax-Normalized Attention Score > !/"Higher G: aggressive approximation, Lower G: conservative approximation
Assume < = =, > = ?. Filter out keys with @′< 0.25
Step 3. Repeat for All Cases In Training Data
User gets different thresholds for each layer (sub-layer)
tradeoff⇒ 11K$
![Page 23: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/23.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 16
ELSA Hardware Implementation
HashComp
HashMemKey
Preprocessing Phase (Key hash computation)1
Execution Phase2
2-1 Compute Query Hash
2-3 Output Division
2-2 Candidate Selection & Attention Computation
⇒ binary hash 생성
![Page 24: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/24.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 17
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase
Query HashComp
HashMem
2
2-1 Compute Query Hash
2-3 Output Division
2-2 Candidate Selection & Attention Computation
Compute hash for the query
h Q = !/+
![Page 25: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/25.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Compute approximate similarity & filter out keys with low similarity
Approximate Similarity||*||(&2(ℎ566%$7 ℎ / , ℎ * ⋅ 9/:)
18
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase
2-3 Output Division
2
2-1 Compute Query Hash
2-2 Candidate Selection & Attention Computation
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
Key & Query Hash
HashComp
HashMem
Selected Row IDs
![Page 26: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/26.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Compute partial attention for selected keys and accumulate it in the buffer
Σ #((⋅*( ⋅ B-
19
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase2
2-1 Compute Query Hash
B)
Query
∑
Buffer
××××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
Selected Row IDs
KeyMem
ValMem
××××
+=
HashComp
HashMem
2-3 Output Division
2-2 Candidate Selection & Attention Computation
![Page 27: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/27.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Normalize the outcome by dividing it by the sum of exponents
(Σ #C'⋅D' ⋅ %E) / Σ #C'⋅D'
20
ELSA Hardware Implementation
Preprocessing Phase (Key hash computation)1
Execution Phase2
2-1 Compute Query Hash
B)
Query
∑
Buffer
××××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
KeyMem
ValMem
××××
+=
HashComp
HashMem …
××
*/,
Buffer
2-2 Candidate Selection & Attention Computation
2-3 Output Division
![Page 28: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/28.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Query-level pipelining ( , , )
Fine-grained pipelining within each pipeline stage
21
ELSA Hardware Pipelining
2-1 Compute Query Hash
B)
Query
∑
Buffer
××××
+=+=+=+=
∙∙∙ ∙∙∙
XOR ∑
XOR ∑
×
× >
>
LUT
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
LUTArb
KeyMem
ValMem
××××
+=
HashComp
HashMem
2-3 Output Division
…
××
*/,
Buffer
2-2 Candidate Selection & Attention Computation
Preprocessing Phase (Key hash computation)1
Execution Phase2
i-1 th Query I th Query i+1th Query
2-1 2-2 2-3
![Page 29: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/29.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 22
Evaluation: Impact of Approximate Attention
• Line Graph: model accuracy metric• Bar Graph: portion of selected candidates
Hyperparameter for Accuracy/Performance
< 1% accuracy metric degradations are observed while inspecting only ~33% of the keys
![Page 30: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/30.jpg)
ARC Lab @ Seoul National UniversityELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks 23
Evaluation: Performance EvaluationGPU Ideal Ours (<1%) Ours (<2.5%) Ours (<5%)Ours (No Approx)
8x – 46x speedup over GPU
Proposed HW Accelerator (w/o Approximation)
20x – 157x speedup over GPU
Proposed HW Accelerator (w/ Approximation)<1% accuracy metric degradation
32x – 168x speedup over GPU<2.5% accuracy metric degradation
![Page 31: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/31.jpg)
Pat/WEnergy onsumration
![Page 32: ELSA: Hardware-Software Co-Design for Efficient](https://reader030.vdocuments.net/reader030/viewer/2022020621/61eb0657309ed776a822c578/html5/thumbnails/32.jpg)
Tae Jun Ham* Yejin Lee* Seong Hoon Seo Soosung [email protected] [email protected] [email protected] [email protected]
Hyunji Choi Sung Jun Jung Jae W. Lee [email protected] [email protected] [email protected]
* These authors contributed equally to this work
ELSA: Hardware-Software Co-Design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
The 48th ACM/IEEE International Symposium on Computer Architecture (ISCA) ARC Lab @ Seoul National University