streaming similarity search on fpga based on dynamic time ... · department of electronic...
TRANSCRIPT
![Page 1: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/1.jpg)
Department of Electronic Engineering, Tsinghua University
Nano-scale Integrated Circuit and System Lab.
Streaming Similarity Search on FPGA
based on Dynamic Time Warping
Yu WANG
Associate Prof.,
Head, Research Institution of Circuits and Systems,
E.E. Dept, Tsinghua University, Beijing, China
http://nics.ee.tsinghua.edu.cn/people/wangyu/ Joint work by Tsinghua Univ. and IBM China Research Lab
Based on a submitted paper to FPGA 2013
1
![Page 2: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/2.jpg)
Outline
Background and Motivation
Why we need streaming similarity search
Recent achievements and problems to solve
Subsequence Similarity Search on FPGA
Algorithms
Hardware Architectures
Results
Conclusion and future work
2
![Page 3: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/3.jpg)
Alberto Sangiovanni-Vincentelli (Tuesday noon @ ICCAD 2012)
ICCAD at 30 years Where We have been, where we are going
3
![Page 4: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/4.jpg)
4
Internet of Things
Nowadays
Independent Applications
Traditional Database Techniques
Small Scale “small IOTs”
Monitoring only
Future
Fully connected, and correlated Applications
Advanced IT techniques
Large Scale, and large Volume Data (“big IOTs”)
Different realtime or non-realtime applications BIG DATA (Time and Spatial
Correlated Streaming DATA) Volume, Variety, Velocity
IoT DATA Manage System (IBM RODB©) Collection, Publish, Processing,
Storage, and Query for BIG DATA
![Page 5: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/5.jpg)
5
RODB
Different Applications
Realtime Oriented DataBase
Application Specific Data Management Middleware (Collection, Publish, Processing, Storage, and Query )
![Page 6: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/6.jpg)
Data format from IOT (CPS, SoS, ect.)
6
Format of Data
Numerical data streams from various sensors (Timing Series)
Multi-media data and sensor data
Industries in Smarter Planet
Petro E&U
Mineral
Chemistry
Steel Manufactory
Smart building Smart City Environment monitoring
Retail Logistic
Healthcare
RFID
Transportation
![Page 7: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/7.jpg)
Mining Task Dependency (Not Complete)
Similarity Search
Correlation Discovery
Classification Clustering
Motif Discovery
Novelty/Anomaly detection
Rule Discovery
Segmentation
Visualization
Data Privacy
Prediction
Burst Detection
No history data involved
May have real time req History data analyses
![Page 8: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/8.jpg)
Finite filed subsequence exact search
Object: string
e.g. find “pattern” in “we have a pattern here” with K.M.P
Finite filed subsequence similarity search
Object: DNA chain, Protein sequence
e.g. find similar subsequence as “ATGAG” in a DNA
chain “ATGACTGAG…” with Smith-Waterman.
Infinite filed subsequence similarity search
Object: time series data
e.g. next slide
“Similarity” Search
![Page 9: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/9.jpg)
Streaming Subsequence Similarity Search
Time series (electrocardiogram) & pattern (query)
Pick out subsequences with sliding window (totally
N subsequences)
Compare the subsequences with the pattern, under
a certain distance measure, to judge if they are
similar
0 100 200 300 400-3
-2
-1
0
1
2
d
Simple DATA representation Tuple [Sensor, Time, Value]
Time Complexity O(N*O(distance))
![Page 10: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/10.jpg)
Distance Measure
Dynamic Time Warping P= p1, p2, p3…pM; S= s1, s2, s3…sM
DTW(S, P) = D(M, M);
D(i-1, j)
D(i, j) = dist(si, pj) + min D(i, j-1)
D(i-1, j-1)
D(0,0) = 0;
D(i, 0) = D(0, j) = infinite,
1<i<M,1<j<M;
DTW is the best distance measure in most domains. It
allows shrinking, sketching, warping, even different
lengths. Distance Complexity Analysis (O(M*M))
Step1: Calculate the distance of each two points
Step2: Find the shortest accumulated path
![Page 11: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/11.jpg)
Challenges for Streaming Similarity Computing
Challenges (Velocity, Volume, Variety)
Real time Analysis
• Both on the sensor part or cloud
Large Volume Streaming DATA to be compared
• Can not afford to storage on Sensors
• Millions of Sensors may be on the Edges
Various Patterns
• People may want to search for different patterns on
different/same dataset
Previous Work
Software preprocessing to reduce the real DTW
Parallel Hardware: More on task level parallelism,
little was performed for fine-grained parallelism
13
![Page 12: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/12.jpg)
Related Work -- Software
1000+ papers on software speedup techniques:
1. Y. Sakurai et al. proposed a computation-reuse
algorithm called SPRING [3]
Only one tuple is different between the two neighboring
subsequences
Merge N M-by-M matrixes into single N-by-M matrix. N
paths grow at the same time
N* =>
It reduces the time complexity from O(N*M*M) to O(N*M)
The whole sequence can’t be normalized in streaming:
26 19 23 16 12
18 19 19 7 5
15 22 18 3 5
14 21 13 3 5
12 13 7 2 4
26(1) 19(1) 23(2) 16(2) 12(2) 14(2) 12(2) 6(2) 14(2) 17(8)
18(1) 19(1) 19(2) 7(2) 5(2) 9(2) 6(2) 11(2) 10(8) 8(8)
15(1) 22(1) 18(2) 3(2) 5(2) 5(2) 8(2) 17(2) 7(8) 4(8)
14(1) 21(1) 13(2) 3(2) 5(2) 5(2) 8(2) 17(2) 6(8) 4(8)
12(1) 13(2) 7(2) 2(2) 4(2) 4(2) 7(2) 14(8) 4(8) 3(8)
![Page 13: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/13.jpg)
2. Lower bound: A. Fu, E. Keogh et al. tried to
estimate the lower bound of DTW distance in a cheap
way, called LB_Keogh [1].
It constrains the warping path will not deviate more than R*M
cells from the diagonal. Generate an upper envelope and a lower
envelope, and the sum of the subsequence not falling within the
bounding envelope is defined as the LB_Keogh
If the lower bound distance exceeds the threshold, the DTW
distance will also exceed the threshold, and then the
subsequence can be pruned off.
Related Work -- Software
![Page 14: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/14.jpg)
3. S. H. Lim et al. used indexing techniques to speed
up the search [11]
Build a look up table for different patterns;subsequences
searching speed equals to the look up table searching speed,
which is very fast
Look up table construction cost is even larger than DTW, only
suitable for frequent querying in the same sequence.
No one can index on a streaming sequence which may be
infinitely long.
4. There are also some other techniques, like early
abandoning.
All the former software techniques can be seen as pre-
processing techniques, aiming at reducing the calling times
of DTW calculation, instead of accelerating DTW itself
Related Work -- Software
![Page 15: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/15.jpg)
Several works try to exploit parallel hardware, such
as multi-cores[8], computer cluster[6], GPU[4] to
speedup the search.
All these works try to allocate subsequences starting from
different position of the whole sequence to different processing
units, which can be seen as coarse-grained parallelism.
[4] also uses threads to parallel generate the warping matrixes,
but serially does the path searching, which can be seen as
partial fine-grained parallelism.
Lead to a heavy data-transfer burden, as one subsequence may
consist of too many tuples. The [4]’s partial fine-grained parallel
work even needs to transfer a whole matrix between thread.
Related Work -- Parallel Hardware
![Page 16: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/16.jpg)
The first and only work[2] using FPGA is generated
by a C-to-VHDL tool called ROCCC
From the reported performance, we think the tool exploits
the fine-grained parallelism inside DTW.
It does not exploit the coarse-grained parallelism.
The lack of insight into FPGA limits the scalability and
flexibility:
• It can not support patterns of length larger than 128.
• It can not support on-line updating patterns of different lengths. For
example, if a new pattern of length 127 is wanted, it must re-
compiled the system and re-download the FPGA, which may cost
several hours.
Related Work -- Parallel Hardware
![Page 17: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/17.jpg)
Problems we try to solve
Problems
Software can’t accelerate DTW itself
Coarse-grained parallelism may leads to heavy burden on
bandwidth
Fine-grained parallelism requires hard-wired
synchronization
FPGA lacks flexibility as software
Solutions
Turn to parallel hardware to accelerate DTW
Choose and modify streaming parallel algorithms (SPRING)
to reduce bandwidth
Use FPGA with flexible structure for fine grained parallelism
![Page 18: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/18.jpg)
Outline
Background and Motivation
Why we need streaming similarity search
Recent achievements and problems to solve
Subsequence Similarity Search on FPGA
Algorithms
Hardware Architectures
Results
Conclusion and future work
20
![Page 19: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/19.jpg)
Algorithms
Normalization
Enable multiple DTW
Hybrid lower bound
Good Preprocessing to leave very few real DTW
Multiple DTW
Coarse-Grain and Fine-Grain Parallelism
Algorithm Framework
NormalizerHybrid Lower
BoundMultiple
DTW
![Page 20: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/20.jpg)
Normalization
Assumption: the offset or the amplitude can be
approximately seen as time-invariant in a little longer
length of M+C, where M is the length of pattern, and
C is a constant.
500 1000 1500 2000 2500
-0.2
0
0.2
a
500 1000 1500 2000 250090
100
110
120
130
b
500 1000 1500 2000 2500
200
400
600
800
c
0 100 200 300 400-3
-2
-1
0
1
2
d
![Page 21: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/21.jpg)
Hybrid Lower Bound
LB_partial DTW
Stable but time-consuming
LB_Keogh/ reversed LB_Keogh
Efficient but in-stable
Significantly degrade when
R increase Ui = max {Pi-R, Pi-R+1…Pi+R-1, Pi+R};
Li = min { Pi-R, Pi-R+1…Pi+R-1, Pi+R };
Di = Si-Ui if Si>Ui
Li-Si if Li>Si
0 else
LB(P1,Y, S1,Y) = sum{D1, D2…, DY}
This can been seen as a combination of early abandoning technique and lower bounding technique
![Page 22: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/22.jpg)
![Page 23: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/23.jpg)
Multiple DTW – Modified SPRING
SPRING:
C* =>
DTW(Ss,e, P) = D(e,M)
i-R< sp(i-1,j)+j<i+R ? D(i-1, j) : INF
D(i, j) = dist(si, pj) + min i-R< sp(i-1,j-1)+j<i+R? D(i-1, j-1): INF
i-R< sp(i,j)+j<i+R ? D(i, j-1): INF
D(i, 0) = 0, if valid(i)==1;
INF, if valid(i)==0;
D(0, j) = infinite; where 1<i<N, 1<j<M.
Sp(i-1, j) if D(i-1, j) is the minimum
Sp(i, j)= Sp(i-1, j-1) if D(i-1, j-1) is the minimum
Sp(i, j-1) if D(i, j-1) is the minimum
Sp(i, 0)=i; Sp(0, j)=0; where 1<i<N, 1<j<M.
26 19 23 16 12
18 19 19 7 5
15 22 18 3 5
14 21 13 3 5
12 13 7 2 4
26(1) 19(1) 23(2) 16(2) 12(2) 14(2) 12(2) 6(2) 14(2) 17(8)
18(1) 19(1) 19(2) 7(2) 5(2) 9(2) 6(2) 11(2) 10(8) 8(8)
15(1) 22(1) 18(2) 3(2) 5(2) 5(2) 8(2) 17(2) 7(8) 4(8)
14(1) 21(1) 13(2) 3(2) 5(2) 5(2) 8(2) 17(2) 6(8) 4(8)
12(1) 13(2) 7(2) 2(2) 4(2) 4(2) 7(2) 14(8) 4(8) 3(8)
![Page 24: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/24.jpg)
Hardware Framework
NormNorm
FIFO
PCIE
Norm
LowerBound
Join
value(32 bit) time(32 bit)
value(16 bit) time(16 bit)
LB(16 bit) time(16 bit)
HighPrecisiondomain
LowPrecisiondomain
value(32 bit) time(32 bit)valid(1 bit)
FIFO
Norm
LowerBound
Join
value(32 bit) time(32 bit)valid(1 bit)
FIFO
DTW
Join
DTW
Join
value(16 bit) time(16 bit)
DTW distance(16 bit)
time(16 bit)
valid(1 bit)
value(32 bit) time(32 bit)valid(1 bit)flag(1 bit)
value(32 bit) time(32 bit)
flag(1 bit)
Buffer FIFO
Implementation on FPGA
Four Loops to Guarantee Streaming
Two-Phase Precision Reduction
Support for Multi FPGAs
![Page 25: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/25.jpg)
Normalizer
Hybrid lower bound
Normalization
shifter
Updating mean& std Pipeline latency: K cycle
Tuple2*M+1
Tuple1
Tuple = (Tuple-mean)/std
Tuple2*M+K+1
shifter
MeanStd
Tuple in
Tupleout
Hybrid Lower Bound
LB_pDTW
Tuple in
LB_Keogh
ReversedLB_Keogh
Max
envelope
tuple
distance
distance
+
lower bound
distance
shifterdistance
Implementation on FPGA
![Page 26: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/26.jpg)
DTW
Single PE
PE...
PE...
PE...
PEW
Min
singledistance
CurrAcc D
/start time
PrevAcc D
/start time
Min
+
|-|
D out
D in
pattern
valid
P outP in
tuple
tupleenable
result
busy
PE1
FIFO
PatternRAM
INF
PEW-1
PE2
SubsequenceFIFO
Tuplerouter
Resultrouter
P valid P valid
INF
valid
DTW distance
1
1 0
0
1
0
Starting
P7=0 INF 26(1) 19(1) 23(2) 16(2) 12(2) 14(2) 12(2) 6(2) 14(2) 17(8) 11(8) 12(8) 14(8) 12(8)
P6=5 INF 18(1) 19(1) 19(2) 7(2) 5(2) 9(2) 6(2) 11(2) 10(8) 8(8) 5(8) 7(8) 9(8) 11(8)
P5=9 INF 15(1) 22(1) 18(2) 3(2) 5(2) 5(2) 8(2) 17(2) 7(8) 4(8) 7(8) 9(8) 11(8) 17(8)
P4=10 INF 14(1) 21(1) 13(2) 3(2) 5(2) 5(2) 8(2) 17(2) 6(8) 4(8) 7(8) 9(8) 11(8) 17(11)
P3=9 INF 12(1) 13(2) 7(2) 2(2) 4(2) 4(2) 7(2) 14(8) 4(8) 3(8) 6(8) 8(8) 10(11)11(14)
P2=5 INF 11(1) 5(2) 2(2) 6(2) 8(2) 11(5) 7(7) 5(8) 3(8) 7(8) 7(8) 8(11) 9(13) 5(14)
P1=0 INF 8(1) 1(2) 4(3) 9(4) 7(5) 9(6) 6(7) 0(8) 8(9) 9(10) 6(11) 7(12) 7(13) 3(14)
value 8 1 4 9 7 9 6 0 8 9 6 7 7 3
time 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PE PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE1 PE2 PE3 PE4 PE5 PE6 PE7
Implementation on FPGA
D(i-1, j)
D(i, j) = dist(si, pj) + min D(i, j-1)
D(i-1, j-1)
![Page 27: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/27.jpg)
Experimental Setup
CPU: intel i7-930+ 16G RAM + window 7
FPGA:Altera Stratix4s530 Combinational ALUTs 362,568/424,960 (85%)
Dedicated logic registers 230,160/424,960 (54%)
Memory bits 1,902,512/21,233,664 (9%)
Fmax: 167.8MHz
X=10,Y=502,PE number W=512
30
![Page 28: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/28.jpg)
Dataset1: medical Data
This dataset has about 8G points, and we need to
find a pattern of length 421 with R = 5%
Experimental Results
![Page 29: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/29.jpg)
Dataset2: speech recognition
We download the CMU_ARCTIC speech synthesis
databases, and construct a speech of 1 minute(1 million
points) by splicing together the first 21 utterances of all the
1132 utterances
128 256 512 1024 2048 4096 8192 1638410
-3
10-2
10-1
100
101
102
103
104
105
pattern length
Tim
e/se
cond
Time taken to search a speech dataset
R = 0.05
R = 0.1
R = 0.2
R = 0.3
R = 0.4
R = 0.5
our work: R =0.05
our work: R =0.5
Experimental Results
![Page 30: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/30.jpg)
FPGA and GPU:
To GPU & software: we use computation-reuse
technique to exploit the coarse-grained parallelism,
and the fine-grained parallelism can be only exploited
by FPGA.
To FPGA: we use both lower bound technique and
computation-reuse technique.
Experimental Results
![Page 31: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/31.jpg)
Conclusions and Future Work
Conclusions
IOT systems propose a lot of time series data
Sensors and computing clusters (the cloud) have different
requirements on tasks, so the problem is how to design a proper
data manage system in order to help people to use these data
For similarity search, which is a basic task for understanding and
analyzing the streaming time series data, we proposed an FPGA
acceleration architecture.
Future Work
Explore the System Architecture for timing series data
analysis to support the IOT data management system
• Find the system arch patterns, and design the AS-system.
![Page 32: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/32.jpg)
Similarity Search
Correlation Discovery
Classification Clustering
Motif Discovery
Novelty/Anomaly detection
Rule Discovery
Segmentation
Visualization
Data Privacy
Prediction
Burst Detection
No history data involved
May have real time req History data analyses
Future Work
![Page 33: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/33.jpg)
Reference
1. T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria,
and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time
warping. SIGKDD, 2012.
2. D. Sart, A. Mueen, W. Najjar, V. Niennattrakul, and E. Keogh. Accelerating Dynamic Time
Warping Subsequence Search with GPUs and FPGAs. ICDM 2010.
3. Y. Sakurai, C. Faloutsos and M. Yamamuro, Stream Monitoring under the Time Warping
Distance. ICDE 2007.
4. Y. Zhang, K. Adl, and J. Glass. Fast spoken query detection using lower-bound Dynamic
Time Warping on Graphical Processing Units. ICASSP 2012, 5173 – 5176.
5. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. J. Keogh. 2008. Querying and
mining of time series data: experimental comparison of representations and distance
measures. PVLDB 1, 2, 1542-52.
6. S. Srikanthan, A.Kumar, and R. Gupta. 2011. Implementing the dynamic time warping
algorithm in multithreaded environments for real time and unsupervised pattern discovery.
IEEE ICCCT, 394-398.
7. M. Grimaldi, D. Albanese, G. Jurman and C. Furlanello. Mining Very Large Databases of
Time-Series: Speeding up Dynamic Time Warping using GPGPU. NIPS 2009 Workshop
8. N. Takhashi, T. Yoshihisa, Y. Sakurai and M. Kanazawa. A Parallelized Data Stream
Processing System using Dynamic Time Warping Distance. CISIS 2009
9. A. Fu, E. Keogh, L. Lau, C. Ratanamahatana, and R. Wong. 2008. Scaling and time
warping in time series querying. VLDB J. 17, 4, 899-921
![Page 34: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/34.jpg)
Department of Electronic Engineering, Tsinghua University
Nano-scale Integrated Circuit and System Lab.
Thank you !
37
For other domain specific accelerations, such as graph theoretic algorithms, sparse matrix
decomposition, search apps, video apps. Please refer to my webpage:
http://nics.ee.tsinghua.edu.cn/people/wangyu/
![Page 35: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/35.jpg)
John D. Davis Researcher Microsoft Research Silicon Valley In Collaboration with Chuck Thacker, Eric Chung, Srinidhi Kestur, Lintao Zhang, Fang Yu, Zhangxi Tan, & Ollie Williams
![Page 36: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/36.jpg)
2
Doubling of transistors
every 18-24 months
2X Compute Capability &
Efficiency
Innovative Applications
Miniaturization Lowered Costs
etc…
![Page 37: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/37.jpg)
3
10
100
1000
10000
1982 1985 1988 1991 1995 1998 2001 2005 2008 2011 2014
Clo
ck F
requency (
MH
z)
15 GHz Processor (100 Watts)
“The Multicore Revolution”
Microprocessor Trends
Source: http://cpudb.stanford.edu
![Page 38: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/38.jpg)
4
1
10
100
2004 2005 2006 2008 2009 2010 2012
Co
re C
ou
nt
x Fr
equ
ency
(n
x G
Hz)
Circa 2005 Multicore Trends
16 cores @ 3.6GHz (100W)
The Power Wall
![Page 39: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/39.jpg)
5
Hardware Specialization
Capabilities (Battery Life, Performance, Killer Apps)
Today Future Past No Longer Can Rely on General Purpose Hardware
Improvements to Enable More Capabilities
![Page 40: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/40.jpg)
6 * Source: Ning Zhang and Bob Brodersen, ISSCC data
10-100X Gap in Efficiency Between General Purpose Processors and Dedicated Hardware
![Page 41: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/41.jpg)
7 7
Motivation
HW Accelerators & Goals
Parallel SAT Solver
Matrix-Vector Multiplication Engine
Conclusions
![Page 42: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/42.jpg)
8 8
HW accelerators Applications (PSAT)
Libraries (MVM)
Language
Common computation architectures to broaden accelerator utility.
Customized memory architecture
Compressed data representations
Precision vs. energy efficiency
![Page 43: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/43.jpg)
9 9
Transistors are abundant, power is scarce Utilize abundant silicon for FPGA fabrics Energy efficient and post-silicon flexibility
Challenges
FPGAs incur large reconfiguration overheads Must provide significant advantages over other architectures (many-core, GPGPU) What are the right applications?
Exploration enabled by the BEE3
![Page 44: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/44.jpg)
10 10
We built it! Vehicle for research in computer system architecture
“BEE3”: Berkeley Emulation Engine, version 3 4 FPGAs (3 types of FPGAs)
Logic-focused, DSP-focused or Embedded Processor-focused
64 GB DDR2 DRAM 2 DRAM channels per FPGA, 2 DIMMs per channel
FPGA Ring Interconnect Plenty of I/O to connect to the BEE3
10 GbE, 1 GbE, PCI-Express, QSH
![Page 45: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/45.jpg)
11
Two design styles
Directly translate SW → HW (generally FSMs)
Easy to debug and compare to SW system
Composable building blocks
Leverage domain (App + HW) expertise
Target FPGA hard macros
General requirements for FPGAs
No reconfiguration
Generalized solution → library-like functionality
![Page 46: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/46.jpg)
12 12
![Page 47: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/47.jpg)
13 13
Determine whether a given boolean formula can be true
3SAT is the first known NP-complete problem Often used to prove that other problems are NP-complete
Applications of SAT: Formal verification of circuit design
Cryptography attacks
Solve other NP-complete problems
SAT solver can take a long time Some times hours, days, or even weeks
)()( 128951 XXXXXX
![Page 48: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/48.jpg)
14 14
CPU FPGA
Loop{ }
10% time
90% time >1000 CPU
cycle/Inference
Branch decision
--set a variable Deduce --loop through all related clauses, obtain inferred
variables; Conflict Analysis
-backtrack, or finish
Software Solver
FSB/HT/PCIe
951 XXX
X1=0 (decision), X5=0 => X9=0
![Page 49: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/49.jpg)
15 15
Previous work Map clauses to logic directly
Hours of reconfiguration time
This is no longer only a logic problem! It’s an architecture problem!
Design for FPGA fabric
Push computation close to storage (state)
Transform logic problem into a memory indexing problem
![Page 50: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/50.jpg)
16 16
1: CPU communication module
2: Implication queue
3: Parallel inference engines
4: Inference multiplexer
5: Conflict inference detection
![Page 51: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/51.jpg)
17 17
An application-specific architecture Reprogram memories for new instance
Avoid global signal wires and careful pipelining
Support tens of thousands of variables and clauses per FPGA
Learned clause support
BCP 5~16 times faster than the conventional software based approach
![Page 52: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/52.jpg)
18 18
![Page 53: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/53.jpg)
19 19
![Page 54: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/54.jpg)
20 20
𝒚 = 𝑨𝒙
Matrix-Vector-Multiply is Critical HPC Kernel 10s of papers published/year on this topic
Existing works on GPU/CPU/FPGA
Performance sensitive to matrix sparsity and formats Processor-centric data formats High power consumption (GPU/CPU)
FPGA opportunities
Exploit custom variable-length formats Low power, large memory configurations Efficient, robust resource utilization
![Page 55: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/55.jpg)
21 21
Build single FPGA bitfile library for 𝒚 = 𝑨𝒙
Handle large-scale inputs (>GB)
Avoid costly run-time reconfiguration
Exploit bit-level manipulation
Dense and sparse inputs
Process multiple sparse matrix formats COO, CSR, Dense, DIA, ELL, etc.
![Page 56: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/56.jpg)
22 22
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
A40 A41 A42 A43
x0
x1
x2
x3
y0
y1
y2
y3
y4
= ×
𝒚 = 𝑨𝒙
![Page 57: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/57.jpg)
23 23
PE
PE
PE
Universal Format
Decoder
Gaxpy PIPE 0
PE
Rows
Matrix Memory
(A)
Vector Memory
(y)
Tiled DMA Engine
Vector Memory (x)
Gaxpy Control
Gaxpy PIPE 3
Gaxpy PIPE 2
Gaxpy PIPE 1
![Page 58: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/58.jpg)
24 24
A00 A03
A12
A21
A32
A41 A43
x0
x1
x2
x3
y0
y1
y2
y3
y4
= ×
𝒚 = 𝑨𝒙
![Page 59: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/59.jpg)
25 25
PE
PE
PE
Universal Format
Decoder
Gaxpy PIPE 0
PE
Gaxpy PIPE 1
Gaxpy PIPE 2
Gaxpy PIPE 3
Rows
Vector Memory
(y)
Tiled DMA Engine
Private Cache (x)
Private Cache (x)
Private Cache (x)
Private Cache (x) A streams Gaxpy Control
A data streams
![Page 60: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/60.jpg)
26 26
1 4 3 7 2 9 5 8 7 Data Array
1 0 4 0
3 7 0 0
0 0 2 9
5 8 0 7
1 4
3 7
2 9
5 8 7
![Page 61: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/61.jpg)
27 27
COO Overhead = 𝑵𝒐𝒏𝒛𝒆𝒓𝒐𝒔 × (𝟒𝑩 + 𝟒𝑩)
Data Array 1 4 3 7 2 9 5 8 7
Row Index
Column Index
0 0 1 1 2 2 3 3 3
0 2 0 1 2 3 0 1 3
![Page 62: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/62.jpg)
28 28
CSR Overhead = 𝑵𝒐𝒏𝒛𝒆𝒓𝒐𝒔 × 𝟒𝑩 + 𝑹𝒐𝒘𝒔 × 𝟒𝑩
Data Array 1 4 3 7 2 9 5 8 7
Row Pointer
Column Index
0 2 4 6 9
0 2 0 1 2 3 0 1 3
![Page 63: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/63.jpg)
29 29
Column Metadata
1 4 *
3 7 *
2 9 *
5 8 7
Data w/ Padding
0 2 *
0 1 *
2 3 *
0 1 3
ELL Overhead = 𝟒𝑩 × 𝑹𝒐𝒘𝒔 × 𝒌 + 𝑫𝒂𝒕𝒂𝑷𝒂𝒅 × 𝟖𝑩
k=3
![Page 64: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/64.jpg)
30 30
![Page 65: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/65.jpg)
31 31
Data Array
[ 1 (0,1) 1 (0,1) 1 (0,4) 1 1 1 1 (0,1) 1 ]
[ 1 (0,1) 1 (0,1) 1 (0,4) 1 1 1 1 (0,1) 1 ]
[ 1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1 ]
Bit Vector (BV)
BV Overhead = 𝑹𝒐𝒘𝒔 × 𝑪𝒐𝒍𝒔 𝒙 𝟏𝒃𝒊𝒕
1 4 3 7 2 9 5 8 7
1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1
![Page 66: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/66.jpg)
32 32
Data Array
Bit Vector (BV) Compressed BV (CBV)
1 4 3 7 2 9 5 8 7
1 (0,1) 1 (0,1) 1 1 (0,4) 1 1 1 1 (0,1) 1
1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1
32-bit “zero” fields
CBV Overhead = 𝑵𝒐𝒏𝒛𝒆𝒓𝒐𝒔 × 𝟏𝒃𝒊𝒕 +𝒁𝒆𝒓𝒐𝑪𝒍𝒖𝒔𝒕𝒆𝒓𝒔 × 𝟑𝟐𝒃𝒊𝒕
![Page 67: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/67.jpg)
33 33
Data Array 1 4 3 7 2 9 5 8 7
1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1
4-bit header + {4,8,…,32}-bit zero field
CVBV Overhead ~ input-dependent
1 (0,1) 1 (0,1) 1 1 (0,4) 1 1 1 1 (0,1) 1
Compressed Variable BV (CVBV)
![Page 68: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/68.jpg)
34 34
![Page 69: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/69.jpg)
35 35
Universal Format/
CVBV Decoder
Gaxpy PIPE 0
Gaxpy PIPE 1
Gaxpy PIPE 2
Gaxpy PIPE 3
Rows
Vector
Memory
(y)
private cache (x)
private cache (x)
private cache (x)
private cache (x) Gaxpy
Control
A data streams
![Page 70: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/70.jpg)
36 36
Specify matrix format descriptors Fixed/variable length, padding, index/ptr, etc.
Translates row/column into sequence #
Generate *BV (reduce storage/BW)
Generate modified COO (consumed by PEs) Row index, nonzero count per row, col indices
![Page 71: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/71.jpg)
37 37
![Page 72: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/72.jpg)
38 38
PEs LUT
(% area)
RAM
(% area)
DSP
(% area)
GFLOP
s
(Peak)
GFLOPs
(off-chip)
BW
(% peak)
Dense V5-LX155T 16 72% 86% 88% 3.1 0.92 64.7
Dense V6-LX240T 32 71% 63% 56% 6.4 1.14 80
Dense+Sparse V5 16 74% 87% 91% 3.1 - -
Sparse
Inputs
V5-LX155T (Ours) HC-1 (32 PE)1 Tesla S10702
GFLOPS / BWUsed GFLOPS /
BWUsed
GFLOPS /
BWUsed
dw8192 0.10 / 10.3% 1.7 / 13.2% 0.5 / 3.1%
t2d_q9 0.15 / 14.4% 2.5 / 19.3% 0.9 / 5.7%
epb1 0.17 / 17.1% 2.6 / 20.2% 0.8 / 4.9%
raefsky1 0.20 / 18.5% 3.9 / 29.0% 2.6 / 15.3%
psmigr_2 0.20 / 18.6% 3.9 / 29.6% 2.8 / 16.7%
torso2 0.04 / 4.0% 1.2 / 9.1% 3.0 / 18.3% [1] Nagar et al., A Sparse Matrix Personality for the Convey HC-1, FCCM’11 [2] Bell et al., Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors, SC’09
![Page 73: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/73.jpg)
39 39
We defined CVBV/CBV sparse format 25% reduction in storage/bandwidth compared to well-known CSR
Exploits bit-level manipulation of FPGA
Single bit file for dense AND sparse MVM Universal matrix format decoder
DMA and caches for memory management
Stall-free accumulator
Scalable design, implemented on multiple platforms
![Page 74: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/74.jpg)
40 40
Demonstrated HW as SW library replacement Bottom up approach (Time consuming/not scalable)
Pros: Common computation architecture and input insensitive
Customized memory architecture
Compressed data representations
Other energy efficiency tools to exploit
Cons: Time consuming: requires designers
![Page 75: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/75.jpg)
41 41
Moving beyond manycore and GPUs
Need tools to automate HW/SW co-design Granularity?
Algorithm specification?
HW building blocks? IP Integration issues
Future FPGA Architecture? More custom building blocks?
Software/OS support Fast communication and synchronization
Accelerators as 1st class building blocks
![Page 76: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/76.jpg)
42 42
![Page 77: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/77.jpg)
43
© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
![Page 78: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/78.jpg)
44
![Page 79: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/79.jpg)
45 45
Inference Engine
Clause Index
Walk
(ieng.vhdl)
Literal Value
Inference
(ieng.vhdl)
Walk Table RAM
18 *2K BRAM
(ieng_ram.vhdl)
Clause Status
Table RAM
18 * 1K BRAM
(ieng_ram.vhdl)
Inference Engine
Clause Index
Walk
(ieng.vhdl)
Literal Value
Inference
(ieng.vhdl)
Walk Table RAM
18 *2K BRAM
(ieng_ram.vhdl)
Clause Status
Table RAM
18 * 1K BRAM
(ieng_ram.vhdl)
Inference
Result
Multiplexer
(ibus.vhd)
Conflict Inference Detection
2-stage Pipeline
(conflict_detect.vhdl)
2 * 8K bits BRAM (xN)
Global Variable Status
(Conflict_detect.vhdl)
Literal to Variable
(External RAM)
To CPU
Communication
Buffer TX
BRAM
36*1K
BRAM
36*1K
36 x 2 bits
Communication
Buffer RX
From CPU
18 * 4 bits
36 / 36*2 bit
(Enqueue / to CPU)
Demux FIFO
Overflow
Buffer
(DRAM)
36*1K
Undo
Undo
Enqueue
New Decision/
Undo
16-Entry FIFO
(distributed RAM)
#1
#64
Inference Engine
Clusters (16 x 4)
36*1K
18 * 1K FIFO (x4)
16-Entry FIFO
(distributed RAM)
Demux Bus
(ibus.vhd)
CPU Communication
Dispatch Unit
Parallel Inference Engine Clusters
Conflict Detection
X1=1 X1=1
X1=1 X1=1
X3=1
X5=0
X3=1 X5=0
X3=1
X5=0
X3=1
X5=0 X5=0
X3=1
X5=0
![Page 80: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/80.jpg)
46 46
0
2000
4000
6000
8000
10000
12000
14000
3-SAT 4-SAT 5-SAT 6-SAT
Co
nvert
ed
CP
U c
ycle
s
CPU
FPGA (HT)
FPGA (PCIe)
BCP 6.7 – 38.6 times faster than the conventional software based approach
![Page 81: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming](https://reader030.vdocuments.net/reader030/viewer/2022041222/5e0c84a09c0b1a68e8260750/html5/thumbnails/81.jpg)
47 47
BCP 5~16 times faster than the conventional software based approach
0
500
1000
1500
2000
2500
3000
3500
Co
nvert
ed
CP
U c
ycle
s p
er
imp
licati
on
Software
FPGA (tree in BRAM)
FPGA( (tree in BRAM and distributed RAM)