加速以 gpu 為運算核心的二階段哼唱選歌 系統 a ccelerating a t wo -s tage q uery by...
TRANSCRIPT
加速以 GPU為運算核心的二階段哼唱選歌系統ACCELERATING A TWO-STAGE QUERY BY SINGING/HUMMING SYSTEM USING GPUS
Student: Andy Chuang (莊詠翔 )
Advisor: Jyh-Shing Roger Jang (張智星教授 )
Jason S. Chang (張俊盛教授 )
Department of Computer Science, National Tsing Hua University
2
Outline
Introduction Related work System flow Methods Experimental results Conclusions and future work
/25
3
Introduction
QBSH (Query by Singing and Humming) Description: the user sings or hums and the system returns the
most similar song from database Problem: the system usually takes too long to response when
database is huge
Strategies Implement the new linear scaling to match the property of
GPU Reduce the database to avoid the unnecessary comparison Combine linear scaling with dynamic time warping on multiple
devices (GPUs) rather than one to speed up the computation/25
4
Related Work (1/2)
/25
MIRACLE(Music Information Retrieval Acoustically with Clustered and paralleL Engines)
Jang, Chen, and Kao, “MIRACLE: A Music Information Retrieval System with Clustered Computing Engines”, ISMIR 2001.
Linear scaling (LS) Wang, Chen, Kuo, Chiu, and Jang, “Accelerating Query
by Singing/Humming on GPU: Optimization for Web Deployment”, ICASSP 2012
Lin, “Speeding Up Query-by-Singing/Humming Systems Based on Linear Scaling”, National Tsing Hua Univ. 2012
5
Related Work (2/2)
/25
Dynamic time warping (DTW) Ferraro, Hanna, Imbert, and Izart, “Accelerating Query-by-
Humming on GPU”, ISMIR 2009 Kuo, “Accelerating Query By Singing/Humming on GPU”,
National Tsing Hua Univ. 2013Hybrid LS+DTW
Zou, “Query By Singing/Humming Using Combination of Classifiers”, National Tsing Hua Univ. 2008
Kao, “A Two-Stage Query by Singing/Humming System on GPU“, National Tsing Hua Univ. 2013
6
System Flow
/25
Sing/hum the song
Detect endpoints and preprocess
audio
Perform linear scaling
Load database
Show top-N song info.
GPUCPUUser
Perform dynamic time warpingPost process the
ranking
Reserve different amount of
candidate songs
Convert to frame-based data
7
Linear Scaling
/25
Example: 十年 , 陳奕迅
Time
Perform key transposition before using linear scaling
8
LS Implementation Detail (1/3)
Our research will focus on “Compute distance” part
/25
Scale the input pitch vector to 31
versions with different size
Put input pitch vector into constant
memory
Compute distanceSort result and return
9
LS Implementation Detail (2/3)
Each block computes one song, each thread in a block computes different segments of the song
An example of a single block Block dimension = 64Segment size = 375Frame rate = 31.25
/25
2034Pitch vector ‧‧‧
0 63Thread id 1 2 ‧‧‧‧‧‧
‧‧‧0 46 157 374 420 531
Segments
‧‧‧ ‧‧‧2408
10
Database Pitch Vector
‧‧‧ ‧‧‧57035521
‧‧‧
Block 1
Thread 0
Thread 1
Thread 63
‧‧‧
Block 999
Thread 0
Thread 1
Thread 63
‧‧‧
Block 0
Thread 0
Thread 1
Thread 63
LS Implementation Detail (3/3)
/25
Song 0 Song 1 Song 999
‧‧‧0 46 ‧‧‧ 2034 ‧‧‧ 5521 5703 5817 ‧‧‧6124567
‧‧‧6124683
‧‧‧6124784
‧‧‧
0 46 2034 6124567
6124683
61247849987
11
Global Memory
Block 0
Local MemoryLocal MemoryLocal Memory
Implementation of LS - method 1 Each thread copies a part of the database pitch vector
from global memory to their local memory, then accesses pitches from local memory while computing
/25
Thread 0 Thread 1 Thread 63
2034‧‧‧157‧‧‧ ‧‧‧
‧‧‧
‧‧‧
‧‧‧0 460 ‧‧‧
3740 ‧‧‧ 46 ‧‧‧ 420 2034 ‧‧‧ 24080 1 374 46 47 420 240820352034
12
Block 0Shared Memory
0 46 2034… … …
Threads in the same block compute the same song, so we copy the pitch vector of that song from global memory to shared memory, then each thread can access the pitch from shared memory when needed
Global Memory
Implementation of LS - method 2
/25/28
Shared pitch
2034‧‧‧0 46 157‧‧‧ ‧‧‧ ‧‧‧
Thread 0
Thread 1
Thread 630 46 20341 2 4748 20352036
‧‧‧
13
Comparison with 2 Method of LS
Advantage Drawback
Method 1 (Using local memory)
‧Intuitive‧Easy to implement
‧Local memory is slower (off-chip)‧Need to copy several times the same data from global memory
Method 2 (Using shared memory)
‧Shared memory is faster (on-chip)‧Only need to copy the same data from global memory once
‧Bank conflict
/25
14
Bank Conflict (1/2)
Bank conflict: Threads in the same half-warp (which have the same color below) access the same bank but different address of shared memory simultaneously
/25
‧‧‧
1 2 3 4 5 ‧‧‧28
29
30
310 0 1 2 3 4 5
Successive4-bytesData
Bank id
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
Thread id 20
No bank conflict No bank conflictBank conflict
‧‧‧
15
‧‧‧
Bank Conflict (2/2)
Bank conflict between thread 2 and thread 5
/25
0 46 315 482 573
0
Pitch vector
Bank id
157
14 27 2 29‧‧‧ 29
‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧
0 1 2 3 4 5Thread id
Bank conflict
‧‧‧ ‧‧‧‧‧‧ ‧‧‧
‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧
16
Append pitches after some note to shift banks
Trade-off between performance & recognition rate
Implementation of LS - method 3
/25
Pitch vector
Bank id
0 1 2 3 4 5Thread id
No bank conflict
‧‧‧0 46 315 482 574
0
157
14 27 2 30‧‧‧ 29
‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧
‧‧‧ ‧‧‧‧‧‧
‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧
573…
17
Database Reduce (1/3)
Remove unnecessary comparison cause by the same song Save 6.2% computing time (remove 1265 songs from 20395)
/25
First-stage‧Song length difference is less than 5 secs.‧Pitch difference (after one-shot key transposition) is less than
2 semitone/frame-based data (pitch)
Second-stage A‧Definitely the same
‧832 songs must be removed
Second-stage B‧Several cases
‧433 songs must be removed
Have same song name & singer?Yes No
18
Database Reduce (2/3)
Cases of the first-stage’s result with different song name or singer
Deal with “Song name error”, “Singer error” and “Abbreviation” cases
/25
Cases Example 1 Example 2
Song name error A Day In The Life Day In The Life
Singer error Procal Harum Procol Harum
Abbreviation 1st Of 5th First Of Fifth
Cover 想你想斷腸 補破網
19
Database Reduce (3/3)
Remove unnecessary comparison cause by repeating pattern Save 22.01% size of database Using correlative matrix T
For the first row
Other
where T is the correlative matrix, S is the pitch vector i, j are the index of matrix
/25
72 68 72 68 72 60
72 X 0 1 0 1 0
68 X 0 2 0 0
72 X 0 3 0
68 X 0 0
72 X 0
60 X
𝑇 1 , 𝑗={ 1,∧𝑖𝑓 𝑆1=𝑆 𝑗
0 ,∧ h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒
𝑇 𝑖 , 𝑗={𝑇 𝑖−1 , 𝑗− 1+1 , 𝑖𝑓 𝑆 𝑖=𝑆 𝑗
0 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒
20
Multiple Device
Each device will be distributed 1/n scale of database to compute the similarity to the query input, while n is the # of devices
/25
System Flow
21
Experimental Environment (1/2)
/25
OS Cent OS 6 x86_64RAM 96GB DDR3 ECCCPU Intel Xeon x5670 six cores 2.93GHz x2GPU NVIDIA TESLA M2070 x3 (448 cores)
Language version C/C++CUDA version 5.5
Compute Capability 2.0
Experimental environment (NCHC Formosa 5 Cluster)
22
Experimental Data
Corpus information NTHU&NTU students recordings in 2013
Database 19130 songs (after reduce)
Corpus CHT Pop Song
Format WAV, 16KHz, 16bits, mono
File amount 2183 (1614 from NTU, 318 from NTHU and 251 CHT)
WAV file size 8-9 seconds
/25
23
Experiment 1 – LS with Different Shared Pitch Sizes & Block Dimension using Method 2
When the pitch vector size in shared memory is 2000 and the block dimension is 128, the computation time is 1.427, the shortest.
/25
Com
puta
tion
Tim
e (s
ec.)
Shared Pitch Size (K)
2 3 4 5 6 7 8 9 10 111
2
3
4
5
6
7
Block dimension: 64, device: 3Block dimension: 128, device: 3Block dimension: 256, device: 3Block dimension: 512, device: 3Block dimension: 1024, device: 3
24
Experiment 2 – LS with Different Shared Pitch Sizes & Block Dimension using Method 3
When the pitch vector size in shared memory is 5000 and the block dimension is 512, the computation time is 1.213, the shortest .
/25
Com
puta
tion
Tim
e (s
ec.)
Shared Pitch Size (K)
2 3 4 5 6 7 8 9 10 111
2
3
4
5
6
7
Block dimension: 64, device: 3Block dimension: 128, device: 3Block dimension: 256, device: 3Block dimension: 512, device: 3Block dimension: 1024, device: 3
25
Experiment 3 – LS with Different Number of Thread using Three Different Method
The best case of LS method 2 & 3 (with a certain shared pitch size) for each block dimension is faster than method 1
/25
Com
puta
tion
Tim
e p
er S
ong
(sec
.)
LS Block Dimension (# of threads per block)
64 128 256 512 10240
1
2
3
4
5
6
20002000 3000 5000
11000
LS 1, device: 3LS 2, device: 3LS 3, device: 3
26
Experiment 4– LSDTW with Different Number of Devices
If we have more devices, the computation time becomes lower since per GPU almost only needs to query 1/n scale of database
/25
Com
puta
tion
Tim
e p
er S
ong
(sec
.)
Number of devices
1 2 31
2
3
4
5
6
7 6.76
3.52
2.41
5.72
3.08
2.12
5.13
2.69
1.84
LS 1, block dimension: 128LS 2, block dimension: 128, shared pitch size: 2000LS 3, block dimension: 512, shared pitch size: 5000
27
Conclusions and Future Work (待補 )
Conclusions Computation time
LS method 2 is faster than method 1, even though the bank conflict exists The Computation time is almost 1/n times while using n devices (GPU)
Future work Advanced database purification to remove bad songs
Abnormal melody (e.g instrumental only) Wrong melody or song name
Improve LS method 2 to reduce bank conflict for Kepler architecture Different definition of bank conflict from Tesla & Fermi architecture Using different method for appending pitches
/25
28
Thank you!!&
DEMO (http://miracle.mirlab.org:8080/miracle)
/25
29
: input pitch vector: reference pitch vectorLocal paths: 27-45-63 degreesDTW recurrence:
i
j
t(i-1)
r(j) ),( jiD
r(j-1)
t(i)
Methods: Dynamic Time Warping
/25
|)()(|),( jritjiD
min
30
Note & Pitch data charts
Average number of notes: 913.32, σ : 450.99 Average number of pitch data: 5849.94, σ: 1878.42
/25
31
Note & Pitch data charts (After Repeating Pattern Removing)
Average number of notes: 694.7, σ : 386 Average number of pitch data: 4562.54, σ: 1773.6
/25
32
Detail of Second-stage B
/25
Song name Singer 處理方式Different Same Manually check
Different Unknown1. Compute two song names’ edit distance 2. Check the song which edit distance is smaller than threshold
Same Different Manually check
Same Unknown Remove data with unknown singer
33
System Flow with multiple devices (待補 )
/25
Sing/hum the song
Detect endpoints and preprocess
audio
Perform linear scaling
Load database
Show top-N song info.
GPUCPUUser
Perform dynamic time warpingPost process the
ranking
Reserve different amount of
candidate songs
Convert to frame-based data
35
以下是高瑋的備用投影片
/25
41 /25
Method: Borda Count
Borda CountR: # of candidate songM: # of melody recognition method
rik: the rank which is k-th result in i-th melody recognition method
Song Rank 1-ABCD(score)
Rank 2-DCAB(score)
Dk (total score)
Rank
A 3 1 4 1
B 2 0 2 4
C 1 2 3 2
D 0 3 3 2
Borda Count example
M
iikk rRD
1
)(
42
Methods: Comparison of Two Methods
The proposed system combines linear scaling with dynamic time warping to accelerate computation.
/25
Type Linear scaling (LS) Dynamic time warping (DTW)
Computation time Faster Slower
Tempo variation Deal with uniform tempo variation
Deal with non-uniform tempo variation
Key transposition One-shot Heuristic search