加速以 gpu 為運算核心的二階段哼唱選歌系統 a ccelerating a t wo -s tage q uery by...

加速以 GPU為運算核心的二階段哼唱選歌系統ACCELERATING A TWO-STAGE QUERY BY SINGING/HUMMING SYSTEM USING GPUS

Student: Andy Chuang (莊詠翔 )

[email protected]

Advisor: Jyh-Shing Roger Jang (張智星教授 )

Jason S. Chang (張俊盛教授 )

Department of Computer Science, National Tsing Hua University

2

Outline

Introduction Related work System flow Methods Experimental results Conclusions and future work

/25

3

Introduction

QBSH (Query by Singing and Humming) Description: the user sings or hums and the system returns the

most similar song from database Problem: the system usually takes too long to response when

database is huge

Strategies Implement the new linear scaling to match the property of

GPU Reduce the database to avoid the unnecessary comparison Combine linear scaling with dynamic time warping on multiple

devices (GPUs) rather than one to speed up the computation/25

4

Related Work (1/2)

/25

MIRACLE(Music Information Retrieval Acoustically with Clustered and paralleL Engines)

Jang, Chen, and Kao, “MIRACLE: A Music Information Retrieval System with Clustered Computing Engines”, ISMIR 2001.

Linear scaling (LS) Wang, Chen, Kuo, Chiu, and Jang, “Accelerating Query

by Singing/Humming on GPU: Optimization for Web Deployment”, ICASSP 2012

Lin, “Speeding Up Query-by-Singing/Humming Systems Based on Linear Scaling”, National Tsing Hua Univ. 2012

5

Related Work (2/2)

/25

Dynamic time warping (DTW) Ferraro, Hanna, Imbert, and Izart, “Accelerating Query-by-

Humming on GPU”, ISMIR 2009 Kuo, “Accelerating Query By Singing/Humming on GPU”,

National Tsing Hua Univ. 2013Hybrid LS+DTW

Zou, “Query By Singing/Humming Using Combination of Classifiers”, National Tsing Hua Univ. 2008

Kao, “A Two-Stage Query by Singing/Humming System on GPU“, National Tsing Hua Univ. 2013

6

System Flow

/25

Sing/hum the song

Detect endpoints and preprocess

audio

Perform linear scaling

Load database

Show top-N song info.

GPUCPUUser

Perform dynamic time warpingPost process the

ranking

Reserve different amount of

candidate songs

Convert to frame-based data

7

Linear Scaling

/25

Example: 十年 , 陳奕迅

Time

Perform key transposition before using linear scaling

8

LS Implementation Detail (1/3)

Our research will focus on “Compute distance” part

/25

Scale the input pitch vector to 31

versions with different size

Put input pitch vector into constant

memory

Compute distanceSort result and return

9


Each block computes one song, each thread in a block computes different segments of the song

An example of a single block Block dimension = 64Segment size = 375Frame rate = 31.25

/25

2034Pitch vector ‧‧‧

0 63Thread id 1 2 ‧‧‧‧‧‧

‧‧‧0 46 157 374 420 531

Segments

‧‧‧ ‧‧‧2408

10

Database Pitch Vector

‧‧‧ ‧‧‧57035521

‧‧‧

Block 1

Thread 0

Thread 1

Thread 63

‧‧‧

Block 999

Thread 0

Thread 1

Thread 63

‧‧‧

Block 0

Thread 0

Thread 1

Thread 63


/25

Song 0 Song 1 Song 999

‧‧‧0 46 ‧‧‧ 2034 ‧‧‧ 5521 5703 5817 ‧‧‧6124567

‧‧‧6124683

‧‧‧6124784

‧‧‧

0 46 2034 6124567

6124683

61247849987

11

Global Memory

Block 0

Local MemoryLocal MemoryLocal Memory

Implementation of LS - method 1 Each thread copies a part of the database pitch vector

from global memory to their local memory, then accesses pitches from local memory while computing

/25

Thread 0 Thread 1 Thread 63

2034‧‧‧157‧‧‧ ‧‧‧

‧‧‧

‧‧‧

‧‧‧0 460 ‧‧‧

3740 ‧‧‧ 46 ‧‧‧ 420 2034 ‧‧‧ 24080 1 374 46 47 420 240820352034

12

Block 0Shared Memory

0 46 2034… … …

Threads in the same block compute the same song, so we copy the pitch vector of that song from global memory to shared memory, then each thread can access the pitch from shared memory when needed

Global Memory

Implementation of LS - method 2

/25/28

Shared pitch

2034‧‧‧0 46 157‧‧‧ ‧‧‧ ‧‧‧

Thread 0

Thread 1

Thread 630 46 20341 2 4748 20352036

‧‧‧

13

Comparison with 2 Method of LS

Advantage Drawback

Method 1 (Using local memory)

‧Intuitive‧Easy to implement

‧Local memory is slower (off-chip)‧Need to copy several times the same data from global memory

Method 2 (Using shared memory)

‧Shared memory is faster (on-chip)‧Only need to copy the same data from global memory once

‧Bank conflict

/25

14

Bank Conflict (1/2)

Bank conflict: Threads in the same half-warp (which have the same color below) access the same bank but different address of shared memory simultaneously

/25

‧‧‧

1 2 3 4 5 ‧‧‧28

29

30

310 0 1 2 3 4 5

Successive4-bytesData

Bank id

0 1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

Thread id 20

No bank conflict No bank conflictBank conflict

‧‧‧

15

‧‧‧

Bank Conflict (2/2)

Bank conflict between thread 2 and thread 5

/25

0 46 315 482 573

0

Pitch vector

Bank id

157

14 27 2 29‧‧‧ 29

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

0 1 2 3 4 5Thread id

Bank conflict

‧‧‧ ‧‧‧‧‧‧ ‧‧‧

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

16

Append pitches after some note to shift banks

Trade-off between performance & recognition rate

Implementation of LS - method 3

/25

Pitch vector

Bank id

0 1 2 3 4 5Thread id

No bank conflict

‧‧‧0 46 315 482 574

0

157

14 27 2 30‧‧‧ 29

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

‧‧‧ ‧‧‧‧‧‧

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

573…

17

Database Reduce (1/3)

Remove unnecessary comparison cause by the same song Save 6.2% computing time (remove 1265 songs from 20395)

/25

First-stage‧Song length difference is less than 5 secs.‧Pitch difference (after one-shot key transposition) is less than

2 semitone/frame-based data (pitch)

Second-stage A‧Definitely the same

‧832 songs must be removed

Second-stage B‧Several cases

‧433 songs must be removed

Have same song name & singer?Yes No

18


Cases of the first-stage’s result with different song name or singer

Deal with “Song name error”, “Singer error” and “Abbreviation” cases

/25

Cases Example 1 Example 2

Song name error A Day In The Life Day In The Life

Singer error Procal Harum Procol Harum

Abbreviation 1st Of 5th First Of Fifth

Cover 想你想斷腸補破網

19


Remove unnecessary comparison cause by repeating pattern Save 22.01% size of database Using correlative matrix T

For the first row

Other

where T is the correlative matrix, S is the pitch vector i, j are the index of matrix

/25

72 68 72 68 72 60

72 X 0 1 0 1 0

68 X 0 2 0 0

72 X 0 3 0

68 X 0 0

72 X 0

60 X

𝑇 1 , 𝑗={ 1,∧𝑖𝑓 𝑆1=𝑆 𝑗

0 ,∧ h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

𝑇 𝑖 , 𝑗={𝑇 𝑖−1 , 𝑗− 1+1 , 𝑖𝑓 𝑆 𝑖=𝑆 𝑗

0 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

20

Multiple Device

Each device will be distributed 1/n scale of database to compute the similarity to the query input, while n is the # of devices

/25

System Flow

21

Experimental Environment (1/2)

/25

OS Cent OS 6 x86_64RAM 96GB DDR3 ECCCPU Intel Xeon x5670 six cores 2.93GHz x2GPU NVIDIA TESLA M2070 x3 (448 cores)

Language version C/C++CUDA version 5.5

Compute Capability 2.0

Experimental environment (NCHC Formosa 5 Cluster)

22

Experimental Data

Corpus information NTHU&NTU students recordings in 2013

Database 19130 songs (after reduce)

Corpus CHT Pop Song

Format WAV, 16KHz, 16bits, mono

File amount 2183 (1614 from NTU, 318 from NTHU and 251 CHT)

WAV file size 8-9 seconds

/25

23

Experiment 1 – LS with Different Shared Pitch Sizes & Block Dimension using Method 2

When the pitch vector size in shared memory is 2000 and the block dimension is 128, the computation time is 1.427, the shortest.

/25

Com

puta

tion

Tim

e (s

ec.)

Shared Pitch Size (K)

2 3 4 5 6 7 8 9 10 111

2

3

4

5

6

7

Block dimension: 64, device: 3Block dimension: 128, device: 3Block dimension: 256, device: 3Block dimension: 512, device: 3Block dimension: 1024, device: 3

24

Experiment 2 – LS with Different Shared Pitch Sizes & Block Dimension using Method 3

When the pitch vector size in shared memory is 5000 and the block dimension is 512, the computation time is 1.213, the shortest .

/25

Com

puta

tion

Tim

e (s

ec.)

Shared Pitch Size (K)

2 3 4 5 6 7 8 9 10 111

2

3

4

5

6

7

Block dimension: 64, device: 3Block dimension: 128, device: 3Block dimension: 256, device: 3Block dimension: 512, device: 3Block dimension: 1024, device: 3

25

Experiment 3 – LS with Different Number of Thread using Three Different Method

The best case of LS method 2 & 3 (with a certain shared pitch size) for each block dimension is faster than method 1

/25

Com

puta

tion

Tim

e p

er S

ong

(sec

.)

LS Block Dimension (# of threads per block)

64 128 256 512 10240

1

2

3

4

5

6

20002000 3000 5000

11000

LS 1, device: 3LS 2, device: 3LS 3, device: 3

26

Experiment 4– LSDTW with Different Number of Devices

If we have more devices, the computation time becomes lower since per GPU almost only needs to query 1/n scale of database

/25

Com

puta

tion

Tim

e p

er S

ong

(sec

.)

Number of devices

1 2 31

2

3

4

5

6

7 6.76

3.52

2.41

5.72

3.08

2.12

5.13

2.69

1.84

LS 1, block dimension: 128LS 2, block dimension: 128, shared pitch size: 2000LS 3, block dimension: 512, shared pitch size: 5000

27

Conclusions and Future Work (待補 )

Conclusions Computation time

LS method 2 is faster than method 1, even though the bank conflict exists The Computation time is almost 1/n times while using n devices (GPU)

Future work Advanced database purification to remove bad songs

Abnormal melody (e.g instrumental only) Wrong melody or song name

Improve LS method 2 to reduce bank conflict for Kepler architecture Different definition of bank conflict from Tesla & Fermi architecture Using different method for appending pitches

/25

28

Thank you!!&

DEMO (http://miracle.mirlab.org:8080/miracle)

/25

http://miracle.mirlab.org:8080/miracle

29

: input pitch vector: reference pitch vectorLocal paths: 27-45-63 degreesDTW recurrence:

i

j

t(i-1)

r(j) ),( jiD

r(j-1)

t(i)

Methods: Dynamic Time Warping

/25

|)()(|),( jritjiD

min

30

Note & Pitch data charts

Average number of notes: 913.32, σ : 450.99 Average number of pitch data: 5849.94, σ: 1878.42

/25

31

Note & Pitch data charts (After Repeating Pattern Removing)

Average number of notes: 694.7, σ : 386 Average number of pitch data: 4562.54, σ: 1773.6

/25

32

Detail of Second-stage B

/25

Song name Singer 處理方式Different Same Manually check

Different Unknown1. Compute two song names’ edit distance 2. Check the song which edit distance is smaller than threshold

Same Different Manually check

Same Unknown Remove data with unknown singer

33

System Flow with multiple devices (待補 )

/25

Sing/hum the song

Detect endpoints and preprocess

audio

Perform linear scaling

Load database

Show top-N song info.

GPUCPUUser

Perform dynamic time warpingPost process the

ranking

Reserve different amount of

candidate songs

Convert to frame-based data

35

以下是高瑋的備用投影片

/25

41 /25

Method: Borda Count

Borda CountR: # of candidate songM: # of melody recognition method

rik: the rank which is k-th result in i-th melody recognition method

Song Rank 1-ABCD(score)

Rank 2-DCAB(score)

Dk (total score)

Rank

A 3 1 4 1

B 2 0 2 4

C 1 2 3 2

D 0 3 3 2

Borda Count example

M

iikk rRD

1

)(

42

Methods: Comparison of Two Methods

The proposed system combines linear scaling with dynamic time warping to accelerate computation.

/25

Type Linear scaling (LS) Dynamic time warping (DTW)

Computation time Faster Slower

Tempo variation Deal with uniform tempo variation

Deal with non-uniform tempo variation

Key transposition One-shot Heuristic search

加速以 gpu 為運算核心的二階段哼唱選歌 系統 a ccelerating a t wo -s tage q uery by...

Documents

加速以 gpu 為運算核心的二階段哼唱選歌系統 a ccelerating a t wo -s tage q uery by...