加速以 gpu 為運算核心的二階段哼唱選歌 系統 a ccelerating a t wo -s tage q uery by...

36
加加加 GPU 加加加加加加加加加加加加加加加 ACCELERATING A TWO-STAGE QUERY BY SINGING/HUMMING SYSTEM USING GPUS Student: Andy Chuang ( 莊莊莊 ) [email protected] Advisor: Jyh-Shing Roger Jang ( 莊莊莊莊莊 ) Jason S. Chang ( 莊莊莊莊莊 ) Department of Computer Science, National Tsing Hua University

Upload: ferdinand-howard

Post on 03-Jan-2016

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

加速以 GPU為運算核心的二階段哼唱選歌系統ACCELERATING A TWO-STAGE QUERY BY SINGING/HUMMING SYSTEM USING GPUS

Student: Andy Chuang (莊詠翔 )

[email protected]

Advisor: Jyh-Shing Roger Jang (張智星教授 )

Jason S. Chang (張俊盛教授 )

Department of Computer Science, National Tsing Hua University

Page 2: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

2

Outline

Introduction Related work System flow Methods Experimental results Conclusions and future work

/25

Page 3: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

3

Introduction

QBSH (Query by Singing and Humming) Description: the user sings or hums and the system returns the

most similar song from database Problem: the system usually takes too long to response when

database is huge

Strategies Implement the new linear scaling to match the property of

GPU Reduce the database to avoid the unnecessary comparison Combine linear scaling with dynamic time warping on multiple

devices (GPUs) rather than one to speed up the computation/25

Page 4: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

4

Related Work (1/2)

/25

MIRACLE(Music Information Retrieval Acoustically with Clustered and paralleL Engines)

Jang, Chen, and Kao, “MIRACLE: A Music Information Retrieval System with Clustered Computing Engines”, ISMIR 2001.

Linear scaling (LS) Wang, Chen, Kuo, Chiu, and Jang, “Accelerating Query

by Singing/Humming on GPU: Optimization for Web Deployment”, ICASSP 2012

Lin, “Speeding Up Query-by-Singing/Humming Systems Based on Linear Scaling”, National Tsing Hua Univ. 2012

Page 5: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

5

Related Work (2/2)

/25

Dynamic time warping (DTW) Ferraro, Hanna, Imbert, and Izart, “Accelerating Query-by-

Humming on GPU”, ISMIR 2009 Kuo, “Accelerating Query By Singing/Humming on GPU”,

National Tsing Hua Univ. 2013Hybrid LS+DTW

Zou, “Query By Singing/Humming Using Combination of Classifiers”, National Tsing Hua Univ. 2008

Kao, “A Two-Stage Query by Singing/Humming System on GPU“, National Tsing Hua Univ. 2013

Page 6: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

6

System Flow

/25

Sing/hum the song

Detect endpoints and preprocess

audio

Perform linear scaling

Load database

Show top-N song info.

GPUCPUUser

Perform dynamic time warpingPost process the

ranking

Reserve different amount of

candidate songs

Convert to frame-based data

Page 7: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

7

Linear Scaling

/25

Example: 十年 , 陳奕迅

Time

Perform key transposition before using linear scaling

Page 8: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

8

LS Implementation Detail (1/3)

Our research will focus on “Compute distance” part

/25

Scale the input pitch vector to 31

versions with different size

Put input pitch vector into constant

memory

Compute distanceSort result and return

Page 9: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

9

LS Implementation Detail (2/3)

Each block computes one song, each thread in a block computes different segments of the song

An example of a single block Block dimension = 64Segment size = 375Frame rate = 31.25

/25

2034Pitch vector ‧‧‧

0 63Thread id 1 2 ‧‧‧‧‧‧

‧‧‧0 46 157 374 420 531

Segments

‧‧‧ ‧‧‧2408

Page 10: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

10

Database Pitch Vector

‧‧‧ ‧‧‧57035521

‧‧‧

Block 1

Thread 0

Thread 1

Thread 63

‧‧‧

Block 999

Thread 0

Thread 1

Thread 63

‧‧‧

Block 0

Thread 0

Thread 1

Thread 63

LS Implementation Detail (3/3)

/25

Song 0 Song 1 Song 999

‧‧‧0 46 ‧‧‧ 2034 ‧‧‧ 5521 5703 5817 ‧‧‧6124567

‧‧‧6124683

‧‧‧6124784

‧‧‧

0 46 2034 6124567

6124683

61247849987

Page 11: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

11

Global Memory

Block 0

Local MemoryLocal MemoryLocal Memory

Implementation of LS - method 1 Each thread copies a part of the database pitch vector

from global memory to their local memory, then accesses pitches from local memory while computing

/25

Thread 0 Thread 1 Thread 63

2034‧‧‧157‧‧‧ ‧‧‧

‧‧‧

‧‧‧

‧‧‧0 460 ‧‧‧

3740 ‧‧‧ 46 ‧‧‧ 420 2034 ‧‧‧ 24080 1 374 46 47 420 240820352034

Page 12: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

12

Block 0Shared Memory

0 46 2034… … …

Threads in the same block compute the same song, so we copy the pitch vector of that song from global memory to shared memory, then each thread can access the pitch from shared memory when needed

Global Memory

Implementation of LS - method 2

/25/28

Shared pitch

2034‧‧‧0 46 157‧‧‧ ‧‧‧ ‧‧‧

Thread 0

Thread 1

Thread 630 46 20341 2 4748 20352036

‧‧‧

Page 13: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

13

Comparison with 2 Method of LS

Advantage Drawback

Method 1 (Using local memory)

‧Intuitive‧Easy to implement

‧Local memory is slower (off-chip)‧Need to copy several times the same data from global memory

Method 2 (Using shared memory)

‧Shared memory is faster (on-chip)‧Only need to copy the same data from global memory once

‧Bank conflict

/25

Page 14: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

14

Bank Conflict (1/2)

Bank conflict: Threads in the same half-warp (which have the same color below) access the same bank but different address of shared memory simultaneously

/25

‧‧‧

1 2 3 4 5 ‧‧‧28

29

30

310 0 1 2 3 4 5

Successive4-bytesData

Bank id

0 1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

Thread id 20

No bank conflict No bank conflictBank conflict

‧‧‧

Page 15: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

15

‧‧‧

Bank Conflict (2/2)

Bank conflict between thread 2 and thread 5

/25

0 46 315 482 573

0

Pitch vector

Bank id

157

14 27 2 29‧‧‧ 29

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

0 1 2 3 4 5Thread id

Bank conflict

‧‧‧ ‧‧‧‧‧‧ ‧‧‧

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

Page 16: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

16

Append pitches after some note to shift banks

Trade-off between performance & recognition rate

Implementation of LS - method 3

/25

Pitch vector

Bank id

0 1 2 3 4 5Thread id

No bank conflict

‧‧‧0 46 315 482 574

0

157

14 27 2 30‧‧‧ 29

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

‧‧‧ ‧‧‧‧‧‧

‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧

573…

Page 17: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

17

Database Reduce (1/3)

Remove unnecessary comparison cause by the same song Save 6.2% computing time (remove 1265 songs from 20395)

/25

First-stage‧Song length difference is less than 5 secs.‧Pitch difference (after one-shot key transposition) is less than

2 semitone/frame-based data (pitch)

Second-stage A‧Definitely the same

‧832 songs must be removed

Second-stage B‧Several cases

‧433 songs must be removed

Have same song name & singer?Yes No

Page 18: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

18

Database Reduce (2/3)

Cases of the first-stage’s result with different song name or singer

Deal with “Song name error”, “Singer error” and “Abbreviation” cases

/25

Cases Example 1 Example 2

Song name error A Day In The Life Day In The Life

Singer error Procal Harum Procol Harum

Abbreviation 1st Of 5th First Of Fifth

Cover 想你想斷腸 補破網

Page 19: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

19

Database Reduce (3/3)

Remove unnecessary comparison cause by repeating pattern Save 22.01% size of database Using correlative matrix T

For the first row

Other

where T is the correlative matrix, S is the pitch vector i, j are the index of matrix

/25

72 68 72 68 72 60

72 X 0 1 0 1 0

68 X 0 2 0 0

72 X 0 3 0

68 X 0 0

72 X 0

60 X

𝑇 1 , 𝑗={ 1,∧𝑖𝑓 𝑆1=𝑆 𝑗

0 ,∧ h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

𝑇 𝑖 , 𝑗={𝑇 𝑖−1 , 𝑗− 1+1 , 𝑖𝑓 𝑆 𝑖=𝑆 𝑗

0 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

Page 20: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

20

Multiple Device

Each device will be distributed 1/n scale of database to compute the similarity to the query input, while n is the # of devices

/25

System Flow

Page 21: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

21

Experimental Environment (1/2)

/25

OS Cent OS 6 x86_64RAM 96GB DDR3 ECCCPU Intel Xeon x5670 six cores 2.93GHz x2GPU NVIDIA TESLA M2070 x3 (448 cores)

Language version C/C++CUDA version 5.5

Compute Capability 2.0

Experimental environment (NCHC Formosa 5 Cluster)

Page 22: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

22

Experimental Data

Corpus information NTHU&NTU students recordings in 2013

Database 19130 songs (after reduce)

Corpus CHT Pop Song

Format WAV, 16KHz, 16bits, mono

File amount 2183 (1614 from NTU, 318 from NTHU and 251 CHT)

WAV file size 8-9 seconds

/25

Page 23: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

23

Experiment 1 – LS with Different Shared Pitch Sizes & Block Dimension using Method 2

When the pitch vector size in shared memory is 2000 and the block dimension is 128, the computation time is 1.427, the shortest.

/25

Com

puta

tion

Tim

e (s

ec.)

Shared Pitch Size (K)

2 3 4 5 6 7 8 9 10 111

2

3

4

5

6

7

Block dimension: 64, device: 3Block dimension: 128, device: 3Block dimension: 256, device: 3Block dimension: 512, device: 3Block dimension: 1024, device: 3

Page 24: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

24

Experiment 2 – LS with Different Shared Pitch Sizes & Block Dimension using Method 3

When the pitch vector size in shared memory is 5000 and the block dimension is 512, the computation time is 1.213, the shortest .

/25

Com

puta

tion

Tim

e (s

ec.)

Shared Pitch Size (K)

2 3 4 5 6 7 8 9 10 111

2

3

4

5

6

7

Block dimension: 64, device: 3Block dimension: 128, device: 3Block dimension: 256, device: 3Block dimension: 512, device: 3Block dimension: 1024, device: 3

Page 25: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

25

Experiment 3 – LS with Different Number of Thread using Three Different Method

The best case of LS method 2 & 3 (with a certain shared pitch size) for each block dimension is faster than method 1

/25

Com

puta

tion

Tim

e p

er S

ong

(sec

.)

LS Block Dimension (# of threads per block)

64 128 256 512 10240

1

2

3

4

5

6

20002000 3000 5000

11000

LS 1, device: 3LS 2, device: 3LS 3, device: 3

Page 26: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

26

Experiment 4– LSDTW with Different Number of Devices

If we have more devices, the computation time becomes lower since per GPU almost only needs to query 1/n scale of database

/25

Com

puta

tion

Tim

e p

er S

ong

(sec

.)

Number of devices

1 2 31

2

3

4

5

6

7 6.76

3.52

2.41

5.72

3.08

2.12

5.13

2.69

1.84

LS 1, block dimension: 128LS 2, block dimension: 128, shared pitch size: 2000LS 3, block dimension: 512, shared pitch size: 5000

Page 27: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

27

Conclusions and Future Work (待補 )

Conclusions Computation time

LS method 2 is faster than method 1, even though the bank conflict exists The Computation time is almost 1/n times while using n devices (GPU)

Future work Advanced database purification to remove bad songs

Abnormal melody (e.g instrumental only) Wrong melody or song name

Improve LS method 2 to reduce bank conflict for Kepler architecture Different definition of bank conflict from Tesla & Fermi architecture Using different method for appending pitches

/25

Page 28: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

28

Thank you!!&

DEMO (http://miracle.mirlab.org:8080/miracle)

/25

Page 29: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

29

: input pitch vector: reference pitch vectorLocal paths: 27-45-63 degreesDTW recurrence:

i

j

t(i-1)

r(j) ),( jiD

r(j-1)

t(i)

Methods: Dynamic Time Warping

/25

|)()(|),( jritjiD

min

Page 30: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

30

Note & Pitch data charts

Average number of notes: 913.32, σ : 450.99 Average number of pitch data: 5849.94, σ: 1878.42

/25

Page 31: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

31

Note & Pitch data charts (After Repeating Pattern Removing)

Average number of notes: 694.7, σ : 386 Average number of pitch data: 4562.54, σ: 1773.6

/25

Page 32: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

32

Detail of Second-stage B

/25

Song name Singer 處理方式Different Same Manually check

Different Unknown1. Compute two song names’ edit distance 2. Check the song which edit distance is smaller than threshold

Same Different Manually check

Same Unknown Remove data with unknown singer

Page 33: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

33

System Flow with multiple devices (待補 )

/25

Sing/hum the song

Detect endpoints and preprocess

audio

Perform linear scaling

Load database

Show top-N song info.

GPUCPUUser

Perform dynamic time warpingPost process the

ranking

Reserve different amount of

candidate songs

Convert to frame-based data

Page 34: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

35

以下是高瑋的備用投影片

/25

Page 35: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

41 /25

Method: Borda Count

Borda CountR: # of candidate songM: # of melody recognition method

rik: the rank which is k-th result in i-th melody recognition method

Song Rank 1-ABCD(score)

Rank 2-DCAB(score)

Dk (total score)

Rank

A 3 1 4 1

B 2 0 2 4

C 1 2 3 2

D 0 3 3 2

Borda Count example

M

iikk rRD

1

)(

Page 36: 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org

42

Methods: Comparison of Two Methods

The proposed system combines linear scaling with dynamic time warping to accelerate computation.

/25

Type Linear scaling (LS) Dynamic time warping (DTW)

Computation time Faster Slower

Tempo variation Deal with uniform tempo variation

Deal with non-uniform tempo variation

Key transposition One-shot Heuristic search