multi-speaker localization using convolutional neural ...media.aau.dk › smc › wp-content ›...

34
Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise Soumitro Chakrabarty and Emanuël Habets ML4Audio@NIPS 2017 08/12/2017

Upload: others

Post on 28-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise

Soumitro Chakrabarty and Emanuël Habets

ML4Audio@NIPS 2017 08/12/2017

Page 2: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

2

Direction-of-arrival (DOA)

5mm 3mm

Microphone Array

Page 3: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

3

Motivation Signal processing methods

§  Cross-correlation-based methods •  GCC-PHAT •  SRP-PHAT •  MCCC…

§  Subspace-based methods

•  MUSIC... §  Model-based methods

•  Maximum-likelihood estimation...

§  ...

Challenges -  Performance

degradation in presence of noise and reverberation

-  High computational cost

Page 4: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

4

Motivation Supervised learning methods §  Recently, deep neural network (DNN) based supervised learning

methods have been successful across a range of applications: §  Automatic speech recognition §  Object recognition in images §  Machine translation….

§  Some DNN based methods that estimate DOAs of sound sources from the observed signals have been proposed

§  Advantage: Supervised learning methods can be adapted to different acoustic environments

Page 5: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

5

Previous work CNN based speaker localization

§  CNN based supervised learning method for single DOA estimation per

time frame was proposed §  Simple input representation “Phase map” was used

§  Trained using synthesized white noise signals

§  Proposed method was shown to clearly outperform a steered-response power based method

[1] S. Chakrabarty and E. A. P Habets “Broadband DOA Estimation using Convolutional Neural Networks Trained with Noise Signals” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2017

Page 6: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

6

Motivation CNN based multi-speaker localization

Aim: A CNN based supervised learning method for DOA estimation that

§  Estimates multiple DOAs per time frame given the STFT representation of the observed signals

§  Simple input representation to learn relevant features during training

§  Extend the idea of training with synthesized noise signals to the multi-speaker case

Page 7: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

7

§  Multi-speaker DOA estimation is formulated as an I class multi-label classification problem

§  Discretize the whole DOA range into I discrete values to obtain a set of possible DOA values:

§  Each class corresponds to a possible DOA value in the set

Problem Formulation DOA estimation as classification

⇥ = {✓1, . . . , ✓I}

......

I classes

i ≈ θi ∈ Θ = {θ1, . . . , θI}

Θ = {θ1, . . . , θI}

Page 8: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

8

Problem Formulation DOA estimation as classification

§  Given the input at each time frame, compute the probability for each class using I binary classifiers

§  Suitable post-processing for estimating DOAs of the sources

......

I classes

i ≈ θi ∈ Θ = {θ1, . . . , θI}

Θ = {θ1, . . . , θI}

Page 9: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

9

System overview Supervised learning framework

STFT Input feature

True DOA Labels

DOA classifier

Training data

STFT Input featureTest data

TrainDOA classifier

Training

Inference/Test

Posteriorprobabilities

DOA estimate

Trained parameters

Page 10: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

10

STFT Input feature

True DOA Labels

DOA classifier

Training data

STFT Input featureTest data

TrainDOA classifier

Training

Inference/Test

Posteriorprobabilities

DOA estimate

System overview Supervised learning framework

Trained parameters

Page 11: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

11

Input feature representation STFT magnitude and phase component Ym(n, k) = Am(n, k)ej�m(n,k)

Magnitude�Am(n, k) Phase� �m(n, k)

N − Time frames

K−

Frequ

ency

Bins

M−Microphones

K−

Frequ

ency

Bins

N − Time frames M−Microphones

Page 12: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

12

K−

Frequ

ency

Bins

N − Time frames M−Microphones

Phase� �m(n, k)

Input feature representation Phase map

M

K

Phase Map��n

for each time frame n

n

Page 13: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

13

STFTΦn

Input feature

True DOA Labels

DOA classifier

Training data

STFT Φn

Input featureTest data

TrainDOA classifier

Training

Inference/Test

Posteriorprobabilities

DOA estimate

System overview Supervised learning framework

Trained parameters

Page 14: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

14

CNN architecture

Input: M ×K

Conv1: 2× 1, 64 Conv2: 2× 1, 64 Conv3: 2× 1, 64

Total Feature Maps: 64

Size: (M − 2)×K Total Feature Maps: 64

Size: (M − 3)×K

FC1:

512

FC2:

512

Output:

I×1

Total Feature Maps: 64

Size: (M − 1)×K

Page 15: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

15

Training with synthesized noise signals Motivation §  In a mixture of simultaneously active speakers, all the speakers

need not be active at each time frame

§  Accurate detection of speaker activity per time frame required §  Errors lead to inconsistent labels during training

Page 16: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

16

Training with synthesized noise signals Challenges

§  Cannot directly use mixture of directional white noise signals as training signals

§  Example: For two sources

Ym(n, k) = Am(n, k)ej�m(n,k)

= Am1(n, k)ej�m1(n,k) +Am2(n, k)e

j�m2(n,k)

Page 17: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

17

Training with synthesized noise signals Idea

§  Assumption: Speakers are not simultaneously active per time-

frequency unit §  Also known as W-disjoint orthogonality

§  Has been shown to hold approximately for speech signals [1]

[1] S. Rickard and O. Yilmaz “On the approximate W-disjoint orthogonality of speech” ICASSP 2002

Page 18: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

18

Training with synthesized noise signals Generating training data §  Consider a specific acoustic setup

✓1✓2

Page 19: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

19

Training with synthesized noise signals Generating training data

✓1 ✓2

N - Time frames

K-Frequ

ency

bins

M-Microphones

N - Time framesK

-Frequ

ency

bins

M-Microphones

Page 20: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

20

Training with synthesized noise signals Generating training data

Concatenate signals

along the time axis

✓1 ✓2

N - Time frames

K-Frequ

ency

bins

M- M

icrophones

Page 21: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

21

N - Time frames

K-Frequ

ency

bins

M- M

icrophones

Training with synthesized noise signals Generating training data

Randomize TF bins

across time axis

for each sub-band

Page 22: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

22

Training with synthesized noise signals Generating training data – Note

1.  Randomization of the TF bins is done separately for each frequency sub-band Reason: The order of the frequency sub-bands remains the same for different time frames

2.  For each frequency sub-band, the TF bins for all the microphones are randomized together Reason: Phase relations between the microphones for the individual TF bins are preserved

Page 23: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

23

Preliminary evaluation §  Task: Estimate DOA of two speakers over a 2 s mixture

§  Performance of CNN compared to SRP-PHAT §  Evaluation measure - Mean Absolute Error (MAE)

§  Post-processing:

1.  Posterior probabilities for each DOA class obtained from the CNN output at each time frame are averaged over all the frames

2.  Final DOA estimates are obtained by choosing the DOAs corresponding to the classes with the two highest averaged posterior probabilities

Page 24: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

24

Evaluation Experimental parameters

§  Uniform linear array (ULA) §  Number of microphones = 4 §  Inter-microphone distance = 8 cm

§  STFT length 512, 50% overlap

§  Resolution for classes: 5 degrees, I = 37 classes

§  Training data is simulated using RIR generator [2]

[2] https://github.com/ehabets/RIR-Generator

Page 25: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

25

Evaluation CNN training parameters

§  Training data: 12.4 million time frames

§  Validation data: 20% split from the training data

§  Loss: Binary cross-entropy

§  Activation: ReLU, Sigmoid (final layer)

§  Optimizer: Adam

§  Batchsize: 512

§  Nb Epochs: 10

§  Regularization: Dropout rate 0.5 (After Conv.3 layer, and each FC layer)

Page 26: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

26

Evaluation Test conditions §  Database: TIMIT test

§  Mixtures: 666 (All angular combinations) ; 0 dB

§  Duration: 2 s each

Simulated test data

Signal Speech signals from TIMIT

Room size (9⇥ 4⇥ 3) m

Array positions in room 1 arbitrary position

Source-array distance 1.8 m

RT60 0.70 s

Page 27: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

27

Evaluation Unmatched acoustic conditions

§  Clearly outperforms SRP-PHAT

§  Proposed method has considerably lower errors

SNR 10 dB 20 dB 30 dBProposed 14.3 6.1 1.8

SRP-PHAT 27.1 21.6 18.2

Mean Absolute Error (

�)

Page 28: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

28

Evaluation Qualitative result – SNR 30 dB

0 20 40 60 80 100 120 140 160 1800

0.2

0.4

0.6

0.8

1

DOA

Pro

bability

CNN SRP−PHAT True DOAs

Page 29: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

29

Evaluation Qualitative result – SNR 30 dB

Time Frames

DOA

10 20 30 40 50 60 70 80 90 100 1100

50

100

150

0

0.2

0.4

0.6

0.8

1

Time Frames

DOA

10 20 30 40 50 60 70 80 90 100 1100

50

100

150

0

0.2

0.4

0.6

0.8

1

Time Frames

DOA

10 20 30 40 50 60 70 80 90 100 1100

50

100

150

0

0.2

0.4

0.6

0.8

1

Time frames

Fre

quen

cybin

s

10 20 30 40 50 60 70 80 90 100 110

50

100

150

200

250

−10

−5

0

5

10

15

CNN

True DOA

Speech segment

SRP-PHAT

Page 30: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

30

Evaluation Qualitative result – 3 speakers

DOA0 20 40 60 80 100 120 140 160 180

Probab

ility

0

0.2

0.4

0.6

0.8

1

CNN True DOAs

Page 31: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

31

Time Frames10 20 30 40 50 60 70 80 90 100

DOA

0

50

100

150

0.2

0.4

0.6

0.8

1

Time Frames10 20 30 40 50 60 70 80 90 100

DOA

0

50

100

150

00.20.40.60.81

Time frames10 20 30 40 50 60 70 80 90 100F

requ

ency

bins

50100150200250

-20

-10

0

10

Evaluation Qualitative result – 3 speakers

CNN

True DOA

Speech segment

Page 32: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

32

Conclusions

§  Extended our previous work on CNN based supervised speaker

localization to multi-speaker case

§  CNN is trained using synthesized noise signals, utilizing the assumption of disjoint activity of speech sources in STFT domain

§  Preliminary results show clearly superior performance compared to SRP-PHAT in unmatched acoustic conditions

Page 33: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

33

Further work §  Systematic experiments for detailed evaluation of the performance

of the proposed method in both simulated and real conditions

§  Formulate a suitable post-processing method to obtain the final DOA estimates for unknown number of speakers

§  Identify limitations (if any) of training with synthesized noise signals compared to speech signals

§  Explore possibility of source number estimation using proposed method

Page 34: Multi-Speaker Localization Using Convolutional Neural ...media.aau.dk › smc › wp-content › uploads › 2018 › 01 › ML4... · Soumitro Chakrabarty Multi-speaker Localization

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

34

Thank you for your attention!