multi-speaker localization using convolutional neural ...media.aau.dk › smc › wp-content ›...

Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise

Soumitro Chakrabarty and Emanuël Habets

ML4Audio@NIPS 2017 08/12/2017

© AudioLabs, 2017

Soumitro Chakrabarty Multi-speaker Localization with CNN

2

Direction-of-arrival (DOA)

5mm 3mm

✓

Microphone Array

© AudioLabs, 2017


3

Motivation Signal processing methods

§  Cross-correlation-based methods •  GCC-PHAT •  SRP-PHAT •  MCCC…

§  Subspace-based methods

•  MUSIC... §  Model-based methods

•  Maximum-likelihood estimation...

§  ...

Challenges -  Performance

degradation in presence of noise and reverberation

-  High computational cost

© AudioLabs, 2017


4

Motivation Supervised learning methods §  Recently, deep neural network (DNN) based supervised learning

methods have been successful across a range of applications: §  Automatic speech recognition §  Object recognition in images §  Machine translation….

§  Some DNN based methods that estimate DOAs of sound sources from the observed signals have been proposed

§  Advantage: Supervised learning methods can be adapted to different acoustic environments

© AudioLabs, 2017


5

Previous work CNN based speaker localization

§  CNN based supervised learning method for single DOA estimation per

time frame was proposed §  Simple input representation “Phase map” was used

§  Trained using synthesized white noise signals

§  Proposed method was shown to clearly outperform a steered-response power based method

[1] S. Chakrabarty and E. A. P Habets “Broadband DOA Estimation using Convolutional Neural Networks Trained with Noise Signals” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2017

© AudioLabs, 2017


6

Motivation CNN based multi-speaker localization

Aim: A CNN based supervised learning method for DOA estimation that

§  Estimates multiple DOAs per time frame given the STFT representation of the observed signals

§  Simple input representation to learn relevant features during training

§  Extend the idea of training with synthesized noise signals to the multi-speaker case

© AudioLabs, 2017


7

§  Multi-speaker DOA estimation is formulated as an I class multi-label classification problem

§  Discretize the whole DOA range into I discrete values to obtain a set of possible DOA values:

§  Each class corresponds to a possible DOA value in the set

Problem Formulation DOA estimation as classification

⇥ = {✓1, . . . , ✓I}

......

I classes

i ≈ θi ∈ Θ = {θ1, . . . , θI}

Θ = {θ1, . . . , θI}

© AudioLabs, 2017


8

Problem Formulation DOA estimation as classification

§  Given the input at each time frame, compute the probability for each class using I binary classifiers

§  Suitable post-processing for estimating DOAs of the sources

......

I classes

i ≈ θi ∈ Θ = {θ1, . . . , θI}

Θ = {θ1, . . . , θI}

© AudioLabs, 2017


9

System overview Supervised learning framework

STFT Input feature

True DOA Labels

DOA classifier

Training data

STFT Input featureTest data

TrainDOA classifier

Training

Inference/Test

Posteriorprobabilities

DOA estimate

Trained parameters

© AudioLabs, 2017


10

STFT Input feature

True DOA Labels

DOA classifier

Training data

STFT Input featureTest data

TrainDOA classifier

Training

Inference/Test


DOA estimate


Trained parameters

© AudioLabs, 2017


11

Input feature representation STFT magnitude and phase component Ym(n, k) = Am(n, k)ej�m(n,k)

Magnitude�Am(n, k) Phase� �m(n, k)

N − Time frames

K−

Frequ

ency

Bins

M−Microphones

K−

Frequ

ency

Bins

N − Time frames M−Microphones

© AudioLabs, 2017


12

K−

Frequ

ency

Bins

N − Time frames M−Microphones

Phase� �m(n, k)

Input feature representation Phase map

M

K

Phase Map��n

for each time frame n

n

© AudioLabs, 2017


13

STFTΦn

Input feature

True DOA Labels

DOA classifier

Training data

STFT Φn

Input featureTest data

TrainDOA classifier

Training

Inference/Test


DOA estimate


Trained parameters

© AudioLabs, 2017


14

CNN architecture

Input: M ×K

Conv1: 2× 1, 64 Conv2: 2× 1, 64 Conv3: 2× 1, 64

Total Feature Maps: 64

Size: (M − 2)×K Total Feature Maps: 64

Size: (M − 3)×K

FC1:

512

FC2:

512

Output:

I×1

Total Feature Maps: 64

Size: (M − 1)×K

© AudioLabs, 2017


15

Training with synthesized noise signals Motivation §  In a mixture of simultaneously active speakers, all the speakers

need not be active at each time frame

§  Accurate detection of speaker activity per time frame required §  Errors lead to inconsistent labels during training

© AudioLabs, 2017


16

Training with synthesized noise signals Challenges

§  Cannot directly use mixture of directional white noise signals as training signals

§  Example: For two sources

Ym(n, k) = Am(n, k)ej�m(n,k)

= Am1(n, k)ej�m1(n,k) +Am2(n, k)e

j�m2(n,k)

© AudioLabs, 2017


17

Training with synthesized noise signals Idea

§  Assumption: Speakers are not simultaneously active per time-

frequency unit §  Also known as W-disjoint orthogonality

§  Has been shown to hold approximately for speech signals [1]

[1] S. Rickard and O. Yilmaz “On the approximate W-disjoint orthogonality of speech” ICASSP 2002

© AudioLabs, 2017


18

Training with synthesized noise signals Generating training data §  Consider a specific acoustic setup

✓1✓2

© AudioLabs, 2017


19

Training with synthesized noise signals Generating training data

✓1 ✓2

N - Time frames

K-Frequ

ency

bins

M-Microphones

N - Time framesK

-Frequ

ency

bins

M-Microphones

© AudioLabs, 2017


20


Concatenate signals

along the time axis

✓1 ✓2

N - Time frames

K-Frequ

ency

bins

M- M

icrophones

© AudioLabs, 2017


21

N - Time frames

K-Frequ

ency

bins

M- M

icrophones


Randomize TF bins

across time axis

for each sub-band

© AudioLabs, 2017


22

Training with synthesized noise signals Generating training data – Note

1.  Randomization of the TF bins is done separately for each frequency sub-band Reason: The order of the frequency sub-bands remains the same for different time frames

2.  For each frequency sub-band, the TF bins for all the microphones are randomized together Reason: Phase relations between the microphones for the individual TF bins are preserved

© AudioLabs, 2017


23

Preliminary evaluation §  Task: Estimate DOA of two speakers over a 2 s mixture

§  Performance of CNN compared to SRP-PHAT §  Evaluation measure - Mean Absolute Error (MAE)

§  Post-processing:

1.  Posterior probabilities for each DOA class obtained from the CNN output at each time frame are averaged over all the frames

2.  Final DOA estimates are obtained by choosing the DOAs corresponding to the classes with the two highest averaged posterior probabilities

© AudioLabs, 2017


24

Evaluation Experimental parameters

§  Uniform linear array (ULA) §  Number of microphones = 4 §  Inter-microphone distance = 8 cm

§  STFT length 512, 50% overlap

§  Resolution for classes: 5 degrees, I = 37 classes

§  Training data is simulated using RIR generator [2]

[2] https://github.com/ehabets/RIR-Generator

© AudioLabs, 2017


25

Evaluation CNN training parameters

§  Training data: 12.4 million time frames

§  Validation data: 20% split from the training data

§  Loss: Binary cross-entropy

§  Activation: ReLU, Sigmoid (final layer)

§  Optimizer: Adam

§  Batchsize: 512

§  Nb Epochs: 10

§  Regularization: Dropout rate 0.5 (After Conv.3 layer, and each FC layer)

© AudioLabs, 2017


26

Evaluation Test conditions §  Database: TIMIT test

§  Mixtures: 666 (All angular combinations) ; 0 dB

§  Duration: 2 s each

Simulated test data

Signal Speech signals from TIMIT

Room size (9⇥ 4⇥ 3) m

Array positions in room 1 arbitrary position

Source-array distance 1.8 m

RT60 0.70 s

© AudioLabs, 2017


27

Evaluation Unmatched acoustic conditions

§  Clearly outperforms SRP-PHAT

§  Proposed method has considerably lower errors

SNR 10 dB 20 dB 30 dBProposed 14.3 6.1 1.8

SRP-PHAT 27.1 21.6 18.2

Mean Absolute Error (

�)

© AudioLabs, 2017


28

Evaluation Qualitative result – SNR 30 dB

0 20 40 60 80 100 120 140 160 1800

0.2

0.4

0.6

0.8

1

DOA

Pro

bability

CNN SRP−PHAT True DOAs

© AudioLabs, 2017


29

Evaluation Qualitative result – SNR 30 dB

Time Frames

DOA

10 20 30 40 50 60 70 80 90 100 1100

50

100

150

0

0.2

0.4

0.6

0.8

1

Time Frames

DOA

10 20 30 40 50 60 70 80 90 100 1100

50

100

150

0

0.2

0.4

0.6

0.8

1

Time Frames

DOA

10 20 30 40 50 60 70 80 90 100 1100

50

100

150

0

0.2

0.4

0.6

0.8

1

Time frames

Fre

quen

cybin

s

10 20 30 40 50 60 70 80 90 100 110

50

100

150

200

250

−10

−5

0

5

10

15

CNN

True DOA

Speech segment

SRP-PHAT

© AudioLabs, 2017


30

Evaluation Qualitative result – 3 speakers

DOA0 20 40 60 80 100 120 140 160 180

Probab

ility

0

0.2

0.4

0.6

0.8

1

CNN True DOAs

© AudioLabs, 2017


31

Time Frames10 20 30 40 50 60 70 80 90 100

DOA

0

50

100

150

0.2

0.4

0.6

0.8

1

Time Frames10 20 30 40 50 60 70 80 90 100

DOA

0

50

100

150

00.20.40.60.81

Time frames10 20 30 40 50 60 70 80 90 100F

requ

ency

bins

50100150200250

-20

-10

0

10

Evaluation Qualitative result – 3 speakers

CNN

True DOA

Speech segment

© AudioLabs, 2017


32

Conclusions

§  Extended our previous work on CNN based supervised speaker

localization to multi-speaker case

§  CNN is trained using synthesized noise signals, utilizing the assumption of disjoint activity of speech sources in STFT domain

§  Preliminary results show clearly superior performance compared to SRP-PHAT in unmatched acoustic conditions

© AudioLabs, 2017


33

Further work §  Systematic experiments for detailed evaluation of the performance

of the proposed method in both simulated and real conditions

§  Formulate a suitable post-processing method to obtain the final DOA estimates for unknown number of speakers

§  Identify limitations (if any) of training with synthesized noise signals compared to speech signals

§  Explore possibility of source number estimation using proposed method

multi-speaker localization using convolutional neural ...media.aau.dk › smc › wp-content ›...

Documents