multi-speaker localization using convolutional neural ...media.aau.dk › smc › wp-content ›...
TRANSCRIPT
Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise
Soumitro Chakrabarty and Emanuël Habets
ML4Audio@NIPS 2017 08/12/2017
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
2
Direction-of-arrival (DOA)
5mm 3mm
✓
Microphone Array
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
3
Motivation Signal processing methods
§ Cross-correlation-based methods • GCC-PHAT • SRP-PHAT • MCCC…
§ Subspace-based methods
• MUSIC... § Model-based methods
• Maximum-likelihood estimation...
§ ...
Challenges - Performance
degradation in presence of noise and reverberation
- High computational cost
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
4
Motivation Supervised learning methods § Recently, deep neural network (DNN) based supervised learning
methods have been successful across a range of applications: § Automatic speech recognition § Object recognition in images § Machine translation….
§ Some DNN based methods that estimate DOAs of sound sources from the observed signals have been proposed
§ Advantage: Supervised learning methods can be adapted to different acoustic environments
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
5
Previous work CNN based speaker localization
§ CNN based supervised learning method for single DOA estimation per
time frame was proposed § Simple input representation “Phase map” was used
§ Trained using synthesized white noise signals
§ Proposed method was shown to clearly outperform a steered-response power based method
[1] S. Chakrabarty and E. A. P Habets “Broadband DOA Estimation using Convolutional Neural Networks Trained with Noise Signals” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2017
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
6
Motivation CNN based multi-speaker localization
Aim: A CNN based supervised learning method for DOA estimation that
§ Estimates multiple DOAs per time frame given the STFT representation of the observed signals
§ Simple input representation to learn relevant features during training
§ Extend the idea of training with synthesized noise signals to the multi-speaker case
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
7
§ Multi-speaker DOA estimation is formulated as an I class multi-label classification problem
§ Discretize the whole DOA range into I discrete values to obtain a set of possible DOA values:
§ Each class corresponds to a possible DOA value in the set
Problem Formulation DOA estimation as classification
⇥ = {✓1, . . . , ✓I}
......
I classes
i ≈ θi ∈ Θ = {θ1, . . . , θI}
Θ = {θ1, . . . , θI}
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
8
Problem Formulation DOA estimation as classification
§ Given the input at each time frame, compute the probability for each class using I binary classifiers
§ Suitable post-processing for estimating DOAs of the sources
......
I classes
i ≈ θi ∈ Θ = {θ1, . . . , θI}
Θ = {θ1, . . . , θI}
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
9
System overview Supervised learning framework
STFT Input feature
True DOA Labels
DOA classifier
Training data
STFT Input featureTest data
TrainDOA classifier
Training
Inference/Test
Posteriorprobabilities
DOA estimate
Trained parameters
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
10
STFT Input feature
True DOA Labels
DOA classifier
Training data
STFT Input featureTest data
TrainDOA classifier
Training
Inference/Test
Posteriorprobabilities
DOA estimate
System overview Supervised learning framework
Trained parameters
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
11
Input feature representation STFT magnitude and phase component Ym(n, k) = Am(n, k)ej�m(n,k)
Magnitude�Am(n, k) Phase� �m(n, k)
N − Time frames
K−
Frequ
ency
Bins
M−Microphones
K−
Frequ
ency
Bins
N − Time frames M−Microphones
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
12
K−
Frequ
ency
Bins
N − Time frames M−Microphones
Phase� �m(n, k)
Input feature representation Phase map
M
K
Phase Map��n
for each time frame n
n
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
13
STFTΦn
Input feature
True DOA Labels
DOA classifier
Training data
STFT Φn
Input featureTest data
TrainDOA classifier
Training
Inference/Test
Posteriorprobabilities
DOA estimate
System overview Supervised learning framework
Trained parameters
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
14
CNN architecture
Input: M ×K
Conv1: 2× 1, 64 Conv2: 2× 1, 64 Conv3: 2× 1, 64
Total Feature Maps: 64
Size: (M − 2)×K Total Feature Maps: 64
Size: (M − 3)×K
FC1:
512
FC2:
512
Output:
I×1
Total Feature Maps: 64
Size: (M − 1)×K
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
15
Training with synthesized noise signals Motivation § In a mixture of simultaneously active speakers, all the speakers
need not be active at each time frame
§ Accurate detection of speaker activity per time frame required § Errors lead to inconsistent labels during training
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
16
Training with synthesized noise signals Challenges
§ Cannot directly use mixture of directional white noise signals as training signals
§ Example: For two sources
Ym(n, k) = Am(n, k)ej�m(n,k)
= Am1(n, k)ej�m1(n,k) +Am2(n, k)e
j�m2(n,k)
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
17
Training with synthesized noise signals Idea
§ Assumption: Speakers are not simultaneously active per time-
frequency unit § Also known as W-disjoint orthogonality
§ Has been shown to hold approximately for speech signals [1]
[1] S. Rickard and O. Yilmaz “On the approximate W-disjoint orthogonality of speech” ICASSP 2002
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
18
Training with synthesized noise signals Generating training data § Consider a specific acoustic setup
✓1✓2
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
19
Training with synthesized noise signals Generating training data
✓1 ✓2
N - Time frames
K-Frequ
ency
bins
M-Microphones
N - Time framesK
-Frequ
ency
bins
M-Microphones
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
20
Training with synthesized noise signals Generating training data
Concatenate signals
along the time axis
✓1 ✓2
N - Time frames
K-Frequ
ency
bins
M- M
icrophones
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
21
N - Time frames
K-Frequ
ency
bins
M- M
icrophones
Training with synthesized noise signals Generating training data
Randomize TF bins
across time axis
for each sub-band
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
22
Training with synthesized noise signals Generating training data – Note
1. Randomization of the TF bins is done separately for each frequency sub-band Reason: The order of the frequency sub-bands remains the same for different time frames
2. For each frequency sub-band, the TF bins for all the microphones are randomized together Reason: Phase relations between the microphones for the individual TF bins are preserved
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
23
Preliminary evaluation § Task: Estimate DOA of two speakers over a 2 s mixture
§ Performance of CNN compared to SRP-PHAT § Evaluation measure - Mean Absolute Error (MAE)
§ Post-processing:
1. Posterior probabilities for each DOA class obtained from the CNN output at each time frame are averaged over all the frames
2. Final DOA estimates are obtained by choosing the DOAs corresponding to the classes with the two highest averaged posterior probabilities
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
24
Evaluation Experimental parameters
§ Uniform linear array (ULA) § Number of microphones = 4 § Inter-microphone distance = 8 cm
§ STFT length 512, 50% overlap
§ Resolution for classes: 5 degrees, I = 37 classes
§ Training data is simulated using RIR generator [2]
[2] https://github.com/ehabets/RIR-Generator
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
25
Evaluation CNN training parameters
§ Training data: 12.4 million time frames
§ Validation data: 20% split from the training data
§ Loss: Binary cross-entropy
§ Activation: ReLU, Sigmoid (final layer)
§ Optimizer: Adam
§ Batchsize: 512
§ Nb Epochs: 10
§ Regularization: Dropout rate 0.5 (After Conv.3 layer, and each FC layer)
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
26
Evaluation Test conditions § Database: TIMIT test
§ Mixtures: 666 (All angular combinations) ; 0 dB
§ Duration: 2 s each
Simulated test data
Signal Speech signals from TIMIT
Room size (9⇥ 4⇥ 3) m
Array positions in room 1 arbitrary position
Source-array distance 1.8 m
RT60 0.70 s
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
27
Evaluation Unmatched acoustic conditions
§ Clearly outperforms SRP-PHAT
§ Proposed method has considerably lower errors
SNR 10 dB 20 dB 30 dBProposed 14.3 6.1 1.8
SRP-PHAT 27.1 21.6 18.2
Mean Absolute Error (
�)
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
28
Evaluation Qualitative result – SNR 30 dB
0 20 40 60 80 100 120 140 160 1800
0.2
0.4
0.6
0.8
1
DOA
Pro
bability
CNN SRP−PHAT True DOAs
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
29
Evaluation Qualitative result – SNR 30 dB
Time Frames
DOA
10 20 30 40 50 60 70 80 90 100 1100
50
100
150
0
0.2
0.4
0.6
0.8
1
Time Frames
DOA
10 20 30 40 50 60 70 80 90 100 1100
50
100
150
0
0.2
0.4
0.6
0.8
1
Time Frames
DOA
10 20 30 40 50 60 70 80 90 100 1100
50
100
150
0
0.2
0.4
0.6
0.8
1
Time frames
Fre
quen
cybin
s
10 20 30 40 50 60 70 80 90 100 110
50
100
150
200
250
−10
−5
0
5
10
15
CNN
True DOA
Speech segment
SRP-PHAT
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
30
Evaluation Qualitative result – 3 speakers
DOA0 20 40 60 80 100 120 140 160 180
Probab
ility
0
0.2
0.4
0.6
0.8
1
CNN True DOAs
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
31
Time Frames10 20 30 40 50 60 70 80 90 100
DOA
0
50
100
150
0.2
0.4
0.6
0.8
1
Time Frames10 20 30 40 50 60 70 80 90 100
DOA
0
50
100
150
00.20.40.60.81
Time frames10 20 30 40 50 60 70 80 90 100F
requ
ency
bins
50100150200250
-20
-10
0
10
Evaluation Qualitative result – 3 speakers
CNN
True DOA
Speech segment
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
32
Conclusions
§ Extended our previous work on CNN based supervised speaker
localization to multi-speaker case
§ CNN is trained using synthesized noise signals, utilizing the assumption of disjoint activity of speech sources in STFT domain
§ Preliminary results show clearly superior performance compared to SRP-PHAT in unmatched acoustic conditions
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
33
Further work § Systematic experiments for detailed evaluation of the performance
of the proposed method in both simulated and real conditions
§ Formulate a suitable post-processing method to obtain the final DOA estimates for unknown number of speakers
§ Identify limitations (if any) of training with synthesized noise signals compared to speech signals
§ Explore possibility of source number estimation using proposed method
© AudioLabs, 2017
Soumitro Chakrabarty Multi-speaker Localization with CNN
34
Thank you for your attention!