speaker recognition system · 2020-02-24 · speaker recognition system adaptation to unseen and...
TRANSCRIPT
Speaker recognition system adaptation to unseen and
mismatched recording devices in the NFI-FRIDA database
1Oxford Wave Research Ltd, UK 2 Speech and Audio Research,
Netherlands Forensic Institute, The
Hague, Netherlands
Finnian Kelly1, Anil Alexander1,
Oscar Forth1, and David van
der Vloed2
Motivation
Current automatic speaker recognition systems are trained on large
quantities of diverse speaker recordings:
• performance is good for forensic casework material involving
typical microphone or telephone recordings
• for unseen recording types, such as those involving a new
covert surveillance recorder, or a new transmission condition,
for example, performance may be impacted negatively
Motivation
Current automatic speaker recognition systems are trained on large
quantities of diverse speaker recordings:
• performance is good for forensic casework material involving
typical microphone or telephone recordings
• for unseen recording types, such as those involving a new
covert surveillance recorder, or a new transmission condition,
for example, performance may be impacted negatively
How can we adapt a well-trained automatic system to the unseen and mismatched conditions of a new case?
Levels of mismatch
Matched seen conditions
Mismatched seen conditions
Matched unseen conditions
Mismatched unseen conditions
Increasing
difficulty
The challenge of mismatch
Matched, seen
conditions
Mismatched, unseen
conditions
H0 H1 H0 H1
H0: same speaker scores
H1: different speaker scores
Existing solutions in VOCALISE
• Train the system from scratch with relevant data
• Data hungry: 1000s of speakers required
• Re-train the system LDA/PLDA stages with relevant data
• Data hungry: 100s of speakers required
• Apply score normalisation
• Will help, but is limited
Existing solutions in VOCALISE
• Train the system from scratch with relevant data
• Data hungry: 1000s of speakers required
• Re-train the system LDA/PLDA stages with relevant data
• Data hungry: 100s of speakers required
• Apply score normalisation
• Will help, but is limited
Here we introduce a new method of adapting a well-trained
system to unseen conditions on the fly using small* quantities
of data => forensically realistic
*10s of speakers
VOCALISE i-vector framework
UBM
feature
extraction
speech i-vector
i-vector
extraction
High-dimensional,
universal speaker space
Low-dimensional,
speaker-specific
space
Comparing i-vectors
i-vector A
i-vector B LDA / PLDA
Comparison score for
i-vectors A and B
Post-processing i-vectors
We could compare ‘raw’ i-vectors directly, but
it is beneficial to first post-process i-vectors to
increase their discriminatory power
LDA (linear discriminant analysis) is an
important post-processing step that:
1. Increases inter-speaker separability
2. Reduces dimensionality
Linear Discriminant Analysis (LDA)
• LDA projects i-vectors into a new space in which: • within-speaker variability is minimised • between-speaker separation is maximised
• Requires a set of training i-vectors and their speaker labels
LDA training
• The LDA transformation is generally learned using the same training data as the other models in the i-vector framework (UBM and TV).
• Can we leverage LDA for adapting a system to new conditions?
Condition adaptation via LDA
System
development
i-vectors
N ≈ 50,000
Adaptation
i-vectors
N ≈ 100
adapted LDA
transformation
well-trained LDA transformation
Probabilistic LDA (PLDA) • PLDA compares two post-LDA i-vectors and returns a comparison
score
• The score is calculated based on the most discriminative parts of an
i-vector:
• Achieved by learning a subspace that describes the dominant
directions of change in the i-vectors of different speakers
• PLDA therefore requires a set of post-LDA training i-vectors and
their speaker labels
We supplement our LDA condition adaptation by re-training
PLDA with all adapted i-vectors
Reference normalisation
Reference (or score) normalisation, is an established technique for
adapting the output of a system to new conditions
i-vector A
i-vector B Raw comparison
score
Reference normalisation
Reference (or score) normalisation, is an established technique for
adapting the output of a system to new conditions
i-vector A
i-vector B
Reference
i-vectors
Raw comparison
score
Reference scores A
Reference scores B
Normalised
comparison
score
Reference normalisation
Reference (or score) normalisation, is an established technique for
adapting the output of a system to new conditions
i-vector A
i-vector B
Reference
i-vectors
Raw comparison
score
Reference scores A
Reference scores B
Normalised
comparison
score
Can only shift scores up or down; less powerful than LDA/PLDA adaptation…
Mismatched condtion experiments
System:
iVOCALISE 2017B
• TEL only and TEL-MIC session
• Condition adaptation
• Reference normalisation
Alexander, A., Forth, O., Atreya, A. A. and Kelly, F. (2016). VOCALISE: A forensic automatic speake recognition system supporting spectral, phonetic, and user-provided featurs. Speaker Odyssey 2016, Bilbao, Spain.
outline
NFI-FRIDA recap
Experiments with NFI-FRIDA
• 40 test speakers
• 3 recordings per speaker from each of the following devices
• d1: Headset microphone
• d2: Close microphone A
• d3: Close microphone B
• d4: Far microphone
• d5: Telephone intercept
• 1 additional recording per speaker from device d1
• Cross-device (mismatched) performance, relative to d1, was evaluated
for all devices d1—d5
• #H0 (same speaker) comparisons = 120
• #H1 (different speaker) comparisons = 4680
Condition adaptation & reference normalisation experiments • 15 training speakers (no overlap with the 40 test speakers)
• 2 recordings per speaker from each of the following devices
• d1: Headset microphone
• d2: Close microphone A
• d3: Close microphone B
• d4: Far microphone
• d5: Telephone intercept
• For condition adaptation, 2 recordings from each of the devices
under comparison were used => 2 recordings x 2 devices x 15
speakers = 60 recordings*
• For reference normalisation, 2 recordings from the other (not d1)
device were used => 2 recordings x 1 devices x 15 speakers = 30
recordings
* With the exception of d1-d1, where only 30 recordings were used
Cross-condition performance (EER%): Telephone-only session data
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
EER%
15 speakers for adaptation/normalisation
EER%
Cross-condition performance (EER%): Telephone+Microphone session data
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
15 speakers for adaptation/normalisation
EER%
Revisiting matched comparison EER%
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
15 speakers for adaptation/normalisation
Cllr-
min
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
Cross-condition performance (Cllr-min): Telephone+Microphone session data
• Cllr-min, like the EER, measures the ability of the system to discriminate
between speakers
• Unlike the EER, it considers the discriminatory power of the system across
all possible score thresholds
• Cllr-min, or minimum log-likelihood ratio cost, is the optimal Cllr value
achievable by a system
• Like the EER, lower is better:
• Cllr-min = 0 for a perfect system
• Cllr-min = 1 for a useless system
Cllr-
min
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
Cross-condition performance (Cllr-min): Telephone+Microphone session data
15 speakers for adaptation/normalisation
Cllr-
min
d1: Headset mic
d2: Close mic A
d3: Close mic B
d4: Far mic
d5: Tel intercept
Revisiting matched comparison Cllr-min
15 speakers for adaptation/normalisation
Varying the number of adaptation & normalisation speakers • The original set of 15 training speakers was increased to 38 speakers
(again, no overlap with the 40 test speakers)
• 2 recordings from each device d1—d5 were used
• Condition adaptation and reference normalisation proceeded as
before, increasing the number of training speakers in increments of 5,
from 5 to 38.
• Results presented for d1—d4 and d1—d5 only
d1-d4: Close mic - Far mic Telephone+Microphone session data
Cllr-
min
#adapt/reference speakers
d1-d5: Close mic - Telephone intercept Telephone+Microphone session data
Cllr-
min
#adapt/reference speakers
d1-d5: Close mic - Telephone intercept Condition adaptation variance with 5 speakers
#adapt/reference speakers
Cllr-
min
#adapt/reference speakers
Cllr-
min
d1-d5: Close mic - Telephone intercept Reference normalisation variance with 5 speakers
d1-d5: Close mic - Telephone intercept Condition adaptation variance with 20 speakers
#adapt/reference speakers
Cllr-
min
#adapt/reference speakers
Cllr-
min
d1-d5: Close mic - Telephone intercept Reference normalisation variance with 20 speakers
Conclusions
The baseline performance of a well-trained automatic system to unseen and mismatched conditions is good:
<4 % EER
Condition adaptation can provide consistent and stable performance improvement with a very small number of speakers (≈30) => applicable to forensic casework
Condition adaptation has the scope to exploit additional speakers and recordings if they are available
Here we have used condition adaptation and reference normalisation in isolation; they can be used in combination
Thanks!