speaker recognition system · 2020-02-24 · speaker recognition system adaptation to unseen and...

36
Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1 Oxford Wave Research Ltd, UK 2 Speech and Audio Research, Netherlands Forensic Institute, The Hague, Netherlands Finnian Kelly 1 , Anil Alexander 1 , Oscar Forth 1 , and David van der Vloed 2

Upload: others

Post on 07-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Speaker recognition system adaptation to unseen and

mismatched recording devices in the NFI-FRIDA database

1Oxford Wave Research Ltd, UK 2 Speech and Audio Research,

Netherlands Forensic Institute, The

Hague, Netherlands

Finnian Kelly1, Anil Alexander1,

Oscar Forth1, and David van

der Vloed2

Page 2: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Motivation

Current automatic speaker recognition systems are trained on large

quantities of diverse speaker recordings:

• performance is good for forensic casework material involving

typical microphone or telephone recordings

• for unseen recording types, such as those involving a new

covert surveillance recorder, or a new transmission condition,

for example, performance may be impacted negatively

Page 3: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Motivation

Current automatic speaker recognition systems are trained on large

quantities of diverse speaker recordings:

• performance is good for forensic casework material involving

typical microphone or telephone recordings

• for unseen recording types, such as those involving a new

covert surveillance recorder, or a new transmission condition,

for example, performance may be impacted negatively

How can we adapt a well-trained automatic system to the unseen and mismatched conditions of a new case?

Page 4: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Levels of mismatch

Matched seen conditions

Mismatched seen conditions

Matched unseen conditions

Mismatched unseen conditions

Increasing

difficulty

Page 5: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

The challenge of mismatch

Matched, seen

conditions

Mismatched, unseen

conditions

H0 H1 H0 H1

H0: same speaker scores

H1: different speaker scores

Page 6: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Existing solutions in VOCALISE

• Train the system from scratch with relevant data

• Data hungry: 1000s of speakers required

• Re-train the system LDA/PLDA stages with relevant data

• Data hungry: 100s of speakers required

• Apply score normalisation

• Will help, but is limited

Page 7: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Existing solutions in VOCALISE

• Train the system from scratch with relevant data

• Data hungry: 1000s of speakers required

• Re-train the system LDA/PLDA stages with relevant data

• Data hungry: 100s of speakers required

• Apply score normalisation

• Will help, but is limited

Here we introduce a new method of adapting a well-trained

system to unseen conditions on the fly using small* quantities

of data => forensically realistic

*10s of speakers

Page 8: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

VOCALISE i-vector framework

UBM

feature

extraction

speech i-vector

i-vector

extraction

High-dimensional,

universal speaker space

Low-dimensional,

speaker-specific

space

Page 9: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Comparing i-vectors

i-vector A

i-vector B LDA / PLDA

Comparison score for

i-vectors A and B

Page 10: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Post-processing i-vectors

We could compare ‘raw’ i-vectors directly, but

it is beneficial to first post-process i-vectors to

increase their discriminatory power

LDA (linear discriminant analysis) is an

important post-processing step that:

1. Increases inter-speaker separability

2. Reduces dimensionality

Page 11: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Linear Discriminant Analysis (LDA)

• LDA projects i-vectors into a new space in which: • within-speaker variability is minimised • between-speaker separation is maximised

• Requires a set of training i-vectors and their speaker labels

Page 12: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

LDA training

• The LDA transformation is generally learned using the same training data as the other models in the i-vector framework (UBM and TV).

• Can we leverage LDA for adapting a system to new conditions?

Page 13: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Condition adaptation via LDA

System

development

i-vectors

N ≈ 50,000

Adaptation

i-vectors

N ≈ 100

adapted LDA

transformation

well-trained LDA transformation

Page 14: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Probabilistic LDA (PLDA) • PLDA compares two post-LDA i-vectors and returns a comparison

score

• The score is calculated based on the most discriminative parts of an

i-vector:

• Achieved by learning a subspace that describes the dominant

directions of change in the i-vectors of different speakers

• PLDA therefore requires a set of post-LDA training i-vectors and

their speaker labels

We supplement our LDA condition adaptation by re-training

PLDA with all adapted i-vectors

Page 15: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Reference normalisation

Reference (or score) normalisation, is an established technique for

adapting the output of a system to new conditions

i-vector A

i-vector B Raw comparison

score

Page 16: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Reference normalisation

Reference (or score) normalisation, is an established technique for

adapting the output of a system to new conditions

i-vector A

i-vector B

Reference

i-vectors

Raw comparison

score

Reference scores A

Reference scores B

Normalised

comparison

score

Page 17: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Reference normalisation

Reference (or score) normalisation, is an established technique for

adapting the output of a system to new conditions

i-vector A

i-vector B

Reference

i-vectors

Raw comparison

score

Reference scores A

Reference scores B

Normalised

comparison

score

Can only shift scores up or down; less powerful than LDA/PLDA adaptation…

Page 18: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Mismatched condtion experiments

System:

iVOCALISE 2017B

• TEL only and TEL-MIC session

• Condition adaptation

• Reference normalisation

Alexander, A., Forth, O., Atreya, A. A. and Kelly, F. (2016). VOCALISE: A forensic automatic speake recognition system supporting spectral, phonetic, and user-provided featurs. Speaker Odyssey 2016, Bilbao, Spain.

Page 19: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

outline

NFI-FRIDA recap

Page 20: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Experiments with NFI-FRIDA

• 40 test speakers

• 3 recordings per speaker from each of the following devices

• d1: Headset microphone

• d2: Close microphone A

• d3: Close microphone B

• d4: Far microphone

• d5: Telephone intercept

• 1 additional recording per speaker from device d1

• Cross-device (mismatched) performance, relative to d1, was evaluated

for all devices d1—d5

• #H0 (same speaker) comparisons = 120

• #H1 (different speaker) comparisons = 4680

Page 21: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Condition adaptation & reference normalisation experiments • 15 training speakers (no overlap with the 40 test speakers)

• 2 recordings per speaker from each of the following devices

• d1: Headset microphone

• d2: Close microphone A

• d3: Close microphone B

• d4: Far microphone

• d5: Telephone intercept

• For condition adaptation, 2 recordings from each of the devices

under comparison were used => 2 recordings x 2 devices x 15

speakers = 60 recordings*

• For reference normalisation, 2 recordings from the other (not d1)

device were used => 2 recordings x 1 devices x 15 speakers = 30

recordings

* With the exception of d1-d1, where only 30 recordings were used

Page 22: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Cross-condition performance (EER%): Telephone-only session data

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

EER%

15 speakers for adaptation/normalisation

Page 23: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

EER%

Cross-condition performance (EER%): Telephone+Microphone session data

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

15 speakers for adaptation/normalisation

Page 24: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

EER%

Revisiting matched comparison EER%

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

15 speakers for adaptation/normalisation

Page 25: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Cllr-

min

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

Cross-condition performance (Cllr-min): Telephone+Microphone session data

• Cllr-min, like the EER, measures the ability of the system to discriminate

between speakers

• Unlike the EER, it considers the discriminatory power of the system across

all possible score thresholds

• Cllr-min, or minimum log-likelihood ratio cost, is the optimal Cllr value

achievable by a system

• Like the EER, lower is better:

• Cllr-min = 0 for a perfect system

• Cllr-min = 1 for a useless system

Page 26: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Cllr-

min

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

Cross-condition performance (Cllr-min): Telephone+Microphone session data

15 speakers for adaptation/normalisation

Page 27: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Cllr-

min

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

Revisiting matched comparison Cllr-min

15 speakers for adaptation/normalisation

Page 28: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Varying the number of adaptation & normalisation speakers • The original set of 15 training speakers was increased to 38 speakers

(again, no overlap with the 40 test speakers)

• 2 recordings from each device d1—d5 were used

• Condition adaptation and reference normalisation proceeded as

before, increasing the number of training speakers in increments of 5,

from 5 to 38.

• Results presented for d1—d4 and d1—d5 only

Page 29: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

d1-d4: Close mic - Far mic Telephone+Microphone session data

Cllr-

min

#adapt/reference speakers

Page 30: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

d1-d5: Close mic - Telephone intercept Telephone+Microphone session data

Cllr-

min

#adapt/reference speakers

Page 31: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

d1-d5: Close mic - Telephone intercept Condition adaptation variance with 5 speakers

#adapt/reference speakers

Cllr-

min

Page 32: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

#adapt/reference speakers

Cllr-

min

d1-d5: Close mic - Telephone intercept Reference normalisation variance with 5 speakers

Page 33: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

d1-d5: Close mic - Telephone intercept Condition adaptation variance with 20 speakers

#adapt/reference speakers

Cllr-

min

Page 34: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

#adapt/reference speakers

Cllr-

min

d1-d5: Close mic - Telephone intercept Reference normalisation variance with 20 speakers

Page 35: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Conclusions

The baseline performance of a well-trained automatic system to unseen and mismatched conditions is good:

<4 % EER

Condition adaptation can provide consistent and stable performance improvement with a very small number of speakers (≈30) => applicable to forensic casework

Condition adaptation has the scope to exploit additional speakers and recordings if they are available

Here we have used condition adaptation and reference normalisation in isolation; they can be used in combination

Page 36: Speaker recognition system · 2020-02-24 · Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database 1Oxford Wave Research Ltd,

Thanks!