ltfd - iafpa poster (1) · title: ltfd - iafpa poster (1) author: vince hughes created date:...

1
The individual and the system: assessing the stability of the output of a semi-automatic forensic voice comparison system Vincent Hughes, Philip Harrison, Paul Foulkes, Peter French [1] Gold, E., French, P. & Harrison, P. (2014) Examining long-termformant distributions as a discriminant in forensic speaker comparisons under a likelihood ratio framework.Proc. Meetings on Acoustics 1 9.[2] Becker, T., Jessen, M. & Grigoras,C. (2008) Forensic speaker verification using formant features and Gaussian mixture models. Proc. Intersp eech. pp. 1 505 -15 08. [3] Jessen, M.,Alexander,A. & Forth, O . (2014) Forensic voice comparisons in German with phonetic and automatic features usingVocali se softw are Proc. AES.p p. 28 -35 . [4] Nolan, F., McDougall, K., de Jong, G.& Hudson,T. (2009)The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research.IJSLL 1 6: 31 -5 7. [5] Alzqhoul, E., Nair,B.& Guillemin, B.(2015) An alternative approach for invest igat ing t he imp act of mobile p hone t echnology on speech. Proc. World Congress on Engineering and Comp uter Science vol.1. [6] Reynolds, D., Quatieri, T. & Dunn, R. (200 0) Sp eaker verifi cat ion using ad ap ted Gaussian mixt ure mod els.Digital Signal Processing 10: 19-41.[7] Brümmer,N. et al. (20 07) Fusion of het erogenous sp eaker recognit ion syst ems in t he STBU sub mission for t he NIST SRE 2006. IEEETrans. Audio Sp eech and Lang. 1 5 : 20 72- 208 4.[8] Alexand er, A., Forth, O. , Nash, J. &Yager, N.(20 14) Sp eaker recognit ion w ith tall and fat animals. Paper at IAFPA 2014 . University of Zurich, Sw itzerland. Introduction Semi-automatic (SASR) forensic voice comparison (FVC): o manual feature extraction: usually formants (LTFDs) o automatic modelling, scoring, evaluation Benefit = features are easier to explain to courts than ASR o clearer relationship between articulation and acoustics Previous work has shown good performance (e.g. [1,2,3]), but… o generally focus on matched conditions o analysis based on overall system performance Research questions 1. How is the performance of a formant-based SASR system affected by mismatched conditions? 2. To what extent is performance affected by degradation in transmission quality? 3. How are individual comparisons affected? Can we predict which speakers will be more/less sensitive to mismatch and degraded quality? Methods Conclusions Corpus 97 DyViS [4] speakers: suspect = Task 1, offender = Task 2 Four versions of the offender sample: o high quality (HQ): original near-end sample o landline telephone (TEL): original far-end sample o high bit-rate mobile telephone (MOB HQ ) o low bit-rate mobile telephone (MOB LQ ) Formant extraction 60 second samples of vowel-only material created 9 feature vector extracted from 20ms frames with 10ms shift: F1, F2 and F3 frequencies, deltas (s) and bandwidths System testing Cross-validated same- (SS) and different-speaker (DS) scores computed using GMM-UBM [6] (using 8 Gaussians) Score-level logistic regression calibration [7] using cross-validation System validity: log LR cost (C llr ) and equal error rate (EER) Individuals analysed using means and standard deviations (SDs) of SS and DS LLRs: visualised using zooplots (see [8]) Results Discussion [5] System performance Individuals Suspect Offender EER (%) C llr (1) HQ HQ 10.33 0.37 (2) HQ TEL 25.95 0.73 (3) HQ MOB HQ 31.71 0.81 (4) HQ MOB LQ 31.99 0.83 Matched HQ condition provides the best overall performance o compare with 6.45% (EER) and 0.255 (C llr ) with the same recordings in [1] using F1, F2, F3, and F4 o F4 provides useful information (where available) Decrease in performance as quality degraded o HQ < TEL < MOB HQ < MOB LQ o effect of bit-rate = relatively small Predicting individual performance Some comparisons affected more/less by mismatch and degradation in quality (SDs as large as two orders of magnitude) Linear mixed effects models fitted to predict speakers’ positions in the zoo space o high SS mean == high DS mean o high means == high SDs o high mean F3 == high SS SD *voice quality and other formants were not significant F3 effect possibly due to default settings used o four formants tracked (LPC order = 12) o may have caused F3 measurement errors for speakers with high F3 (where F4 is outside the upper threshold for telephone transmission) Transmission mismatch between suspect and offender can have a substantial effect on SASR performance Considerable effect on LRs for individuals in terms of strength of evidence and variability Difficult to predict which comparisons will be most affected o but it may be possible to reduce effects by using channel- and speaker-specific (and possibly vowel- specific) formant settings

Upload: others

Post on 10-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LTFD - IAFPA Poster (1) · Title: LTFD - IAFPA Poster (1) Author: Vince Hughes Created Date: 7/24/2018 10:08:56 AM

The individual and the system: assessing the stability of the output of a semi-automatic forensic voice comparison system

Vincent Hughes, Philip Harrison, Paul Foulkes, Peter French

[1] Gold, E., French, P. & Harrison, P. (2014) Examining long-term formant dist ributions as a discriminant in forensic speaker comparisons under a likelihood rat io framework. Proc. Meetings on Acoustics 19. [2] Becker, T., Jessen, M. & Grigoras, C. (2008) Forensic speaker verificat ion using formant features and Gaussian mixture models. Proc. Interspeech. pp. 1505-1508. [3] Jessen, M., Alexander, A. & Forth, O. (2014) Forensic voice comparisons in German with phonet ic and automatic features using Vocalise software Proc. AES. pp. 28-35 . [4] Nolan, F., McDougall, K., de Jong, G. & Hudson, T. (2009) The DyViSdatabase: style-controlled recordings of 100 homogeneous speakers for forensic phonet ic research. IJSLL 16: 31-57. [5] Alzqhoul, E., Nair, B. & Guillemin, B. (2015) An alternative approach for invest igat ing the impact of mobile phone technology on speech. Proc. World Congress on Engineering and Computer Science vol.1 . [6] Reynolds, D., Quat ieri, T. & Dunn, R. (2000) Speaker verificat ion using adapted Gaussian mixture models. Digital Signal Processing 10: 19-41 . [7] Brümmer, N. et al. (2007) Fusion of heterogenous speaker recognit ion systems in the STBU submission for the NIST SRE 2006. IEEE Trans. Audio Speech and Lang. 15: 2072-2084. [8] Alexander, A., Forth, O., Nash, J. &Yager, N. (2014) Speaker recognit ion with tall and fat animals. Paper at IAFPA 2014. University of Zurich, Switzerland.

Introduction• Semi-automatic (SASR) forensic voice comparison (FVC):

o manual feature extraction: usually formants (LTFDs)o automatic modelling, scoring, evaluation

• Benefit = features are easier to explain to cour ts than ASRo clearer relationship between ar ticulation and acoustics

• Previous work has shown good performance (e.g. [1,2,3]), but…o generally focus on matched conditions o analysis based on overall system performance

Research questions1. How is the performance of a formant-based SASR system

affected by mismatched conditions?2. To what extent is performance affected by degradation in

transmission quality?3. How are individual comparisons affected? Can we predict

which speakers will be more/less sensitive to mismatch and degraded quality?

Methods

Conclusions

Corpus• 97 DyViS [4] speakers: suspect = Task 1, offender = Task 2 • Four versions of the offender sample:

o high quality (HQ): original near-end sampleo landline telephone (TEL): original far-end sampleo high bit-rate mobile telephone (MOBHQ) o low bit-rate mobile telephone (MOBLQ)

Formant extraction• 60 second samples of vowel-only material created• 9 feature vector extracted from 20ms frames with 10ms shift: F1,

F2 and F3 frequencies, deltas (∆s) and bandwidthsSystem testing• Cross-validated same- (SS) and different-speaker (DS) scores

computed using GMM-UBM [6] (using 8 Gaussians)• Score-level logistic regression calibration [7] using cross-validation• System validity: log LR cost (Cllr) and equal error rate (EER)• Individuals analysed using means and standard deviations (SDs) of

SS and DS LLRs: visualised using zooplots (see [8])

Results Discussion

[5]

System performance

Individuals

Suspect Offender EER (%) Cllr

(1) HQ HQ 10.33 0.37(2) HQ TEL 25.95 0.73(3) HQ MOBHQ 31.71 0.81(4) HQ MOBLQ 31.99 0.83

• Matched HQ condition provides the best overall performanceo compare with 6.45% (EER) and 0.255 (Cllr) with the

same recordings in [1] using F1, F2, F3, and F4o ∴ F4 provides useful information (where available)

• Decrease in performance as quality degradedo HQ < TEL < MOBHQ < MOBLQo effect of bit-rate = relatively small

Predicting individual performance• Some comparisons affected more/less by mismatch and

degradation in quality (SDs as large as two orders of magnitude)• Linear mixed effects models fitted to predict speakers’ positions

in the zoo spaceo high SS mean == high DS meano high means == high SDso high mean F3 == high SS SD

*voice quality and other formants were not significant• F3 effect possibly due to default settings used

o four formants tracked (LPC order = 12)o may have caused F3 measurement errors for speakers

with high F3 (where F4 is outside the upper threshold for telephone transmission)

• Transmission mismatch between suspect and offender can have a substantial effect on SASR performance

• Considerable effect on LRs for individuals in terms of strength of evidence and variability

• Difficult to predict which comparisons will be most affectedo but it may be possible to reduce effects by using

channel- and speaker-specific (and possibly vowel-specific) formant settings