the nist speaker recognition evaluations alvin f martin [email protected]
DESCRIPTION
The NIST Speaker Recognition Evaluations Alvin F Martin [email protected]. Odyssey 2012 @ Singapore 27 June 2012. Outline. Some Early History Evaluation Organization Performance Factors Metrics Progress Future. Some Early History. Success of speech recognition evaluation - PowerPoint PPT PresentationTRANSCRIPT
The NIST Speaker Recognition Evaluations
Alvin F [email protected]
Odyssey 2012 @ Singapore27 June 2012
Odyssey 2012 @ Singapore 2
Outline
• Some Early History• Evaluation Organization• Performance Factors• Metrics• Progress• Future
27 June 2012
Odyssey 2012 @ Singapore 3
Some Early History• Success of speech recognition evaluation
– Showed benefits of independent evaluation on common data sets• Collection of early corpora, including TIMIT, KING, YOHO, and
especially Switchboard – Multi-purpose corpus collected (~1991) with speaker recognition in
mind– Followed by Switchboard-2 and similar collections
• Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US
• The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc.
• Earlier NIST speaker evaluations in ‘92, ’95– ‘92 evaluation had several sites as part of DARPA program– ‘95 evaluation with 6 sites used some Switchboard-1 data– Emphasis was on speaker id rather than open set recognition 27 June 2012
Odyssey 2012 @ Singapore 427 June 2012
Odyssey 2012 @ Singapore 5
Martigny 1994
27 June 2012
Varying corpora and performance measures made meaningful comparisons difficult
Odyssey 2012 @ Singapore 6
Avignon 1998
27 June 2012
19th February 1998: WORKSHOP RLA2C - Speaker Recognition
****************************************************• RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C *
****************************************************------------------- -------------------
la Reconnaissance Speaker du Locuteur Recognition
et ses and its Applications Commercial
Commerciales and Forensic et Criminalistiques Applications
-------------------------------------------------AVIGNON 20-23 avril/april 1998
Soutenu / Sponsored by GFCP - SFA - ESCA - IEEE
TIMIT was preferred corpus
Sometimes bitter debate overforensic capabilities
Odyssey 2012 @ Singapore 7
Avignon Papers
27 June 2012
Odyssey 2012 @ Singapore 8
Crete 2001
27 June 2012
2001: A Speaker Odyssey - The Speaker Recognition Workshop
June 18-22, 2001Crete, Greece
First official “Odyssey
More emphasis on evaluation
Odyssey 2012 @ Singapore 927 June 2012
Odyssey 2012 @ Singapore 1027 June 2012
Odyssey 2012 @ Singapore 11
Toledo 2004
ISCA ArchiveODYSSEY 2004 - The
Speaker and Language Recognition WorkshopMay 31 - June 3, 2004
Toledo, Spain
27 June 2012
First Odyssey with NIST SRE Workshop held in conjunction at same location
First to include language recognition.
Two notable keynotes on forensic recognition.
Well attended. Odyssey held bi-annually since 2004.
Odyssey 2012 @ Singapore 1227 June 2012
Odyssey 2012 @ Singapore 13
Etc. – Odyssey 2006, 2008, 2010, 2012, …
Odyssey 2008: The Speaker and Language Recognition Workshop
Stellenbosch, South AfricaJanuary 21-24, 2008
27 June 2012
Odyssey 2010: The Speaker and Language Recognition WorkshopBrno, Czech Republic28 June 1 July 2010�
Odyssey 2012 @ Singapore 14
Organizing Evaluations
• Which task(s)?• Key principles• Milestones• Participants
27 June 2012
Odyssey 2012 @ Singapore 15
Which Speaker Recognition Problem?
• Access Control?– Text independent or dependent?– Prior probability of target high
• Forensic?– Prior not clear
• Person Spotting?– Prior probability of target low– Text independent
• NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve
27 June 2012
Odyssey 2012 @ Singapore 16
Some Basic Evaluation Principles
• Speaker spotting primary task• Research system oriented• Pooling across target speakers• Emphasis on low false alarm rate operating
point with scores and decisions (calibration matters)
27 June 2012
Odyssey 2012 @ Singapore 17
Organization Basics
• Open to all willing participants• Research-oriented– Commercialized competition discouraged
• Written evaluation plans– Specified rules of participation
• Workshops limited to participants– Each site/team must be represented
• Evaluation data sets subsequently published by the LDC
27 June 2012
Odyssey 2012 @ Singapore 1827 June 2012
Odyssey 2012 @ Singapore 19
1996 Evaluation Plan (cont’d)
27 June 2012
Odyssey 2012 @ Singapore 20
1996 Evaluation Plan (cont’d)
27 June 2012
1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots
Odyssey 2012 @ Singapore 21
DET Curve Paper – Eurospeech ‘97
27 June 2012
Odyssey 2012 @ Singapore 22
Wikipedia DET Page
27 June 2012
Odyssey 2012 @ Singapore 23
Some Milestones
• 1992 – DARPA program limited speaker identification evaluation
• 1995 – Small identification evaluation• 1996 – First SRE in current series• 2000 – AHUMADA Spanish data, first non-English speech• 2001 – Cellular data, • 2001 – ASR transcripts provided• 2002 – FBI “forensic” database• 2002 – SuperSid Workshop following SRE• 2005 – Multiple languages with bilingual speakers
27 June 2012
Some Milestones (cont’d)
• 2005 – Room mic recordings, cross-channel trials• 2008 – Interview data• 2010 – New decision cost function metric stressing
even lower FA rate region• 2010 – High and low vocal effort, aging• 2010 – HASR (Human-Assisted Speaker Recognition)
Evaluation• 2011 – BEST Evaluation, broad range of test
conditions, included added noise and reverb• 2012 – Target Speakers Defined Beforehand27 June 2012 Odyssey 2012 @ Singapore 24
Odyssey 2012 @ Singapore 25
Participation
• Grew from fewer than a dozen to 58 sites in 2010
• MIT (Doug) provided workshop notebook covers listing participants
• Big increase in participants after 2001• Handling scores of participating sites becomes
a management problem
27 June 2012
Odyssey 2012 @ Singapore 26
NIST 2004Speaker Recognition Workshop
Taller de Reconocimiento de Locutor
27 June 2012
Odyssey 2012 @ Singapore 2727 June 2012
Odyssey 2012 @ Singapore 28
Participating Sites
92* 95* 96 97 98 99 00 01 02 03 04 05 06 08 10 11* 12#0
10
20
30
40
50
60
70
Number of Sites
Number of Sites
27 June 2012
* Not in SRE series # Incomplete
Odyssey 2012 @ Singapore 2927 June 2012This slide is from 2001: A Speaker Odyssey in Crete
Odyssey 2012 @ Singapore 30
NIST Evaluation Data Set (cont’d)Year Common Condition(s) Evaluation Features
2002 One-session training on conv. phone data
Cellular data, alternative tests of extended training, speaker segmentation, and a limited corpus of simulated forensic data
2003 One-session training on conv. phone data
Cellular data, extended training
2004 Handheld landline conv. phone speech, English only
Multi-language data with bilingual speakers
2005 English only with handheld tel. set Included cross-channel trials with mic. test, both sides of 2-channel convs. provided
2006 English only trials (including mic. test trials)
Included cross-channel trials with mic. test
27 June 2012
NIST Evaluation Data Set (cont’d)Year Common Condition(s) Evaluation Features
2008 8 – contrasting English and bilingual speakers, interview and conv. phone speech along with cross-condition trials
Interview speech recorded over multiple mic channels and conv. phone speech recorded over mic and tel channels, multiple languages
2010 9 – contrasting tel and mic channels, interview and conversational phone speech, and high, low and normal vocal effort
Multiple microphones, phone calls with high, low, and normal vocal effort, aging data (Greybeard), HASR
2012 5 – interview test without noise, conv. phone test without noise, interview test with added noise, conv. phone test with added noise, conv. phone test collected in noisy environment
Target speakers specified in advance (from previous evals) with large amounts of training, some test calls collected in noisy environments, phone test data with added noise
27 June 2012 Odyssey 2012 @ Singapore 31
Odyssey 2012 @ Singapore 32
Performance Factors
• Intrinsic• Extrinsic• Parametric
27 June 2012
Odyssey 2012 @ Singapore 33
Intrinsic FactorsRelate to the speaker– Demographic factors
• Sex• Age• Education
– Mean pitch– Speaking style
• Conversational telephone• Interview• Read text
– Vocal effort• Some questions about definition and how to collect
– Aging• Hard to collect sizable amounts of data with years of time separation
27 June 2012
Odyssey 2012 @ Singapore 34
Extrinsic Factors
Relate to the collection environment– Microphone or telephone channel– Telephone channel type
• Landline, cellular, VOIP• In earlier times, carbon vs. electret
– Telephone handset type• Handheld, headset, earbud, speakerphone
– Microphone type – matched, mismatched– Placement of microphone relative to speaker– Background noise– Room reverberation
27 June 2012
Odyssey 2012 @ Singapore 35
“Parametric” Factors
• Train/test speech duration– Have tested 10 s up to ~half hour, more in ‘12
• Number of training sessions– Have tested 1 to 8, more in ‘12
• Language English has been predominant, but a variety of others included in some evaluations– Is better performance for English due to familiarity
and quantity of development data?– Cross-language trials a separate challenge
27 June 2012
Odyssey 2012 @ Singapore 36
Metrics
• Equal Error Rate– Easy to understand– Not operating point of interest– Calibration matters
• Decision Cost Function• CLLR• FA rate at fixed miss rate– E.g. 10% (lower for some conditions)
27 June 2012
Odyssey 2012 @ Singapore 37
Decision Cost Function CDet
CDet = CMiss × PMiss|Target × PTarget
+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget)
• Weighted sum of miss and false alarm error probabilities• Parameters are the relative costs of detection errors, CMiss and
CFalseAlarm, and the a priori probability of the specified target speaker, PTarget.
• Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” )
27 June 2012
Odyssey 2012 @ Singapore 38
Decision Cost Function CDet (cont’d)
• Parameters 1996-2008
• Parameters 2010
• Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory
27 June 2012
CMiss CFalseAlarm PTarget
10 1 0.01
CMiss CFalseAlarm PTarget
1 1 0.001
Odyssey 2012 @ Singapore 39
CLLR
Cllr = 1/(2*log2) * ((Σlog(1+1/s)/NTT)+ (Σlog(1+s))/NNT))where first summation is over target trials, second is over non-
target trials, NTT and NNT are the numbers of target and non-target trials, respectively, and s represents a trial’s likelihood ratio
• Information theoretic measure made popular in this community by Niko
• Covers broad range of performance operating points• George has suggested limiting range to low FA rates
27 June 2012
Odyssey 2012 @ Singapore 40
Fixed Miss Rate
• Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate
• Easy to understand• Practical for applications of interest• May be viewed as cost of listening to false
alarms• For easier conditions, a 1% miss rate now
more appropriate
27 June 2012
Odyssey 2012 @ Singapore 41
Recording Progress• Difficult to assure test set comparability
– Participants encouraged to run prior systems on new data
• Technology changes– In ‘96 landline phones predominated, with carbon button or electret
microphones– Need to explore VOIP
• With progress, want to make the test harder– Always want to add new evaluation conditions, new bells and whistles
• More channel types, more speaking styles, languages, etc.– Externally added noise and reverb explored in 2011 with BEST
• Doug’s history slide - updated27 June 2012
Odyssey 2012 @ Singapore 42
History Slide
27 June 2012
Odyssey 2012 @ Singapore 43
Future
• SRE12• Beyond
27 June 2012
Odyssey 2012 @ Singapore 44
SRE12 Plans• Target speakers specified in advance
– Speakers in recent past evaluations (in the thousands)– All prior speech data available for training– Some new targets with training provided at evaluation time– Test segments will include non-target speakers
• New interview speech provided in 16-bit linear pcm• Some test calls collected in noisy environments• Artificial noise added to some test segment data• Will this be an effectively easier id task?
– Will the provided set of known targets change system approaches?– Optional conditions include
• Assume test speaker is one of the known targets• Use no information about targets other than that of the trial
27 June 2012
Odyssey 2012 @ Singapore 45
SRE12 Metric• Log-likelihood ratios will now be required
– Therefore, no hard decisions are asked for• Primary metric will be an average of two detection cost
functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide)– Adds to stability of cost measure– Emphasizes need for good score calibration over wide range of
log likelihoods
• Alternative metrics will be Cllr and Cllr-M10, where the latter is Cllr limited to trials for which PMiss > 10%
27 June 2012
Odyssey 2012 @ Singapore 46
SRE12 Primary Cost Function• Niko noted that estimated llr’s making good decisions at a single
operating point may not be effective at other operating points; therefore an average of two points is used
• Writing DCF asPMiss + β * PFA
whereβ = (CFA/CMiss) * (1 – PTarget) / PTarget
• We take as cost function(DCF1 + DCF2)/2
wherePTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1
27 June 2012
Odyssey 2012 @ Singapore 47
Future Possibilities
• SRE12 outcome will determine whether pre-specified targets will be further explored– Does this make the problem too easy?
• Artificially added noise and reverb may continue• HASR12 will indicate whether human-in-the-loop
evaluation gains traction• SRE’s have become bigger undertakings– Fifty or more participating sites– Data volume approaching terabytes (as in BEST)– Tens or hundreds of millions of trials– Schedule could move to every three years
27 June 2012