the nist speaker recognition evaluations alvin f martin [email protected] odyssey 2012 @...

The NIST Speaker Recognition Evaluations

Alvin F [email protected]

Odyssey 2012 @ Singapore27 June 2012

Odyssey 2012 @ Singapore 2

Outline

• Some Early History• Evaluation Organization• Performance Factors• Metrics• Progress• Future

27 June 2012


Some Early History• Success of speech recognition evaluation

– Showed benefits of independent evaluation on common data sets• Collection of early corpora, including TIMIT, KING, YOHO, and

especially Switchboard – Multi-purpose corpus collected (~1991) with speaker recognition in

mind– Followed by Switchboard-2 and similar collections

• Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US

• The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc.

• Earlier NIST speaker evaluations in ‘92, ’95– ‘92 evaluation had several sites as part of DARPA program– ‘95 evaluation with 6 sites used some Switchboard-1 data– Emphasis was on speaker id rather than open set recognition 27 June 2012

Odyssey 2012 @ Singapore 427 June 2012


Martigny 1994

27 June 2012

Varying corpora and performance measures made meaningful comparisons difficult


Avignon 1998

27 June 2012

19th February 1998: WORKSHOP RLA2C - Speaker Recognition

****************************************************• RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C *

****************************************************------------------- -------------------

la Reconnaissance Speaker du Locuteur Recognition

et ses and its Applications Commercial

Commerciales and Forensic et Criminalistiques Applications

-------------------------------------------------AVIGNON 20-23 avril/april 1998

Soutenu / Sponsored by GFCP - SFA - ESCA - IEEE

TIMIT was preferred corpus

Sometimes bitter debate overforensic capabilities


Avignon Papers

27 June 2012


Crete 2001

27 June 2012

2001: A Speaker Odyssey - The Speaker Recognition Workshop

June 18-22, 2001Crete, Greece

First official “Odyssey

More emphasis on evaluation


Toledo 2004

ISCA ArchiveODYSSEY 2004 - The

Speaker and Language Recognition WorkshopMay 31 - June 3, 2004

Toledo, Spain

27 June 2012

First Odyssey with NIST SRE Workshop held in conjunction at same location

First to include language recognition.

Two notable keynotes on forensic recognition.

Well attended. Odyssey held bi-annually since 2004.


Etc. – Odyssey 2006, 2008, 2010, 2012, …

Odyssey 2008: The Speaker and Language Recognition Workshop

Stellenbosch, South AfricaJanuary 21-24, 2008

27 June 2012

Odyssey 2010: The Speaker and Language Recognition WorkshopBrno, Czech Republic28 June 1 July 2010�


Organizing Evaluations

• Which task(s)?• Key principles• Milestones• Participants

27 June 2012


Which Speaker Recognition Problem?

• Access Control?– Text independent or dependent?– Prior probability of target high

• Forensic?– Prior not clear

• Person Spotting?– Prior probability of target low– Text independent

• NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve

27 June 2012


Some Basic Evaluation Principles

• Speaker spotting primary task• Research system oriented• Pooling across target speakers• Emphasis on low false alarm rate operating

point with scores and decisions (calibration matters)

27 June 2012


Organization Basics

• Open to all willing participants• Research-oriented– Commercialized competition discouraged

• Written evaluation plans– Specified rules of participation

• Workshops limited to participants– Each site/team must be represented

• Evaluation data sets subsequently published by the LDC

27 June 2012


1996 Evaluation Plan (cont’d)

27 June 2012


1996 Evaluation Plan (cont’d)

27 June 2012

1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots


DET Curve Paper – Eurospeech ‘97

27 June 2012


Wikipedia DET Page

27 June 2012


Some Milestones

• 1992 – DARPA program limited speaker identification evaluation

• 1995 – Small identification evaluation• 1996 – First SRE in current series• 2000 – AHUMADA Spanish data, first non-English speech• 2001 – Cellular data, • 2001 – ASR transcripts provided• 2002 – FBI “forensic” database• 2002 – SuperSid Workshop following SRE• 2005 – Multiple languages with bilingual speakers

27 June 2012

Some Milestones (cont’d)

• 2005 – Room mic recordings, cross-channel trials• 2008 – Interview data• 2010 – New decision cost function metric

stressing even lower FA rate region• 2010 – High and low vocal effort, aging• 2010 – HASR (Human-Assisted Speaker

Recognition) Evaluation• 2011 – BEST Evaluation, broad range of test

conditions, included added noise and reverb• 2012 – Target Speakers Defined Beforehand27 June 2012 Odyssey 2012 @ Singapore 24


Participation

• Grew from fewer than a dozen to 58 sites in 2010

• MIT (Doug) provided workshop notebook covers listing participants

• Big increase in participants after 2001• Handling scores of participating sites becomes

a management problem

27 June 2012


NIST 2004NIST 2004Speaker Recognition Workshop

Speaker Recognition Workshop

Taller de Reconocimiento de Locutor

Taller de Reconocimiento de Locutor

27 June 2012


Participating Sites

92* 95* 96 97 98 99 00 01 02 03 04 05 06 08 10 11* 12#0

10

20

30

40

50

60

70

Number of Sites

Number of Sites

27 June 2012

* Not in SRE series # Incomplete

Odyssey 2012 @ Singapore 2927 June 2012This slide is from 2001: A Speaker Odyssey in Crete


NIST Evaluation Data Set (cont’d)Year Common Condition(s) Evaluation Features

2002 One-session training on conv. phone data

Cellular data, alternative tests of extended training, speaker segmentation, and a limited corpus of simulated forensic data

2003 One-session training on conv. phone data

Cellular data, extended training

2004 Handheld landline conv. phone speech, English only

Multi-language data with bilingual speakers

2005 English only with handheld tel. set Included cross-channel trials with mic. test, both sides of 2-channel convs. provided

2006 English only trials (including mic. test trials)

Included cross-channel trials with mic. test

27 June 2012

NIST Evaluation Data Set (cont’d)Year Common Condition(s) Evaluation Features

2008 8 – contrasting English and bilingual speakers, interview and conv. phone speech along with cross-condition trials

Interview speech recorded over multiple mic channels and conv. phone speech recorded over mic and tel channels, multiple languages

2010 9 – contrasting tel and mic channels, interview and conversational phone speech, and high, low and normal vocal effort

Multiple microphones, phone calls with high, low, and normal vocal effort, aging data (Greybeard), HASR

2012 5 – interview test without noise, conv. phone test without noise, interview test with added noise, conv. phone test with added noise, conv. phone test collected in noisy environment

Target speakers specified in advance (from previous evals) with large amounts of training, some test calls collected in noisy environments, phone test data with added noise

27 June 2012 Odyssey 2012 @ Singapore 31


Performance Factors

• Intrinsic• Extrinsic• Parametric

27 June 2012


Intrinsic FactorsRelate to the speaker– Demographic factors

• Sex• Age• Education

– Mean pitch– Speaking style

• Conversational telephone• Interview• Read text

– Vocal effort• Some questions about definition and how to collect

– Aging• Hard to collect sizable amounts of data with years of time

separation27 June 2012


Extrinsic Factors

Relate to the collection environment– Microphone or telephone channel– Telephone channel type

• Landline, cellular, VOIP• In earlier times, carbon vs. electret

– Telephone handset type• Handheld, headset, earbud, speakerphone

– Microphone type – matched, mismatched– Placement of microphone relative to speaker– Background noise– Room reverberation

27 June 2012


“Parametric” Factors

• Train/test speech duration– Have tested 10 s up to ~half hour, more in ‘12

• Number of training sessions– Have tested 1 to 8, more in ‘12

• Language English has been predominant, but a variety of others included in some evaluations– Is better performance for English due to familiarity

and quantity of development data?– Cross-language trials a separate challenge

27 June 2012


Metrics

• Equal Error Rate– Easy to understand– Not operating point of interest– Calibration matters

• Decision Cost Function• CLLR• FA rate at fixed miss rate– E.g. 10% (lower for some conditions)

27 June 2012


Decision Cost Function CDet

CDet = CMiss × PMiss|Target × PTarget

+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget)

• Weighted sum of miss and false alarm error probabilities• Parameters are the relative costs of detection errors, CMiss and

CFalseAlarm, and the a priori probability of the specified target speaker, PTarget.

• Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” )

27 June 2012


Decision Cost Function CDet (cont’d)

• Parameters 1996-2008

• Parameters 2010

• Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory

27 June 2012

CMiss CFalseAlarm PTarget

10 1 0.01

CMiss CFalseAlarm PTarget

1 1 0.001


CLLR

Cllr = 1/(2*log2) * ((Σlog(1+1/s)/NTT)+ (Σlog(1+s))/NNT))

where first summation is over target trials, second is over non-target trials, NTT and NNT are the numbers of target and non-

target trials, respectively, and s represents a trial’s likelihood ratio

• Information theoretic measure made popular in this community by Niko

• Covers broad range of performance operating points• George has suggested limiting range to low FA rates

27 June 2012


Fixed Miss Rate

• Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate

• Easy to understand• Practical for applications of interest• May be viewed as cost of listening to false

alarms• For easier conditions, a 1% miss rate now

more appropriate

27 June 2012


Recording Progress• Difficult to assure test set comparability

– Participants encouraged to run prior systems on new data

• Technology changes– In ‘96 landline phones predominated, with carbon button or electret

microphones– Need to explore VOIP

• With progress, want to make the test harder– Always want to add new evaluation conditions, new bells and

whistles• More channel types, more speaking styles, languages, etc.

– Externally added noise and reverb explored in 2011 with BEST

• Doug’s history slide - updated27 June 2012


History Slide

27 June 2012


Future

• SRE12• Beyond

27 June 2012


SRE12 Plans• Target speakers specified in advance

– Speakers in recent past evaluations (in the thousands)– All prior speech data available for training– Some new targets with training provided at evaluation time– Test segments will include non-target speakers

• New interview speech provided in 16-bit linear pcm• Some test calls collected in noisy environments• Artificial noise added to some test segment data• Will this be an effectively easier id task?

– Will the provided set of known targets change system approaches?– Optional conditions include

• Assume test speaker is one of the known targets• Use no information about targets other than that of the trial

27 June 2012


SRE12 Metric

• Log-likelihood ratios will now be required– Therefore, no hard decisions are asked for

• Primary metric will be an average of two detection cost functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide)– Adds to stability of cost measure– Emphasizes need for good score calibration over wide range

of log likelihoods

• Alternative metrics will be Cllr and Cllr-M10, where the latter is Cllr limited to trials for which PMiss > 10%

27 June 2012


SRE12 Primary Cost Function

• Niko noted that estimated llr’s making good decisions at a single operating point may not be effective at other operating points; therefore an average of two points is used

• Writing DCF asPMiss + β * PFA

whereβ = (CFA/CMiss) * (1 – PTarget) / PTarget

• We take as cost function(DCF1 + DCF2)/2

wherePTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1

27 June 2012


Future Possibilities

• SRE12 outcome will determine whether pre-specified targets will be further explored– Does this make the problem too easy?

• Artificially added noise and reverb may continue• HASR12 will indicate whether human-in-the-loop

evaluation gains traction• SRE’s have become bigger undertakings– Fifty or more participating sites– Data volume approaching terabytes (as in BEST)– Tens or hundreds of millions of trials– Schedule could move to every three years

27 June 2012

the nist speaker recognition evaluations alvin f martin [email protected] odyssey 2012 @...

Documents

speaker odyssey

odyssey martigny

official odyssey

greece odyssey

singapore13 odyssey

evaluation slide

speaker recognition

speaker recognition