the nist speaker recognition evaluations alvin f martin [email protected] odyssey 2012 @...

47
The NIST Speaker Recognition Evaluations Alvin F Martin [email protected] Odyssey 2012 @ Singapore 27 June 2012

Upload: stacy-hillstead

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

The NIST Speaker Recognition Evaluations

Alvin F [email protected]

Odyssey 2012 @ Singapore27 June 2012

Page 2: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 2

Outline

• Some Early History• Evaluation Organization• Performance Factors• Metrics• Progress• Future

27 June 2012

Page 3: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 3

Some Early History• Success of speech recognition evaluation

– Showed benefits of independent evaluation on common data sets• Collection of early corpora, including TIMIT, KING, YOHO, and

especially Switchboard – Multi-purpose corpus collected (~1991) with speaker recognition in

mind– Followed by Switchboard-2 and similar collections

• Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US

• The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc.

• Earlier NIST speaker evaluations in ‘92, ’95– ‘92 evaluation had several sites as part of DARPA program– ‘95 evaluation with 6 sites used some Switchboard-1 data– Emphasis was on speaker id rather than open set recognition 27 June 2012

Page 4: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 427 June 2012

Page 5: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 5

Martigny 1994

27 June 2012

Varying corpora and performance measures made meaningful comparisons difficult

Page 6: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 6

Avignon 1998

27 June 2012

19th February 1998: WORKSHOP RLA2C - Speaker Recognition

****************************************************• RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C *

****************************************************------------------- -------------------

la Reconnaissance Speaker du Locuteur Recognition

et ses and its Applications Commercial

Commerciales and Forensic et Criminalistiques Applications

-------------------------------------------------AVIGNON 20-23 avril/april 1998

Soutenu / Sponsored by GFCP - SFA - ESCA - IEEE

TIMIT was preferred corpus

Sometimes bitter debate overforensic capabilities

Page 7: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 7

Avignon Papers

27 June 2012

Page 8: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 8

Crete 2001

27 June 2012

2001: A Speaker Odyssey - The Speaker Recognition Workshop

June 18-22, 2001Crete, Greece

First official “Odyssey

More emphasis on evaluation

Page 9: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 927 June 2012

Page 10: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 1027 June 2012

Page 11: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 11

Toledo 2004

ISCA ArchiveODYSSEY 2004 - The

Speaker and Language Recognition WorkshopMay 31 - June 3, 2004

Toledo, Spain

27 June 2012

First Odyssey with NIST SRE Workshop held in conjunction at same location

First to include language recognition.

Two notable keynotes on forensic recognition.

Well attended. Odyssey held bi-annually since 2004.

Page 12: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 1227 June 2012

Page 13: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 13

Etc. – Odyssey 2006, 2008, 2010, 2012, …

Odyssey 2008: The Speaker and Language Recognition Workshop

Stellenbosch, South AfricaJanuary 21-24, 2008

27 June 2012

Odyssey 2010: The Speaker and Language Recognition WorkshopBrno, Czech Republic28 June 1 July 2010�

Page 14: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 14

Organizing Evaluations

• Which task(s)?• Key principles• Milestones• Participants

27 June 2012

Page 15: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 15

Which Speaker Recognition Problem?

• Access Control?– Text independent or dependent?– Prior probability of target high

• Forensic?– Prior not clear

• Person Spotting?– Prior probability of target low– Text independent

• NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve

27 June 2012

Page 16: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 16

Some Basic Evaluation Principles

• Speaker spotting primary task• Research system oriented• Pooling across target speakers• Emphasis on low false alarm rate operating

point with scores and decisions (calibration matters)

27 June 2012

Page 17: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 17

Organization Basics

• Open to all willing participants• Research-oriented– Commercialized competition discouraged

• Written evaluation plans– Specified rules of participation

• Workshops limited to participants– Each site/team must be represented

• Evaluation data sets subsequently published by the LDC

27 June 2012

Page 18: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 1827 June 2012

Page 19: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 19

1996 Evaluation Plan (cont’d)

27 June 2012

Page 20: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 20

1996 Evaluation Plan (cont’d)

27 June 2012

1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots

Page 21: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 21

DET Curve Paper – Eurospeech ‘97

27 June 2012

Page 22: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 22

Wikipedia DET Page

27 June 2012

Page 23: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 23

Some Milestones

• 1992 – DARPA program limited speaker identification evaluation

• 1995 – Small identification evaluation• 1996 – First SRE in current series• 2000 – AHUMADA Spanish data, first non-English speech• 2001 – Cellular data, • 2001 – ASR transcripts provided• 2002 – FBI “forensic” database• 2002 – SuperSid Workshop following SRE• 2005 – Multiple languages with bilingual speakers

27 June 2012

Page 24: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Some Milestones (cont’d)

• 2005 – Room mic recordings, cross-channel trials• 2008 – Interview data• 2010 – New decision cost function metric

stressing even lower FA rate region• 2010 – High and low vocal effort, aging• 2010 – HASR (Human-Assisted Speaker

Recognition) Evaluation• 2011 – BEST Evaluation, broad range of test

conditions, included added noise and reverb• 2012 – Target Speakers Defined Beforehand27 June 2012 Odyssey 2012 @ Singapore 24

Page 25: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 25

Participation

• Grew from fewer than a dozen to 58 sites in 2010

• MIT (Doug) provided workshop notebook covers listing participants

• Big increase in participants after 2001• Handling scores of participating sites becomes

a management problem

27 June 2012

Page 26: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 26

NIST 2004NIST 2004Speaker Recognition Workshop

Speaker Recognition Workshop

Taller de Reconocimiento de Locutor

Taller de Reconocimiento de Locutor

27 June 2012

Page 27: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 2727 June 2012

Page 28: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 28

Participating Sites

92* 95* 96 97 98 99 00 01 02 03 04 05 06 08 10 11* 12#0

10

20

30

40

50

60

70

Number of Sites

Number of Sites

27 June 2012

* Not in SRE series # Incomplete

Page 29: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 2927 June 2012This slide is from 2001: A Speaker Odyssey in Crete

Page 30: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 30

NIST Evaluation Data Set (cont’d)Year Common Condition(s) Evaluation Features

2002 One-session training on conv. phone data

Cellular data, alternative tests of extended training, speaker segmentation, and a limited corpus of simulated forensic data

2003 One-session training on conv. phone data

Cellular data, extended training

2004 Handheld landline conv. phone speech, English only

Multi-language data with bilingual speakers

2005 English only with handheld tel. set Included cross-channel trials with mic. test, both sides of 2-channel convs. provided

2006 English only trials (including mic. test trials)

Included cross-channel trials with mic. test

27 June 2012

Page 31: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

NIST Evaluation Data Set (cont’d)Year Common Condition(s) Evaluation Features

2008 8 – contrasting English and bilingual speakers, interview and conv. phone speech along with cross-condition trials

Interview speech recorded over multiple mic channels and conv. phone speech recorded over mic and tel channels, multiple languages

2010 9 – contrasting tel and mic channels, interview and conversational phone speech, and high, low and normal vocal effort

Multiple microphones, phone calls with high, low, and normal vocal effort, aging data (Greybeard), HASR

2012 5 – interview test without noise, conv. phone test without noise, interview test with added noise, conv. phone test with added noise, conv. phone test collected in noisy environment

Target speakers specified in advance (from previous evals) with large amounts of training, some test calls collected in noisy environments, phone test data with added noise

27 June 2012 Odyssey 2012 @ Singapore 31

Page 32: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 32

Performance Factors

• Intrinsic• Extrinsic• Parametric

27 June 2012

Page 33: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 33

Intrinsic FactorsRelate to the speaker– Demographic factors

• Sex• Age• Education

– Mean pitch– Speaking style

• Conversational telephone• Interview• Read text

– Vocal effort• Some questions about definition and how to collect

– Aging• Hard to collect sizable amounts of data with years of time

separation27 June 2012

Page 34: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 34

Extrinsic Factors

Relate to the collection environment– Microphone or telephone channel– Telephone channel type

• Landline, cellular, VOIP• In earlier times, carbon vs. electret

– Telephone handset type• Handheld, headset, earbud, speakerphone

– Microphone type – matched, mismatched– Placement of microphone relative to speaker– Background noise– Room reverberation

27 June 2012

Page 35: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 35

“Parametric” Factors

• Train/test speech duration– Have tested 10 s up to ~half hour, more in ‘12

• Number of training sessions– Have tested 1 to 8, more in ‘12

• Language English has been predominant, but a variety of others included in some evaluations– Is better performance for English due to familiarity

and quantity of development data?– Cross-language trials a separate challenge

27 June 2012

Page 36: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 36

Metrics

• Equal Error Rate– Easy to understand– Not operating point of interest– Calibration matters

• Decision Cost Function• CLLR• FA rate at fixed miss rate– E.g. 10% (lower for some conditions)

27 June 2012

Page 37: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 37

Decision Cost Function CDet

CDet = CMiss × PMiss|Target × PTarget

+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget)

• Weighted sum of miss and false alarm error probabilities• Parameters are the relative costs of detection errors, CMiss and

CFalseAlarm, and the a priori probability of the specified target speaker, PTarget.

• Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” )

27 June 2012

Page 38: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 38

Decision Cost Function CDet (cont’d)

• Parameters 1996-2008

• Parameters 2010

• Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory

27 June 2012

CMiss CFalseAlarm PTarget

10 1 0.01

CMiss CFalseAlarm PTarget

1 1 0.001

Page 39: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 39

CLLR

Cllr = 1/(2*log2) * ((Σlog(1+1/s)/NTT)+ (Σlog(1+s))/NNT))

where first summation is over target trials, second is over non-target trials, NTT and NNT are the numbers of target and non-

target trials, respectively, and s represents a trial’s likelihood ratio

• Information theoretic measure made popular in this community by Niko

• Covers broad range of performance operating points• George has suggested limiting range to low FA rates

27 June 2012

Page 40: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 40

Fixed Miss Rate

• Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate

• Easy to understand• Practical for applications of interest• May be viewed as cost of listening to false

alarms• For easier conditions, a 1% miss rate now

more appropriate

27 June 2012

Page 41: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 41

Recording Progress• Difficult to assure test set comparability

– Participants encouraged to run prior systems on new data

• Technology changes– In ‘96 landline phones predominated, with carbon button or electret

microphones– Need to explore VOIP

• With progress, want to make the test harder– Always want to add new evaluation conditions, new bells and

whistles• More channel types, more speaking styles, languages, etc.

– Externally added noise and reverb explored in 2011 with BEST

• Doug’s history slide - updated27 June 2012

Page 42: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 42

History Slide

27 June 2012

Page 43: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 43

Future

• SRE12• Beyond

27 June 2012

Page 44: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 44

SRE12 Plans• Target speakers specified in advance

– Speakers in recent past evaluations (in the thousands)– All prior speech data available for training– Some new targets with training provided at evaluation time– Test segments will include non-target speakers

• New interview speech provided in 16-bit linear pcm• Some test calls collected in noisy environments• Artificial noise added to some test segment data• Will this be an effectively easier id task?

– Will the provided set of known targets change system approaches?– Optional conditions include

• Assume test speaker is one of the known targets• Use no information about targets other than that of the trial

27 June 2012

Page 45: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 45

SRE12 Metric

• Log-likelihood ratios will now be required– Therefore, no hard decisions are asked for

• Primary metric will be an average of two detection cost functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide)– Adds to stability of cost measure– Emphasizes need for good score calibration over wide range

of log likelihoods

• Alternative metrics will be Cllr and Cllr-M10, where the latter is Cllr limited to trials for which PMiss > 10%

27 June 2012

Page 46: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 46

SRE12 Primary Cost Function

• Niko noted that estimated llr’s making good decisions at a single operating point may not be effective at other operating points; therefore an average of two points is used

• Writing DCF asPMiss + β * PFA

whereβ = (CFA/CMiss) * (1 – PTarget) / PTarget

• We take as cost function(DCF1 + DCF2)/2

wherePTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1

27 June 2012

Page 47: The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

Odyssey 2012 @ Singapore 47

Future Possibilities

• SRE12 outcome will determine whether pre-specified targets will be further explored– Does this make the problem too easy?

• Artificially added noise and reverb may continue• HASR12 will indicate whether human-in-the-loop

evaluation gains traction• SRE’s have become bigger undertakings– Fifty or more participating sites– Data volume approaching terabytes (as in BEST)– Tens or hundreds of millions of trials– Schedule could move to every three years

27 June 2012