the nist speaker recognition evaluation – overview, methodology, systems, results, perspective

30
The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective George R. Doddington a,b , Mark A. Przybocki b , Alvin F. Martin b, * , Douglas A. Reynolds c a SRI International, Menlo Park, CA, USA b Technology Building 225, National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA c MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02173, USA Received 1 October 1998; received in revised form 22 September 1999 Abstract This paper, based on three presentations made in 1998 at the RLA2C Workshop in Avignon, discusses the evalu- ation of speaker recognition systems from several perspectives. A general discussion of the speaker recognition task and the challenges and issues involved in its evaluation is oered. The NIST evaluations in this area and specifically the 1998 evaluation, its objectives, protocols and test data, are described. The algorithms used by the systems that were de- veloped for this evaluation are summarized, compared and contrasted. Overall performance results of this evaluation are presented by means of detection error trade-o (DET) curves. These show the performance trade-o of missed detections and false alarms for each system and the eects on performance of training condition, test segment duration, the speakers’ sex and the match or mismatch of training and test handsets. Several factors that were found to have an impact on performance, including pitch frequency, handset type and noise, are discussed and DET curves showing their eects are presented. The paper concludes with some perspective on the history of this technology and where it may be going. Ó 2000 Elsevier Science B.V. All rights reserved. R esum e Cet article, bas e sur les trois expos es eectu es en 1998 lors de la conf erence RLA2C a Avignon, pr esente les m ethodes de m etrologie de la reconnaissance du locuteur. Apres un aperc ß u de ce quÕest la reconnaissance du locuteur et des probl emes pos es, nous pr esenterons les objectifs, les m ethodes, les donn ees utilis ees et les r esultats de lÕ evaluation propos ee par NIST en 1998. Pour cela nous utiliserons des detection error trade-o (DET) curves. Ces courbes ont lÕavantage de montrer le compromis entre erreur et fausse alarme et lÕeet sur les performances des conditions dÕen- trainement, de la dur ee des segments de tests, du sexe du locuteur ou du type de combin e. Ensuite nous pr esenterons, comparerons et ferons la synth ese des dierents algorithmes utilis es par les syst emes propos es par les participants. Nous avons d ecouvert que plusieurs facteurs influenc ß aient fortement les performances des syst emes, comme par exemple la tonalit e de la voix, le type de combin e utilis e ou le bruit ambiant. Nous les pr esenterons et mettrons en evidence alÕaide de DET curves. Nous concluerons avec un historique des techniques de reconnaissance du locuteur et avec quelques projections de ce domaine dans lÕavenir Ó 2000 Elsevier Science B.V. All rights reserved. www.elsevier.nl/locate/specom Speech Communication 31 (2000) 225–254 * Corresponding author. Tel.: +1-301-975-3169; fax: +1-301-670-0939 (http://www.nist.gov/speech/martinal.htm). E-mail address: [email protected] (A.F. Martin). 0167-6393/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 8 0 - 1

Upload: george-r-doddington

Post on 02-Jul-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

The NIST speaker recognition evaluation ± Overview,methodology, systems, results, perspective

George R. Doddington a,b, Mark A. Przybocki b, Alvin F. Martin b,*,Douglas A. Reynolds c

a SRI International, Menlo Park, CA, USAb Technology Building 225, National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA

c MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02173, USA

Received 1 October 1998; received in revised form 22 September 1999

Abstract

This paper, based on three presentations made in 1998 at the RLA2C Workshop in Avignon, discusses the evalu-

ation of speaker recognition systems from several perspectives. A general discussion of the speaker recognition task and

the challenges and issues involved in its evaluation is o�ered. The NIST evaluations in this area and speci®cally the 1998

evaluation, its objectives, protocols and test data, are described. The algorithms used by the systems that were de-

veloped for this evaluation are summarized, compared and contrasted. Overall performance results of this evaluation

are presented by means of detection error trade-o� (DET) curves. These show the performance trade-o� of missed

detections and false alarms for each system and the e�ects on performance of training condition, test segment duration,

the speakers' sex and the match or mismatch of training and test handsets. Several factors that were found to have an

impact on performance, including pitch frequency, handset type and noise, are discussed and DET curves showing their

e�ects are presented. The paper concludes with some perspective on the history of this technology and where it may be

going. Ó 2000 Elsevier Science B.V. All rights reserved.

R�esum�e

Cet article, bas�e sur les trois expos�es e�ectu�es en 1998 lors de la conf�erence RLA2C �a Avignon, pr�esente les m�ethodes

de m�etrologie de la reconnaissance du locuteur. Apres un apercßu de ce quÕest la reconnaissance du locuteur et des

probl�emes pos�es, nous pr�esenterons les objectifs, les m�ethodes, les donn�ees utilis�ees et les r�esultats de lÕ�evaluation

propos�ee par NIST en 1998. Pour cela nous utiliserons des detection error trade-o� (DET) curves. Ces courbes ont

lÕavantage de montrer le compromis entre erreur et fausse alarme et lÕe�et sur les performances des conditions dÕen-

trainement, de la dur�ee des segments de tests, du sexe du locuteur ou du type de combin�e. Ensuite nous pr�esenterons,

comparerons et ferons la synth�ese des di�erents algorithmes utilis�es par les syst�emes propos�es par les participants. Nous

avons d�ecouvert que plusieurs facteurs in¯uencßaient fortement les performances des syst�emes, comme par exemple la

tonalit�e de la voix, le type de combin�e utilis�e ou le bruit ambiant. Nous les pr�esenterons et mettrons en �evidence �a lÕaide

de DET curves. Nous concluerons avec un historique des techniques de reconnaissance du locuteur et avec quelques

projections de ce domaine dans lÕavenir Ó 2000 Elsevier Science B.V. All rights reserved.

www.elsevier.nl/locate/specomSpeech Communication 31 (2000) 225±254

* Corresponding author. Tel.: +1-301-975-3169; fax: +1-301-670-0939 (http://www.nist.gov/speech/martinal.htm).

E-mail address: [email protected] (A.F. Martin).

0167-6393/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 8 0 - 1

Page 2: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

Keywords: Speaker recognition; Identi®cation; Veri®cation; Performance evaluation; NIST evaluations; Detection error trade-o�

(DET) curve

1. Introduction

Language is the engine of civilization andspeech is its most powerful and natural form, de-spite the misappropriation of the term ``naturallanguage processing'' to mean the processing oftext. Textual language has certainly become ex-tremely important in modern life, but speech hasdimensions of richness that text cannot approxi-mate. For example, the health, the sex and theattitude of a person are all naturally and sublimi-nally communicated by that personÕs speech. Suchextra-linguistic information has social value andserves important communicative functions in oureveryday lives.

Cues to a personÕs identity are also part of theextra-linguistic information communicated by thesound of a speakerÕs voice. With the complexitiesof modern life, this ability to identify people by thesound of their voices ®nds value beyond the realmof personal living. With the advent of the infor-mation age and the high-powered but low-costcomputers that fuel it, speaker recognition hasbecome an attractive potential capability ± albeitone that largely remains to be ful®lled throughresearch and technology development e�orts.

1.1. NIST evaluations

The National Institute of Standards and Tech-nology (NIST), located in Gaithersburg, MA,USA, has played a vital role in working with in-dustry to develop and apply technology, mea-surements and standards, for several years. TheSpoken Natural Language Processing group ofNISTÕs Information Technology Laboratory con-ducts annual benchmark tests to evaluate the state-of-the-art in the areas surrounding core speechrecognition technologies.

NIST has coordinated an ongoing series ofspeaker recognition evaluations (Przybocki andMartin, 1998b), which have provided an importantcontribution to the direction of research e�ortsand the calibration of technical capabilities. They

are intended to be of interest to all researchersworking on the general problem of text-indepen-dent speaker recognition. To this end, the evalua-tions are designed to be simple, to focus on coretechnology issues, to be fully supported and to beaccessible. A follow-up workshop for the partici-pants to discuss research ®ndings is held after eachevaluation. Participation in these evaluations issolicited for all research sites that ®nd the task andthe evaluations of interest.

1.2. Outline

The authors have been involved in the NISTspeaker recognition evaluations as planners, or-ganizers, scorers, sponsors and participants. Sec-tion 2 (®rst author Doddington) of this papero�ers a somewhat personal overview of evaluationmethodology for speaker recognition based onlong experience in the ®eld. Section 3 (authorsPrzybocki and Martin) of this paper discusses theobjectives and protocols of the 1998 evaluation.Section 4 (author Reynolds) describes the varioussystems from laboratories in the United Statesand Europe that were included in the evaluation.Section 5 (authors Przybocki and Martin) pre-sents a number of charts of the performance re-sults of the evaluation with some contrast with theresults from the evaluation of the previous year.We discuss 2 speci®c factors that we found tosigni®cantly a�ect the level of performance,namely, speakersÕ average pitch and the handsettypes of the phones used. Section 6 summarizesthe NIST evaluation process and suggests futuredirections for research for where this technology isheaded.

2. Overview of evaluation methodology

The increase in application opportunities hasresulted in a heightened interest in speaker recog-nition research. A key element in this research isthe identi®cation of key research tasks and the

226 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 3: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

establishment of evaluation methodology to sup-port them. The purpose of this section is to pro-vide an interpretive overview on speakerrecognition tasks and evaluation methodology aspreviously presented at the ``La Reconnaissancedu Locuteur et ses Applications Commercialeset Criminalistiques'' (RLA2C) Workshop (Dod-dington, 1998).

2.1. Applications

There are numerous possible ways of catego-rizing di�erent applications of speaker recognition.One meaningful way is to divide the applicationsinto two groups according to the immediate ben-e®ciary of the process. Speci®cally, the question iswhether it is the speaker who bene®ts or someoneelse. This split is particularly useful because it alsohas a profound impact on the task de®nition,the technology, the system design and systemperformance.

Examples of speaker recognition applicationsthat serve the speaker are security systems thatprovide control of physical entry and informationaccess. Such systems are in general more intricatein design and capable of better performance thanother kinds of systems. For example, because se-curity system users are cooperative, they can col-laborate with the system by saying what is asked ofthem and by pro�ering their identities to make therecognition task easier. Also, special care is takenin the design of these systems to ensure goodspeaker recognition performance ± e.g., by usinghigh quality uniform microphones and by con-trolling the ambient noise level, where possible.

Examples of speaker recognition applicationsthat serve someone other than the speaker includeforensic applications. While the use of speakerrecognition in such cases is less encumbered bydesign details, performance is usually signi®cantlyworse ± the speaker may be using an unknownmicrophone (often a carbon button telephonehandset) or may be particularly emotional and therecording is typically of low quality for a variety ofreasons. The one advantage that forensic typeapplications often do have is in the amount ofspeech data. Whereas in security applications,where time is of the essence and speech segments

are measured in a few seconds, forensic applica-tions may have minutes of speech to use.

2.2. Task de®nitions

In order to conduct a productive applied re-search e�ort directed toward speaker recognitiontechnology, it is helpful to have a clear under-standing and expression of the research objective.This research objective is properly expressed interms of a formal de®nition of the speaker recog-nition task.

Ideally, a formal task de®nition of speakerrecognition will serve more than to organize re-search within a single research e�ort. It can alsoserve to share resources and results across anumber of di�erent research sites. This tendencytoward globalization of speech research has ac-celerated strongly during the past 10 years, due inpart to a gradual realization of the immensechallenge of understanding speech and creatinguseful speech technology.

There are logically two parts to de®ning thespeaker recognition task. These are to de®ne thenature of the speaker recognition decisions and tode®ne the nature of the speech data on which thesedecisions are to be made.

2.2.1. Recognition decisionsSpeaker recognition is the general term used to

include all of the many di�erent tasks of discrim-inating people based on the sound of their voices.There are many terms used to distinguish di�erenttasks, including speaker identi®cation, speakerveri®cation, speaker spotting and speaker detec-tion. But it will be most useful to focus primarydiscussion on just two di�erent kinds of tasks,identi®cation and veri®cation.

2.2.1.1. Identi®cation. Speaker identi®cation is thetask of deciding, given a sample of speech, whoamong many candidate speakers said it. This is anN-class decision task, where N is the number ofcandidate speakers. Of course N has a powerfulin¯uence on the ultimate performance for anidenti®cation task ± performance might be ex-tremely good for just a few speakers (especially ifthey happen to have distinctly di�erent voices), but

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 227

Page 4: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

as N becomes arbitrarily large, the probability ofcorrectly identifying the speaker becomes arbi-trarily small and approaches zero.

There are variants of the identi®cation task,including most notably the situation where theactual speaker may be none of the candidatespeakers. The identi®cation task is called a ``closedset'' task when the actual speaker is always one ofthe candidate speakers. When the actual speakermay not be one of the candidate speakers, it iscalled an ``open set'' task.

In most cases, the speaker identi®cation taskappears to be more of a scienti®c game than anoperationally useful capability. This is especiallytrue for security system type applications, becausehere the speaker has good reason to cooperatewith the system by pro�ering an identity, thus re-ducing the N-class task down to a 2-class task.Even for forensic type applications, the problemmost often is one of evaluating each candidateseparately rather than choosing one from amongmany candidates. Thus, it seems that, though thespeaker identi®cation task may garner consider-able scienti®c interest, it is the speaker veri®cationtask that has the greatest application potential.

2.2.1.2. Veri®cation. Speaker veri®cation is thetask of deciding, given a sample of speech, whethera speci®ed candidate speaker said it. This is a 2-class task and is sometimes referred to as a speakerdetection task. The two classes for the veri®cationtask are: the speci®ed speaker (known variously asthe ``true speaker'' or the ``target speaker''); andsome speaker other than the speci®ed speaker(known variously as an ``impostor'' or a ``non-target speaker'').

One very nice characteristic of the speakerveri®cation task, deriving from the inherent natureof a 2-class task, is that the performance is inde-pendent of the number of potential impostorspeakers. Thus, the measured performance of averi®cation system is independent of the numberof impostor speakers in the test set. (This assumesthat the veri®cation system does not ``know'' theimpostor speakers. Improved impostor resistancecan be achieved for small numbers of impostors ifthe system has explicit knowledge of the impos-torsÕ speech.) This does not imply, however, that

performance is independent of the choice of im-postors ± if impostors similar to the target speakerare selected, then naturally performance will beworse.

2.2.2. Operating modesGenerally speaking, there are two di�erent

modes of speech used in speaker recognition,which correspond to the two di�erent kinds ofapplications. When the speaker is cooperative, thesystem may know what the speaker is supposed tosay and better performance may be achieved.When the speaker is uncooperative or unaware,then the system will be handicapped by lack of thisknowledge.

2.2.2.1. Text-dependent recognition. A speakerrecognition system is called ``text-dependent'' if itknows what the speaker is supposed to say. In thetypical text-dependent recognition scenario, thespeaker will either say a prede®ned utterance orwill be prompted to say an utterance by the sys-tem. In either case, the target speaker will haveexplicitly spoken these words during an enrollmentsession.

This explicit knowledge can be used to goode�ect in building detailed dynamic models of thespeaker. These models can be word- and phone-speci®c and thus can calibrate the target speakerwith great accuracy ± sometimes too accurately, ifthe speaker does not produce the appropriate ut-terance properly! But, with nominal speaker co-operation, text-dependent recognition improvesrecognition performance while at the same timeminimizing the amount of speech data required.

2.2.2.2. Text-independent recognition. A speakerrecognition system is called ``text-independent'' if itdoes not have foreknowledge of what the speakeris saying. This forces the technology to deal withwhatever speech data happens to be available,both for training and for recognition decisions.Because of this, text-independent systems typicallyexhibit worse performance than do text-dependentsystems, at least per unit of speech.

There is a signi®cant research advantage toworking on text-independent recognition, beyondjust the ability to use signi®cantly larger amounts

228 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 5: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

of speech data. This is, namely, the general ab-sence of idiosyncrasies in the de®nition of the taskand the speci®cation of the speech data. This al-lows simple and general task and data speci®ca-tion, which in turn supports much broadercollaboration and sharing of results than has beenrealized in text-dependent recognition. Partially asa result of this advantage, interest in text-inde-pendent recognition has risen signi®cantly.

2.3. The technical challenges

PeopleÕs voices are distinctive. That is, a per-sonÕs speech exhibits distinctive characteristicsthat indicate the identity of the speaker. We areall familiar with this and we all use it in oureveryday lives to help us interact with others. Ofcourse, from time to time we might notice thatone person sounds very much like another personwe know. Or we might even momentarily mistakeone person as another because of the sound ofthe personÕs voice. But this similarity betweenvoices of di�erent individuals is not what thetechnical challenge in speaker recognition is allabout.

The challenge in speaker recognition is variance,not similarity. That is, the challenge is to decode ahighly variable speech signal into the characteris-tics that indicate the speakerÕs identity. Thesevariations are formidable and myriad. The prin-cipal source of variance is the speaker.

2.3.1. The speakerWhile it is the speakerÕs voice that provides the

recognition capability, it is also the speakerÕsvariability that makes the problem so di�cult. Anexplanation for why the speakerÕs variability issuch a vexing problem is that the use of speech,unlike ®ngerprints or handprints or retinal pat-terns, is to a very large degree a result of what theperson does, rather than who the person is ± speechis a performing art and each performance isunique.

Some of the sources of variability are within thespeakerÕs control. Some are not. Here is a list ofsome of the factors that have been o�ered as

explanations for speaker variability leading to lessthan perfect speaker recognition performance:· The session. Early experiments showed good

speaker recognition performance when trainingand testing were both conducted in the same ses-sion. The single session experiment successfullyavoids these problems, but unfortunately doesnot contribute to solving them.

· Health. Respiratory infections are a de®niteproblem and of course laryngitis is the ultimatehealth problem for speaker recognition. Otherfactors that might be considered part of a per-sonÕs health include emotional state and meta-bolic rate.

· Educational level and intelligence. Although thissubject is taboo, general intellectual keennessmay play a role, at least in cooperative systemswhere speaker control can have a bene®cial ef-fect on performance.

· Speech e�ort level and speaking rate. These fac-tors are at least super®cially controllable,though people are not typically conscious ofthem. This is especially true for the Lombard ef-fect (where people unconsciously talk louderwhen exposed to auditory noise). Such changesin speech production cause complex changes inthe speech signal beyond simple energy and ratechanges.

· Experience. This applies only to cooperative sys-tems where the speaker has the opportunity tointeract with the speaker recognition systemrepeatedly. In these cases, it has been observedthat there is a striking learning e�ect. However,how this learning e�ect is divided between themachine learning the speaker and the speakerlearning to talk to the machine is not totally clear.

2.3.2. Other challengesThere are other factors, beyond speaker vari-

ability, that present a challenge to speaker recog-nition technology. These deal with problems ingetting the acoustical signal intact from thespeaker to the recognition system:· Acoustical noise. This may be a large or small

problem, depending on the microphone usedto transduce the speech signal and the acousticalenvironments within which the system is re-quired to perform.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 229

Page 6: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

· The microphone. Second only to the speaker, themicrophone and associated sound pick-up issuesprobably represent the next biggest challenge tospeaker recognition. This is especially true forapplications that use the telephone, because ofthe wide variety of telephone handsets and mi-crophone elements. People also have a widerange of ways that they hold telephone handsetsand may continually change positions whiletalking, presenting problems to human interloc-utors as well as speaker recognition systems.

· Electrical transmission. This has been a problemin the past and current mobile telephone sys-tems continue to pose a challenge. However,improvement in line quality and the rapiddeployment of high performance digital commu-nication systems may largely eliminate transmis-sion from the palette of problems that facespeaker recognition in the future.

2.4. Performance measures

Performance measures serve a number of pur-poses. These include, most importantly, a meansfor evaluating research ideas and making consis-tent long-term technical progress. Other reasonsinclude comparing di�erent systems, evaluatingthe e�ectiveness of technology for speci®c appli-cations, marketing research to sponsors and sellingproducts to customers.

Performance measures need to be easy to un-derstand and clear. In deciding how to expressperformance ®gures, there is good reason tochoose error over accuracy. The reason is thaterror lends itself to better intuition than does ac-curacy. For example, it is easy to see that reducingthe error rate from 10% to 5% is in a very realsense a doubling of performance, which is ofcourse the same as increasing the accuracy from90% to 95%.

2.4.1. Identi®cationThe speaker identi®cation task is straightfor-

ward and lends itself to a simple bottom-lineidenti®cation error rate, EID, as the performancemeasure:

EID � nerr=ntot;

where ntot and nerr are the total number of trialsand the number of trials in error, respectively. Theerror rate may vary with speaker and other con-ditions, but the basic performance measure is quitestraightforward.

2.4.2. Veri®cationSpeaker veri®cation is a detection task and

therefore there exists considerable support for it, interms of existing conventions and procedures formeasuring the performance of detection systems.

2.4.2.1. Miss/false alarm. Detection system per-formance is usually characterized in terms of twoerror measures, namely, the miss and false alarmerror rates. These correspond respectively to theprobability of not detecting the target speakerwhen present, Emiss, and the probability of falselydetecting the target speaker when not present, Efa.These measures are calculated as

Emiss � nmiss=ntarget;

where ntarget and nmiss are the number of targettrials and the number of those where the targetspeaker was not detected, respectively, and

Efa � nfa=nimpostor;

where nimpostor and nfa are the number of impostortrials and the number of those where the targetspeaker was falsely detected, respectively.

2.4.2.2. Equal error rate. Miss and false alarmrates, while properly characterizing the perfor-mance, still do not produce a single number per-formance ®gure. The use of equal error rateessentially combines miss and false alarm ratesinto a single number by ®nding the decisionthreshold at which they are equal. This onlyworks, of course, if the decision threshold isadjustable.

2.4.2.3. Geometric mean error. While equal errorrate may be a reasonable performance measure forlaboratory systems, this does not extend so easilyto ®elded systems. Operational systems do notexhibit equal miss and false alarm error rates andusually these are purposely adjusted to be quite

230 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 7: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

di�erent from each other. Nonetheless, it would bedesirable to compare di�erent systems in terms ofa single error ®gure that represents their underly-ing ability to discriminate among speakers.

The geometric mean error, EGM, is a statisticthat o�ers the desired ability to compare systemsat other than the equal error point. It is de®nedsimply as

EGM �������������������������������Emiss � Efalse alarm

p:

EGM will be fairly constant as miss and false alarmrates vary, at least over a modest range of values.While it may be used as a rough comparison ofdi�erent systems, probably a better use for EGM isas a diagnostic measure of performance duringresearch or system upgrade. By this means, e.g.,small variations in miss/false alarm trade-o� neednot prevent a meaningful comparison.

However, EGM is not perfectly constant. Typi-cally, it tends to increase as the miss rate is de-creased. This e�ect usually becomes morepronounced when the miss rate gets down to about1% or 2%. This gives rise to the adage that ``it'seasier to reject impostors than it is to accept truespeakers''. This droll folk wisdom re¯ects the be-havioral statistics (e.g., state of health and emo-tions) of the true speakers ± an interestingstatistical commentary on the inherent variabilityof the human voice.

2.4.2.4. Detection cost. Another good approach torepresenting system performance as a single num-ber is to formulate a detection cost function. Thisis a weighted arithmetic mean of the miss and falsealarm rates. This measure has the advantage that itmodels the application and produces a number,which is directly meaningful to the application.Detection cost, Cdet, is typically modeled as aweighted sum of the posterior probabilities of missand false alarm:

C det � cmiss � Emiss � Ptarget � cfa � Efa � 1ÿ ÿ Ptarget

�;

where cmiss and cfa are the costs of a miss and afalse alarm, respectively, and Ptarget is the a prioriprobability of a target.

The detection cost function as de®ned aboveseems to be an excellent evaluation measure, be-

cause it quanti®es the value (cost) of the technol-ogy to the application. For many di�erentapplications, the speaker recognition value/cost isarguably well represented by this detection costfunction. Yet there appears to be little interest inthe scienti®c community for such abstract mea-sures. Researchers prefer to understand miss andfalse alarm error trade-o�.

2.4.3. The DET plotThe trade-o� between miss and false alarm

error rates have traditionally been shown on so-called ROC curves, which plot detection proba-bility as a function of false alarm probability. Animprovement to this visualization aid, called thedetection error trade-o� (DET) plot, has been in-troduced by NIST recently (Martin et al., 1997).The DET plot improves the visual presentation ofDET by plotting miss and false alarm probabilitiesaccording to their corresponding Gaussian devi-ates, rather than the probabilities themselves. Thisresults in a non-linear probability scale, but theadvantage is that the plots are visually more in-tuitive. In particular, if the distributions of errorprobabilities are Gaussian, then the resultingtrade-o� curves are straight lines. The distancebetween curves depicts performance di�erencesmore meaningfully. Examples of DET plots maybe seen in Section 5. In particular, Fig. 3 contrastsa DET plot and a traditional ROC plot for thesame data.

2.4.4. Decision makingSpeaker recognition technology deals with

making decisions. The task is quite simple ± tomake a decision about the identity of the speaker,based on the sound of the personÕs voice. Thatseems clear enough. Yet the actual decision mak-ing process, speci®cally the setting of decisionthresholds, has often been neglected in speakerrecognition research.

Making these decisions has often been dis-missed as an unchallenging problem to be ad-dressed during application development. Yet forthose who actually have had to deploy real oper-ational systems, the problem has been found to bequite challenging, indeed. For the case of cooper-ative speaker systems, performance is invariably

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 231

Page 8: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

found to vary widely among di�erent speakers.But homogeneous performance, with nice low missrates for everyone, is a system requirement. Soresearchers endeavor to calibrate each speakerseparately, only to ®nd that what is observed intraining may be destructively misleading in usage ±speaker normalization sometimes results in worseperformance!

The important point here is that the actual de-cision process must be considered to be part of anycomprehensive speaker recognition research pro-gram and that the ability of a system to makegood decisions should be an integral part of theevaluation.

2.4.5. Pooling dataThe issue of how to pool data is related to the

problem of setting thresholds and making deci-sions. Pooling data for di�erent target speakerscan be especially painful, because the good per-formance given by speaker-dependent thresholdsmay dissolve into poor performance from a mud-dled threshold after results have been pooled.

One suggested solution to this problem is to®nd the equal error rate for each target speakerand then to average these error rates over allspeakers. While this procedure invariably showssuperior performance, the result is illusory. Thereare several reasons why this is a bad practice. The®rst is that knowledge of the best decisionthresholds is inadmissible information ± decisionsmust be made without the arti®cial luxury ofknowing the answers. The second is that inferenceof the best decision thresholds from the test data isfalse knowledge ± random sampling produces theillusion of good decision points that do not reallyexist. This can have a strong bias on resultswhenever there is a relatively small amount of datafor each of the target speakers, which is usually thecase.

2.5. Evaluation

Having a measure of performance is certainlydesirable, even necessary, in order to evaluate asystem. But performance is a function of manyfactors. Consideration of these factors and howthey in¯uence performance, is necessary in order

to understand the technology and apply it suc-cessfully.

2.5.1. Evaluation factorsEvaluation factors are factors that in¯uence the

performance of the system. Ideally we would liketo vary each of these factors and evaluate theire�ect so as to understand how they relate to per-formance. Here are some of the most importantevaluation factors:· Amount of target training data. The more train-

ing data that a system has in order to learn aspeakerÕs voice characteristics, the better willbe the performance of the system. In particular,data from a number of di�erent sessions is desir-able, because a speakerÕs voice characteristicschange signi®cantly from session to session.But for cooperative speaker systems there is usu-ally an overriding need to minimize the demandson the speaker and thus the amount of trainingdata. While multi-session training would be veryhelpful, this is de®nitely a disadvantage to theapplication.

· Test segment duration. This is probably the moststudied factor in speaker recognition perfor-mance with longer segment durations providingsigni®cantly better performance. This is dis-cussed in Section 5.2 for the 1998 NIST evalua-tion.

· Microphone di�erences. Microphone di�erencesare one of the most serious problems facingspeaker recognition, especially when dealingwith the telephone, where EdisonÕs old non-linear carbon-button microphone still representsa signi®cant fraction of all transducers. The im-portance of microphone di�erences may be illus-trated by comparing the performance obtainedwhen training and testing are performed on asingle type of microphone with that obtainedwhen training and testing are performed on dif-ferent types of microphones. This may be doneeasily enough when using telephone data, be-cause there are two di�erent types of micro-phone elements that are commonly used intelephone handsets. These are, namely, the tra-ditional carbon-button element and the nowmore common electret element, which is rapidlyreplacing carbon buttons because of its lower

232 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 9: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

cost. This contrast is discussed in Section 5.3 forthe 1998 NIST evaluation.

· Noise. Noise and distortion will obviously cor-rupt performance, but there has been little sys-tematic study of this issue. See Section 5.5 fora discussion of NIST's e�ort to correlate subjec-tive noise with system performance in the 1998evaluation.

· Temporal drift. Temporal factors play an impor-tant role in speaker recognition performance.Depending on the needs of the intended applica-tion, evaluation may need to stretch out overmonths or even years. For systems that includeadaptation to speaker changes, this may not bea critical factor. However, it may also be thatdi�erent age groups exhibit di�erent perfor-mance characteristics. For cooperative speakersystems, special attention needs to be paid tothe period immediately following a speakerÕs en-rollment. This is an especially di�cult time, be-cause the system must bridge the gap between anunadapted and relatively inaccurate model andwell-adapted model of the speaker at the sametime that the speaker is learning how to speakto the system.

· Sex di�erences. Male and female voices are gen-erally quite di�erent from each other, both inphysical characteristics (e.g., pitch and vocaltract length) and in linguistic and stylistic use.For this reason, it is usually informative to mea-sure performance separately for men and wom-en. This is discussed in Section 5.4 for the 1998NIST evaluation.

2.5.2. Evaluation methodologyThe two di�erent types of speaker recognition

applications, namely, cooperative (text-dependent)recognition and tacit (text-independent) recogni-tion, have a profound in¯uence on the design of anevaluation. For the text-dependent case, the task isinvariably system-speci®c and usually very idio-syncratic, making system evaluation speci®c to thesystem. This in turn inhibits the comparison ofperformance among di�erent systems.

2.5.2.1. Statistical signi®cance. The overarchingconsideration in designing an evaluation is statis-tical signi®cance. An evaluation is pointless if it is

not signi®cant. Furthermore, if the evaluation is tobe helpful to research, then statistical signi®cancemust be with respect to multiple and various se-lected conditions. For example, if only one targetspeaker is used, then statistically signi®cant resultsmay be obtained, but unfortunately they will bevalid only for that particular speaker. Clearly, ifstatistical signi®cance across a population in gen-eral is required, then an adequate number ofspeakers must be sampled. Regardless of thenumber of speakers sampled, however, care mustbe exercised to ensure that the sample populationrepresents the population of interest. For example,it might be particularly easy to recruit collegestudents and so an experiment might be limited tocollege students. Are the results of this experimentvalid for populations of di�erent age and educa-tional demographics? Perhaps. Perhaps not. As-sertions of statistical signi®cance are dangerous insuch circumstances.

One of the di�culties facing evaluation is thelarge number of factors that in¯uence performanceand the complex way that they may interact. Re-quiring meaningful (i.e., statistically signi®cant)results for many di�erent combinations of evalu-ation conditions easily results in an unmanageablylarge test.

2.5.2.2. The rule of 30. In determining the requiredsize of a corpus, a helpful rule is what might becalled ``the rule of 30''. This comes directly fromthe binomial distribution, assuming independenttrials. Here is the rule:

To be 90% con®dent that the true error rate iswithin �30% of the observed error rate, theremust be at least 30 errors.

This con®dence interval and proportionalbound on error rate are reasonable values thatyield reasonable requirements for the size of anevaluation corpus. The rule may be applied byconsidering the performance goals or expectationsfor the evaluation. For example, suppose that theperformance goals are 1% miss and 0.1% falsealarm. Thirty errors at 1% miss implies a total of3000 true speaker trials and 30 errors at 0.1% falsealarm implies a total of 30,000 impostor trials.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 233

Page 10: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

Note, however, that a key assumption in thesecalculations is that the trials are independent. Theimplications of this can be daunting ± does thismean 3000 true speakers and 30,000 impostors?Strictly speaking, it probably does. The alterna-tive is to compromise on independence and notto take assertions of statistical signi®cance tooseriously.

2.5.2.3. Speaker selection. Target speakers shouldbe chosen so as to best represent the intendedapplication. This includes selection of sex, age,dialect and other demographic factors of potentialrelevance; and of course, there must be enough ofthem. This is especially important because of per-formance inhomogeneities among speakers. Spe-ci®cally, there will be di�erences in performanceamong speakers for which there are no clear ex-planations or underlying mechanisms. This has ledto the jocular characterization of the target pop-ulation as being composed of ``sheep'' and ``goats''.In this characterization, the sheep are well behavedand dominate the population, whereas the goats,though in a minority, tend to determine the per-formance of the system through their dispropor-tionate contribution of errors.

The impostor population should model thetarget population, which is typically done infaultless style by having the targets serve also asimpostors. Like targets, impostors also havebarnyard appellations, which follow from inho-mogeneities in impostor performance across thespeaker population. Speci®cally, there are someimpostors who have unusually good success atimpersonating many di�erent target speakers.These are called ``wolves''. Also, there are sometarget speakers who seem unusually susceptible tomany di�erent impostors. These are called``lambs''. The purpose of these appellations is topoint to the need to study and understand thesespeaker inhomogeneities. For an initial e�ort inthis direction, see (Doddington et al., 1998).

Another pro®table direction for speaker selec-tion would be so as to focus research on idiosyn-cratic voice di�erences. Without special selection,many if not most of the salient di�erences invoices will be broad di�erences such as age, sizeand dialect. In order to focus research e�ort on

®ne idiosyncratic voice di�erences, it would behelpful to select a population of speakers who allshare the same gross physical and linguistic char-acteristics.

2.5.2.4. Data collection. For target speakers, testdata should be collected in many di�erent sessions.Performing multiple tests from data collected in asingle session is of limited value at best, because ofthe strong correlation of speech characteristicswithin a single session. Test data should also beprocessed chronologically, so that test data alwaysfollows training data.

Impostor trials are easier to come by if it ispossible to use a single impostor test segment totest against multiple target speakers. Because ofthis multiplicative e�ect, it may be possible to runan extremely large number of impostor trials. Theimpulse to do this should be resisted, because thestatistical signi®cance of the results is determinedin a fundamental way by the number of targetspeakers, not simply by the number of trials per-formed on them. So running an excessive numberof impostor trials is just wasteful.

By convention and common sense, cross-seximpostor trials are to be avoided. Although al-lowing cross-sex trials will improve the apparentperformance of a system somewhat, it confoundscomparison of performance with other systemsand it risks straining the credulity of sponsors,colleagues and bystanders.

3. The 1998 evaluation: objectives, data, partici-

pants

This section summarizes the protocols for the1998 NIST evaluation (NIST, 1998) and o�erssome contrasts with those for the 1997 evaluation(NIST, 1997). We discuss the overall technicalobjective, the evaluation metric and how resultswere presented, the evaluation data set, thetraining and test conditions evaluated and theparticipating sites. The o�cial plans for theseevaluations are posted on the NIST Spoken Nat-ural Language Processing group's website (NIST,1997, 1998).

234 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 11: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

3.1. Technical objective

The NIST evaluations have focused on the taskof speaker detection (or equivalently, speakerveri®cation). That is, the task has been to deter-mine whether a speci®ed target speaker is speakingduring a given speech segment.

This task has been posed in the context ofconversational telephone speech and for limitedtraining data. The evaluations are designed tofoster research progress with the goals:· exploring promising new ideas in speaker recog-

nition;· developing advanced technology incorporating

these ideas;· measuring the performance of this technology.Speaker detection performance has been evaluatedby measuring the correctness of detection decisionsfor an ensemble of speech segments. These weresegments selected to represent a statistical sam-pling of the conditions of evaluation interest. Foreach of these segments a set of target speakeridentities was assigned as test hypotheses. Everyhypothesis was to be judged as either true or false.Each decision had to be based only upon thespeci®ed test segment and target speaker. Use ofinformation about other test segments and/orother target speakers has not been allowed.

3.2. Evaluation measure

The formal evaluation measure was the detec-tion cost function (see Section 2.4.2.4), de®ned as aweighted sum of the miss and false alarm errorprobabilities:

C det � cmiss � Emiss � Ptarget � cfa � Efa � 1ÿ ÿ Ptarget

�:

The parameters of this cost function are therelative costs of detection errors, cmiss and cfa, andthe a priori probability of the target, Ptarget. Theevaluations have used the following parametervalues:

cmiss � 10; cfa � 1; Ptarget � 0:01:

In addition to the (binary) detection decision, adecision score was also required for each test hy-pothesis. The decision scores were used to produce

DET curves (see Section 2.4.3), showing the trade-o� of misses and false alarms.

3.3. Evaluation data

The speech data came from phase 1 of theswitchboard-2 corpus in 1997 and from phase 2 in1998. Switchboard-2 is a corpus of conversationaltelephone speech that was created using a collec-tion paradigm similar to that of the originalswitchboard corpus. These corpora are availablefrom the Linguistic Data Consortium. 1

As with switchboard, switchboard-2 speakerswere assigned topics to discuss with anotherspeaker, whom they did not know. In contrast tothe earlier corpus, however, they were given per-mission to ignore the assigned topic and in mostcases they chose to do so. These speakers weregenerally college students, considerably youngeron average than the switchboard speakers andmuch of the dialogue is dominated by what may becharacterized as ``college chit-chat''. The phase 1speakers were primarily from the northeasternUnited States, while the phase 2 speakers wereprimarily from the American midwest.

The evaluation data contained about 400 targetspeakers in 1997 and about 500 in 1998. Therewere about 5000 test segments per duration bothyears. Ten or eleven targets were speci®ed for eachsegment to be tested against.

New in 1998 was the creation process of the testand training segments. An automatic speech de-tector determined where the speech signal waspresent in a conversation. NIST manually auditedthese segments in 1997, rejecting ones that did notmeet the acceptance criteria (rules designed toeliminate non-speech segments). Then the appro-priate amount of speech was concatenated to-gether, eliminating silences. It was determined in aspecial test after the 1997 evaluation that removingthe auditing for non-speech segments had little ef-fect on recognition performance. For 1998, there-fore, the segment selection process was essentiallyentirely automatic with human veri®cation only

1 Linguistic Data Consortium (LDC), Philadelphia, PA, USA

(www.ldc.upenn.edu).

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 235

Page 12: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

that the speakers were correct. Segments that werenoisy or lacking in speech (see Section 5.5) werenoted for possible later analysis, but were notremoved.

The development data set for the 1997 evalua-tion was the data used in the 1996 evaluation,while for the 1998 evaluation the 1997 data servedthis role. These three data sets, which may be usedas kits for research e�orts, are now available fromthe LDC.

3.4. Evaluation conditions

The 1998 evaluation o�cially consisted of ninedi�erent tests. Participating sites could choose todo some or all of these, but were required toprocess all segments and targets speci®ed in eachtest. The nine tests were de®ned by three di�erenttraining conditions and by test segments of threedi�erent durations. Each test included separatelymale speakers and test segments and femalespeakers and test segments. There were no cross-gender tests.

3.4.1. TrainingThere were three training conditions for each

target speaker in the 1998 evaluation. The threeconditions were:· ``One-session'' training. The training data con-

sists of 2 minutes of speech data taken from onlyone conversation.

· ``Two-session'' training. The training data con-sists of 1 minute of speech data taken from eachof two di�erent conversations collected usingthe same telephone number (and presumablythe same telephone handset).

· ``Two-session-full'' training. The training dataconsists of all speech data available in the twoconversations used for two-session training.

The 1997 evaluation, it may be noted, used the ®rsttwo of these training conditions. It did not use thetwo-session-full training condition, but it also in-cluded the following condition:· ``Two-handset'' training. The training data con-

sists of 1 minute of speech data taken from eachof two di�erent conversations collected usingdi�erent telephone numbers (and thus presum-ably di�erent telephone handsets).

Table 1 includes a summary of the training con-ditions.

3.4.2. TestPerformance was computed and evaluated sep-

arately for female and male target speakers and forthe three training conditions. The test segmentsincluded in the evaluation were organized by sexand by duration. The evaluation plan speci®ed 3factors of interest with respect to which perfor-mance would be examined. They were:· Test segment duration. Performance was com-

puted separately for the 3, 10 and 30-s testsegments.

Table 1

Training and test conditions in NIST 1997 and 1998 speaker recognition evaluations

Condition Description Notes

Traininga

One-session 2 minutes of speech from one conversation 1997 and 1998

Two-session 1 minute of speech from each of two same-number

conversations

1997 and 1998

Two-session full All speech data from above two conversations 1998 only

Two-handset 1 minute of speech from each of two di�erent number

conversations

1997 only

Testa

Sex Male and female No cross-sex tests

Duration (approximate) 3, 10 and 30-s segments 3-s is subsegment of 10-s which

is subsegment of 30-s

Same/di�erent phone number True speaker training versus test segment Phone number serves as proxy

for handset

Same/di�erent handset type Model speaker training versus test segment Based on MIT handset labeler

a Data consists of concatenated speech segments derived from energy-based speech detector.

236 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 13: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

· Same/di�erent number. Performance was com-puted separately for test segments that usedthe same phone number as was used for trainingthe true speaker versus those segments whichused a di�erent phone number.

· Same/di�erent handset type. Performance wascomputed separately for di�erent number testsegments with the same handset microphonetype label (either electret or carbon-button, asdiscussed below) as the true speaker training da-ta versus those segments with a di�erent handsetmicrophone type label.

The test segments came from di�erent conversa-tion sides from the training data. Only one seg-ment of each duration was selected from aconversation side. The 10-s segment was a sub-segment of the 30, and the 3-s segment was asubsegment of the 10. For convenience, however,they were provided as separate ®les in the test data.

The 1997 evaluation had focused on di�erentnumber tests and the same/di�erent type of eachtest segment was kept unknown to the system. The1998 evaluation sought to de-emphasize handsetdi�erences and thus focused on same number tests.The system was informed of whether each testsegment involved a same or di�erent number testwith respect to the true speaker. Note that all testsinvolving a speaker other than the true speaker ofa test segment (non-target tests) necessarily in-volved di�erent handsets in training and test.

Also in 1998 and in contrast to 1997, systemswere told the classi®cation of all training and testhandsets as being of either carbon button orelectret microphone type as determined by theMIT Lincoln Lab handset classi®er. The impor-tance of handset type to recognition performancehad been observed in the previous evaluation(Section 5.3).

Table 1 includes a summary of the test condi-tions.

3.5. Participants

Twelve research sites participated in the 1998evaluation. Five of the European sites, while sub-mitting results for separate systems, worked co-operatively in what was called the ELISAConsortium. The types of systems developed by

many of these sites are discussed in Section 4. Theperformance results presented in Section 5 do not,however, identify the individual sites. The sites andtheir designations (* indicates ELISA Consortiumparticipant) were:· A2RT: Department of Language and Speech,

Nijmegen University, Netherlands.· BBN: BBN Technologies, GTE, Cambridge,

MA, USA.· CIRC*: Circuits and Systems Group, Ecole

Polytechnique Federale de Lausanne (EPFL),Switzerland.

· Dragon: Dragon Systems, Newton, MA, USA.· ENST*: Ecole Nationale Superieure des Tele-

communications, Paris, France.· IDIAP*: Institut Dalle Molle d'Intelligence Ar-

ti®cielle Perceptive, Martigny, Switzerland.· IRISA*: Institut de Recherche en Informatique

et Systemes Aleatoires, Rennes, France.· LIA*: Laboratoire Informatique, Universite

d'Avignon et des Pays de Vaucluse, France.· LIP6: Laboratoire d'Informatique de Paris 6,

Universite Pierre et Marie Curie, France.· MIT-LL: MIT Lincoln Laboratory, Lexington,

MA, USA.· OGI: Oregon Graduate Institute of Science and

Technology, Portland, OR, USA.· SRI: Speech Technology and Research Labora-

tory, SRI International, Menlo Park, CA, USA.While not a participant in the 1998 evaluation,Ensigma, located in Chepstow, UK, was repre-sented at the review workshop following theevaluation since it had participated in previousevaluations and presented some performance re-sults on its then current speaker recognition sys-tem. Its system is discussed along with those of theevaluation participants in Section 4.

4. The 1998 evaluation: technology overview

In this section we provide a high-level overviewof the recognition technology employed by par-ticipants in the 1998 NIST evaluation. The aim isto outline the general technology trends of state-of-the-art, text-independent, speaker recognitionsystems in the areas of features, speaker modelsand score normalization. We also discuss the use

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 237

Page 14: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

of fusion systems and potential future researchdirections. It is outside the scope of this article todescribe the di�erent approaches in detail forwhich the reader is directed to the referenced pa-pers or cited evaluation participant. In addition,association of a site with a particular techniquedoes not imply that they invented the approach,merely that this is what they used in their eval98system.

Based on a presentation at the RLA2C Work-shop, the material in this section has not beenpublished previously.

4.1. Canonical recognizer structure

From a general perspective, the approach usedby all systems in the NIST speaker recognitionevaluation is that of a likelihood ratio detector.The canonical structure of this general system isshown in Fig. 1 and consists of three main com-ponents. The ®rst is the front-end processingwhich represents the signal processing to extractfeatures used for model training and recognitionand any signal related channel compensation tominimize the e�ects of di�erent channels on thefeatures. The second component is the targetspeaker and background models used in formingthe likelihood ratio statistic for a test utterance.The ®nal component is a post-processing step ofscore normalization usually applied to stabilizedetection scores so that speaker-independentthresholds are more e�ective. Although describedas distinct components, in some systems thesecomponents can be integrated thus blurring thecategory to which they belong. For example,channel compensation techniques can and do oc-cur in front-end processing, in model training and

in score normalization. We describe the trends inthese di�erent areas.

4.2. Speech features

Of the major system components presented inFig. 1, the greatest commonality of approachamong the participating sites was seen in the front-end processing. A majority of systems employedthe following standard set of front-end processingsteps:· Signal bandlimiting. Systems used only spectral

information from the frequency range 300±3400 Hz, the voice band of the telephone signal.

· Cepstral feature extraction. Most systems used®lterbank magnitude spectral representationsfollowed by transformation to the cepstral do-main using the discrete cosine transform(DCT). A few used linear predictive coe�cients(LPC) spectral representations followed by re-cursive equation expansion of cepstral coe�-cients.

· Cepstral derivatives. Generally, ®rst order deltacepstral features were appended to the staticcepstral feature vector. One system used deltacepstra as an independent feature stream. Sec-ond order derivatives were not found to providemeasurable improvements.

· Cepstral mean subtraction. Primarily non-causalcepstral mean subtraction was performed overthe entire train or test ®le. A few systems usedcausal ceptral mean subtraction (e.g., in theform of RASTA ®ltering).

A few di�erent and new approaches related tofront-end processing were used in eval98 systems.These included (with the site using them in squarebrackets):

Fig. 1. Canonical structure of a speaker recognition system. The general approach used by all systems in the NIST eval98 was that of a

likelihood ratio comparison detector consisting of three main components: front-end processing, target and background modeling, and

score normalization.

238 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 15: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

· Nonlinear discriminant analysis (NLDA) features[SRI] (Konig et al., 1998). Here features are de-rived from using the hidden node outputs of aneural network trained to separate target andnon-target cepstral features.

· Pitch prosodics [SRI]. In addition to using thepitch frequency derived from each short-timewindow as an additional feature for the speakermodel, the approach was aimed at modelingboth the static and dynamic pitch contour infor-mation.

· Modulation spectral ®ltering [OGI] (van Vuurenand Hermansky, 1998). This approach lookedat ®ltering the time sequence output from eachspectral ®lterbank output to suppress the speak-er-independent information in the spectral fea-tures.

· Speaker-independent features and speaker map-ping [OGI] (Hermansky and Malayath, 1998).This approach is a tight integration betweenthe feature extraction and the speaker modelingcomponents. In general, a form of principalcomponent analysis was used to derive featuresconveying speaker-independent, linguistic infor-mation from cepstral features and both wereused to train a speaker-speci®c mapping func-tion to characterize the speaker-dependent por-tion of the feature space.

Overall, little new work on features was deployedin eval98 systems. The NLDA features did providea small but consistent gain over baseline cepstralfeatures and pitch prosodics. While not powerfulby themselves, NLDA features did boost perfor-mance when combined with cepstral featurescores. The other new features showed no im-provement over standard cepstra.

4.3. Speaker models

The approaches to speaker modeling used byvarious eval98 systems can be broadly categorizedinto two approaches based on the representationdetail employed. The ®rst category is termed un-supervised acoustic representation in which aspeakerÕs acoustic characteristics (as re¯ected inthe sequence of feature vectors extracted from his/her speech signal) are modeled by a single modelwith only implicit representation of underlying

acoustic classes (such as broad-category soundslike vowels or fricatives). This is the case, e.g.,when a speaker representation is a model of theprobability distribution of his/her features; themodel represents the agglomeration of all speaker-speci®c sounds with no explicit modeling of indi-vidual sounds or sound classes.

The second category is termed supervisedacoustic representation in which explicit segmen-tation, labeling and representation of underlyingacoustic classes are used to model a speakerÕsacoustic characteristics. A system of this typewould use an external labeler, such as a phonerecognizer, to segment and label a speakerÕstraining speech so that speaker-speci®c models ofeach acoustic class (phoneme) could be trained.During recognition, the same external labelerwould segment and label test speech so labeled testspeech segments could be compared to speaker-speci®c label models.

In addition to the speaker model representa-tion, how the background model is constructedand how the target speaker model and backgroundmodel scores are combined to produce the likeli-hood ratio statistic are key to system performance.In Sections 4.3.1 and 4.3.2 we outline the speci®cmodeling approaches for both types of represen-tations, detailing the construction of the back-ground model and the likelihood ratio statistic. Asabove we note the speci®c site using the approachin square brackets.

4.3.1. Unsupervised acoustic representationsAs expected for a text-independent task, unsu-

pervised acoustic models were the predominantmodels used in systems. Of these, Gaussian mix-ture models (GMMs), especially adapted GMMs,were the models most often used primarily due totheir modest computational requirements andconsistently high performance.· Adapted GMMs [MITLL, Dragon, OGI, SRI]

(Reynolds, 1997b). For this approach, the back-ground model is a single, speaker-independentGMM with 1024±2048 mixtures and a targetspeaker model is derived from the backgroundmodel using Bayesian adaptation. The likeli-hood ratio statistic is then simply the ratio ofthe target to background models likelihood

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 239

Page 16: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

scores for a test utterance. This model has beenthe most widely used by sites over the past fewNIST evaluations.

· Unadapted Gaussian mixture models [BBN,CIRC, ENST, IRISA] (Reynolds, 1995). In thisapproach, each speaker is represented by inde-pendent GMMs, typically 32±128 mixtures, de-rived via maximum likelihood estimation fromtheir training data. The background model is ei-ther a single, speaker-independent GMM or acollection of speaker-dependent GMMs termedcohorts, likelihood-ratio sets, or backgroundsets. When using background sets, the back-ground score for a test utterance is typicallythe arithmetic or geometric average of each setmemberÕs likelihood score. The likelihood ratiostatistic is the ratio of the target to backgroundlikelihood scores for a test utterance.

· Ergodic hidden Markov models [A2RT] (Jabouletet al., 1998). For this approach, the backgroundmodel is a single, speaker-independent 4 state,32 mixture/state, ergodic HMM and a targetspeaker model is derived from the backgroundmodel using adaptation. The main di�erence ofthis model and the adapted GMM is the ergodicHMM explicitly models transition probabilitiesbetween hidden states. The likelihood ratio sta-tistic is then simply the ratio of the target tobackground models likelihood scores for a testutterance.

· Unimodal Gaussian models [LIA] (Besacier andBonastre, 1998). In this approach, each speakeris represented by an independent, unimodal, fullcovariance Gaussian model with the back-ground model being a speaker-independent uni-modal, full-covariance Gaussian as well. For theimplementation used in the eval98 system thissimple model was augmented by more compli-cated processing such as using a sequence of®xed-length segments from each test utteranceand a parallel bank of multi-band recognizers.The segment, band likelihood ratio was simplythe ratio of target to background likelihoodscores; the ®nal test utterance score was a morecomplicated merging of individual likelihood ra-tio scores.

· Auto regressive vector models [LIP6] (Montacieand Le Floch, 1992). In this approach, each

speaker is represented by an independentARVM, which is a predictive model of the se-quence of spectral feature vectors. A collectionof ARVMs from non-target speakers is used asthe background model. The likelihood ratioscore is the di�erence between the target andbest scoring background ARVM score on a testutterance.

4.3.2. Supervised acoustic representationsThe use of supervised acoustic models have

made a strong showing in this and previous eval-uations. The major limitation for text-independentapplications is in the accuracy and consistency ofthe labeler. Under more controlled conditions oftext-dependent or vocabulary constrained appli-cations, such systems are indeed the approach ofchoice. With the addition of better labelers andnew scoring techniques, these approaches mayoutperform the unsupervised approaches for text-independent tasks.· Large vocabulary continuous speech recognizer

[Dragon] (Gillick et al., 1993). In this approach,the labeler used is a speaker-independentLVCSR system segmenting and labeling pho-neme units. For each label, a speaker-indepen-dent, monophone unit (3 state/128 mixtures) isused as a background model and target labelmodels are derived from these background labelmodels using Bayesian adaptation. Likelihoodratios for each label are computed as the ratioof target to background label model likelihoodscores and the sequence of label likelihood ratiosis integrated over the entire test utterance. In ad-dition to using standard HMM-based tech-niques to compare train and test labelsegments, Dragon also examined a model free,sequential, non-parametric (SNP) technique tocompare label segments from training and test-ing speech. When combined with their adaptedGMM system scores, the SNP approach showeda signi®cant improvement in performance.

· Broad phonetic class recognizer [BBN, Ensigma](Carey and Parris, 1998). In the Ensigma sys-tem, a speaker-independent, ergodic, broad-class phonetic recognizer with 28 phoneticclasses is used for segmentation and labeling.Background label models are 3 state/3 mixtures

240 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 17: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

HMMs and target label models are derived fromthe background label model via adaptation. Thelikelihood ratio score is computed not from theacoustic likelihood scores, but from the ratio ofnumber of target label matches to number ofbackground label matches from a Viterbi decod-ing of the test utterance. In addition to the stan-dard technique of only scoring test speechagainst models derived from training speech,Ensigma examined a symmetric scoring tech-nique that also scored training speech againstmodels derived from test speech with some sig-ni®cant improvement. In the BBN system, a fullspeech recognition system was used to tran-scribe the speech into words and then the wordswere expanded into 53 phonemes using a dictio-nary expansion. The output phone sequence wasthen used as the input to a secondary discreteHMM to model the speaker dependent charac-teristics of the phoneme sequence. By itself thisapproach was not very e�ective and even whencombined with GMM acoustic scores it showedno improvement in performance.

· Temporal-decomposition with neural nets [CIRC](Hennebert and Petrovska-Delacretaz, 1998). Inthis approach, the labeler is a blind, temporal-decomposition segmentation followed by avector quantization labeling of segments into8 label classes. Unlike the LVCSR or phoneticclass approaches, the labels in this approachare acoustically rather than linguistically de®nedunits. Label models are 3 layer multi-layer-per-ceptrons (5 frame input, 20 hidden nodes and2 output nodes) trained to discriminate betweenlabel-class features from target and non-targetspeakers. This system represents the only purediscriminant classi®er used in the evaluation(SRI ®elded an adapted GMM system that usedsecondary discriminant training). The test utter-ance score is the average of the MLP out-puts over all observed label segments in theutterance.

4.4. Score normalization/microphone compensation

One of the major challenges for the NISTspeaker recognition evaluations is dealing with thechannel variability found in telephone speech,

primarily variability and distortions imposed bydi�erent microphones. While standard channelcompensation techniques applied to spectral fea-tures, such as cepstral mean subtraction, do help,there is still a signi®cant gap between performanceunder matched and mismatched microphone con-ditions. Three general approaches have emergedfrom the NIST evaluations to address thisproblem.

4.4.1. Znorm/hnormFor the NIST evaluation, systems were required

to produce scores for which speaker-independentthresholds could be applied. Although likelihoodratio scoring does provide relatively stable speakerindependent scores, there still exist biases from thedata and models that imparts speaker-dependencyon model scores. In the znorm compensation ap-proach, these speaker-dependent score biases andspreads are estimated by observing the distributionof scores produced from speaker models scoringnon-target development speech. During testing amodel score is adjusted using its estimated biasand spread parameters. It has been further ob-served that the microphone type (carbon-button orelectret) used during training a model can alsoinduce strong biases on scores. For example, amodel trained using speech from an electret mi-crophone will tend to produce higher likelihoodratio scores for speech from electret microphonesregardless of the speaker. In an extension toznorm, a technique called hnorm was developedwhich estimated speaker and handset-dependentbias and spread parameters for compensation. Theapplication of hnorm requires that putativehandset labels be available for development,training and testing data. For eval98, NIST sup-plied handset labels to participants for all data. Amajority of eval98 systems used either znorm orhnorm score normalization (Reynolds, 1997a,b).

4.4.2. Handset type mappingTo compensate for handset di�erences in the

speech signal domain, MITLL (Quatieri et al.,1998) developed and applied a non-linear mapperfor mapping electret speech into carbon-buttonspeech and vise versa. The mapper was ap-plied during training to synthetically augment

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 241

Page 18: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

single-handset training data and during testing tomap test speech from one microphone into speechfrom the training microphone. While showingimprovement over a baseline system, this approachdid not work as well as hnorm on the eval98 testset.

4.4.3. Handset type consideration in backgroundmodels

Several sites used the provided handset labels toselect the data for background model training orthe members of cohort sets (Heck and Weintraub,1997). For a single speaker-independent back-ground model a balance of electret and car-bon-button speech was used to derive amicrophone-independent model or separate mi-crophone-dependent models were trained andassociated with target models of the same micro-phone type. For cohort sets, microphone balancedspeakers were selected and microphone dependentsets used for target speakers.

4.5. Fusion systems

In this evaluation, several sites showed someimprovement over a baseline system by simplelinear combination of scores from di�erent sys-tems. For example, Dragon showed substantialperformance improvement by combining scoresfrom their adapted GMM system and their SNPsystem; SRI found additional gain in baselineperformance by combining scores from their sys-tems using cepstral, NLDA and prosodic features;Ensigma showed performance improvement bycombining forward and backward scores in theirsymmetric scoring system.

Both the ELISA consortium and NIST per-formed true cross-site system fusion. IDIAP andENST collaborated in using a Dempster±Shafer-based fusion system to combine scores fromconsortium systems. The combined system, un-fortunately, showed no improvement over the bestconsortium system. NIST examined two fusionsystems combining scores from all eval98 partici-pants. The ®rst NIST fusion system was a simplevoting system which looked at the hard decisionoutputs from each system an produced a scorewhich was the average number of ``true'' decisions.

This fused system did show improved performanceover the best individual system in the low false-alarm region of the DET curve. The second NISTfusion system utilized an MLP (multi-layer per-ceptron). Since no development data was availableto ®nd optimal MLP parameter settings, a jack-knife experiment was conducted for robust pa-rameter selection while testing on the entire testset. This fused system too had mixed resultscompared to the best individual system, doingbetter in some operating regions and worse inothers (Section 5.1 and also (Fiscus, 1997)).

5. The 1998 evaluation: results

The 12 research sites participating in the 1998evaluation submitted results for a total of 27 sys-tems. NIST produced scoring results consisting ofDET curves for the speci®ed training and testconditions and for portions of the test data cor-responding to various factors of interest. Some ofthese are described below. These and some addi-tional results were presented at the RLA2CWorkshop (Przybocki and Martin, 1998a) and areavailable on the NIST website (NIST, 1998).

NIST analyzed various conditions that could bea�ecting recognition performance, including sex,age, pitch, handset type, noise, numbers of callsmade, dialect region and channel. For each ofthese conditions, we partitioned the training, tar-get test and/or non-target test data based on con-dition-relevant values and analyzed the results.

It has been NIST policy in the speaker recog-nition evaluations not to publicly rank the per-formance results of the di�erent participating sites.It is for this reason that the DET curves presentedin this section do not identify the systems beingconsidered. Speci®c information about the per-formance of individual systems may be availablefrom researchers at the participating sites.

5.1. Overall performance

Fig. 2 shows the DET curve for one of the testconditions, namely, that of two-session training,30-s durations, with same number tests. (In gen-eral, we concentrate on the two-session training,

242 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 19: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

30-s duration results in what follows.) In additionto results for 12 primary systems (system identitieshave been suppressed), it also includes two NISTcreated fusion systems.

The NIST12 system is a simple voting system.We combine the hard decisions of 12 primarysystems, assigning the segment score as the num-ber of TRUES minus the number of FALSES. Thehard decision is true if the segment score is greaterthan zero. This approach produces a modestbene®t compared to the other best systems, as maybe seen in Fig. 2. The NIST12 system is plotted asa set of discrete points, since it has only a limitednumber of operating points.

The NISTmlp system is an attempt to extendthe ideas used in NISTÕs system (Fiscus, 1997)for combining results from multiple speech rec-ognition systems. It uses the MIT Lincoln Lab-oratory LINKnet software to create a multi-layerperceptron that combines the likelihood scores

from the 12 primary systems. As may be seen inboth the bar chart and DET curve of Fig. 2, theNISTmlp system outperforms the other systemsrepresented.

5.2. Training conditions and duration

Figs. 3±12 show DET plots of performance re-stricted to certain conditions for one of the systemsin the 1998 evaluation. For all of these plots theone system was chosen to be broadly representa-tive of the results for all systems included in theevaluation. It should be understood, however, thatin each case there is considerable variation inperformance results across systems. The generaltrends that are noted are stronger for some systemsthan for others, but are not strongly reversed inthe performance of any of the systems.

Fig. 3 shows the variation in performanceby training condition for one system. The

Fig. 2. DET curves for the primary systems of the 12 participants and two NIST fusion systems, processing the 1998 NIST speaker

recognition evaluation data. Results shown are from same number, 30-s test segments using the two-session training condition.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 243

Page 20: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

Fig. 4. System performance is shown as a function of test segment duration under the same training condition (two-session). It is seen

that the longer the test segment, the better the performance, however, with diminishing returns.

Fig. 3. System performance is shown for each training condition processing the same test data (same number tests of 30-s durations).

The DET plot is on the left. On the right for contrast is a traditional ROC plot of the same data. It is seen that multiple training

sessions using the same amount of data improves performance. Additional training data improves performance slightly.

244 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 21: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

performance di�erences are in the expected direc-tion. Using training data from two di�erent con-versations improves performance over using thesame amount of data from only one conversation.With these same two conversations, but using allavailable data, thus increasing the total amount oftraining data by an average factor of 2.7, there is arather smaller improvement. There is some varia-tion across systems with at least one showing nogain from the increased amount of training data.Thus, multiple session training, if the human fac-tors of the application permit it, may signi®cantlybene®t performance.

Fig. 4 shows the variation in performance byduration for one system, with longer durationsproducing better performance, as expected. Thise�ect was more pronounced in some systems andless in others, but was always present. There ismore gain going from 3 to 10 s than from 10 to

30 s, suggesting that the bene®ts from increasedduration are limited beyond a certain point.

5.3. Handset e�ects

Fig. 5 shows the performance variation for onesystem on same number and di�erent numbertests. Since speci®c handset information was notavailable, same or di�erent number is assumed tocorrespond to same or di�erent handset, thoughthere are undoubtedly exceptions. Note the verylarge performance di�erence for same and di�erentnumbers. A large di�erence occurs for all systems.

Analysis following the 1997 evaluation showedthat one key source of di�culty for systems is theuse of telephone handsets of di�erent type (car-bon-button or electret microphone). This wasmade possible by one participant, MIT-LincolnLaboratory, making available to NIST its software

Fig. 5. System performance is shown for same and di�erent phone number test segments for the two-session training condition, 30-s

test segments. It is seen that phone number (and presumably handset) matching between training and test greatly improves perfor-

mance.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 245

Page 22: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

that attempts to classify speech segments as com-ing from either carbon or electret handsets. In1998 systems were given this handset classi®cationinformation along with the test data. They coulduse this information as they wished, aware that itcould be less than totally accurate. The resultspresented here, however, assume the correctness ofthis MIT classi®er.

Fig. 6 shows the 1998 DET curves for the sys-tem used in Fig. 5 with the di�erent number testsbroken down according to whether the traininghandset microphone type (electret or carbon-but-ton) of the speaker (true or otherwise) being testedmatches that of the test segment. Note the largegap in performance between the matched andmismatched cases. Again, this is a performancevariation found in all systems in the evaluation.

Fig. 7 breaks apart the two curves of Fig. 6according to the speci®c handset types involved.

Thus, for the matched case it shows results forelectret training and test and for carbon trainingand test. For the mismatched case, it shows resultsfor carbon training and electret test and forelectret training and carbon test. The numbers oftests included, both target (true speaker) and non-target (non-true speaker), are indicated. The majorpoint to note here is that performance is betterwhen the test segments are electret than when theyare carbon. This held true for all systems.

5.4. Sex and pitch

It has been NIST policy in recent evaluations,as discussed in Section 2, not to include any cross-sex tests, i.e., male hypothesized speakers when afemale is actually talking or vice versa. Thus, eachtest is really two separate tests, one on malespeakers and one on female speakers. It is natural

Fig. 6. System performance is shown for the two-session training condition, 30-s test segments when the test segment comes from a

phone number that was not used for training. It is seen that handset type matching between training and test greatly improves per-

formance.

246 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 23: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

to ask if there is a performance di�erence betweenthe two.

The ``all'' lines in Figs. 9 (for 1998) and 10 (for1997) suggest that the performance di�erence be-tween the sexes is fairly small but with a trendtoward better results for males. These ®gures showresults for one particular site in each of theseevaluations, but the same trend was observed foralmost all systems.

We also examined in several ways the e�ect ofspeaker average pitch on performance. Since therewere some di�erences, we here look at both 1997and 1998 results. We obtained average pitch esti-mates using the Entropics 2 software ``get_f0'' and``pplain'' functions.

The switchboard-2 corpus contains a largerpercentage of high-pitched (and young) speakers,especially among females in the phase 1 (1997data). Fig. 8 shows the distributions of averagepitch frequencies of the training data among thechosen female and male target speakers in 1997(�400 total speakers) and 1998 (�500 totalspeakers).

We chose to look at speci®c high- and low-pit-ched subsets of the speakers for both males andfemales. We examined performance on samenumber tests when both targets and non-targetswere restricted to the high or low of 25% averagetraining pitch frequencies for each sex, as well asoverall performance for each sex. (The vertical linesin Fig. 8 show these 25th and 75th percentile pitchvalues for the distributions.) Fig. 9 shows theseresults for 1998 and Fig. 10 gives similar results for1997 for the systems of one particular site.

2 Entropic Research Laboratories, 600 Pennsylvania Ave. SE

Suite 202, Washington DC 20003, USA.

Fig. 7. System performance is shown as a function of handset type for both matched and mismatched tests. The test was from the 1998

NIST speaker recognition evaluation using the two-session training condition with di�erent number 30-s test segments. The general

trend is that performance is best with electret test segments.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 247

Page 24: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

It may ®rst be noted that performance over allfemale or male speakers is not consistently betteror worse than over the restricted pitch subsets.One might have imagined that restricting the pitchrange would make the recognition problem moredi�cult, but this does not appear to be the case.Average pitch di�erences thus do not appear to bea major cue for distinguishing speakers.

There are performance di�erences, however,between high- and low-pitched speakers. In bothyears performance was better for low-pitched thanhigh-pitched males. For females, however, the low-pitched speakers did better in 1997 while the high-pitched ones were better in 1998. Performance inthese matters tended to be consistent for di�erentsystems while being inconsistent between years.Perhaps this has something to do with the largenumber of high-pitched females in the 1997 test setor perhaps these are just random e�ects. Moreinvestigation would be of interest.

We also examined partitioning the targets orthe non-targets by ``closeness'' in pitch between thetraining data and the test segment. The more in-teresting results came from partitioning the targetsin this way. Fig. 11 shows these results for samenumber tests for one system in 1998.

The plot shows performance for all tests and forwhen target tests (true speaker tests) are restrictedto the high or low 10% or 25% of all such testsbased on pitch closeness. The results are all in theexpected direction, but the big di�erence of note isthe considerable degradation in performance whentests are restricted to when the model and testpitch values are far apart. For example, for a ®xedmiss rate in the 2±20% range, the false alarm rategenerally doubles between the ``all'' case and the``25% far'' case for the system used in Fig. 11. Thispresumably corresponds to situations when, be-cause of a cold or stress or other factors, a speakerdoes not talk quite as he or she normally does.

Fig. 8. Distributions of average pitch over training data for female and male target speakers in 1997 (switchboard-2, phase 1) and 1998

(switchboard-2 phase 2) evaluations. The 25th and 75th percentile pitch values of each distribution are shown by the vertical lines. The

test set for 1997 contained an unusually large number of high pitch females. There is not much disparity in the male pitch distribution

from year to year.

248 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 25: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

While varying in magnitude, such a degradationoccurs for all systems.

5.5. Other factors

Noise was, not too surprisingly, found to alsobe a factor that moderately a�ected speaker rec-ognition performance. Speakers were encouragedto initiate calls from di�erent phones, resulting in afair number from pay phones in outdoor or othernoisy locations. The 10-s segments were manuallylabeled as ``good'', ``bad'' or ``really bad'' and truespeaker test scores were conditioned on this label.(Impostor scores were held the same for all of theconditions.) The results, shown in Fig. 12, dem-onstrate a modest correlation, with about a factorof 2 increase in error rate for ``really bad'' dataover ``good'' data. Further study is needed tocategorize which types of noise may most impactperformance

Corpus speakers provided their age as part ofthe enrollment process. Most were of college age(late teens or early twenties), but there were someolder speakers. As might be expected, false alarmrates were lower when the target and modelspeakers were further apart in age, but this e�ectwas quite modest. Factors not found to have asigni®cant e�ect on performance included howmany calls a speaker made, place of birth (perhapsre¯ecting dialect) and conversation side, i.e.,whether the speaker initiated or received the call.

6. Summary and perspective

NIST has supported regular speaker recogni-tion evaluations, open to all, with announcedschedules, written evaluation plans and follow-upworkshops. These can be an e�ective means toencourage research and develop state-of-the-art

Fig. 9. System performance is shown as a function of sex and for high and low pitched speakers. The upper and lower 25% distri-

butions (shown in Fig. 7) are plotted for the two-session training condition and the same number 30-s test segments.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 249

Page 26: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

systems in core technology areas such as speakerrecognition. Furthermore, appropriate evaluationmethodologies and analysis of results can help il-luminate the progress that has been made andidentify the factors limiting further advance. NISTintends to use the results of such analysis in de-signing future evaluations, where these and otherfactors will be studied further.

Any site or research group desiring to partici-pate in future evaluations should contact AlvinMartin at NIST ([email protected]) andshould obtain data from previous evaluationsfrom the LDC to develop and test their systems forevaluation tasks.

6.1. Future research directions

From an examination in Section 4 of the tech-nologies used in the NIST (1998) evaluation, there

are some observations we can make about thecurrent status of speaker recognition research andpossible future research directions.

In general, new features have provided small,but consistent gains in performance. The maingoal is to ®nd features that are more immune tothe variability and degradations we know a�ectspeaker recognition, such as channel, microphonesand acoustic environment. It is highly unlikely thatthese new features will be derived from the spec-trum since the spectrum is obviously highly af-fected by the above factors. Non-spectral features,such as pitch, have shown some promise of vari-ability immunity, but are not very robust to otherfactors like emotions and in general are much lesse�ective in speaker separation.

To date, supervised acoustic representationshave not outperformed unsupervised acousticrepresentations for text-independent applications.

Fig. 10. System performance is shown as a function of sex and for high and low pitched speakers. The upper and lower 25% of

distributions (shown in Fig. 7) are plotted for the two-session training condition and the same number 30-s test segments.

250 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 27: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

These techniques generally require complicatedlabelers with large computational demands withlimited returns in performance. Gains are likely toremain limited as long as the labeler is only used asan elaborate means to compute the acoustic like-lihood of the data. The real pay-o� in these ap-proaches is likely to be in using the label sequenceoutput to learn about higher levels of informationnot currently found in and complimentary to theacoustic score. Exploitation of such high-level in-formation may require some form of event-basedscoring techniques, since higher levels of infor-mation, such as indicative word usage, will notlikely occur regularly as acoustic informationdoes.

One of the largest robustness challenges for aspeaker recognition system is dealing with mis-matched conditions, especially microphone mis-

matches. Since most systems rely primarily onacoustic features, such as spectra, they are toodependent on channel information. While this canbe a plus for applications where speaker and mi-crophone are very likely to be linked (e.g., facilityaccess control), it is a large impediment to moregeneral application deployment and a fundamen-tal research challenge. It is likely that decouplingof the speaker and channel will come from a betterunderstanding of speci®c channel e�ects on thespeech signal since this would lead to the imme-diate pay-o� of better features and compensationtechniques.

Fusion of systems may be a means to build on asolid baseline approach and provide the best at-tributes of di�erent systems. This of course re-quires the search for complementary informationto model; what indications are strong when the

Fig. 11. System performance is shown as a function of pitch closeness between the true speaker training data and the test segments.

The results are shown for the two-session training condition on same number 30-s test segments. The high and low 10% and 25% in

pitch closeness of all target tests are plotted, along with the curve for all such tests. It is shown that performance degrades when target

models and test segments are not close in pitch.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 251

Page 28: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

baseline is weak? Furthermore, successful fusionwill require ways to adjudicate between con¯ictingsignals and to combine systems producing con-tinuous scores with systems producing event-basedscores.

6.2. A technology perspective

Speaker recognition technology has made tre-mendous strides forward since the initial work inthe ®eld some 30 years ago. The perspective thenwas that, while the research was interesting, thetechnology would never be practical ± because itwould take a whole computer to perform the task!Of course this was back when computers cost realmoney, orders of magnitude more money thannow, for a computer that by today's standardswould be totally inept. If researchers then couldpossibly have imagined and believed in, today'scomputing and information infrastructure, what

would they have thought? Probably that making abusiness success of speaker recognition would betrivial! But since we are not there yet, it might begood to re¯ect on why not.

One of the problems that must be faced in theapplication of speaker recognition for accesscontrol is the rejection of valid users. In realsystems, it is utterly unacceptable to reject validusers. Therefore, there must be acceptable backupprocedures to ensure that valid users are notrejected.

Another stumbling block for automatic speakerrecognition is the discrepancy between human andmachine performance. Historically, humans haveoutperformed machines by a wide margin. How-ever, humans and machines give di�erent resultsand seem to operate quite di�erently. For exam-ple, humans are more robust than machines todistortions and noise. On the other hand, ma-chines have demonstrated greater ability than

Fig. 12. A DET plot for text-independent speaker detection, contrasting true-speaker performance for tests segments subjectively

classi®ed as ``good'', ``bad'' and ``really bad''.

252 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254

Page 29: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

humans to distinguish the voices of identicaltwins. Perhaps we are approaching the pointwhere machines overtake humans in speaker rec-ognition ability.

Fig. 13 shows the results of a recent comparisonof human versus machine performance on 3-s testsegments, made during the 1998 NIST (NIST,1998) evaluation of text-independent speaker rec-ognition. While neither humans nor machines didan impressive job, the di�erence was not great ±about a factor of 2 in error rate. (Note that bothsame number and di�erent number tests are in-cluded in Fig. 13. This is why this ®gure showspoorer machine performance than that shown inFig. 4. For a further discussion of human versusmachine performance on a limited subset of thefull evaluation test set, see (Schmidt-Nielsen andCrystal, 1998).

Future directions in speaker recognition tech-nology are not totally clear, but several observa-tions might be helpful. First, computer power willcontinue to grow exponentially for at least the near(and foreseeable) future. The challenge is clear ±®gure out how to exploit that power, because it is asafe bet that major advances will demand evergreater computing power and that leading re-searchers will have ®gured out how to use it pro-ductively. Second, human listeners have a relativelykeen ability to recognize familiar voices. (Peopleapparently are much more accurate in classifyingfamiliar voices than unfamiliar voices.) It mighttherefore be worthwhile to try to understand thenature of this capability and to begin to createmodels to perform this function by computer. Thissounds like a formidable challenge. But at least wehave an existence proof to encourage us onward!

Fig. 13. A DET plot for text-independent speaker detection, contrasting human listener performance with machine performance for

3-s test segments. Both same number and di�erent number tests are included in the curves shown.

G.R. Doddington et al. / Speech Communication 31 (2000) 225±254 253

Page 30: The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

References

Besacier, L., Bonastre, J.-F., 1998. Time and frequency pruning

for speaker identi®cation. In: RLA2C, April, pp. 106±110.

Carey, M., Parris, E., 1998. Cross-validation in speaker

recognition. In: RLA2C, April, pp. 161±164.

Doddington, G., 1998. Speaker recognition evaluation meth-

odology: a review and perspective. In: Proc. RLA2C, 20±23

April, Avignon, pp. 60±66.

Doddington, G. et al., 1998. Sheep, goats, lambs and wolves: a

statistical analysis of speaker performance in the NIST 1998

speaker recognition evaluation. In: Proc. ICSLP '98.

Fiscus, J., 1997. Recognizer voting error reduction In: Pro-

ceedings of the IEEE Workshop on Automatic Speech

Recognition and Understanding, December.

Gillick, L. et al., 1993. Application of large vocabulary

continuous speech recognition to topic and speaker identi-

®cation using telephone speech. In: ICASSP, April,

pp. II-471±II-474.

Heck, L., Weintraub, M., 1997. Handset-dependent back-

ground models for robust text-independent speaker recog-

nition. In: ICASSP, April, pp. 1071±1073.

Hennebert, J., Petrovska-Delacretaz, D., 1998. Phoneme-based

text-prompted speaker veri®cation with multi-layer percep-

trons. In: RLA2C, April, pp. 55±58.

Hermansky, H., Malayath, N., 1998. Speaker veri®cation using

speaker-speci®c mappings. In: RLA2C, April, pp. 111±114.

Jaboulet, C., Koolwaaij, J., Lindberg, J., Pierrot, J.-B., Bimbot,

F., 1998. The CAVE-WP4 generic speaker veri®cation

system. In: RLA2C, April, pp. 202±205.

Konig, Y., Heck, L., Weintraub, M., Sommez, K., 1998.

Non-linear discriminant feature extraction for robust text-

independent speaker recognition. In: RLA2C, April,

pp. 72±75.

Martin, A. et al., 1997. The DET curve in assessment of

detection task performance. In: Proceedings of EuroSpeech

'97, September, pp. 1895±1898.

Montacie, C., Le Floch, J., 1992. AR vector models for free-text

speaker recognition. In: ICSLP, Ban�, Canada, pp. 611±

614.

NIST, 1997. Speaker recognition evaluation plan (www.

nist.gov/speech/sp_v1p1.htm).

NIST, 1998. Speaker recognition evaluation plan (www.

nist.gov/speech/spkrec98.htm).

Przybocki, M., Martin, A., 1998a. NIST speaker recognition

evaluation ± 1997. In: Proceedings RLA2C, Avignon, 20±23

April, pp. 120±123.

Przybocki, M., Martin, A., 1998b. NIST speaker recognition

evaluations. In: Proceedings LREC, 28±30 May, Granada,

Spain. Vol. 1, pp. 331±335.

Quatieri, T., Reynolds, D., O'Leary, G., 1998. Magnitude-only

estimation of handset non-linearity with application to

speaker recognition. In: ICASSP, May, pp. 745±748.

Reynolds, D.A., 1995. Speaker identi®cation and veri®cation

using Gaussian mixture speaker models. Speech Communi-

cation 17, 91±108.

Reynolds, D., 1997a. HTIMIT and LLHDB: speech corpora

for the study of handset transducer e�ects. In: ICASSP,

April, pp. 1535±1538.

Reynolds, D., 1997b. Comparison of background normaliza-

tion methods for text-independent speaker veri®cation. In:

Eurospeech, pp. 963±967.

Schmidt-Nielsen, A., Crystal, T., 1998. Human vs. machine

speaker identi®cation with telephone speech. In: Proceed-

ings ICSLP '98.

van Vuuren, S., Hermansky, H., 1998. !MESS: a modular,

e�cient speaker veri®cation system. In: RLA2C, April,

pp. 198±201.

254 G.R. Doddington et al. / Speech Communication 31 (2000) 225±254