papers on speaker recognition

8/12/2019 Papers on Speaker Recognition

1/5

5. COMPARISON TO OTHER BIOMETRICS

It is commonly asked, how does speaker verification compare to

other biometrics, such as iris, fingerprint or face recognition?

There really is no complete way to compare different biometrics

since there are so many dimensions on which to evaluate a

biometric (accuracy, suitability for application, ease of use,

recognition time, cost, etc). However, in this section we discuss

some of the strengths and weaknesses of speaker verification and

point to one study which attempted to compare several

biometrics based on accuracy.

The main strength of speaker verification technology is that it

relies on a signal that is natural and unobtrusive to produce and

can be obtained easily from almost anywhere using the familiar

telephone network (or internet) with no special user equipment or

training. This technology has prime utility for applications with

remote users and applications already employing a speech

interface. Additionally, speaker verification is easy to use, has

low computation requirements (can be ported to smartcards and

handhelds) and, given appropriate constraints, has high accuracy.

Some of the flexibility of speech actually lends to its weaknesses.

First, speech is a behavioral signal that may not be consistently

reproduced by a speaker and can be affected by a speakers

health (cold or laryngitis). Second, the varied microphones and

channels that people use can cause difficulties since most speaker

verification systems rely on low-level spectrum features

susceptible to transducer/channel effects. Also, the mobility of

telephones means that people are using verification systems from


2/5

more uncontrolled and harsh acoustic environments (cars,

crowded airports), which can stress accuracy. Robustness to

channel variability is the biggest challenge to current systems.

Spoofing of systems is often cited as a weakness, but there have

been many approaches developed to thwart such attempts

(prompted phrases, knowledge verification). There is current

efforts underway to address these known weaknesses and some

of these weaknesses may be overcome by combination with a

complementary biometric, like face recognition.

Finally, we show some results from a study by the United

Kingdoms Communications-Electronics Security Group (CESG)

that attempted to compare performance of several biometrics.

The complete report can be found in [10]. In Figure 5 we show a

DET plot for eight systems (1 face, 3 fingerprint, 1 hand, 1 iris, 1

vein and 1 voice). While it is debatable that a test can be

conducted to compare all these biometrics, it is interesting to

note that voice verification performed quite well. Readers,

however, should read the report to get all the details of the test.

Voice

Figure 5 DET curves from CESG study comparing

several biometrics. (Best of three attempts Figure 6 [10]).

6. FUTURE TRENDS

In this section we briefly outline some of the trends in speaker

recognition research and development.

Exploitation of higher-levels of information: In addition to the

low-level spectrum features used by current systems, there are


3/5

many other sources of speaker information in the speech signal that can be used. These include

idiolect (word usage), prosodic

measures and other long-term signal measures. This work will be

aided by the increasing use of reliable speech recognition

systems for speaker recognition R&D. High-level features not

only offer the potential to improve accuracy, they may also help

improve robustness since they should be less susceptible to

channel effects.

In recent work, Doddington has shown that a speakers idiolect

can be used to successfully verify a person [11], and Andrews et. Automatic Speaker Recognition:Current Approaches and Future Trends1

Douglas A. Reynolds

MIT Lincoln Laboratory, Lexington, MA USA

[email protected]

al [12] havehttp://www.cs.joensuu.fi/pages/tkinnu/research/

used n-grams of phonetic sequences for verifying

3. Vector Quantization

Speaker recognition is the task of comparing an

unknown speaker with a set of known speakers in a

database and finding the best matching speaker.

Vector quantization (VQ) is a lossy data compression

method based on the principle of block coding. In

Vector Quantization a large set of feature vectors are

taken and a smaller set of measure vectors is produced

which represents the centroids of the distribution.
http://www.cs.joensuu.fi/pages/tkinnu/research/http://www.cs.joensuu.fi/pages/tkinnu/research/http://www.cs.joensuu.fi/pages/tkinnu/research/http://www.cs.joensuu.fi/pages/tkinnu/research/


4/5

3.1 Speaker Database

The first step is to build a speaker database,

Cdatabase = {C1,C2, ,CN} consisting of N codebooks,

first converting the raw input signal into a sequence of

feature vectors X= {x1, , xT }. These feature vectors

are clustered into a set of M codewords, C = {c1, ,

cM}. The set of codewords is called a codebook. The

clustering is done by a clustering algorithm, and here

K-means clustering algorithm is used for this purpose.

3.2 K-means

The K-means algorithm partitions the X feature

vectors into M centroids. The algorithm first chooses

M cluster centroids among the X feature vectors. Then

each feature vector is assigned to the nearest centroid,

and the new centroids are calculated. This procedure

is continued until a stopping criterion is met, that is

the mean square error between the feature vectors and

the cluster-centroids is below a certain threshold or

there is no more change in the cluster-center

assignment.

3.3 Speaker Matching


5/5

In the recognition phase an unknown speaker,

represented by a sequence of feature vectors {x1,, xT

}, is compared with the codebooks in the database.

For each codebook a distortion measure is computed,

and the speaker with the lowest distortion is chosen.

One way to define the distortion measure is to use the

average of the Euclidean Distances. The Euclidean

distance is the ordinary distance between the two

points that one would measure with a ruler, which can

be proven by repeated application of the Pythagorean

Theorem. Thus, each feature vector in the sequence X

is compared with all the codebooks, and the codebook

with the minimized average distance is chosen to be

THE BEST

papers on speaker recognition

Documents