papers on speaker recognition
TRANSCRIPT
-
8/12/2019 Papers on Speaker Recognition
1/5
5. COMPARISON TO OTHER BIOMETRICS
It is commonly asked, how does speaker verification compare to
other biometrics, such as iris, fingerprint or face recognition?
There really is no complete way to compare different biometrics
since there are so many dimensions on which to evaluate a
biometric (accuracy, suitability for application, ease of use,
recognition time, cost, etc). However, in this section we discuss
some of the strengths and weaknesses of speaker verification and
point to one study which attempted to compare several
biometrics based on accuracy.
The main strength of speaker verification technology is that it
relies on a signal that is natural and unobtrusive to produce and
can be obtained easily from almost anywhere using the familiar
telephone network (or internet) with no special user equipment or
training. This technology has prime utility for applications with
remote users and applications already employing a speech
interface. Additionally, speaker verification is easy to use, has
low computation requirements (can be ported to smartcards and
handhelds) and, given appropriate constraints, has high accuracy.
Some of the flexibility of speech actually lends to its weaknesses.
First, speech is a behavioral signal that may not be consistently
reproduced by a speaker and can be affected by a speakers
health (cold or laryngitis). Second, the varied microphones and
channels that people use can cause difficulties since most speaker
verification systems rely on low-level spectrum features
susceptible to transducer/channel effects. Also, the mobility of
telephones means that people are using verification systems from
-
8/12/2019 Papers on Speaker Recognition
2/5
more uncontrolled and harsh acoustic environments (cars,
crowded airports), which can stress accuracy. Robustness to
channel variability is the biggest challenge to current systems.
Spoofing of systems is often cited as a weakness, but there have
been many approaches developed to thwart such attempts
(prompted phrases, knowledge verification). There is current
efforts underway to address these known weaknesses and some
of these weaknesses may be overcome by combination with a
complementary biometric, like face recognition.
Finally, we show some results from a study by the United
Kingdoms Communications-Electronics Security Group (CESG)
that attempted to compare performance of several biometrics.
The complete report can be found in [10]. In Figure 5 we show a
DET plot for eight systems (1 face, 3 fingerprint, 1 hand, 1 iris, 1
vein and 1 voice). While it is debatable that a test can be
conducted to compare all these biometrics, it is interesting to
note that voice verification performed quite well. Readers,
however, should read the report to get all the details of the test.
Voice
Figure 5 DET curves from CESG study comparing
several biometrics. (Best of three attempts Figure 6 [10]).
6. FUTURE TRENDS
In this section we briefly outline some of the trends in speaker
recognition research and development.
Exploitation of higher-levels of information: In addition to the
low-level spectrum features used by current systems, there are
-
8/12/2019 Papers on Speaker Recognition
3/5
many other sources of speaker information in the speech signal that can be used. These include
idiolect (word usage), prosodic
measures and other long-term signal measures. This work will be
aided by the increasing use of reliable speech recognition
systems for speaker recognition R&D. High-level features not
only offer the potential to improve accuracy, they may also help
improve robustness since they should be less susceptible to
channel effects.
In recent work, Doddington has shown that a speakers idiolect
can be used to successfully verify a person [11], and Andrews et. Automatic Speaker Recognition:Current Approaches and Future Trends1
Douglas A. Reynolds
MIT Lincoln Laboratory, Lexington, MA USA
al [12] havehttp://www.cs.joensuu.fi/pages/tkinnu/research/
used n-grams of phonetic sequences for verifying
3. Vector Quantization
Speaker recognition is the task of comparing an
unknown speaker with a set of known speakers in a
database and finding the best matching speaker.
Vector quantization (VQ) is a lossy data compression
method based on the principle of block coding. In
Vector Quantization a large set of feature vectors are
taken and a smaller set of measure vectors is produced
which represents the centroids of the distribution.
http://www.cs.joensuu.fi/pages/tkinnu/research/http://www.cs.joensuu.fi/pages/tkinnu/research/http://www.cs.joensuu.fi/pages/tkinnu/research/http://www.cs.joensuu.fi/pages/tkinnu/research/ -
8/12/2019 Papers on Speaker Recognition
4/5
3.1 Speaker Database
The first step is to build a speaker database,
Cdatabase = {C1,C2, ,CN} consisting of N codebooks,
first converting the raw input signal into a sequence of
feature vectors X= {x1, , xT }. These feature vectors
are clustered into a set of M codewords, C = {c1, ,
cM}. The set of codewords is called a codebook. The
clustering is done by a clustering algorithm, and here
K-means clustering algorithm is used for this purpose.
3.2 K-means
The K-means algorithm partitions the X feature
vectors into M centroids. The algorithm first chooses
M cluster centroids among the X feature vectors. Then
each feature vector is assigned to the nearest centroid,
and the new centroids are calculated. This procedure
is continued until a stopping criterion is met, that is
the mean square error between the feature vectors and
the cluster-centroids is below a certain threshold or
there is no more change in the cluster-center
assignment.
3.3 Speaker Matching
-
8/12/2019 Papers on Speaker Recognition
5/5
In the recognition phase an unknown speaker,
represented by a sequence of feature vectors {x1,, xT
}, is compared with the codebooks in the database.
For each codebook a distortion measure is computed,
and the speaker with the lowest distortion is chosen.
One way to define the distortion measure is to use the
average of the Euclidean Distances. The Euclidean
distance is the ordinary distance between the two
points that one would measure with a ruler, which can
be proven by repeated application of the Pythagorean
Theorem. Thus, each feature vector in the sequence X
is compared with all the codebooks, and the codebook
with the minimized average distance is chosen to be
THE BEST