traitement de la parole -...

Traitement de la Parole – SE 07 [email protected], University of Fribourg

Traitement de la Parole

Cours 6: La reconnaissance du locuteur

26/04/2007


Reconnaissance du locuteur

Objectifs: introduction aux technologies de vérification du locuteur (VL) – applications aux services téléphoniques

PLAN

• Introduction• Qu’est ce qui identifie un locuteur?• Avantages et inconvénients de la VL• Algorithmes Fondamentaux pour la VL• Performances des systèmes de VL - état-de-l’art• Taxonomie des systèmes de VL• Applications aux services téléphoniques

• Conclusions – Tendances futures?


Introduction

La VL est une technologie biométriqueTypes de biométries

Le but du jeu


LanguageRecognition

SpeechRecognition

SpeakerRecognition

Speech signal

Textual content« read my mail »

Spoken languageEnglish

Speaker IDJohn Smith

SpeakerRecognition

Speech signal


La VL est une biométrie

Who you areFingerprints, voice, iris, ...

What you knowAccount #, Passwords, ...

What you have

Keys, card, CLI, ...


Types de biométrie

• Fingerprint,iris

• Parole

• Signature

Rigide / passive

Plastique / dynamique

PropriétésPhysiques

Caractéristiquescomportementales


SpeakerRecognition

Tasks

Segmentation

Identification

Closed-set Open-set

SpeakerTracking

Verification

Types de tâches dans la reco du loc

1:N

1:2

multispeaker

Clustering n?"Whose voice is this?"

"Is this the voice of Mr Smith?"

“When spoke Mr Smith?"

“Who speaks?Who spoke when?"


History of speaker recognition

• 5 decades of activities• 1941: the laboratories of Bell Telephone in

New Jersey produced a machine able to visualize spectrograph of voice signals.– During the Second World War, the work on the

spectrograph was classified as a military project. Acoustic scientists used it to attempt to identify enemy voices from intercepted telephone and radio communications.



• 1950's and 1960's : so-called Experts testimony in forensic application started. – These experts (Lawrence Kersta) were claiming that spectrographs

were a precise way to identify individuals, which is of course not true in most conditions.

– They associated the term "voiceprint" to spectrographs, as a direct analogy to fingerprint.

– This expert ability to identify people on the basis of spectrographs was of course very much disputed in the field of forensic applications, for many years and even until now.

• 1960's and 1970's : The introduction of the first computers and mini-computers in the triggered the beginning of more thorough and applied research in speaker recognition. – Speaker recognition started to be applied to more realistic access

control applications. – Real-life issues were uncovered as the need to build systems with

single-session enrolment.



• 1980's : speaker verification began to be applied in the telecom area. – Other application issues were then uncovered, such as unwanted

variabilities due to microphone and channel. – More complex statistical modelling techniques were also introduced

such as the Hidden Markov Models.• 1990’s : common speaker verification databases were made

available– First through the Linguistic Data Consortium (LDC). – This was a major step that triggered more intensive collaborative

research and common assessment. – The National Institute of Standards and Technology (NIST) started

also in 1997 to organize open evaluations of speaker verification systems on common shared tasks.

– The central theme of research in the 1990's was to increase robustness of speaker recognition against all sort of variabilities.



• In the present decade:– the recent advances in computer performances and the

proliferation of automated system to access information and services pulled speaker recognition systems out of the laboratories into robust commercialized products.

– Currently, the technology remains expensive and deployment stillneeds lots of customization according to the context of use.

• These two factors are maybe refraining service providers in rolling out such systems.

– From a research point of view, new trends are also appearing. • Extraction of higher-levels information such as word usage or

pronunciation is more and more studied for applications where large quantities of training data are available.

• New systems attempting to combine speaker verification with other modalities such as face or handwriting.


Qu’est-ce qui identifie un locuteur?

• 3 sources de variation entre les locuteurs•Les algorithmes de modélisation ne capturent qu’une partie de ces variations


3 sources de variation entre les locuteurs

1. Propriétés physiologiques• Formes de l’appareil phonatoire• Longueur des cordes vocales

2. Caractéristiques comportementales• Vitesse d’énonciation• Prosodie• Coarticulation

3. Informations linguistiques de plus haut niveau

• Sélection du vocabulaire• Constructions grammaticales• Hesitations et « filler sounds »• Contexte de la conversation

Propriétésphysiologiques

Caractéristiquescomportementales


• Propriétés physiologiques– Formes de l’appareil phonatoire– Longueur des cordes vocales

• Caractéristiques comportementales

– Vitesse d’énonciation– Prosodie– Coarticulation

• Informations linguistiques de plus haut niveau

– Sélection du vocabulaire– Constructions grammaticales– Hesitations et « filler sounds »– Contexte de la conversation

Current SV algorithms capture a part of it


Avantages et inconvénients de la VL

3 avantages3 inconvénients


Warning

We focus now on Speaker Verificationexclusively !


Le vérification du locuteur

“Vrai locuteur”: accès accordé

“Imposteur”: accès refusé

+ identitéproclamée

SCOREValeur de

seuil

Vérification du Locuteur = Authentifier l’identité proclamée d’un individu sur la base de sa voix


3 avantages

1. Bonne acceptation de la part des utilisateurs• Consideré comme peu intrusif• Parler est un geste naturel• Pas de contacts physiques avec les capteurs

2. Coût technologique relativement bas• Un simple microphone suffit• Potentiellement utilisable depuis chaque téléphone

3. Bonne sécurité contre les attaques• Stratégie “challenge-response” peut être utilisée• Les imitations capturent les caractéristiques

comportementales. Les imitations capturent plus difficilement les propriétés physiologiques


3 désavantages

1. Session d’enrôlement• Une session d’enregistrement ne capture pas toutes les

variabilités• Il est nécessaire d’effectuer plusieurs sessions d’enrôlement ,

voire de l’enrôlement incrémental• L’enrôlement le plus court est le meilleur pour l’utilisateur• L’enrôlement le plus long est le meilleur pour les algorithmes• L’enrôlement doit être sécurisé

2. Performances moyennes• Typiquement, la VL est moins bien côtée que les autres

biométries comme les empreintes digitales ou les scans d’iris• La variabilité en est la cause!• L’uniquité en est la cause (par ex membres d’une même famille)

3. Mauvaise sécurité contres les attaques• Si le système n’est pas bien dessiné, un simple enregistrement

de la voix du locuteur suffit pour “rentrer” dans le système


Algorithmes fondamentaux

OverviewProblème de détection

Calcul du seuilCourbes ROC / DET


VL = problème de détection

• VL donne un SCORE (“log-likelihood ratio”)

– La décision est prise en comparant le score à un seuil T

• Problème de détection : 2 types d’erreurs:

1. Fausse Acceptation (FA) (“false alarm”)2. Faux Rejet (FR) (“missed detection”)

Le taux de FA et de FR dépendent de la valeur du seuil


Pour une valeur de seuil donnée T:

• Le système a une certaine probabilité de rejeter faussement un client

• Le système a une certaine probabilité d’accepter un imposteur

On doit déterminer un niveau de sécurité pour le système ! Ceci est fait en minimisant une fonction de coût:

Calcul du seuil T


Performances des systèmesétats-de-l’art

Les performances sont fonction de …Quelques chiffres

Les façons de piéger le système


Les performances sont fonction de ...

• Quantité de données– Durée de l’enrôlement et du test– Nombre de sessions d’enrôlement

• Qualité du signal de parole– Canal: bande passante, recording device, reverberation...– Bruit ambiant, sources multiples de parole

• Stratégie de modélisation

• Utilisateur– Niveau de coopération– Etat de santé/stress – Intrinsèquement, les performances ne sont pas les mêmes d’un utilisateur à un

autre– “aging effect”: la voix change en fonction de l’âge

• L’heure dans la journée !!!


Quelques chiffres• 50%

– Limite supérieure de perfomance de la VL

• EER < 0.5%– Ce que les vendeur de technologie prétendent

• EER ~ 1%– Ce à quoi on peut s’attendre raisonnablement avec un système bien

entraîné et bien designé

• EER ~ 20%– Conditions extrêmes: peu de données pour l’enrôlement, peu de

données pour le test, environnement bruité, conditions d’enregistrementnon identiques

• NIST organise une “competition” de VL chaque année, ouverte aux instituts de rechercher et aux industries


Les façons de piéger un système de VL

1. Pré-enregistrement• L’enregistrement doit être de bonne qualité car les systèmes

sont généralement sensibles aux conditions d’enregistrement• Les systèmes à “challenge-response” empêchent ce genre

d’attaques2. Forcer le client de donner sa voix en face du système

• Peut ne pas marcher car le stress change la voix3. Imiter la voix du client

• Peu de chance de fonctionner car les imitations capturent les caractéristiques comportementales et pas les caractéristiquesphysiologiques

4. Construire un outil de modification de voix (ou de synthese) vers la voix du client

• Cette attaque a le plus de chance de fonctionner mais cela coûtecher.


Taxonomie des systèmes de VL

Text dependentText independent

Text prompted


Text Dependent Text Independent

• System selected password– A priori fixed phrases, PIN– Identity claim and SV can

be done at the same time

• User selected password– Technology is much more

difficult– Recovery infrastructure

• User is free to say anything he wants

• User is not constrained to remember anything

• More vulnerability (any recording can be used to break into the system)


1. Text-Prompted

• Challenge-response• User just has to repeat something prompted

(easier for user and computer)• System must check what has been said in a first

step• Randomness in the prompts prevents use of

recorded speech


Example of text-prompted dialog


Application in telephony services

Dialog machinesSome applications

Advantages for phonebankingIn practice…


Dialog Machine

IVR system

Dialogs

DTMF TTS ASR SV

Information

V-commerce

AutomaticAttendant

IVR = Interactive Voice Response

DTMF = Dual Tone Multiple Frequency

TTS = Text-to-Speech

ASR = Automatic Speech Recognition

SV = Speaker Verification


Speech Recognition Speaker Recognition

Information retrieval

Voicedialing

Booking/purchaseservices

PhonebankingTeleshopping

Credit/callingcard validation

Homeincarceration

Telephony commercial applications

Passwordrenewal

AutomatedAttendant/

Receptionist


Other applications : speech data management

• Audio browser: – Speaker segmentation

• Who and when a speaker has been speaking ?• Also known as Speaker Diarization

– How many speakers have been speaking ?– Speaker change for subtitles

• Intelligent answering machines– « Hello Mr Smith. »

• Broad characteristics recognition– Marketin purposes.


Use Case: In a banking environment: what are the advantages?

SV is a biometrics: verify who you are.SV is user convenient: reduce need for PIN / Strike listsSV does not require elaborate installation/hardware on the user side.SV can be used as a Gate Keeper or Alarm Bell.SV can be a complement to additional security measures to be applied (passwords, strike lists, other biometrics,…).


Conclusions


Critical Success Factors

- Cooperation with customer absolutely necessary.

- Technology must make life easier and safer.

- Combine other protection techniques with SV.


Conclusions

• Speech technologies will be used in many applications.

• Dialog systems with voice recognition applications are around the corner.

• SV will be available very soon for deployment:

there is a good potential for SV.

arguments are both security and ease of use.

technology is continuously improved.


• Modeling higher-level sources of information

– weighting differently phonemes contributions– “going beyond the atomic acoustic features”

• usage of words• pronunciation• duration of phonemes• language models

• Multimodality– Speech and face biometrics : talking faces– Speech and handwriting: CHASM project of

University of Fribourg

What’s next from a research point of view?

traitement de la parole -...

Documents