age and gender recognition from speech patterns … · · 2012-12-06age and gender recognition...

Age and Gender Recognition

from Speech Patterns Based on

Supervised Non-Negative Matrix Factorization

July 2011 1

Mohamad Hasan Bahari

Hugo Van hamme

Outline

� Introduction and Motivations

� Age and Gender Recognition

� Corpora

� Supervised Non-negative Matrix Factorization

2

� Supervised Non-negative Matrix Factorization

� Proposed Method

� Results

� Conclusions and Future Researches

Introduction

� Confirming the identity of individuals

� Biometric Characteristics

� Fingerprint

� Face

� Iris

3

� Iris

� Hand Geometry

� Ear Shape

� Voice pattern

� +

� Choosing a characteristic

� Availability

� Reliability

Motivation

� In many real world cases, only speech patterns are available(kidnapping, threatening calls, +)

� Speech patterns can include many interesting information

� Gender

� Age

4

� Age

� Dialect (original or previous regions)

� Membership of a particular social group

� +

To facilitates in identifying a criminal

To narrow down the number of suspects

Goal

Goal:

To extract different physical and psychological characteristics of the speaker from his/her voice patterns (Speaker Profiling).

Physical: Psychological:

5

Physical:

1. Gender

2. Age

3. Accent

4. +

Psychological:

1. Anxiousness

2. Stress

3. Confidence

4. +


Three approaches:

I. Directly from speech signal.

II. Modeling the speech generation

6

II. Modeling the speech generation

system.

III. Modeling the hearing system.

I. Directly from speech signal.

� Different acoustic features vary with age.

1) Fundamental frequency

2) Speech rate

3) Sound pressure level


7

4) …

� By Finding all acoustic features varying with age and their exact relation

to the speaker age.

� Conceptually simple and computationally inexpensive

x These features are affected by many other parameters, such as weight,

height, voice quality, emotional condition, …

Effect of Age and Gender on speech (Fundamental frequency) [1]


�Age is only one of inputs affecting

the speech and consequently acoustic

features.

8[1] W. S. Brown, R. J. Morris, H. Hollien, and E. Howell, Journal of Voice, vol. 5, pp. 310–315, 1991.

�It is impossible to estimate the age

without considering the rest of inputs

�Perceptions of gender and age have a

significant mutual impact on each

other.

II. Modeling the speech generation system.

� It is an input estimation problem.

x Modeling the speech generation system of the speaker is very

difficult.


9


III. Modeling the hearing system

� To solve the speech recognition problem, the hearing system is

modeled using Hidden Markove Models (HMMs).

� Using the tools applied in speech recognition problems (HMMs) .

� Well established.

10

� Well established.

� Accurate in recognizing content.

x There exist a difference between the age of a speaker as perceived,

and their actual age.

x Computationally complex

Corpora

Young Young Middle Middle Senior Senior

� 555 speakers from the N-best evaluation corpus [1]

� The corpus contains live and read commentaries, news, interviews, and

reports broadcast in Belgium

�Different age groups and genders

11

Category NameYoung

Male

Young

Female

Middle

Male

Middle

Female

Senior

Male

Senior

Female

Age 18-35 18-35 36-45 36-45 46-81 46-81

Number of Speakers 85 53 160 41 191 25

[1] D. A. Van Leeuwen, J. Kessens, E. Sanders, and H. van den Heuvel, In proc. Interspeech, pp. 2571-2574, 2009.

SNMF

� Non-negative Matrix Factorization (NMF) is a popular machine

learning algorithm [1]

� It is used in supervised or unsupervised modes.

� Supervised NMF or SNMF is a pattern recognition method [1]

12

� It is very effective in the case of high dimension input space.

� It is a generative classifier.

� It can directly classify patterns into multiple classes (no need to

change the problem into multiple binary classification).

[1] H. Van hamme, In proc. Interspeech, Australia, pp. 2554-2557, 2008.

Problem Statement:

Given a training data-set: Str= {(x1, y1), . . ., (xn, yn), . . . , (xN, yN)}

xn is a vector of observed characteristics for the data item

yn denotes a label vector which represents the class that xn belongs to

SNMF

13

Goal:

Approximation of a classifier function (g), such that ŷ=g(xtst) is as

close as possible to the true label.

xtst is an unseen observation

SNMF

SNMF in Training Phase:

First step: Second step:

[ ][ ]Ntr

B

N

tr

S

xxV

yyV

L

L

1

1

=

= tr

tr

B

tr

S

tr

B

tr

Strtrtr HW

W

V

VHWV

≈

≈

=

tr

B

tr

Str

V

VV

Extended Kullbeck-Leibler divergence:

Multiplicative updating formula:

14

( ) ( ) ( ) ( )∑∑ +−+

=

zn

zn

tr

mn

tr

mnmn

trtr

mn

trtr

tr

mntr

mn

trtrtr

KL HVHWHW

VVHWVD ρlog

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]trtr

trTtr

NM

Ttr

trtr

Ttr

trtr

tr

Ttr

NM

trtr

HW

VW

W

HH

HHW

V

H

WW

)(1)(

)()(1

o

o

ρ+←

←

×

×

SNMF

SNMF in Testing Phase:

First step: Second step:

( )tsttr

B

tst

KLH

tr

S

tst HWxDWxgytst

minarg)(ˆ ==

tsttr

B

tst HWx ≈ tsttr

S

tst HWxgy == )(ˆ

Extended Kullbeck-Leibler divergence:

Multiplicative updating formula:

15

B HWx ≈ S

( ) ( ) ( ) ( )∑∑ +−+

=

z

z

tst

m

tst

mm

tsttr

B

m

tsttr

B

tst

mtst

m

tsttr

B

stt

KL HxHWHW

xxHWxD ρlog

[ ][ ]

[ ][ ]tsttr

B

tstTtr

B

M

Ttr

B

tsttst

HW

xW

W

HH )(

1)( 1

o

ρ+←

×

Proposed Method

1. Feature selection

2. Acoustic modeling

3. Supervector making procedure

4. Training phase

16

4. Training phase

5. Testing phase

Proposed Method

1. Feature selection

• MEL Spectra

• Mean normalization

• vocal tract length normalization

• Augmented with their first and second order time derivatives.

17

• Augmented with their first and second order time derivatives.

Speech Signal

Feature selection

Feature Vectors

+.

Proposed Method

2. Acoustic modeling

Speaker

Independent

Model

Speaker

Adaptation

Method

Model of

the

Speaker

18

Speaker independent Model:

• An HMM with a shared pool of 49740 Gaussians to model the observations in 3873 cross-word

context-dependent tied triphone states.

Adaptation Method:

• The speaker dependent mixture weights for each speaker result from a re-estimation of the

speaker independent weights based on a forced alignment of the training data for that speaker

using a speaker-independent acoustic model.

The result of this step is 555 speaker adapted models

Model Method Speaker

Proposed Method

3. Supervector making procedure

Gaussian Mixture Model (GMM) of each speaker adapted HMMs is:

Three type of supervectors:

),,()(1

s

j

s

jt

J

j

s

jt owosf

s

∑∆=∑=

µ

19

Three type of supervectors:

1. Means

2. Variances

3. Weights

Weights supervectors:

The result of this step is 555 supervectors for each of 555 speakers

[ ][ ]TTSTsT

n

Ts

Q

s

q

sss wwwfr

)()()( 1

1

λλλχ

λ

LL

LL

=

=

Proposed Method

4. Training phase

20

5. Testing phase

Results

Evaluation Methodology

� 5-fold cross-validation (five independent run)

� In each of five run:

� Training set is speech data of 444 speakers

� Testing set is speech data of 111 speakers

21

� Testing set is speech data of 111 speakers

TST TR TR TR TR

Database

TR TST TR TR TR

Database

.

.

.

Run 1

Run 2

Results

Gender recognition is 96%.

relative confusion matrix

CL

AC

YM YF MM MF SM SF

YM 13 03 58 0 26 0

YF 02 77 04 11 057 0

MM 06 01 44 01 47 0

MF 0 54 02 24 17 02

22

Age group recognition

MF 0 54 02 24 17 02

SM 03 01 19 0 76 0

SF 0 2 08 28 28 16

Category Name

Young Male Young Female Middle MaleMiddle Female

Senior Male Senior Female

Prior 15 10 29 7 34 4

Accuracy 13 77 44 24 76 16

Conclusions and Future Researches

Conclusions:

1. A new age-gender recognition method based on SNMF

2. Supervectors of GMM weights were used

3. Evaluated on N-Best Corpus

4. Gender recognition accuracy is 96%

23

4. Gender recognition accuracy is 96%

5. Age group recognition accuracy is significantly higher than chance level

Future Researches:

1. Age estimation instead of age group recognition.

2. Using supervectors of GMM means and variances and combining these features

g{tÇ~ lÉâ yÉÜ lÉâÜ TààxÇà|ÉÇ

24

age and gender recognition from speech patterns … · · 2012-12-06age and gender recognition...

Documents