speaker identification based on the statistical analysis of f0

SpeechTechnologyCenter

SpeechTechnologyCenter

Speaker Identification based on the statistical analysis of F0

Pavel Labutin, Sergey Koval, Andrey RaevSt. Petersburg, Russia

[email protected]

224.07.2007 www.speechpro.comwww.speechpro.com

Report overview

The problem of the F0 usage in forensic speaker identification Main challenges Proposed Method Results Conclusion


The problem of the F0 usage in forensic speaker identification

F0 analysis - obligatory stage in forensic speaker identification.

Remedial legislation demands: forensic investigation of the speech evidence must be comprehensiveBecause pitch reflects important properties of the human voice, consequently it must be investigated by forensic examination of the speech record

Typical F0 usage by speaker identificationAutomatic F0 detection

Some data smoothing

Simple F0 statistics comparison


Main challenges in F0 usage for forensic speaker identification

Fig.1. F0 curve for telephone conversation of the suspected person.At 15th sec he got an important information: Average F0 grew in 70Hz. Vertical axis – frequency (Hz), horizontal axis – time (sec),

Low speech quality for real police records

As usual SNR < 15 dB

Frequency range is limited

Speech signal distortions (compression, non linear FR of channel equipment, tape recorders etc.)

High inner speaker F0 variability

High dependence F0 statistics from speaker state and style of speech


The method discussed

Three stages:

1. F0 reliable detection 2. F0 detection control an correction

3. F0 statistics data analysis and comparison.

F0 Detection algorithm: two-pass-method; using summation of multiple harmonics in the spectral field; Noise cancellation, adaptation for speech signals of very low quality Good results for field applications; Is implemented into expert software (SIS) and is used for real forensic examinations.

Fig.2. Waveform (upper window) and F0 curve (thin yellow curve) superimposed on cepstrogram (bottom window). On the cepstrogram picture [7] shadow degree corresponds to the signal periodicity degree at this point of frequency and time. Vertical axis – frequency (Hz), horizontal axis – time (sec).


F0 detection exactness control and correction

Fig.3. Waveform (upper window)and F0 curve (thin yellow line in bottom window). Correspondence between real F0 and calculated curve is unknown and uncontrolled. Vertical axis – frequency (Hz), horizontal axis – time (sec).



Fig.4. Waveform (upper window), cepstrogram (signal periodicity function – in the middle) and F0 curve (thin yerllow curve) superimposed on cepstrogram (bottom window). On the cepstrogram picture [7] shadow degree corresponds to the signal periodicity degree at this point of frequency and time. Vertical axis – frequency (Hz), horizontal axis – time (sec).



Fig.5. Waveform (upper window), initially detected F0 curve (yellow curve) superimposed on cepstrogram (middle window), graphically corrected by expert’s F0 curve and cepstrogram (bottom window). On the cepstrogram picture [7] shadow degree corresponds to the signal periodicity degree at this point of frequency and time. Vertical axis – frequency (Hz), horizontal axis – time (sec).


Statistical F0 features used

Values of pitch are transformed to a logarithmic scale, and then statistical pitch features are calculated.

The typical set of the statistical parameters: Average value, Hz; Maximum, Hz; Minimum, Hz; Maximum -3%, Hz;* Minimum +1%, Hz; Median, Hz; Percent of areas with raising pitch,%;* Pitch logarithm variation;* Pitch logarithm distribution asymmetry;* Pitch logarithm distribution excess; Average velocity of pitch change, %/sec; Pitch logarithm variation derivative; Pitch logarithm derivative distribution asymmetry; Pitch logarithm derivative distribution excess; Average velocity of pitch raise, %/sec;* Average velocity of pitch fall, %/sec.*The asterisk indicates the statistical features more heavily weighted in common

metric for speaker identification.


General identification metric

The deviation of every statistical parameter was calculated for every file pair from the corpus.

The distributions of the deviations for pairs “same-different” and “same–same” were built

Functions False Acceptance (FA), False Rejection (FR) and EER (Equal Error Rate) were calculated for every statistical parameter.

The general identification metric was constructed as a weighted sum of separate statistical parameters.

The weights were selected to minimize EER for the given speech database.

For general weighted metric FR and FA curves and ERR were calculated.


Speech data base used for training and testing A speaker identification algorithm was developed and trained

using the STC corpus RUSTEN.

RUSTEN includes: 126 speakers (67 women and 59 men) in 5 sessions for 5 different analog telephone lines (including public

telephones from noisy streets and underground stations), real spontaneous dialogs

and130 speakers (61 women and 69 men)in 2 – 10 sessionsfor different digital telephone linesabout 1000 files of high quality digital phone channel

conversations.

RUSTEN: Russian Switched Telephone Network speech database (STC), 2003. S0050, ELDA - Evaluations and Language resources Distribution Agency.


An example of F0 feature detection in SIS software

Fig.6. An example of working window of the SIS software with the results of F0 statistic comparison for two speakers.

Such screenshots are typically inserted into the expert examination conclusion to illustrate F0 statistical analysis results.


Pitch of the two files with differebt avaraged value

Fig.7. Cepstrograms of two compared speech files. The same speaker with different style of speech. According to pitch statistical analysis speakers are the same, although average pitch values differs significantly: 154Hz and 135Hz correspondently.


Results of method testing Tonal

speech duration

10 sec

template

20 sec template

40 sec template

80 sec template

10 sec

Test

All

Men

Women

17.7

25.2

26.6

20 sec

Test

All

Men

Women

16.7

23.7

24.9

15.2

21.7

22.6

40 sec

Test

All

Men

Women

16.1

23.0

23.8

14.4

20.6

21.1

13.2

19.1

19.0

80 sec

Test

All

Men

Women

15.6

22.1

23.1

13.6

19.5

19.8

12.3

17.8

17.5

10.9

16.2

15.0

Tables 1 shows the results of the speaker identification using F0 statistics analysis. The test data base includes about 1600 speech files of 256 speakers, real dialogs through public telephone net, both analog and digital channels.


Results of speaker discrimination using only averaged F0 value.

Tonal speech duration

10 sec

template

20 sec template

40 sec template

80 sec template

10 sec

Test

Men 32.0

20 sec

Test

Men 31.1 30.1

40 sec

Test

Men 30.5 30.1

80 sec

Test

All

Men 30.1 28.8 27.9

17.4

27.5

Tables 2 shows the results of the speaker identification using only one, usually used F0 feature: average F0 value. The test data base includes about 1600 speech files of 256 speakers, real dialogs through public telephone net, both analog and digital channels.


An example of FA and FR curves. Ave F0


An example of FA and FR curves. F0 min+ 3%


An example of FA and FR curves.General metric


CONCLUSION

The method based upon the statistical analysis of F0 for forensic speaker identification is described.

The reliability of the method is tested on a large amount of real speech material of telephone conversations.

Described really very good method to detect F0, check and correct detected F) curve for real forensic speech records.

The method is implemented into expert software (SIS) and used in everyday forensic examination practice.


PERSPECTIVES

The same method of the statistical analysis of F0 is used for diagnostics of unknown speaker anthropometric features, such as age, high, weight , etc.

Preliminary results are promising.

Except the statistical F0 analysis we propose for experts in addition to perform detailed structural analysis of the F0 curve.

In particular, to measure Max, Min, Range,Timing of the F0 moving for the space of accented syllable of the phrase or for voiced hesitation pauses.


Thank you for attention

speaker identification based on the statistical analysis of f0

Documents