methods for improving voice activity detection in ...588802/fulltext01.pdf · it 13 001...

IT 13 001

Examensarbete 30 hpJanuari 2013

Methods for Improving Voice Activity Detection in Communication Services

Amardeep

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Methods for Improving Voice Activity Detection inCommunication Services

Amardeep

A video conferencing application has to display only active sites due to limited displayarea that are identified using voice activity detector (VAD) and maintain a list of themost vocally active sites. In a typical video conferencing room there will be peopletyping on their computers or laptops and this can cause problem when the VADclassifies the keyboard typing signals as speech activity even there is nobody talking inthe room. As a result the vocally inactive site is not removed from the list of activesites and thus blocks another vocally active site from being added to the list, thuscreating a very bad user experience in the video conference. Current VAD oftenclassify keyboard typing as active speech.

In this thesis work, we explore two main approaches to solve the problem. Firstapproach is based on identification of keystroke signals in the mixed audio data(speech and keyboard signal). In this approach we explore various audio signalclassification approaches based on temporal and spectral features of speech andkeystroke signals as well as prediction model based classification. We evaluate andcompare this approach by varying parameters and maximizing the percentage ofcorrectly-classified keystroke frames as true-keystroke frames whereas minimizing thefalsely-classified keystroke frames among non true-keystroke frames. The evaluatedkeystroke identification approach is based on thresholding the model error thatresulted into 85% accuracy using one previous and one future frame. Thefalsely-classified frames as keystroke frames in this approach are mainly due to theplosive sounds in the audio signal due to the similar characteristics as that ofkeystroke signal.

Second approach is based on finding a mechanism to complement VAD such that itdoesn’t trigger at keystroke signals. For this purpose we explore different methodsfor improving pitch detection functionality in the VAD. We evaluate a new pitchdetector which computes pitch using autocorrelation of the normalized signal frames.Then we design a new speech detector which consists of the new pitch detectoralong with hangover addition that separates the mixed audio data into speech regionand non-speech region in real time. The new speech detector doesn’t trigger atkeystroke frames i.e. it places the keystroke frames in non-speech region and hencesolves the problem.

Tryckt av: Reprocentralen ITC

Sponsor: Multimedia Division, Ericsson Research, EABIT 13 001Examinator: Lisa KaatiÄmnesgranskare: Magnus Lundberg NordenvaadHandledare: Erlendur Karlsson

Ericsson Internal

MSC THESIS REPORT

1 (89) Prepared (Subject resp) No.

Amardeep Amardeep Approved (Document resp) Checked Date Rev Reference

2012-09-11 PA1

Acknowledgement:

This Master’s thesis is the final achievement for my degree of Master of Science in Computer Science at Uppsala University. The work has been performed at the Audio Technology Department of Ericsson Research in Stockholm.

I would like to acknowledge and thank my supervisor Erlendur Karlsson for his tremendous support during the work and writing of this thesis. I would also like to thank Magnus Lundberg Nordenvaad for his valuable feedback and suggestions during report writing. Finally, I would like to express my gratitude to the all others who either directly or indirectly helped me in achieving the goal.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Contents

1 Introduction .............................................................................................................. 5 1.1 Thesis outline ............................................................................................... 5

2 Voice Activity Detector ............................................................................................ 7 2.1 Sub-band Filter Bank .................................................................................... 8 2.2 Pitch detection ............................................................................................ 11 2.3 Tone detection............................................................................................ 12 2.4 Complex signal analysis ............................................................................. 13 2.5 VAD decision .............................................................................................. 13 2.5.1 SNR computation ....................................................................................... 14 2.5.2 Background Noise Estimation ..................................................................... 14 2.5.3 Threshold Adaption .................................................................................... 14 2.5.4 Comparison block ....................................................................................... 15 2.5.5 Hangover block .......................................................................................... 15 2.6 Problems with VAD .................................................................................... 16

3 Signal Characteristics ........................................................................................... 17 3.1 Signal Recording ........................................................................................ 17 3.1.1 Keystroke signals ....................................................................................... 17 3.1.2 Speech signals ........................................................................................... 17 3.2 Keystroke Signal Characteristics ................................................................ 18 3.2.1 Keypress signal .......................................................................................... 20 3.2.2 Keyrelease signal ....................................................................................... 20 3.2.3 Low frequency component of Keystroke signal ........................................... 21 3.3 Speech Signal ............................................................................................ 22 3.3.1 Voiced Sounds ........................................................................................... 23 3.3.2 Fricative or unvoiced sounds ...................................................................... 23 3.3.3 Plosive sounds ........................................................................................... 24 3.4 Acoustic Phonetics ..................................................................................... 24 3.4.1 Vowels ....................................................................................................... 24 3.4.2 Diphthongs ................................................................................................. 24 3.4.3 Semivowels ................................................................................................ 25 3.4.4 Nasals ........................................................................................................ 25 3.4.5 Unvoiced Fricatives .................................................................................... 25 3.4.6 Voiced Fricatives ........................................................................................ 25 3.4.7 Voiced Stops .............................................................................................. 25 3.4.8 Unvoiced Stops .......................................................................................... 26

4 Alternative Signal Classification Approaches ..................................................... 27 4.1 Feature Vector Based Classification ........................................................... 27 4.1.1 Frame Based Feature Vector ..................................................................... 27 4.1.2 Texture Based Feature Vector .................................................................... 28 4.2 Temporal and Spectral Feature Extraction ................................................. 28

4.2.1 Zero Crossing Rate ( rZC ) .......................................................................... 28

4.2.2 Cepstral feature .......................................................................................... 31 4.2.3 Short Time Fourier Transform (STFT) ........................................................ 34 4.3 Temporal Prediction Based Signal Classification ........................................ 35 4.3.1 Prediction Model for smooth speech signals ............................................... 35 4.3.2 Classification based on thresholding the norm of the modeling error vector 43

4.3.3 Computation of variance 2

,kn ..................................................................... 45

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

4.3.4 Weight computation .................................................................................... 48 4.4 Pitch detection based on Autocorrelation of the Normalized Signal ............ 50

5 Performance Evaluation ........................................................................................ 55 5.1 Test Signals ............................................................................................... 56 5.2 Evaluation Terminology .............................................................................. 57 5.3 Evaluation of keystroke signal classification approaches ............................ 58 5.3.1 Approach based on combined prediction of the previous and next frame ... 58 5.3.2 Approach based on prediction using previous and next frame separately... 64 5.4 Evaluation of new speech detector based on the new pitch detection

algorithm .................................................................................................... 71 5.5 Conclusion ................................................................................................. 72 5.5.1 Similarities and differences between Speech and Keystrokes .................... 72 5.5.2 Comparison of the classification approaches .............................................. 73 5.5.3 Future work ................................................................................................ 74

6 Appendix A.1: Automatic detection of keystroke in a keystroke only file ......... 77

7 Appendix A.2: Data collection of Audio classification approach using prediction model (by varying parameters) ........................................................... 79 7.1 Plots for how the classification criteria works using variance based on the

average of the previous and next frame ..................................................... 79 7.2 Plots for how the classification criteria works using variance based on the

filtering of squared model error. .................................................................. 81 7.3 Plots and tables for hit rate, False alarm and false alarm in speech region . 82

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Abbreviations:

3GGP 3rd Generation Partnership Project

AMR Adaptive Multi Rate

AMR-WB+ Adaptive MultiRate WideBand Plus

ASC Audio Signal Classification

CCF Cross-Correlation Function

DFT Discrete Fourier Transform

FFT Fast Fourier Transform

ITU International Telecommunication Union

SNR Signal to Noise Ratio

STFT Short Time Fourier Transform

VAD Voice Activity Detector

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

1 Introduction

The use of video conferencing is rapidly growing in various domains like government, law, education, health, medicine and business. This development is being enabled by the availability of affordable high speed internet connections (fixed and mobile) and the recent technological innovations in high quality video coding and affordable high resolution displays and driven by a vision of sustainable growth. The strain on the environment is

released by the drastic reduction in travel (saving airplane fuel and decreasing 2CO

emissions) and the meetings become more efficient and more affordable as the participants do not need to spend time on long distance travels and the companies and organizations can make big savings on travel and hotel expenses.

A video conferencing site usually has a limited display area to show the videos from the other participating sites and very often the number of the other participating sites is so large that it is impossible to display them all at the same time. This situation is typically handled by identifying the most active sites and only displaying those. The identification of the most active sites is usually achieved by using a voice activity detector (VAD) on the audio signals from each of the sites and maintaining a list of the most active sites, where a site is taken from the list when it becomes vocally inactive and a site is added to the list when it goes from vocally inactive to vocally active. For this to work it is imperative that the VAD does a good job of properly identifying the voice activity in each audio signal.

In a typical video conferencing room there will be people typing on their computers/laptops and this can cause some problems when the VAD classifies the keyboard typing signals as speech activity when there is nobody talking in the room. This can result in that a vocally inactive site is not removed from the list of active sites and thus block another vocally active site from being added to the list, thus creating a very bad user experience in the video conference.

Current VADs often classify keyboard typing as active speech. In this thesis work we will be looking into methods for complementing those VADs with a mechanism that will enable them to distinguish between true speech and keyboard typing. To be able to achieve this we will of course need to understand the inner functions of the current VADs and the main signal characteristics of speech and keyboard typing signals.

1.1 Thesis outline

In the first chapter we study about the narrowband AMR VAD (Voice activity detector) according to our purpose. The problems with the VAD are also described using some test data.

The second chapter focuses on the signal characteristics of keystroke and speech signals.

In the third chapter, we look into the frame based signal processing used in real time classification.

The fourth chapter describes the various alternative approaches for the signal classification. Feature vector based and maximum likelihood based approach are described in detail. A derived approach based on the combination of the above features is also described.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

In the fifth chapter, evaluation of various approaches described earlier and summary of the obtained results are provided.

In the appendix, one section is included on automatic detection of keystroke signal. Another section includes data from testing of the evaluation chapter.

Following are the main modules of the thesis work:

VAD (Voice activity Detection)

o Functionality

o Problems related to our purpose

Signal Characteristics

o Speech and keyboard signal

Frame based signal processing

o Multirate signal processing

o Effect of windowing on the signal

Alternative signal processing approaches

o Feature vector based classification

o Maximum likelihood criteria based classification

o New approach to solve our problem based on the above methods and data analysis

o Improved pitch detection

Classification results and conclusion

o Hit rate and False alarm comparison for the above described methods

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

2 Voice Activity Detector

In this chapter we describe the functionality of the VAD. The VAD we have used in this study is based on the narrowband AMR VAD that is used in the speech coder so our functional description will focus on this particular VAD. For more details, we refer the reader to technical documentation on AMR by 3GPP [1].

VAD is based on a technique used in speech processing to detect the presence or absence of speech. The main usage of VAD is in speech coding and speech recognition. In speech coding it is used to avoid unnecessary coding/transmission of silent frames, saving both computational effort and network bandwidth.

The input signal to the VAD is sampled at 8 kHz, has a bandwidth of 4 kHz, and the VAD processes the signal in 20ms signal frames. The VAD consists of four analysis (feature extraction) blocks that provide signal features to the VAD decision block, which outputs a speech flag for each frame, as illustrated in figure 1.

Figure 1 Block Diagram of VAD

Sub-band Filter Bank

Pitch Detection

Tone Detection

Complex Signal Analysis

S(i)

VAD decision

level[n]

pitch

tone

Complex_warning

Complex_timer

VAD_flag

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

The analysis blocks are:

A sub-band filter bank delivering the signal levels in 9 sub-bands.

A pitch detection block that delivers a pitch flag for each frame.

A tone detection block that delivers a tone flag for each frame.

A complex signal analysis block that identifies and delivers a complex flag for each frame.

2.1 Sub-band Filter Bank

The input signal to the sub-band filter bank is sampled at 8 kHz, has a bandwidth of 4 kHz, and is processed in 20ms signal frames. The sub-band filter bank uses 8 LP/HP QMF filter blocks organized in 4 stages to generate 9 sub-band signals as illustrated in figure 2.

Since most of the speech energy is contained in the lower frequencies, the frequency band resolution is higher at the lower frequencies than the higher frequencies.

5th order filter block



3rd order filter block




3rd order filter block 0-250Hz

250-500Hz

500-750 Hz

750-1000Hz

1-1.5 kHz

1.5-2 kHz

2 - 2.5 kHz

2.5-3 kHz

3 - 4 kHz

LP[0-2 k]

LP[2-3 k]

LP[0-1 k]

LP[0-500]

HP[1-2 k]

HP[500-1000]

HP[2-4 k]

Stage 1 Stage 2 Stage 3 Stage 4

[0-4 k]

Figure 2: Sub-band filter bank using LP/HP QMF filter blocks

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

A 2-channel LP/HP alias free QMF filter block [2] is shown in Figure 3. In the figure 3, 0H

and 1H are low pass and high pass filters respectively. For alias free realization, the high

pass filter is made the mirror image of the low pass filter with a cutoff frequency π/2 as

shown in the figure 4. So, 0H and 1H can be related as

)()( )(

01

jj eHeH

Figure 3 A LP/HP QMF block

Figure 4 High pass and low pass filter

In the z domain, it can be represented as )(1 zH = )(0 zH . An efficient way to implement

2-channel alias free QMF block is done in polyphase form. A 2-band Type 1 polyphase

representation of 1H and 0H are shown below using the alias free condition:

2

2

1H

0H

x[n]

Two output bands

)()(

)()(

2

1

12

01

2

1

12

00

zEzzEH

zEzzEH

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Now figure 3 can be redrawn using the above equations as shown in figure 5.

Figure 5 A 2-channel alias free polyphase QMF block

The cascade implementation of the filter block in figure 6 is shown in figure 7. The cascade implementation saves a lot of processing time by reducing the computation work by half in each channel as we downsample the signal first. If we downsample later then we process each sample and then throw away every other sample, so we can reduce the computation

cost by the downsampling the signal first. In case of 5th order filter bank, )(A)( 10 zzE

and )(A)( 21 zzE , where )(A1 z and )(A2 z are all pass filters. In case of 3rd order filter

bank, 1)(0 zE and )(A)( 31 zzE , where )(A3 z is an all pass filter.

Figure 6: Cascade implementation of QMF block

In case of all pass filters, there is only phase distortion, but magnitude is preserved. The

filters )(),( 21 zAzA and )(3 zA are first order direct form all-pass filters, whose transfer

function is given by equation (1).

1

1

1)(

zC

zCzA

k

kk (1)

)( 2

0 zE

)( 2

1 zE

1z

x[n] +

- Two output bands

2

2

+

+

+

+

2

2

)(0 zE

)(1 zE

1z

x[n] +

Two output bands

-

+

+

+

+

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

where kC is the filter coefficient.

As shown in the figure 2, after each stage, the signal is downsampled by a factor of 2. So the final 9 output sub-band signals are sampled at different sampling frequencies and have different bandwidths as summarized in table 1. Table 1 also shows the number of samples per frame for each of the sub-band signals.

For each frame, the signal level is summed over a 24 ms time interval [1]. Since the input frame size is 20 msec, 4 msec must be obtained from the previous frame. To exemplify this in number of samples we can look at sub-band 9. The required number of samples is 48. The current frame has 40 samples and 8 samples are taken from the previous frame.

Table 1 Frequency distribution in Sub-bands

Band no

Frequency (Hz) Output from stage number

Sampling rate (Hz)

No of samples in 20 ms frame

1 0-250 4 500 10

2 250-500 4 500 10

3 500-750 4 500 10

4 750-1000 4 500 10

5 1-1.5 k 3 1 k 20

6 1.5-2 k 3 1 k 20

7 2-2.5 k 3 1 k 20

8 2.5-3 k 3 1 k 20

9 3-4 k 2 2 k 40

2.2 Pitch detection

This block computes the autocorrelation of the current signal frame and decides whether the signal frame is pitched or not. Example of pitched speech signals are vowel sounds and other periodic signals. In the AMR VAD [1], the pitch detection is done by comparison of open-loop lags or delays which are computed by open loop pitch analysis function. This function computes autocorrelation maxima and delays corresponding to them.

In the AMR VAD, the delays are divided into three ranges and one autocorrelation maximum is selected in each range. Instead of choosing the global autocorrelation maxima, logic is used to select autocorrelation maximum among them favoring the autocorrelation maximum corresponding to lower delay range.

In the AMR VAD, if the bit rate mode is 5.15, 4.75 kbits/s, open-loop pitch analysis is performed once per frame (each 20 ms). The computation of autocorrelation is given by the equation (2).

159

0

)(*)(n

k knxnxO (2)

where kO is the autocorrelation of the 20ms signal frame x(0:159) at delay k. The delay is

divided into three ranges as shown in following table 2.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Table 2

Range number (i) Delay range (k)

3 20,…,39

2 40,…,79

1 80,…,143

Maxima are computed in each range and they are normalized by dividing with the signal power of the corresponding delayed frame. The normalized maxima and corresponding

delays are denoted by 3,2,1),,( itM ii . An Autocorrelation maximum

)( opTM corresponding to the delay ( opT ) is selected favoring the lower delay range over

the higher ones. The autocorrelation maximum corresponding to the largest delay range is assigned first, and then it is updated with the autocorrelation maxima corresponding to the lower delay ranges using following logic:

end

tT

MTM

TMMif

end

tT

MTM

TMMif

MTM

tT

op

op

op

op

op

op

op

op

3

3

3

2

2

2

1

1

)(

)(85.0

)(

)(85.0

)(

A counter variable lagcount stores the number of lags for the current frame. The difference of the open-loop lags or delays is computed and if it is smaller than a threshold then the lagcount is incremented. If the sum of lagcounts of the two consecutive frames is higher than the pitch threshold then the pitch flag is set.

2.3 Tone detection

The main functionality of this block is to detect the sinusoidal signals such as information tones. One way is to use the second order AR model. It tries to look into the poles of the model on the unit circle. If the poles are close to the unit circle, then the signal is classified as tonal.

In AMR VAD [1], a pitched signal is classified as tonal, if the pitch gain is higher than the tone threshold (TONE_THR), otherwise non-tonal. The normalized autocorrelation maximum from the open-loop pitch analysis function is compared to the tone threshold; if it is higher then tone flag is set to 1 otherwise 0.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

2.4 Complex signal analysis

This block detects the highly correlated signal in high pass weighted speech domain. One example of signal having high correlation value is music.

2.5 VAD decision

The block diagram of the VAD decision algorithm [1] is shown in the figure 7.

Figure 7: Block diagram for VAD decision

As shown in the figure 7, signal level output from the sub-band filter bank, the pitch flag from the pitch computation block, tone flag from tone detection block and complex warning flag from the complex signal detection block are inputs to this block. Then we compute SNR (signal to noise ratio) using signal level of the frame and background noise estimation. We will describe backward noise estimation in the sub-section [2.5.2]. The computed SNR is compared to adaptive threshold which depends on the noise level. If the SNR is higher than the threshold then intermediate VAD flag (vad_reg) is set, which along with hangover determines the VAD flag (vad_flag). We describe how the hangover works with intermediate VAD flag in the sub-section [2.5.5].

SNR Computation

Background Noise Estimation

level[n]

pitch_flag

bckr_est[n]

Threshold Adaption

noise_level

Comparison

Hangover Addition

vad_thr

vad_flag

tone_flag

complex_warning

vad_reg

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Each block in the figure 7 is described in the following sub-sections, starting with the SNR Computation block.

2.5.1 SNR computation

For SNR computation, the ratio between the signal levels of the input frame and the background noise estimate in each band is computed and then sum of them is stored in the output variable snr_sum for the current frame as shown in the equation (3)

9

1

2 ,)_

,0.1(_n nestbckr

nlevelMAXsumsnr (3)

where level[n] and bckr_est[n] are the signal level and level of background noise estimate at band n respectively.

2.5.2 Background Noise Estimation

Background noise estimation is updated using amplitude levels of previous frame in the non speech region only. In VAD, the background noise is estimated using the first order IIR filter in each band as shown in the equation (4).

nlevelnestbckrnestbckr mmm 11 _)1(_ (4)

where m = the current frame, n = band number , nlevelm 1 = signal level of previous

frame.

The variable is set to some value by comparing the background noise estimate of

current frame with the signal level of previous frame. A pseudo-code for this is as bellow:

end

else

nlevelnestbckrif

DOWN

UP

mm

)_( 1

where variables UP and DOWN are set according to complex signal hangover, pitch and

intermediate VAD decision.

2.5.3 Threshold Adaption

The threshold is tuned according to the background noise level. A threshold is tuned to a lower value in case of higher noise level to detect the speech reliably; although some noise frames may be classified as speech frames.

Average background noise level is computed by adding noise estimates in each band as shown in the equation (5).

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

9

1

__n

nestbckrlevelnoise (5)

VAD threshold is calculated using average noise level as shown in the equation (6).

HIGHTHRVADPIVADlevelnoiseSLOPEVADthrvad __)__(*__ (6)

where VAD_SLOPE, VAD_PI and VAD_THR_HIGH are constants.

2.5.4 Comparison block

The input to this block is snr_sum from SNR computation block and vad_thr from the Threshold adaption block. The intermediate VAD decision (vad_reg) is made by comparing the variable snr_sum to vad_thr; if it is higher than the vad_thr it is set to 1 otherwise 0 as bellow:

end

regvad

else

regvad

thrvadsumsnrif

0_

1_

)__(

2.5.5 Hangover block

Figure 8: Hangover addition

Length>burst_len

Hangover addition

Intermediate VAD Flag

Actual VAD Flag

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Hangover is added to avoid detecting the silence between two speech words as non-speech frames. It combines two conditions. If a certain number of frames (burst_len) have the intermediate VAD decision set to 1, then the hangover flag is set to one. The next certain number of frames (hang_len) will have the VAD flag set to 1, although the intermediate VAD flags become 0. This is illustrated in the figure 7.

The green dot in the figure 8 indicates the intermediate VAD flag set to 1 and the orange dot indicates intermediate flag set to 0. After hangover addition, the hang_len number of orange dots becomes green.

2.6 Problems with VAD

We tested the VAD with a test signal that is a mixture of clean speech and keyboard typing signal. The plot of the mixed signal along with the speech flag signal from the VAD is shown in figure 9. The first subplot in the figure indicates the mixed signal whereas the second subplot indicates speech flag with respect to frame numbers, flag one indicates the frame contains speech signal whereas the flag zero indicates the frame doesn’t contain speech signal.

Figure 9: Test run using VAD

As from the figure 9, it is clear that VAD is indicating almost all keystroke frames as speech frames. So we will look into signal characteristics in the next section to find out why keystrokes signal are classified as speech by VAD.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

3 Signal Characteristics

This chapter describes the general characteristics of keystroke and speech signals in the time and frequency domains such as the duration and frequency range of a keystroke and the phonetic components of speech signals.

To study the signal characteristics of these signals we have had to record a representative collection these types of signals. How we went about this is described in section 3.1.

In section 3.2 we describe the signal characteristics of the keystroke signals and section 3.3 covers the signal characteristics of speech signals.

3.1 Signal Recording

In this section we describe how we covered the variation of the keystroke signals 3.1.1 and speech signals 3.1.2.

3.1.1 Keystroke signals

Keystroke signals can vary quite a bit, depending on the keyboard used, the person typing on the keyboard and the mood of that person. It is, therefore, important to obtain keystroke signals that cover this variation space reasonably well. Keyboards manufactured by different companies have different keyboard mechanics that affect the signal characteristics of the keystroke signal. To cover that variation we have used keyboards manufactured by following companies in our recordings:

HP keyboard

Logitech Keyboard

Logitech wireless keyboard

Mac laptop keyboard

Another factor that influences the signal characteristics is the person typing on the keyboard, because different persons have different typing styles, which also are affected by the mood of the person. Some people type very fast, so the duration between the keystrokes will be low compared to people typing slowly. In our work, we recorded two minute typing of ten people on each of the above keyboards. The recording room was noise free and insulated from outside environment.

3.1.2 Speech signals

Speech signals vary from person to person and with the phonetic sequences of the sentences being spoken. To cover this variation we have used recordings made by Ericsson of different English speaking persons speaking different sentences with well chosen phonetic content.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

There were 160 files of speech data each having 8 seconds length. To cover all the variation in the phonetics, speech of seven female and nine male was recorded having spoken ten speech files of two sentences by each person.

Ten speech files each consisting of two sentences by each person from a group of seven female and nine male was recorded.

3.2 Keystroke Signal Characteristics

A keystroke begins with a key press signal component and ends with a key release signal component [3][[4] The duration of a keystroke is between 60 to 200 ms [5]. The key press and key release signal components are high frequency components and the middle component between the two is a low frequency component.

Keeping in mind the typing mechanism on a keyboard, a person first touches the keyboard button, then hits it and finally releases it, but everything in a fast order. This mechanism results in three peaks as touch peak, a hit peak and a release peak [3]. Generally the touch and hit peak are very close and sometimes overlapping. The touch peak and hit peak form the key press component. The release peak forms the key release sound. The duration of the key press and key release components varies from 10-35 ms as shown in the figures 10 and 11.

Figure 10: Keypress and keyrelease duration

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 11: Keystroke duration

Spectrally keystroke signals are highly random due to the typing style, key sequences and the mechanics of the keyboard. The signal power of a key press is larger than that of the key release. The main frequency range for the key press varies between 1-8 kHz. The spectrum of a typical keystroke is shown in the figure 12.

Figure 12: Spectrum of a keystroke signal

In the subsection 3.2.1 variations of keypress signal is described in more detail. Keyrelease and the middle component of keystroke signal are described in sub-sections 3.2.2 and 3.2.3 respectively.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

3.2.1 Keypress signal

As mentioned earlier, keystroke signal varies with keyboard mechanics and typing style of a person typing on the keyboard. It was observed that the peaks in a keypress vary with the keyboard mechanics. One peak and two peak keypress signals were mainly observed in our data set.

A one-peak key press component was observed on the HP keyboard. Here the touch peak and hit peak are not differentiable. The key press is followed by the low frequency middle component and the key release, which may have more than one peak. The key press component is stronger than the key release component. The average width of key press is 10-15 ms. A typical keystroke for HP keyboard looks like the figure 13.

A two-peak key press component is observed on the Logitech wired, Logitech wireless and Macbook laptop keyboard. Here the touch peak and hit peak are a bit distant. Hence the key press is two peaks which are followed by the low frequency middle component and the key release. The width of key press varies from 10-20 ms varies from 15-25 ms. A typical keystroke for Logitech keyboard looks like the figure 14.

Sometimes the keyboard signal also exhibits multiple peaks in keypress as shown in figure 15 due to mechanics of the key in the device. This behavior is found in specific keys of the keyboard like the spacebar and the enter key. In this case the width of keystroke varies between 20-35 ms.

Figure 13: One peak key press

3.2.2 Keyrelease signal

Keyrelease signals have weak signal strength compared to keypress signal. The keyrelease peaks are not as sharp as keypress peaks. The width of keyrelease peak varies between 10 – 25 ms. A typical keyrelease is shown in the figure 10.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

3.2.3 Low frequency component of Keystroke signal

The part of the keyboard signal between the keypress and keyrelease contain low frequency components. The spectrum of the low frequency component is shown in figure 16.

Figure 14: Two peak key press

Figure 15: Multipeak key press in spacebar of MAC laptop keyboard

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 16: Spectrum of high and low frequency component of keyboard signal

3.3 Speech Signal

In this section a brief description about speech signals is given. The speech signal can be divided into 3 distinct classes according to the mode of excitation [6]. These are Voiced sounds [3.3.1], Fricative Sounds [3.3.2] and Plosive sounds [3.3.3]. Figure 17 shows a cross sectional view of the vocal tract system.

Figure 17: Cross sectional view of vocal tract system

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

3.3.1 Voiced Sounds

They are produced by forcing air through the glottis with the tension of the vocal chords adjusted so that they vibrate in a relaxation oscillation. These periodic pulses excite the vocal tract. Examples of voiced sound are /U/, /d/, /w/, /i/ and /e/ [6] shown in figure 18.

Figure 18: /i/ in finished

3.3.2 Fricative or unvoiced sounds

They are generated by forming a constriction at some point in the vocal tract (usually towards end of the mouth), and forcing air through the constriction at a high enough velocity to produce turbulence. It creates broad spectrum noise source to excite the vocal tract. An example of fricative sound is /sh/ and is labeled as /∫/ [6] shown in figure 19.

Figure 19: /sh/ in finished

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

3.3.3 Plosive sounds

It results from the complete closure of the front of vocal tract, building up pressure behind the closure, and abruptly releasing it. A typical example of a plosive sound is /t∫/ [6] shown in the figure 20.

Figure 20: /t/ in town

3.4 Acoustic Phonetics

Most languages can be described in terms of a set of distinctive sounds or phonemes. There are 42 phonemes for American English including vowels, diphthongs, semivowels and consonants [6]. There are many ways to study the phonetics e.g. study of distinctive features or characteristics of the phonemes.

3.4.1 Vowels

They are produced by exciting a fixed vocal tract with quasi-periodic pulses of air caused by vibration of the vocal cords. The variation of cross-sectional area along the vocal tract determines the resonant frequencies of the tract (also called formants). The dependence of cross-sectional area upon distance along the tract is called the area function of the vocal tract. The position of tongue determines the area function of a particular vowel, but the positions of the jaw, lips, and, to a small extent, the velum also influence the resulting sound. Each vowel is characterized by the vocal tract configuration. The examples of vowel are /a/ in “father”, /i/ in “eve” [6].

3.4.2 Diphthongs

A diphthong is gliding monosyllabic speech item that starts at or near the articulatory position for one vowel and moves to or toward the position for another. There are six diphthongs in American English including /eI/ (as in bay), /oU/ (as in boat), /aI/ (as in buy), /aU/ (as in how), /oI/ (as in boy) and /ju/ (as in you) [6].

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

3.4.3 Semivowels

They are characterized by gliding transition in vocal tract area function between adjacent phonemes. Thus the acoustic characteristics of these sounds are strongly influenced by the context in which they occur. Example of semivowels is /w/, /l/, /r/ and /y/. These are called semivowels because they sound like vowel [6].

3.4.4 Nasals

They are produced by glottal excitation and the constriction of vocal tract at some point. Due to lowering of velum, the air flows through the nasal tract. The mouth serves as a resonant cavity. They are characterized by resonances which are spectrally broader than those for vowels. Examples are /m/, /n/ [6].

3.4.5 Unvoiced Fricatives

They are produced by exciting the vocal tract by steady air flow and constriction at some location in the vocal tract. The constriction location determines the fricative sound. Examples of unvoiced fricatives and their constriction place are shown in table 3 [6]. As shown in the figure 20, the unvoiced fricative sounds are non-periodic.

Table 3: Constriction place for unvoiced fricatives

Unvoiced fricatives Constriction place

/f/ lips

/θ/ teeth

/s/ Middle of oral tract

/sh/ Back of the oral tract

3.4.6 Voiced Fricatives

In case of production of the voiced fricatives, the constriction place is some point near the glottis and is same for all, but there are 2 excitation sources involved in it. One excitation source is at glottis [[6].

3.4.7 Voiced Stops

These sounds are produced by building up pressure at some constriction point in the oral tract and releasing it suddenly. Examples of voiced stops and their constriction point are shown in table 4 [6].

Table 4: Constriction place for voiced stops

Voiced stops Constriction point

/b/ Lips

/d/ Back of teeth

/g/ Near velum

These sounds are highly dynamic and their properties depend on the vowel which follows them. Their waveforms give little information about particular voiced stop.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

3.4.8 Unvoiced Stops

They are very similar to the voiced stops except the vocal chords don’t vibrate in this case. Examples are /p/, /t/ and /k/. Their duration and frequency contents also vary with the stop constants [6].

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

4 Alternative Signal Classification Approaches

Chapter 2 described the VAD that is currently used in classifying/detecting speech content in an audio signal. As described there, the current VAD has a problem in that it is prone to classify non-speech content as speech content. In this chapter we describe two approaches that can be used to classify/detect keystrokes in audio signals and the improved pitch detection method, which can be integrated into the VAD to greatly improve its speech detection performance through a big reduction in incorrectly classified non-speech frames.

The audio signal classification is explained in section 4.1-4.3. Feature Vector Based Classification is explained in the section 4.1. Section 4.2 describes basic approaches used for window based signal processing. Temporal Prediction Based Classification is explained in the section 4.3 where we explain two approaches that are based on prediction model. The first approach of audio signal classification is based on a combined prediction of the current frame from the previous and the following frames (combined backward and forward prediction), whereas the second approach uses separate backward and forward predictions.

The improved pitch detection method is explained in the section 4.4 where we explain an approach based on the autocorrelation of the normalized signal to compute the pitch of the audio signal.

4.1 Feature Vector Based Classification

A feature vector is used to classify the signal into their corresponding classes. The selection of features in the feature vector plays an important role in obtaining a reliable and robust signal classification. There are two different approaches for selecting these features [7]:

Frame based feature vectors

Texture based feature vectors.

4.1.1 Frame Based Feature Vector

In the frame based feature vector approach, the input signal is broken into small blocks and a feature vector is computed for each block. The blocks are called analysis windows. The analysis window also represents the window length. The feature vector is computed in the time intervals between 10-40 ms [[7]. Frame Based Feature Vector approach is widely used in real time classification of audio signal.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

4.1.2 Texture Based Feature Vector

The major drawback with the frame-based approach is that it doesn’t take into account the other long-term characteristics which can improve the classification result. For example in music classification, rhythmic structure can help in detecting the genre. But with 40ms frame we can’t find the rhythmic structure. Similarly envelop detection also helps in the classification. For this purpose we need to find the structural description. So, longer time interval is required for the classification purpose. In case of music classification, not only feature, but its variation also helps a lot in better classification. A texture window is used for this purpose as it contains a long term segment (in the range of seconds) having many analysis windows [[7]. In this approach mainly statistical measure for each analysis window like mean, standard deviation, mean of the derivative, standard deviation of the derivative are computed for classification purpose.

Texture Based Feature Vector is not suitable for real time classification because it needs processing of a large number of frames and may introduce large delay which defeats the classification purpose.

4.2 Temporal and Spectral Feature Extraction

In this section various temporal and spectral features are explained that are widely used in classification of audio signals.

4.2.1 Zero Crossing Rate ( rZC )

Zero crossing (ZC) is the number of times the signal crosses zero during one analysis window and is often used to obtain a rough estimation of the fundamental frequencies for voiced signals [8][[12]. In the case of a complex signal it gives a measure of noisiness.

Generally the short-time rZC is helpful in differentiating between voiced and unvoiced

segments of speech due to their differing spectral energy concentration. If the signal is spectrally deficient, like the sinusoid, then it will cross the zero line twice per cycle. However if it is spectrally rich then it might cross the zero line many more times per cycle. The zero-crossing of speech signal, keystroke signal and mixed signal is shown in figures 21, 22 and 23 respectively.

Zero Crossing ( ZC ) is defined as the number of times the signal amplitude changes the

sign during one analysis window as shown in the equation (7). Zero Crossing Rate ( rZC )

is defined as change of zero crossing of current frame with respect to the previous frame as shown in equation (8) [7].

N

n

nxsignnxsignZC1

|)1()(|2

1 (7)

where sign function is defined by

01

00

01)(

xif

xif

xifxsign

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

prevcurrentr ZCZCZC (8)

where rZC = zero crossing rate, currentZC =zero crossing of current analysis

window, prevZC = zero crossing of previous analysis window

Figure 21 : Zero crossing of clean speech signal

As shown in the figure 21, ZC of voiced speech has lower value than that of the unvoiced speech and keystroke. So, this feature can help in differentiating among them. One important advantage with ZC is that it’s very fast to calculate as it is a time domain feature so we don’t need to compute the spectrum [8].

Figure 22 shows zero crossing of keystroke signal frames. The figure shows that Zero crossing of keystroke signal is less than 350.

Figure 23 shows zero crossing of mixed speech signal frames.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 22 : Zero crossing of keystroke signal

Figure 23 : Zero crossing of mixed signal

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

4.2.2 Cepstral feature

Spectral features are more general purpose features for different kind of problems compared to temporal features, which are limited to instrument recognition, genre recognition or speaker recognition. The cepstral feature is a commonly used spectral feature in speech processing. The idea behind using cepstrum feature is to find the range of resonant frequency that can be computed by extracting smooth envelop of lower cepstral coefficients. Following are three types of cepstrums that are commonly used Error! Reference source not found.:

Power Cepstrum

Real Cepstrum

Complex Cepstrum

The Power Cepstrum is often used to determine the pitch of a human speech signal. It is defined as the squared magnitude of the Fourier transform of the log of the squared magnitude of the Fourier transform of the signal Error! Reference source not found.[11][18].

2

2

10 )))(((log_ signalfftfftcepspower (9)

The Complex Cepstrum is used in homomorphic signal processing. It is defined as the Fourier transform of the log of the Fourier transform of the signal. The Complex Cepstrum is used for complete reconstruction of the signal as it includes the phase information along with the magnitude information [75]Error! Reference source not found..

)))(((log_ 10 signalfftfftcepscomplex (10)

Homomorphic system theory states that if there are two signals convoluted in the time domain, one having high frequency components while the other having low frequency components, then those signals can be extracted separately through a selection of the cepstral coefficients. The lower cepstral coefficients will then represent the low frequency signal and the higher cepstral coefficients represent the high frequency signal [6].

A very important property of the complex cepstral domain is that the convolution of two signals can be expressed as addition of their Cepstra [6][9]. Suppose signal x is the

convolution of two signals 1x and 2x in the time domain, then the Fourier transform of the

signal x in the frequency domain will be the multiplication of the Fourier transforms of

signals 1x and 2x .

2121 XXXxxx FFT (11)

where 21,, XXX are the Fourier coefficients of the signals 21,, xxx respectively.

The complex cepstrum of equation (11) is shown in equation (12).

cccccc XXXXfftXfftfftX 2121011010 )log(log)))(((log (12)

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

where cccccc XXX 21 ,, are the complex cepstral coefficients of the signals 21,, xxx

respectively.

If the signal 1x is a low frequency signal and 2x a high frequency signal, then the lower

cepstral coefficients of ccX will be dominated by the lower spectral coefficients of ccX1 and

the higher cepstral coefficient of ccX will be dominated by the higher spectral coefficients of

ccX 2 .

Another important application is done in extracting a smooth envelope of the log of the Fourier Transform of the signal so that resonant frequencies are easier to detect. As the log of FT contains both low and high frequency components, we can get rid of the high frequency components by only selecting the lower spectral coefficients and then estimating the spectrum by taking the Inverse Fourier transform using these lower spectral coefficients only. But the selection of cutoff threshold for the spectral index should be done carefully.

The Real Cepstrum is defined as the real part of the inverse Fourier transform of the log of the absolute value of the Fourier transform of the signal [10]. Only the real part is taken, because during computation very small imaginary part is produced. As the phase information is removed in the computation there is a significant amount of reduction in the information being processed.

))))((((log(_ 10 signalfftabsifftrealcepsreal (13)

For a signal x, its real cepstrum is defined as bellow:

))(()))))((((log( 10 XifftrealxfftabsifftrealX rc (14)

where rcX is the real cepstrum of the signal x and X is the log of absolute value of the

Fourier Transform of the signal x as shown in equation (15).

)))(((log10 xfftabsX (15)

We are interested in smooth part of real cepstrum i.e. it’s envelop enveloprcX _ that can be

computed by selecting only the lower spectral coefficients of X , then taking the real part of the Fourier Transform of these lower spectral coefficients. This property is true, because

X is real and symmetric.

))((_ smoothenveloprc XifftrealX (16)

where enveloprcX _ = estimated envelop of real cepstrum, smoothX = lower spectral coefficient

of X .

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

In the thesis work we computed the smooth Real Cepstrum coefficients. The plot of X is shown by the red line in the first window of the figure 24. The green line shows the estimated real cepstrum using the first 50 cepstral coefficients. It is seen from the figure that the plot is not very smooth, so it may be difficult to find the resonant frequency in a general case. In this particular case of the keystroke as shown in the figure 24, the resonant frequency is approx 6 kHz.

Figure 24 : Estimated real cepstrum of the keypress signal and its estimate

The log of the FT of voiced speech and unvoiced speech are shown in figure 25 and 26 respectively. As shown in the figure 25, the resonant frequencies of voiced speech lies bellow 1 kHz. Fricative sound is spectrally flat compared to voiced speech sound. Their resonant frequency lies in higher frequency domain (10-15 kHz) as shown in figure 26.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 25 : Estimated real cepstrum of the voiced speech signal and its estimate

Figure 26 : Estimated real cepstrum of the fricative sound signal and its estimate

4.2.3 Short Time Fourier Transform (STFT)

An FFT based feature is widely used in signal classification. The Short Time Fourier Transform is obtained by taking the Fourier transform of the windowed signal. The width of window function determines the resolution. For good frequency resolution the size of window is increased whereas for good time resolution the size of window is decreased.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

There are many spectral features based on the STFT of the signal such as the Spectral Flux where the change of the STFT of the current frame from the previous frame is compared with a threshold.

4.3 Temporal Prediction Based Signal Classification

Speech signals are smooth and highly correlated signals in the temporal domain (frame to frame) and can, therefore, be modeled with an AR model. This model is explained in section 4.3.1. The classification, which is based on the smoothness criteria, is explained in section 4.3.2. We describe weight computation in section 4.3.3 to optimize the classification algorithm.

4.3.1 Prediction Model for smooth speech signals

As discussed earlier, speech signal consists of smooth signal, so the STFT of a speech signal can be modeled with the Autoregressive model [17][18] as shown in the equation (17) [[5].

kn

M

m

mmkn XknYknY ,

1

,, ),(),(

(17)

where n and k represents the frame index and the frequency index respectively, m is the

delay, mkn ,, are the prediction coefficients for the frames used in the prediction and knX , is

the model error. The model error is modeled here as a white Gaussian stochastic process

in the temporal domain (over n) with zero mean and variance 2

,kn ( ),0( 2

,kn ), with

probability density function given by equation (18).

2,

2,

2

,

,2

1)( kn

knX

kn

kn eXp

(18)

and independence between the subbands given by following equation (19)

otherwise

qkandmnifXXE

kn

qmkn0

2

,

,,

(19)

where E is the expectation operator.

The model error vector for each frame is:

kn

n

n

n

X

X

X

X

,

2,

1,

...

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

and the normalized model error vector given as

kn

n

n

n

X

X

X

X

,

2,

1,

... , where

kn

kn

kn

XX

,

,

,

are the normalized model errors.

If we assume that there is no correlation between the frequency components of a given

frame [5], the pdf for the normalized vector nX is given by equation (20).

)( nXp = )(...)()( ,2,1, knnn XpXpXp

= 2

)...(

2

2,

22,

21,

)2(

1knnn XXX

ke

=

k

knX

ke

2,

2

1

2)2(

1

(20)

Using this pdf, we can evaluate the probability for the normalized model error vector to be within k-dimensional sphere of radius R as:

RX

knknnnn

n

XdXdXdXpRXP ,,1, ...)()(

RX

knknn

X

k

n

k

kn

XdXdXde ,,1,

2

1

2

...)2(

12,

(21)

where )( RXP n is a monotonic increasing function of R. For a given probability 1,0 ,

we can find the R that satisfies )( RXP n and use the criterion as shown in the

equation (22) for distinguishing between smooth and impulsive signals. With 95.0 we

would then be classifying the smooth signals correctly with probability 0.95.

RX n (22)

)( RXP n as a function of R can be expressed in terms of the known Gamma function

as shown below. First the normalized model error vector is mapped to polar coordinates as shown below Error! Reference source not found.:

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

1221,

12211,

3213,

212,

11,

sinsin....sinsin

,cossin....sinsin

......

,cossinsin

,cossin

,cos

kkkn

kkkn

n

n

n

rX

rX

rX

rX

rX

(23)

where r can be computed as shown in the equation (24) :

2

,

2

2,

2

1,

2 ... knnn XXXr (24)

and the limits of the polar coordinate vary as shown bellow:

20

0

...

0

0

0

1

2

2

1

k

k

Rr

(25)

The mapping of the differential volume from the Cartesian format to the polar format is done using the above equations as shown bellow:

122122

3

1

21

,,1, ...sin...sinsin...

kkk

kkk

knknn dddddrrXdXdXd (26)

Using the above polar transformations, equation (21) can now be mapped from Cartesian to polar coordinate as shown in equation (27) as bellow:

knnr

kkk

kkkr

kn dddddrreRXP

,1,

2

,...,,

122122

3

1

212

1

2

...sin...sinsin)2(

1)(

(27)

As the function inside the integral is in equation (27) is separable in terms of the polar variables, we can integrate over each polar variable separately. First we integrate over all the angular variables. To simplify the above equation, let’s define

0

sin)( dI k

k (28)

then the above integral (27) reduces to the equation (29):

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

knnr

kk

kr

kn drIIIIreRXP

,1,

2

,...,,

0112

12

1

2

)2()()...()()2(

1)(

(29)

The first three integrals of (28) are computed as bellow:

2)2(

2

0

0

dI

2sin)(0

1

dI

2sin)(0

2

2

dI

The rest of the integrals for the angular variables can be computed recursively from previous integrals through the following recursion formula:

constIn

n

ndI

termnd

n

termst

nn

n

2

2

1

1 1sincos

1sin (30)

For integrating θ over 0 to π for these integrals, the 1st term in the equation (30) will be zero. So the above equation reduces to equation (31).

2

1

nn I

n

nI (31)

The integration over all the angular dimensions of (29) using (31) will result into a constant term as shown in equation (32).

)!12

(

2)2()()...()(

2

0112

kIIIIC

k

kk

(32)

Using equation (32) in the equation (29), we get:

R

r

kr

kn drreC

RXP0

12

2

2

)2()(

R

r

kr

kdrre

k 0

12

12

2

)!12

(2

1 (33)

We can write the integral in the equation (33) in terms of the incomplete gamma function Error! Reference source not found. that is defined as:

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

x

tm dtetxm 1),(

Using the following integration formula [15]

constxm

xxdxexm

mxm

),

2

1()(

2

1 221

2212

where ),( xm is the incomplete gamma function.

For an integer m, the gamma function is defined as:

1

0 !)!1(),(

m

n

nx

n

xemxm

The equation (33) reduces to following equation (34):

2/

0

2 ),2

()!1

2(

1)(

R

n xk

kRXP

)

2,

2()0,

2(

)!12

(

1 2Rkk

k

)!1

2(

)2

,2

(

1

2

k

Rk

(34)

where the left hand side of the above equation (34) is the probability of the norm of the k-

dimensional normalized vector i.e. nX that is less than a threshold R.

The equation (34) shows how the probability of the norm of k-dimensional vector that is less than a threshold R can be represented in terms of gamma function of the dimension k and the threshold R. In our case k is the number of the frequency components. So for example if the window length is 15ms and the sampling rate is 48 kHz, then k will be 360. The plot of the probability P versus the threshold R is shown in the figure 27 for k=360.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 27 Plot of Probability P vs Threshold R

The above figure shows that to correctly identify the impulsive signal with 80.23% accuracy, the threshold must be greater than 19.56. In other words, the signals identified bellow the threshold will be a smooth signal.

To illustrate equation (34) we consider the case for k=2. The inequality using the equation (22) can be written as equation (35).

RXX nn )( 2

2,

2

1, (35)

It’s an equation of a circle which tells that the norm of nX will lie inside or on the circle

having radius R as shown by the green region in the figure 28.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 28 Area of circle showing limit for pdf

The pdf of a frame having only two spectral coefficients can be given by equation (36) using equation (21)

2,1,2

)(

22

22,

21,

)2(

1)( nn

XX

RX

nn dXdXeXdXpnn

n

(36)

The equation (36) can be written in the polar form as equation (37)

,

2

R

2

)(r

r

X

nn drdreXdXp

n

(37)

where 20,0 Tr

Using the limits for integration, the equation (37) reduces to equation (38)

R

r

r

RX

nn rdreXdXp

n0

2

1

22

2

)2(

2)(

(38)

To solve equation (38), let dsrdrsr

2

2

then equation reduces to equation

(39).

1nX

2nX

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

2

0

2

)()(

R

s

RX

nn dseXdXp

n

)1( 2

2R

e

(39)

where the left hand side of the equation (39) is the probability of the norm of the 2-

dimensional normalized vector i.e. nX that is less than a threshold R.

The equation (39) shows how the probability of the norm of the 2-dimensional normalized vector can be represented in terms the threshold R. The plot between the probability and the threshold is shown in the figure 29.

Figure 29 Probability vs Threshold for k =2

From the figure 6, to identify a signal having 2-frequency components with 81% probability, the threshold must be 1.83 at max.

Another illustration for k = 1 is as bellow:

RX

n

X

n

n

n

XdeRXP

1,

21,

1,2

21

)2(

1)(

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

2

0

2

1,

2

1 2

R

y

ny Xywheredye

)2

(2

1 Rerf

where the error function (erf) is defined as

x

t

t dtexerf0

22)(

As explained earlier, we determined the threshold R corresponding to the probability of the norm of the k-dimensional vector that is less than the threshold. Graphically it can be interpreted as the area under the pdf between the limits –R and R. Figure 30 shows the area under pdf from the limits –R_95 to R_95 will give the probability of the norm of the 1-dimensional vector, where R_95 corresponds to the threshold for getting 95% probability.

95.0)(

%)95(

%)95(

R

R

nn XdXp

Figure 30 pdf vs Threshold R, the area under the curve from –R_95 to R_95 is 0.95 represents probability.

The other way would be to show that 2

nX has chi square distribution [14]Error!

Reference source not found..

4.3.2 Classification based on thresholding the norm of the modeling error vector

As explained in section 4.3.1 we will use Equation (20) and (22) for classification of smooth signals. Combining these two equations we get equation (40) which tells that if the model error is less than the threshold R then the audio frame is a smooth signal otherwise impulsive signal.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

RknYknYk

M

m

mmkn

kn

2

1

,,2

,

)),(),((1

(40)

In the above equation, we used M

mkn

1,,

We will explain in the next section about the computation of variance 2

,kn . Now we will

discuss two approaches for this model:

Combined prediction using previous and next frame

Prediction using previous and next frame separately

In the first approach, we will take 2 frames for the prediction of the frame: one previous

frame and one look-ahead frame i.e M=2, 1,1, 21 , 2

1,, mkn .

So, the above equation reduces to the following criteria:

RknYknYknYk kn

2

212

,

))),(),((2

1),((

1

(41)

In the second approach, we will consider the forward and backward predictions separately. First criteria will consider only previous frame for prediction whereas the second criteria will consider a look-ahead frame for the prediction as shown bellow. We define the first criteria for prediction as backward criteria:

k kn

knYknYcritriabckd2

12

,

),(),(1

_

(42)

and the second criterion is defined as forward criteria:

k kn

knYknYcritriafwd2

22

,

),(),(1

_

(43)

Finally, if both backward and forward criteria are lower than threshold then the frame is classified as speech frame otherwise impulsive frame as bellow:

end

impulsiveframe

else

Smoothframe

RcriteriafwdRcriteriabckdif fwdbckd

''

''

)_&&_(

where bckdR and fwdR are the thresholds for backward and forward criteria respectively.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Further normalization of these criteria is done by dividing by the number of frequencies (K) as bellow:

k kn

knYknYK

critriabckd2

12

,

),(),(11

_

k kn

knYknYK

critriafwd2

22

,

),(),(11

_

(44)

In classifying STFTs of speech and keyboard typing signals some frequencies are more important than others. It is, therefore, worth investigating if better classification results can be obtained by weighting the frequency components differently. For computing weight for the criteria, we need to take into consideration of different types of keystrokes data that cover all kinds of variations such as typing style and keyboard mechanics. We will compute the weight using Frobenius-Perron theorem which is explained in the section (4.3.4).

4.3.3 Computation of variance 2

,kn

We analysed the speech data to check whether the assumed model (20) for a smooth stochastic signal having zero mean and varying variance is valid or not. From the speech data we also tried to figure out the best variance in case of a speech signal. As plosive and fricative speech signal have impulsive behavior, these signals should be classified as outliers and not considered as part of a smoothly varying signal.

The first variance we considered is based on filtering of the square of the model error. In

case of combined prediction using previous and next frame, the model error ),( kne is

given as shown bellow:

)),(),((2

1),(),( 21

framenextframeprev

knYknYknYkne

where 1,1, 21

A low pass estimation of the Variance can be computed as

),()1(),(),1( 2 kneknZknZ

where ),( knZ is the variance, is constant. A plot of the estimated variance along with

square of the model error is shown in following figure 31. The first subplot in the figure clearly indicates that the mean of the speech signal is zero. The blue color in the subplot 2 is estimated variance whereas the green color is the plot of the squared model error. The subplot 3 contains only estimated variance. The figure clearly shows that the variance is giving good estimate of the smoothly varying part whereas excluding the impulsive part. A good estimate of variance of smooth signal should exclude the impulsive part of the signal considering them as outlier.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 31 Model error vs Variance in case of combined prediction

Now the model error in case of prediction using previous and next frame separately can be given as bellow:

),(),(),( 1 knYknYkneb

),(),(),( 2 knYknYkne f

where be and fe are the model errors using the previous and next frame separately. The

estimated variance can be computed using these model errors as bellow:

),()1(),(),1( 2 kneknZknZ bbb

),()1(),(),1( 2 kneknZknZ fff

where bZ and fZ are the variance for backward and forward criteria respectively. The

figure 32 and 33 shows plots of the variance estimate and the squared model error using previous and next frame respectively.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 32 Model error vs Variance in case of prediction using previous frame

Figure 33 Model error vs Variance in case of prediction using next frame

The first subplots of all the above figures clearly show that the mean of speech signal is zero and the variance is capturing the smoothly varying part of the signal.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Another variance estimate that we tried is based on the basis frame which is based on the research work done by A. Subramanya [5]. For combined prediction using the previous and the next frame, the variance is defined as bellow:

)),(),.((2

1 2

2

2

1

2

, knYknYkn where 1,1, 21

Figure 34 Model error vs estimated Variance used in the combined prediction

The figure 34 shows that estimated variance is not following the model error in this case.

4.3.4 Weight computation

Weighting of frequency components can help in the signal classification because all the frequencies are not equally relevant. For example if we have to detect keystroke signals, then the frequency range in which keystroke lies are more important than the other frequencies. So weighting can be used to enhance the relevant part of data.

As frequency weighting of signals can give good result, we computed weights for the keystrokes data. For computing weights, we prepared a matrix whose rows are the frequency components of a windowed frame and the columns represent the different frames containing keystroke signal. According to the Frobenius-Perron theorem [16] for a positive matrix, the Eigen vector corresponding to the largest Eigen value will represent the matrix. The condition for the positive dominant Eigen value of a positive matrix is that the number of rows should be greater than the number of columns.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Let there be m rows and k columns for a matrix F, then for the positive and dominant Eigen values m>k. In our case the rows will represent the frames whereas the column will represent the spectral coefficients. So, the number of frames for weight computation must be higher than the number of spectral coefficient to get positive and dominant Eigen value which will represent the weight.

mkm

k

FF

FF

F

...

.....

.....

.....

...

1

111

In Matlab, weight can be computed by

)''*(, FFeigEW

where W (weight) is the Eigen vector corresponding to the max Eigen value in the diagonal matrix E.

The matrix F is the forward-backward criteria matrix. Using the above theorem, the weights WB and WF are computed corresponding to forward and backward matrices respectively. Using the above computed weights, the criteria equation can be modified as bellow:

k kn

knYknYkWBK

critriabckd2

12

,

),(),()(11

_

k kn

knYknYkWFK

critriafwd2

22

,

),(),()(11

_

(45)

where WF and WB denote the weights for forward and backward criteria respectively.

Further optimization of the criteria can be achieved using selection of the frequency range, as all the frequencies are not relevant. So, only considering some range of frequencies as per our interest, the criteria computation becomes as bellow:

en

stk kn

knYknYkWBsten

criteriabckd 2

12

,

)),(),()((1

1

1_

en

stk kn

knYknYkWFsten

criteriafwd 2

22

,

)),(),()((1

1

1_

(46)

where st and en are the starting and ending index for the spectral coefficient under

consideration respectively.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

4.4 Pitch detection based on Autocorrelation of the Normalized Signal

As explained in the chapter 2, the current VAD is not working very well in case of pitch detection. Sometimes it detects the keystroke signal as a pitched speech signal. In this section, we describe an improved pitch detection method that is based on the autocorrelation of the normalized audio signal.

For defining the autocorrelation [17], we use the inner product of two vectors X and Y that is defined as bellow:

N

n

nynxYX0

)()(, (47)

Inner product of the normalized vectors X and Y is defined as bellow:

YX

YXYX nn

,, (48)

where X and Y are norm of vectors X and Y. Inner product of the normalized vectors

lie between -1 and 1.

Equation (47) is the standard inner product of two vectors X and Y. Autocorrelation is defined as the inner product of a vector with itself as given by equation (49). The “zero lag” autocorrelation is the same as the mean-square signal power. Autocorrelation helps in determining the features of the signal buried in noise. It also estimates the periodicity of a signal in a very convenient way.

)()(

,)(

0

0

knxnx

XXkX

N

n

kAC

(49)

where )(kX AC is the autocorrelation of the signal x with delay or lag k,

)(

...

)(

)(

...

)0(

0

Nkx

kx

Xand

Nx

x

X k

In the case of a random or a periodic signal, equation (49) reduces to equation (50)

)()(12

1)( lim kmxmx

NkX

N

NmN

AC

(50)

Following are the few properties of autocorrelation of a signal [6]:

If the signal is periodic with period P samples, then the autocorrelation is also

periodic with the same period, )()( PkXkX ACAC

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Autocorrelation is an even function, )()( kXkX ACAC

Autocorrelation attains maximum value at k = 0 (without any lag),

kXkX ACAC )0()(

)0(ACX is the energy for deterministic signal or the average power for random or

periodic signal.

So the autocorrelation of a periodic signal attains maximum at its sample period regardless of the time of its origin. The period can be estimated by looking into the first maximum of the autocorrelation function.

The delay or lag k should be such that it should cover the pitch period of human speech which lies in the range from 100Hz to 400Hz. If the delay is larger, then the peaks in the autocorrelation function decreases. So, the autocorrelation of a normalized signal can remove this dependency.

Autocorrelation of the normalized signal is based on the inner product of the normalized signal with itself as shown bellow:

k

kAC

XX

XXkX

0

0 ,)(ˆ (51)

where )(ˆ kX AC is the autocorrelation of the normalized signal x with itself delayed by k,

)(

...

)(

)(

...

)0(

0

Nkx

kx

Xand

Nx

x

X k

Autocorrelation model error can provide further information regarding whether the signal is tonal or non-tonal. We can generalize P-order AR model for a n-dimensional vector.

)(

...

)1(

ny

y

Y (52)

be the autocorrelation of the signal frame whose P-order and k-step model for the autocorrelation is to be estimated.

For vector Y, P-order and k-step model is given bellow:

e

pkny

pky

a

ny

yP

p

p

)(

...

)1(

)(

...

)1(

1

(53)

where e is the model error.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

For index n to be positive

pknpkn 11 min (54)

Using (54) in (53) we get P-order, k-step AR model as bellow:

e

pkny

kpPy

a

ny

pkyP

p

p

)(

...

))(1(

)(

...

)1(

1

(55)

The above model can be written as bellow:

eXaYP

p

pp 1

where

)(

...

))(1(

)(

...

)1(

pkny

kpPy

Xand

ny

pky

Y p

The above equation is shown as bellow:

Projection of Y on X is

P

p

pXXYP

1

)( . As .,...,10, PpforXe p

So the AR model coefficients can be computed as

)()(

,,

1

1

1

YXXX

YXXX

aAR

TT

P

p

pcoeff

(56)

Estimated Autocorrelation model is computed using (55) and (56) as bellow

Y

PXXX ,...,1

e

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

)(

...

))(1(

ˆ

1pkny

kpPy

aARYP

p

pcoeff (57)

Normalized Model Error is given using (52) and (57)

)(

)ˆ(_

Ynorm

YYnormerrModel

(58)

If the model error is less than 0.05 then the signal is classified as Tonal otherwise Non-tonal. As shown in figure 35, the model error of the speech signal is 0, so it’s a tonal signal whereas the model error in case of keystroke signal (shown in figure 36) is larger than 0.05 hence keystroke signal is non-Tonal.

Similarly if the model error is less than 0.15 then we say the signal is noise otherwise it’s not noise.

In case of periodic signal such as speech, the autocorrelation function has peaks. We can compute the number of peaks per frame by setting some threshold for the autocorrelation. If there is more than one peak then the signal is periodic. Figure 35 shows the autocorrelation in case of speech signal. Figure 36 shows the autocorrelation in case of keystroke signal. We evaluate these methods in the next chapter.

Figure 35 Normalized Autocorrelation of Speech signal

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 36 Normalized Autocorrelation of Keystroke Signal

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

5 Performance Evaluation

In this chapter we evaluate the audio signal classification and the improved pitch detection methods that were described in detail in chapter 4.

In audio signal classification we evaluate two approaches that are based on a prediction model. The first approach of audio signal classification is based on a combined prediction of the current frame from the previous and the following frames (combined backward and forward prediction), whereas the second approach uses separate backward and forward predictions.

For each approach there are four parameters that we can vary to tune the performance. These variables are:

The threshold R in the norm inequality of the STFTs,

The temporal step size between the STFTs, sN

The window length of the STFTs, wlN

The frequency weight vector for the STFTs, fW .

The evaluation is done by looking at the hit rate (percentage of correct keystroke classifications) and the number of false alarms (number of false keystroke classifications), with the objective to maximize the hit rate, while keeping the number of false alarms low.

The improved pitch detection method is based on the normalized autocorrelation of the audio signal as explained in the chapter 4. The proper way to evaluate this method would be to incorporate the pitch detector in the VAD and use the new VAD for the classification purpose. But due to lack of time, we have evaluated the new speech detector by combining the pitch flag of the audio frame from the new pitch detector and the speech flag of the audio frame using the original VAD. As the minimum length of the human pitched audio signal is 20 ms, we have used logic to filter out pitched frames less than 20 ms. For example if the temporal step length is 5 ms and the window length is 20ms, then the new speech detector will classify the audio frame as pitched speech frame only when at least the previous three consecutive frames are detected as pitched frames by the new pitch detector.

To keep the speech level up between the speech frames, we added a logic which keeps the speech level up and waits till certain number of frames after a speech frame. If a speech frame appears after that speech frame within a certain number of frames then the speech level will remain up otherwise the speech level will go down that will indicate the end of the speech region. The process is called hangover addition and the certain number of waiting frames is called the hangover length. The average hangover length is 500ms for the human speech.

For the improved pitch detection method, the tuning parameters are the following:

Autocorrelation threshold, thAC

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1


The window length of the STFTs, wlN

In section 5.1 we describe the construction of the test signals used in the evaluation and in section 5.2 we introduce the terminology used in the evaluation process. In section 5.3 we describe the evaluation of the two signal classification approaches. In section 5.4, we describe the performance of the improved pitch detection algorithm. In section 5.5 we summarize the conclusion of our research. Interested readers who want to go through the data are advised to look into appendix A.2 for further details.

5.1 Test Signals

The test signals used in the evaluation are composed by mixing high quality speech recordings and keystroke recordings.

Speech signals vary from person to person and with the phonetic sequences of the sentences being spoken. To cover this variation we have used recordings made by Ericsson in 2005 of different English speaking persons speaking different sentences with well balanced phonetic content.

Each speech file contains two sentences, is 8 sec long and sampled at 48 kHz. From this data base we have used seven female and nine male speakers and used ten files per speaker, which results in a total of 160 speech files.

Keystroke signals can vary quite a bit, depending on the keyboard used, the person typing on the keyboard and the mood of that person. It is, therefore, important to obtain keystroke signals that cover this variation space reasonably well. Keyboards manufactured by different companies have different keyboard mechanics that affect the signal characteristics of the keystroke signal. To cover this variation we recorded keyboard typing by 10 persons, where each person typed on the four different keyboards listed below:

HP keyboard

Logitech Keyboard

Logitech wireless keyboard

Mac laptop keyboard

The number of the keystroke files is 40 and each keystroke file is 8 second long.

Two sets of data were prepared for the evaluation purpose by mixing the high quality speech and the keystroke files; one for optimization (threshold computation) and weight computation and the other set of data for verification.

For the optimization dataset, the keystroke typing of the first 5 people on the first two keyboards (10 files) and speech files spoken by 5 males and 4 females (90 files) are mixed together to form a set of 90 test files, where the first 10 speech files are mixed with the 10 keyboard files, the next 10 speech files are mixed with the 10 keyboard files and so on.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Similarly the verification dataset was put together by mixing the keyboard typing of the last 5 people on the remaining 2 keyboards (10 files) with the speech files of the remaining 4 male and 3 female speakers (70 files) in the same manner as described above for the optimization data set...

5.2 Evaluation Terminology

In the beginning of this chapter we introduced the terms hit rate and false alarm. These terms need a more formal definition, which we provide here.

We assume that there exists an algorithm that can accurately classify an audio frame in terms of its speech and keystroke content that sets a true-speech flag if there is speech content in the frame and a true-keystroke flag if the frame contains a keystroke. Furthermore, we define a true-keystroke frame as a frame that has its true-keystroke flag set and its true-speech flag not set.

We also assume that the keystroke classification algorithm under evaluation sets a keystroke flag for each frame that it classifies as containing a keystroke and define a correctly-classified keystroke frame as a true-keystroke frame that the classification algorithm under evaluation has also classified as a keystroke frame.

With these definitions in place, the hit-rate can now be defined as shown in the equation (59).

100frames keystroke trueofnumber

frames keystroke classifiedcorrectlyofnumber_

RATEHIT (59)

The algorithm used to set the true-keystroke and true-speech flags are explained in appendix A.2. It uses the fact that each test signal is the mix of a clean speech signal and a clean keyboard typing signal and uses the original VAD on the clean speech signal to set the true-speech flag for the mixed file.

A frame is said to be a falsely-classified keystroke frame if it in not a true-keystroke frame but the algorithm under evaluation classifies it as a keystroke frame. The false alarm rate is then defined as shown in the equation (60).

100frames keystrokenon trueofnumber

frames keystroke classified-falslyofnumber__

RATEALARMFALSE (60)

The evaluation of true keystroke is done by computing the energy of the keystroke frames and then selecting some threshold to classify keystroke frames which is explained in the appendix section. We also did manual verification of the keystroke frames. But these result may very by few frames because we can’t say one frame as fixed location of keystroke, it spans to many frames. So the classification result may vary due to the approximation of the keystroke frame.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

5.3 Evaluation of keystroke signal classification approaches

In this section we evaluate the two approaches of keystroke signal classification (explained in the chapter 4) with varying the parameters (threshold in the norm inequality, window length, temporal step length and frequency weighting).

5.3.1 Approach based on combined prediction of the previous and next frame

This subsection explains the evaluation of the keystroke signal classification method using the combined prediction of the current frame from the previous and the following frame. This approach is inspired from the research work done by A Subramanya [5]. In his work variance estimate is based on the average of previous and next frame. We tried out another variance estimate based on the low pass filtering of squared model error. We evaluated the method by changing the parameters and plotting the hit rate and the corresponding false alarm. The best result was selected by minimizing the false alarm rate and maximizing the hit rate.

The criteria for classification as explained in chapter 4 is

RknYknYk

M

m

mmkn

kn

2

1

,,2

,

)),(),((1

and we evaluated

ThknYknYK k

M

m

mmkn

kn

2

1

,,2

,

)),(),((11

The relation between threshold Th and R is thus given by

21

RK

Th (61)

First we describe the result using variance estimates that are computed using filtering of the squared model error as explained in section (4.3.3) of chapter 4.

Table 5 shows the hit rate and the total number of false alarms and false alarms in the speech region for data set 1 obtained for varying threshold values. These results are plotted in Figure 37, 38 and 39. As we want the false alarm to be minimized whereas hit rate to be maximized. If we allow 5% of the false alarm in a speech region of an audio file, then false alarm allowed in the speech section can be maximum 20, as speech frames contained in each audio file is 400. So in this case the hit rate obtained was between 75%-80%.

A more detailed view of how the method works for a mixed audio file is given in Figure 40. Plots for clean speech and only keystroke files are attached in the appendix section. The upper graph of the figure 40 shows the mixed signal itself and the lower graph shows the weighted norm of the STFT for each frame that is used in the classification decision. The red stars indicate the true-keystroke frames in the sound file. The notches in the curve show how the method identifies the keystroke frames. The question is then where to put the threshold for setting the keystroke flag.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 37 Hit rate using variance estimation based on squared model error

Table 5 Hit rate and False alarm using variance based on squared model error

Thr Hit rate False Alarm

False Alarm(SP)

0,1 95 22 22

0,2 95 21 22

0,3 95 21 22

0,4 95 20 21

0,5 94 18 20

0,6 93 16 18

0,7 91 14 16

0,8 90 12 13

0,9 88 9 10

1 84 7 8

1,1 81 6 6

1,2 75 5 5

1,3 69 4 4

1,4 64 3 3

1,5 58 3 2

1,6 50 2 2

1,7 39 2 1

1,8 31 2 1

Now we describe the result obtained using variance which is average of square of the previous and the next frame. This approach is according to research work done by A Subramanya [5]. The hit rate, total false alarm per file and the false alarm in the speech region is shown in the figures 41, 42 and 43. Table 6 summarizes all the data. Figure 44 shows how the method works for a mixed audio file. Plot for clean speech and keystroke only files are attached in the appendix section. In this case hit rate for 5% false alarm rate was around 90%-97%. This was the best case obtained in the overall results.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 38: False alarm rate

Figure 39 False alarm rate in the speech region

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 40 Graphical view of the method for a mixed audio file

Figure 41 Hit rate using variance estimate based on average of prev and next frame

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 42 False alarm rate

Figure 43 False alarms in the speech region

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Table 6 Hit rate and False alarm using variance based on average of previous and next frame

Thr Hit rate False Alarm

False Alarm(SP)

1 98 10 12

2 97 5 6

3 96 3 4

4 95 2 3

5 93 2 2

6 92 2 2

7 91 1 2

8 90 1 2

9 89 1 1

10 88 1 1

11 87 1 1

12 85 1 1

13 84 1 1

14 83 1 1

15 82 1 1

16 81 1 1

17 79 1 1

18 77 1 1

Figure 44 Graphical view of how the method works for a mixed audio file

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

5.3.2 Approach based on prediction using previous and next frame separately

This subsection explains the audio signal classification based on the prediction of the current frame using the forward and backward prediction separately that is explained in detail in chapter 4. The criteria explained in chapter4 are:

k kn

knYknYK

critriabckd2

12

,

),(),(11

_

bkR

and

k kn

knYknYK

critriafwd2

22

,

),(),(11

_

fwR

We evaluated

bk

k kn

ThknYknYK

2

12

,

),(),(11

Thus the relation between bkTh and bkR is given as bellow:

21bkbk R

KTh

and similarly

k kn

fw knYknYK

Th2

22

,

),(),(11

The relation between fwTh and fwR is given as bellow:

21fwfw R

KTh

Then a simple logic was used to select the peak of the curve above the threshold, which indicates the presence of impulsive signal such as keystroke.

Now we will first describe the result obtained using the variance which is computed using the filtering of the squared model error for the previous and next frame explained in section (4.3.3). Table 7, 8 and 9 shows the total false alarm, false alarm in the speech region and the hit rate for 15 ms hamming window with 5ms step length. The plots are shown in the figures 45, 46 and 47. The backward threshold and forward threshold are 0.9 and 0.7 for max 5% false alarms and the hit rate corresponding to the threshold is 74%.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

The second evaluation is done with 20ms hamming window, 5ms step length. Table 10, 11 and 12 summarizes the hit rate, total false alarm and the false alarm in the speech region which are plotted in the figures 48, 49 and 50. The backward threshold and forward threshold are 0.8 and 0.8 for max 5% false alarms and the hit rate corresponding to the threshold is 73%.

Table 7 Hit rate corresponding to forward and backward threshold

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 84 84 84 84 84 84 82 79 77 69 61 49 38 28 19 11 6 3

2 84 84 84 84 84 84 82 79 77 69 61 49 38 28 19 11 6 3

3 83 83 83 83 84 84 82 79 77 69 61 49 38 28 19 11 6 3

4 83 83 83 83 84 84 82 79 77 69 61 49 38 28 19 11 6 3

5 83 83 83 83 83 83 82 79 77 69 61 49 38 28 19 11 6 3

6 82 82 82 82 82 82 82 79 76 69 61 49 38 28 19 11 6 3

7 80 80 80 80 80 80 80 78 76 69 61 49 38 28 19 11 6 3

8 79 79 79 79 79 79 79 77 76 69 61 49 38 28 19 11 6 3

9 78 78 78 78 78 78 77 76 75 69 61 49 38 28 19 11 6 3

10 76 76 76 76 77 76 76 74 73 68 61 49 37 28 19 11 6 3

11 72 72 72 72 72 72 71 70 69 65 59 48 37 28 19 11 6 3

12 69 69 69 69 69 69 68 67 65 62 57 47 37 27 19 11 6 3

13 63 63 63 63 63 63 63 62 60 57 52 43 35 27 19 11 6 3

14 60 60 60 60 60 60 59 58 56 53 48 40 33 26 18 11 6 3

15 56 56 56 56 56 56 55 54 53 49 44 37 30 25 18 10 5 3

16 52 52 52 52 52 52 52 50 49 46 41 34 28 23 16 10 5 2

17 50 50 50 50 50 50 49 48 47 44 39 32 27 22 16 9 5 2

18 45 45 45 45 45 45 44 43 43 39 35 29 24 20 15 8 5 2

Table 8 False Alarm rate corresponding to forward and backward threshold

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 20 20 19 18 16 13 11 10 8 5 4 3 2 2 2 1 1 1

2 20 20 19 18 16 13 11 10 8 5 4 3 2 2 2 1 1 1

3 20 20 19 18 15 13 11 10 8 5 4 3 2 2 2 1 1 1

4 19 19 18 17 15 13 11 9 7 5 4 3 2 2 1 1 1 1

5 18 18 17 16 15 13 11 9 7 5 4 3 2 2 1 1 1 1

6 16 16 16 15 13 12 11 9 7 5 3 3 2 2 1 1 1 1

7 15 15 14 13 12 11 10 9 7 5 3 2 2 1 1 1 1 1

8 13 13 13 12 11 10 9 8 6 4 3 2 2 1 1 1 1 1

9 11 11 11 10 9 8 7 7 6 4 3 2 1 1 1 1 1 0

10 9 9 9 8 7 6 6 5 4 3 2 2 1 1 1 1 0 0

11 7 7 7 6 5 5 4 4 3 3 2 1 1 1 1 0 0 0

12 6 6 5 5 4 3 3 3 2 2 2 1 1 1 1 0 0 0

13 4 4 4 4 3 3 2 2 2 2 1 1 1 1 0 0 0 0

14 3 3 3 3 2 2 2 2 2 1 1 1 1 0 0 0 0 0

15 2 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0 0

16 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0

17 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Table 9 False Alarm rate in speech region corresponding to forward and backward threshold

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 20 20 20 19 17 15 13 11 9 6 4 3 2 2 1 1 1 1

2 20 20 20 19 17 15 13 11 9 6 4 3 2 2 1 1 1 1

3 20 20 20 19 17 15 13 11 9 6 4 3 2 2 1 1 1 1

4 20 20 19 18 17 15 13 11 9 6 4 3 2 2 1 1 1 1

5 19 19 19 18 16 15 13 11 8 6 4 3 2 2 1 1 1 0

6 18 18 18 17 16 14 13 11 8 6 4 3 2 2 1 1 1 0

7 17 17 16 16 15 13 12 10 8 6 4 3 2 1 1 1 1 0

8 15 15 15 14 13 12 11 9 7 5 4 3 2 1 1 1 0 0

9 13 13 13 12 11 10 9 8 7 5 3 2 2 1 1 1 0 0

10 11 10 10 10 9 8 7 6 5 4 3 2 1 1 1 0 0 0

11 8 8 8 7 7 6 5 5 4 3 2 2 1 1 1 0 0 0

12 6 6 6 5 5 5 4 4 3 3 2 1 1 1 0 0 0 0

13 5 5 4 4 4 3 3 3 3 2 2 1 1 1 0 0 0 0

14 3 3 3 3 3 3 3 2 2 2 1 1 1 0 0 0 0 0

15 3 3 2 2 2 2 2 2 2 1 1 1 1 0 0 0 0 0

16 2 2 2 2 2 2 2 2 1 1 1 1 0 0 0 0 0 0

17 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

Figure 45 Hit rate in case of estimated variance using squared model error

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 46: False alarm rate

Figure 47: False alarm rate in the speech region

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Table 10 Hit rate corresponding to forward and backward thresholds (20ms frame)

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 83 83 83 84 84 84 83 81 77 68 60 47 35 23 13 8 4 1

2 83 83 83 84 84 84 83 81 77 68 60 47 35 23 13 8 4 1

3 83 83 83 83 84 84 83 81 77 68 60 47 35 23 13 8 4 1

4 83 83 83 83 84 84 83 81 77 68 60 47 35 23 13 8 4 1

5 81 81 81 81 82 82 82 81 77 68 60 47 35 23 13 8 4 1

6 80 80 80 80 80 81 81 80 76 68 60 47 35 22 13 8 4 1

7 78 78 78 78 79 79 80 79 76 68 60 47 35 22 13 8 4 1

8 77 77 77 77 77 78 78 78 75 67 60 47 35 22 13 8 4 1

9 74 74 74 74 75 75 76 75 73 66 59 47 34 22 13 8 4 1

10 70 70 70 70 71 71 72 71 70 64 57 46 34 22 13 8 4 1

11 67 67 67 67 67 67 67 67 66 61 55 45 34 22 13 8 4 1

12 62 62 62 62 62 62 62 62 61 57 52 42 33 21 13 8 4 1

13 56 56 56 56 57 57 57 57 56 52 48 39 31 21 12 7 4 1

14 53 53 53 53 53 53 54 54 52 49 45 37 30 20 12 7 4 1

15 50 50 50 50 50 50 51 50 49 46 43 35 28 19 11 7 4 1

16 46 46 46 46 46 46 46 46 44 42 38 32 25 17 10 6 4 1

17 41 41 41 41 41 41 41 41 40 37 34 29 23 16 10 6 3 1

18 36 36 36 36 37 37 37 37 35 33 30 26 21 14 9 6 3 1

Table 11 False alarm rate corresponding to forward and backward thresholds (20ms frame)

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 18 17 17 16 14 12 10 9 7 5 4 3 2 2 2 1 1 1

2 18 17 17 16 14 12 10 9 7 5 4 3 2 2 2 1 1 1

3 17 17 17 16 14 12 10 9 7 5 4 3 2 2 1 1 1 1

4 17 17 16 15 14 12 10 9 7 5 4 3 2 2 1 1 1 1

5 16 16 15 14 13 12 10 8 7 5 3 3 2 2 1 1 1 1

6 15 14 14 13 12 11 10 8 6 4 3 2 2 2 1 1 1 1

7 13 13 13 12 11 10 9 8 6 4 3 2 2 1 1 1 1 1

8 12 12 12 11 10 9 8 7 6 4 3 2 2 1 1 1 1 1

9 10 10 10 10 9 8 7 6 5 3 2 2 1 1 1 1 1 0

10 8 8 8 7 7 6 5 4 4 3 2 1 1 1 1 0 0 0

11 7 7 6 6 5 4 4 3 3 2 2 1 1 1 0 0 0 0

12 5 5 5 5 4 3 3 3 2 2 1 1 1 1 0 0 0 0

13 4 4 4 4 3 3 2 2 2 1 1 1 1 0 0 0 0 0

14 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 0 0

15 2 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 0 0

16 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0

17 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Table 12 False alarm rate in speech region corresponding to forward and backward thresholds (20ms frame)

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 18 18 18 17 15 14 12 10 8 6 4 3 2 2 1 1 1 1

2 18 18 18 17 15 14 12 10 8 6 4 3 2 2 1 1 1 1

3 18 18 17 17 15 14 12 10 8 6 4 3 2 2 1 1 1 1

4 18 18 17 16 15 14 12 10 8 6 4 3 2 2 1 1 1 1

5 17 17 17 16 15 13 12 10 8 6 4 3 2 2 1 1 1 0

6 16 16 16 15 14 13 11 10 8 6 4 3 2 1 1 1 1 0

7 15 15 15 14 13 12 11 9 7 5 4 3 2 1 1 1 1 0

8 14 14 14 13 12 11 10 9 7 5 3 2 2 1 1 1 0 0

9 12 12 12 11 10 9 8 7 6 4 3 2 1 1 1 1 0 0

10 10 10 9 9 8 7 7 6 5 4 2 2 1 1 1 0 0 0

11 7 7 7 7 6 6 5 4 4 3 2 1 1 1 0 0 0 0

12 6 6 5 5 5 4 4 3 3 2 2 1 1 1 0 0 0 0

13 4 4 4 4 4 3 3 3 2 2 1 1 1 0 0 0 0 0

14 3 3 3 3 3 2 2 2 2 1 1 1 0 0 0 0 0 0

15 2 2 2 2 2 2 2 2 1 1 1 1 0 0 0 0 0 0

16 2 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 0 0

17 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

Figure 48: Hit rate (20ms frame)

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 49: False alarm rate (20ms)

Figure 50: False alarm rate in the speech region (20ms frame)

The rest of the evaluation results are tabled in the appendix section A2 for other cases. A tabular form of the comparison of the algorithm is shown in the following table:

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

window Window length(ms)

Zero crossing

Backward threshold

Forward Threshold

Hit rate (%)

False alarm rate (%)

False alarm rate (SP) (%)

Rectangular 10 No 1.1 0.5 51 5 7

Hamming 10 No 0.9 0.6 81 5 7

Hamming 15 No 0.9 0.7 74 5 6

Hamming 20 No 0.8 0.8 73 5 6

Hamming 10 Yes 0.4 0.3 86 5 5

Hamming 15 Yes 0.5 0.5 82 5 5

Hamming 20 Yes 0.6 0.6 78 5 5

5.4 Evaluation of new speech detector based on the new pitch detection algorithm

The new speech detector uses an improved pitch detector along with new hangover and a new filter as explained in chapter 4. The new pitch detector is based on the autocorrelation of normalized signal. We will evaluate the performance of the speech detector by varying the parameters like window length, temporal step size, autocorrelation threshold constant. The hangover was added to separate the speech region from the keystroke region.

In the evaluation the best case was found with 20 ms window and 5 ms step size. The autocorrelation threshold was set 0.95.

The figure 51 shows the detection of speech and keystroke region using the new speech detector. The figure shows that the new speech detector based on improved pitch detection can be successfully used for signal classification.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 51 The new speech detector

5.5 Conclusion

In this research we explored features of speech and keystroke signals. Based on the problem and data analysis, we tried two approaches. First approach is based on identification of keystrokes whereas other approach consists of improving VAD such that it doesn’t trigger on keystroke signals. Identification of keystroke frames is based on audio signal classification using prediction model into speech and keystroke frames in a mixed signal. The second approach which consists of improving VAD is based on improvement in the pitch detection using autocorrelation of the normalized signal. The first approach worked well but it still lagged our requirements where as second approach fulfilled our requirements. We found out that the second approach can be implemented successfully and incorporated in the VAD as speech detector for the video switching functionality in video conferencing application.

Section 5.5.1 explains the similarities and the differences between the speech and the keystrokes based on the data analysis. We compare the approaches used in signal classification in section 5.5.2.

5.5.1 Similarities and differences between Speech and Keystrokes

In this section we explain briefly the similarities and differences between speech and keystroke signals.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

5.5.1.1 Keystrokes with plosive sounds

The plosive sound (a/c the mode of excitation) or voiced and unvoiced stops (a/c acoustic phonetics) and keystrokes resemble similarities in the behavior and spectrum. The beginning of plosive sound consists of high frequency component then low frequency component for short period followed by high frequency component. So, these sounds have very similar behavior as those exhibited by key strokes in short analysis window. Although the width of plosive sounds varies from 100ms to 300 ms, but due to short analysis window the beginning exhibit almost similar behavior. So plosive sound can generate false alarm i.e. the frames containing plosive sounds can be wrongly classified into keystroke class.

5.5.1.2 Keystrokes with fricative sounds

Fricative sounds have noise like broad spectrum. Their spectrum varies between 2-15 kHz. Although they consist of only high frequency components and it’s width varies to 300 ms, the problem occurs when the fricative sound is overlapped with the keystrokes, then it will be difficult to differentiate the keystroke from the fricative sound. The mixed signal resembles a fricative sound. This is also due to low strength of keystrokes relative to speech signal. So, fricative sound dominates keystroke in case of mixed signal. For separating mixed signals, reconstruction of signal may be beneficial which is out of scope of current objective.

5.5.2 Comparison of the classification approaches

First we compare the audio classification approach. We evaluated two prediction models in the section [5.3] using varying parameters such as variance, window length, frequency weights and ZCR. The classification approach in the research paper [5] works better. They used one previous and one next frame in the prediction model and their variance was based on the average power of normalized one previous and one next frames. Another approach in which we tried different variance estimate that is based on low pass filtering of squared model error, didn’t perform fairly well.

As we derived the theoretical relation between the probabilities and the corresponding threshold shown in the equation (61), we verified that the theoretical threshold coincides with the practical threshold in case of variance based on the low pass filtering of squared model error. We computed the probability and the corresponding theoretical threshold for a case with15ms window length and sampling rate 48 kHz. As shown in the figure 27, for getting 80% accuracy the value of threshold R must be at least 19.81. As per our practical evaluation, the theoretical threshold is scaled as bellow:

1.1360

)81.19(1 22 R

KTh

where Th is scaled threshold, K is half of frequency components in a 15ms window and

R is theoretical threshold. K can be computed as bellow:

360)480001000

15(

21)__(

21 rateSamplinglengthwindowK

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

But the threshold obtained from reproduction work of research paper [5] is 16 for 81% hit rate which doesn’t justify theoretically.

Effect of variance: The result obtained using the approach that uses variance estimate based on the filtering of the squared model error satisfies theoretically whereas the variance estimate based on the average power of the normalized one previous and one next frame doesn’t satisfies theoretically although the result is better in the later case. However the work done in the research paper [5] doesn’t motivate why they choose this variance.

Effect of window length: The optimum result was obtained using the window length 10ms which justifies that the average keypress length that is 10 ms. If we increase or decrease the window length then the hit rate decreases.

Effect of Zero Crossing (ZC): ZC feature helps in improving the result. ZC of keystroke lies between 50 and 350 as shown in the figure 21, whereas fricative sound has higher ZC than keystroke signal. Sometimes prediction model detects the voiced speech signal as impulsive signal if it is at the beginning of the audio frame. ZC helps in solving this problem.

The second approach based on the improvement of pitch detection of VAD so that it doesn’t trigger on keystroke signals works fairly well. The new speech detector is based on a new pitch detector and addition of hangover. Pitch detection algorithm is based on the autocorrelation of the normalized signal. So the peaks of autocorrelation are not affected by delay as normalization takes care of these things. The new speech detector classifies the audio data into speech region and non-speech region in real time successfully which fulfills our requirement. The parameters were obtained using optimization dataset and verified using verification dataset.

5.5.3 Future work

As we used the variance which was based on filtering of the squared model error in audio signal classification approach, but it wasn’t giving good result. One of reason maybe the variance is not covering well the smooth part of the signal. One of the solutions may be to try out Non-linear filter such as Median filter. The approach will be highly beneficial in case of finding the exact location of the keystrokes with high probability in a mixed signal.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

References:

[1] ETSI EN 301 708; “Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) speech traffic channels”; General description (GSM 06.94 version 7.1.0 Release 1998)

[2] Mitra, S.K.; Babic, H.; Somayazulu, V.S.; "A modified perfect reconstruction QMF bank with an auxiliary channel," Circuits and Systems, IEEE International Symposium, May 1989

[3] Zhuang, L.; Zhou, F.; and Tygar J. D.; “Keyboard acoustic emanations revisited”. ACM Trans. Inf. Syst. Security, Nov 2009

[4] Asonov, D.; Agrawal, R.; "Keyboard acoustic emanations" Security and Privacy, 2004. Proceedings. 2004 IEEE Symposium , May 2004

[5] Subramanya, A.; Seltzer, M.L.; Acero, A.; , "Automatic Removal of Typed Keystrokes From Speech Signals", Signal Processing Letters, IEEE , May 2007

[6] Rabiner, L. R., Schafer R. W.; “Digital Processing of Speech Signals”, Prentice Hall; US edition (Sept 15, 1978)

[7] Burred, J. J.; “An Objective Approach to Content-Based Audio Signal Classification”, Masters Thesis, Technische Universität,Berlin, May 2003 http://www.jjburred.com/research/pdf/burred_da.pdf

[8] Gerhard, D.; “Audio Signal Classification: History and Current Techniques, Technical Report”, Department of Computer Science, University of Regina, Canada, Nov 2003

[9] Benesty J.; Sondhi M. M.; Huang Y.; ”Springer Handbook of Speech Processing”, Springer, Dec. 2007

[10] Kalpuri, Anssi; “Audio signal classification”, ISMIR Graduate School, Oct 2004. http://mtg.upf.edu/ismir2004/graduateschool/people/Klapuri/classification.pdf

[11] Childers, D.G.; Skinner, D.P.; Kemerait, R.C.; , "The cepstrum: A guide to processing”, Proceedings of the IEEE, Oct. 1977

[12] Saunders, J., "Real-time discrimination of broadcast speech/music," Acoustics, Speech, and Signal Processing. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference, May 1996

[13] Huber, G.; "Gamma function derivation of n-sphere volumes". The American Math Monthly, May 1982

[14] Abramowitz, M.; Stegun, I. A.; Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing, New York, Dover, 1972

[15] WolframAlpha; http://www.wolframalpha.com/input/?i=integration+of+x%5Em*e%5E-x%5E2&x=0&y=0

http://www.jjburred.com/research/pdf/burred_da.pdf

http://mtg.upf.edu/ismir2004/graduateschool/people/Klapuri/classification.pdf

http://www.wolframalpha.com/input/?i=integration+of+x%5Em*e%5E-x%5E2&x=0&y=0

http://www.wolframalpha.com/input/?i=integration+of+x%5Em*e%5E-x%5E2&x=0&y=0

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

[16] Berman, A.; Plemmons, R. J.; “Nonnegative Matrices in the Mathematical Sciences”, Society for Industrial and Applied Mathematics; Jan, 1987

[17] John G. P.; Dimitris G. M.; “Digital signal processing”. Pearson Prentice Hall. 2007

[18] Stoica, P.; Moses, R. L.; “Spectral Analysis of Signals”. Prentice Hall; 1 edition (April , 2005)

[19] Tukey, J. W.; Bogert, B. P.; Healy, M. J. R.; "The quefrency alanysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking", Proceedings of the Symposium on Time Series Analysis', (M. Rosenblatt, Ed), New York, Wiley

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

6 Appendix A.1: Automatic detection of keystroke in a keystroke only file

This section explains an algorithm that sets typing flag at the keystroke frames in a keystroke file.

It is based on the spectral energy of the signal. The algorithm computes the Fast Fourier transform of the windowed signal frame and the sum of all the FFT coefficients of the frame is used as energy criteria. A threshold is used to detect the keystroke.

The energy criterion is computed as bellow:

))((_ signalfftsumcriteriaenergy

The parameters for the algorithm are defined as bellow:

Threshold


The window length of the STFTs, wlN .

Figure 53 shows the performance of the algorithm. The algorithm triggers two peaks for each keystroke. The resolution and level of the keypress peak is better than that of the keyrelease peak. We chose threshold such that it captures all the keypress. The position of the keypress peak is taken as position of the keystroke. The evaluation of the algorithm was done and verified manually too.

This algorithm tries to compute the position of keystroke signal according to the major energy source of the keystroke which is the keypress of the keystroke. The algorithm computes the energy of current window frame and two lookahead window frames, then it computes the maximum among the 2 previous and 2 lookahead frames. If the current frame is the maximum one and it is larger than the previous and the next frame, then it would be a keystroke signal if it is greater than the keystroke threshold. The keystroke threshold was decided by plotting the keystroke data. It was found between 0.4 -0.5. The position of the keypress is taken as the position of the keystroke. For verification purpose we desigened a GUI to verify the major keystrokes. Figure 54 shows the GUI which is capturing the impulsive signal.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 52 Keystroke detection by the algorithm

Figure 53 Keystroke detection and verification in a Keystroke file

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

7 Appendix A.2: Data collection of Audio classification approach using prediction model (by varying parameters)

7.1 Plots for how the classification criteria works using variance based on the average of the previous and next frame

In this section we will show some plots in case of how the method is performing in case of only audio signal, keystroke only signal and mixed signal file.

Figure 54: Plot of audio signal containing only speech

From the figure 55, it clearly shows that the classification criteria generate some false alarms in the speech region.

Figure 56 is a plot of keystroke only file. It seems like that all the keystrokes are captured by the classification criteria.

Figure 57 is the plot in case of mixed audio file.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 55 Plot of audio signal containing only keystroke

Figure 56 Case of audio signal containing mixed signal

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

7.2 Plots for how the classification criteria works using variance based on the filtering of squared model error.

Figure 57 : Plot of audio signal containing only speech

Figure 58 Plot of audio signal containing only keystrokes

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 59 : Plot of audio signal containing mixed signal

7.3 Plots and tables for hit rate, False alarm and false alarm in speech region

In this section we provide some data obtained during evaluation process of the algorithms using different parameters.

1. In this sub-section we describe the test results obtained using the prediction model based on previous and next frame separately. Here we choose the variance maximum of the basis frame and the predicted frame. The other parameters are 15 ms hamming window, 5ms step length. In this case the hit rate, total false alarm and the false alarm in the speech region is shown in the tables 13, 14 and 15 respectively. The corresponding plots are shown in the figures 61,62 and 63 respectively. From the table, it seems that the hit rate obtained for forward threshold 0.65 and backward threshold 0.55 is 68% for maximum 19 false alarms. Figure 64 shows how the method works in this case for a mixed audio file.

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

Figure 60: Forward and Backward prediction approach based on variance using max of basis frame and the predicted frame.

2. Forward Backward approach 2.1 Test Criteria: 10ms hamming window, 5ms step, no ZCR, no weight, sigma based on the filtering of squared model error Hit Rate

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 90 90 90 89 88 87 84 81 75 69 58 50 38 27 18 14 9 4

2 90 90 90 89 88 87 84 81 75 69 58 50 38 27 18 14 9 4

3 90 90 90 89 88 87 84 81 75 69 58 50 38 27 18 14 9 4

4 90 90 90 90 89 87 84 81 75 69 59 50 38 27 18 14 9 4

5 90 90 90 90 89 88 85 82 76 70 59 51 38 27 18 14 9 4

6 89 89 89 88 88 87 85 82 76 70 59 51 38 27 18 14 9 4

7 88 88 88 88 87 86 85 83 76 70 59 51 38 27 18 14 9 4

8 87 87 87 86 86 84 84 82 77 70 59 51 38 27 18 14 8 4

9 85 85 85 84 84 83 82 81 77 71 60 51 38 27 18 14 8 4

10 83 83 83 82 82 81 81 79 75 70 60 52 38 27 19 14 8 4

11 80 80 80 80 79 79 78 77 74 69 61 52 39 28 19 14 8 4

12 78 78 78 77 77 76 76 75 72 69 60 52 39 28 19 14 8 4

13 73 73 73 73 72 72 72 70 68 64 57 50 38 28 19 14 8 4

14 69 69 69 69 69 68 68 67 64 61 55 49 38 27 19 14 8 4

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

15 66 66 65 65 65 64 64 63 61 58 52 47 37 27 19 14 8 4

16 60 60 60 59 59 59 59 58 56 52 47 43 34 26 18 13 8 4

17 55 55 55 55 55 54 54 53 51 48 43 40 32 25 18 13 8 4

18 52 52 52 52 51 51 51 50 48 45 41 38 30 23 17 12 8 4

False Alarm Rate

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 18 18 17 16 14 12 10 8 7 5 3 3 2 2 1 1 1 1

2 18 18 17 16 14 12 10 8 7 5 3 3 2 2 1 1 1 1

3 18 18 17 16 14 12 10 8 7 5 3 3 2 2 1 1 1 1

4 17 17 16 15 13 12 10 8 6 5 3 3 2 2 1 1 1 1

5 15 15 15 14 13 11 10 8 6 5 3 2 2 1 1 1 1 1

6 14 14 13 13 12 11 9 8 6 4 3 2 2 1 1 1 1 1

7 12 12 12 11 10 10 9 8 6 4 3 2 2 1 1 1 1 1

8 11 11 10 10 9 9 8 7 6 4 3 2 2 1 1 1 1 0

9 9 9 9 8 8 7 7 6 5 4 3 2 1 1 1 1 1 0

10 7 7 7 6 6 5 5 5 4 3 2 2 1 1 1 1 0 0

11 5 5 5 5 4 4 4 4 3 3 2 2 1 1 1 1 0 0

12 4 4 4 4 3 3 3 3 3 2 2 1 1 1 1 0 0 0

13 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 0 0 0

14 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 0 0 0

15 2 2 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0

16 2 2 2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

17 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0

False Alarm Rate in speech region

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 19 19 18 17 15 13 12 10 8 6 4 3 2 2 1 1 1 1

2 19 19 18 17 15 13 12 10 8 6 4 3 2 2 1 1 1 1

3 19 19 18 17 15 13 12 10 8 6 4 3 2 2 1 1 1 1

4 18 18 17 16 15 13 12 10 8 6 4 3 2 2 1 1 1 1

5 17 17 17 16 15 13 11 10 8 6 4 3 2 2 1 1 1 0

6 16 16 16 15 14 13 11 9 7 6 4 3 2 2 1 1 1 0

7 15 15 14 14 13 12 11 9 7 5 4 3 2 2 1 1 1 0

8 13 13 13 12 11 11 10 8 7 5 4 3 2 1 1 1 1 0

9 11 11 11 10 10 9 8 7 6 5 3 2 2 1 1 1 1 0

10 9 9 9 8 8 7 7 6 5 4 3 2 2 1 1 1 0 0

11 7 7 7 6 6 6 5 5 4 4 3 2 1 1 1 1 0 0

12 5 5 5 5 5 4 4 4 3 3 2 2 1 1 1 0 0 0

13 4 4 4 4 4 3 3 3 3 2 2 2 1 1 1 0 0 0

14 3 3 3 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0

15 3 3 2 2 2 2 2 2 2 2 1 1 1 1 0 0 0 0

16 2 2 2 2 2 2 2 2 2 1 1 1 1 1 0 0 0 0

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

17 2 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

2.2 Test Criteria: 20ms hamming window, 5ms step, no ZCR, no weight, sigma based on the filtering of squared model error Hit Rate

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 83 83 83 84 84 84 83 81 77 68 60 47 35 23 13 8 4 1

2 83 83 83 84 84 84 83 81 77 68 60 47 35 23 13 8 4 1

3 83 83 83 83 84 84 83 81 77 68 60 47 35 23 13 8 4 1

4 83 83 83 83 84 84 83 81 77 68 60 47 35 23 13 8 4 1

5 81 81 81 81 82 82 82 81 77 68 60 47 35 23 13 8 4 1

6 80 80 80 80 80 81 81 80 76 68 60 47 35 22 13 8 4 1

7 78 78 78 78 79 79 80 79 76 68 60 47 35 22 13 8 4 1

8 77 77 77 77 77 78 78 78 75 67 60 47 35 22 13 8 4 1

9 74 74 74 74 75 75 76 75 73 66 59 47 34 22 13 8 4 1

10 70 70 70 70 71 71 72 71 70 64 57 46 34 22 13 8 4 1

11 67 67 67 67 67 67 67 67 66 61 55 45 34 22 13 8 4 1

12 62 62 62 62 62 62 62 62 61 57 52 42 33 21 13 8 4 1

13 56 56 56 56 57 57 57 57 56 52 48 39 31 21 12 7 4 1

14 53 53 53 53 53 53 54 54 52 49 45 37 30 20 12 7 4 1

15 50 50 50 50 50 50 51 50 49 46 43 35 28 19 11 7 4 1

16 46 46 46 46 46 46 46 46 44 42 38 32 25 17 10 6 4 1

17 41 41 41 41 41 41 41 41 40 37 34 29 23 16 10 6 3 1

18 36 36 36 36 37 37 37 37 35 33 30 26 21 14 9 6 3 1

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

False Alarm Rate

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 18 17 17 16 14 12 10 9 7 5 4 3 2 2 2 1 1 1

2 18 17 17 16 14 12 10 9 7 5 4 3 2 2 2 1 1 1

3 17 17 17 16 14 12 10 9 7 5 4 3 2 2 1 1 1 1

4 17 17 16 15 14 12 10 9 7 5 4 3 2 2 1 1 1 1

5 16 16 15 14 13 12 10 8 7 5 3 3 2 2 1 1 1 1

6 15 14 14 13 12 11 10 8 6 4 3 2 2 2 1 1 1 1

7 13 13 13 12 11 10 9 8 6 4 3 2 2 1 1 1 1 1

8 12 12 12 11 10 9 8 7 6 4 3 2 2 1 1 1 1 1

9 10 10 10 10 9 8 7 6 5 3 2 2 1 1 1 1 1 0

10 8 8 8 7 7 6 5 4 4 3 2 1 1 1 1 0 0 0

11 7 7 6 6 5 4 4 3 3 2 2 1 1 1 0 0 0 0

12 5 5 5 5 4 3 3 3 2 2 1 1 1 1 0 0 0 0

13 4 4 4 4 3 3 2 2 2 1 1 1 1 0 0 0 0 0

14 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 0 0

15 2 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 0 0

16 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0

17 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

False Alarm Rate in speech region

Thr 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 18 18 18 17 15 14 12 10 8 6 4 3 2 2 1 1 1 1

2 18 18 18 17 15 14 12 10 8 6 4 3 2 2 1 1 1 1

3 18 18 17 17 15 14 12 10 8 6 4 3 2 2 1 1 1 1

4 18 18 17 16 15 14 12 10 8 6 4 3 2 2 1 1 1 1

5 17 17 17 16 15 13 12 10 8 6 4 3 2 2 1 1 1 0

6 16 16 16 15 14 13 11 10 8 6 4 3 2 1 1 1 1 0

7 15 15 15 14 13 12 11 9 7 5 4 3 2 1 1 1 1 0

8 14 14 14 13 12 11 10 9 7 5 3 2 2 1 1 1 0 0

9 12 12 12 11 10 9 8 7 6 4 3 2 1 1 1 1 0 0

10 10 10 9 9 8 7 7 6 5 4 2 2 1 1 1 0 0 0

11 7 7 7 7 6 6 5 4 4 3 2 1 1 1 0 0 0 0

12 6 6 5 5 5 4 4 3 3 2 2 1 1 1 0 0 0 0

13 4 4 4 4 4 3 3 3 2 2 1 1 1 0 0 0 0 0

14 3 3 3 3 3 2 2 2 2 1 1 1 0 0 0 0 0 0

15 2 2 2 2 2 2 2 2 1 1 1 1 0 0 0 0 0 0

16 2 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 0 0

17 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

18 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

Ericsson Internal

MSC THESIS REPORT



2012-09-11 PA1

methods for improving voice activity detection in ...588802/fulltext01.pdf · it 13 001...

Documents