a review on speech processing and use of lpc (linear ...excom.vsb.cz › images › files ›...

The Development of Excellence of the Telecommunication Research Team in

Relation to International Cooperation - CZ.1.07/2.3.00/20.0217

A review on speech processing

and use of LPC

(Linear Predictive Coding)

Keynote Talk

Prof. H. Gökhan İlk

Telecommunication

Educational Seminar



Contact details

Address : Ankara University,

Faculty of Engineering

Electrical & Electronics Eng. Department

Gölbaşı, Ankara TURKEY



[email protected]

What is speech signal?

• Speech signal represents the change of acoustic pressure with respect to time s(t) in continious time or s[n] in discrete time

• It has a bandwidth between 20 Hz to 20 KHz

• The audibable band, Audio Band)

• Speech, for communication purposes, has a bandwidth of up to 3.4 KHz



What is sampling frequency fs ?

• In general speech is sampled at 8000 Hz (Narrow band speech)

• 11025 Hz (Forensic Quality)

• 16000 Hz (Wide band speech, ITU standard just like 8 kHz)

• 44100 Hz for CD quality

• A Word on quantization:::

• Depth is usually 16 bits

A more complicated, but more informative waveform:

Spectrogram: Time, Frequency and magnitude, all at once

Q1: What is the samplingfrequency?Q2: What does the spectrum looklike?

Confucius, 'Learning without thought is

labor lost’



Speech Production Mechanism

LUNGS

How does it look like in time domain?

Short term correlation (i.e, from sample to sample correlation)

Long term

correlation



Redundancy:

• Speech signal posses lots of short term, i.e. from sample to sample and long term, i.e. from one pitch period (fundamental period) to another

• It is an almost periodic waveform



How does it look like?

• Speech can be generally classified as Voiced or Unvoiced.

• Voiced part is a quasi-periodic (almost periodic) signal with higher energy and less zero crossing.

• Unvoiced part is a noise like signal

Areas of speech processing:

• Speech processing has several applications

Speech recognition (What is said ?) Speaker identification (Who says it ?) Speech compression (Less samples) Speech coding (Save or Send it digitally) Speech analysis (Medical or Forensic applications) Speech synthesis• (From parameters or from text to speech signal, TTS) Speech enhancement (Noisy to clean) Combination of more than one application



Speech recognition

SpeechRecognition

Continious speechrecognition

I would like to buy a ticket from Ostrava to Prague for tomorrow

Isololated speechrecognition

One, Two, …

Word spotting

… going out for a pub and need

cash for this weekend …

Speech to Text applications

• For dictation purposes

(transcribing medical reports,

limited vocabulary for the disabled)

Artificial Intelligence

Speaker Identification

SpeakerIdentification

Close set 1:NOpen set 1:(N+1)Speaker Identification

Identify “ who” is speaking from a known set of people.

Speaker Verification 1:1

Determine whether the speaker is the person of interest

Text dependent/Text Independent

Does / Does not matter what is being said

Speaker transformation

Barack Obama speaking in Czech

Speech coding

Speech coding

Waveform coders

Try to preserve the waveform in time domain

A-law, CELP

Vocoders (VOiceCODERS)

Try to preserve the entropy (information) content of speech, usually in the spectral domain

MBEV

Transform Coders

Adaptive transformcoders (ATC)

Speech coding for VoIP

From circuit switching technology to packet switching technology

Military vocodersNATO and DoDstandards

MELP (1.2 – 2.4 kbps)

Speech analysis

Speech Analysis

Voicing analysis/Pitchestimation, jitter andshimmer analysis

To identify effects of surgery also in forensicanalysis

Cepstralanalysis/Spectralanalysis

Widely used“feature” in speechrecognition

Formant analysis

To study phoneticsand vowelstructures

Time domain analysis

To obtain featuresin speechapplications

Speech synthesis

Speech Synthesis

Text to speechsynthesis

Newspapers speaking, robots talking

Speech coders

During decodingspeech synthesisis required

Artificial productionof human sound

Prosodic modification and emotional content

Current – State of Speech Processing

Windowing(Stationary)

Postprocessing

Preprocessing

Featureselection/ extraction

ClassifierDesign

Classification result

Speech recognicedwith Speaker ID

Text-to-texttranslation

TTSProsody, emotionmodification

Output Speech

(in anotherlanguage)

Input Speech(in one language)

Father of all parameters Mother of all featuresLPC (Linear Predictive Coding)

A little bit of history first !

How slow human brain works, evolution from simple to most advanced

How would

you code

these

samples,

speech[n],

OF COURSE

PCM



Is it not easier

to code the

difference

signal?

DPCM

(Differential

Pulse Coded

Modulation)



A mojor problem !

Differences can be

quiet different

ADPCM

(Adaptive

Differential Pulse

Coded Modulation)





Evolving from PCM to ADPCM

Human brain Works in a different, more greedy way !

Is it possible to adapt NOT only for one sample difference?

But many sample differences???

LPC equations

P

k

k knxanxnxnxne1

][ˆ

pnxanxanxaknxanx p

p

k

k

...21][ˆ 21

1

p

k

k aknxanxnxne0

0 1 ,][ˆ

A simple, linear model. Just addition and multiplicationwith constants: LPC

Engineering point of view LPC

+

Linear Prediction

Filter

nx

][ˆ nx

ne-



Building bridges

1

0

][ˆN

k

knxkwnynnnyne

Anyone heard of Wiener Filter Theory, Optimal Filtering

1

0

][ˆN

k

knxkhnynnnyne

Convolution sum

Wiener filter turns outan FIR filter with N coefficients

Auto-regression

Error is the difference between our signal and optimal estimate

nxyNow

1

0

][ˆN

k

k knxanxnxnxne

Lets summarize !

1

1

][ˆN

k

k knxanxnxnxne

pnxanxanxaknxanx p

p

k

k

...21][ˆ 21

1

p

k

k aknxanxnxne0

0 1 ,][ˆ

LPC Analysis Filter

nx

p

k

k aknxanxnxne0

0 1 ,][ˆ

Now we are using P previous samples

LPC Analysis, AR (Auto Regressive) Model

nepnxanxanxanx p ...21][ 21

Considering the difference: Optimum filter theory and regressionanalysis; since both independent and dependent variables

belong to the same random process, x, x[n] is called an autoregressive or AR process. That is why LPC (linear predictive

analysis) is also called AR analysis

How do you minimize a VECTOR?

It is now possible to determine the estimates byminimising the mean squared error, i.e.

}])()({[)}({ 2

1

2

p

j

j jnsnsEneEError

Setting the partial derivatives of Error with respect to j to zero for j = 1,2,...,p, we get

where E{.} is the expectation operator

PiinsjnsnsEp

j

j ,...,2,1 0)}(])()({[1

Solving the linear equation

piiji n

p

j

nj ,...,2,1)0,(),(1

)}()({),( jnsinsEjin

pjpijnsinsEjin ,...,2,1,,...,2,1)}()({),(

This is auto correlation?



Are we good in linear algabra?That is, e(n) is orthogonalto s(n-i) for i = 1,2,...P

A x = b



Auto-Correlation Method



pjpijmsimsjipN

m

nnn

0,1)()(),(1

0

jN

m

nnn jmsmsjR1

0

)()()(

piiRjiR n

p

j

nj

1)()(1

)(

:

)2(

)1(

:

)0(..)1(

::::

)2(..)1(

)1(.)1()0(

2

1

pR

R

R

RpR

pRR

pRRR

n

n

n

pnn

nn

nnn

Levinson-

Durbin

recursion



Finally

• Now that we have the LPC ai coefficients, we can present speech with a compact representation

• This further requires an efficient representation of the excitation (residual, error) signal. In fact for example optimum magnitude calculation of regularly spaced pulses for the excitation constitutes GSM (Global System for Mobile Communications)



Why LPC is so popular ? LPC in the frequency domain

Why LPC is so popular ? LPC in the time domain

Innovations representation

H(z)=A(z) H-1(z)=1/A(z)

nx

ne nx

IF you assume the residual signal (error) as White noise !!!

The inverse system has many advantages.

1. In communications (left and right systems are apart)

2. The system on the right do not need any input ???



Innovations representationInnovations representation is basically an inverse system.

Why called innovations??

Assume that x, our discrete random signal is speech.

e[n] is white noise

We do not need e[n], because the inverse filter itself represent the information. That is why the representation is called INNOVATIONS.



More on LPC ! An expert view

Since 1/A(z) is a causal filter (does everybody see that???), this implies that it is minimum phase (It is causal stable with a causal stable inverse)

Since A(z) is an FIR filter, it is always stable and we know that it is causal. Wealso know that 1/A(z) is also causal. BUT IS IT ALWAYS STABLE???

When ak (LPC coefficients) are found by solving Normal equations with a positive definite correlation function. Since they are found by solving a positive definite matrix inverse, the poles always lie within the unit circleand therefore they are ALWAYS STABLE !



How good the theory works?

• Steve wore a bright red cashmere sweater (Male speaker)

• Before Thursday’s exam, review every formula (Female speaker)



Samples



Male

“Steve wore a bright

red cashmere sweater”

Female

“Before Thursday’s

exam review every

formula”

2.4 kb/s

1.0 kb/s

128 kb/s PCM

2.4 kb/s

1.0 kb/s

128 kb/s PCM

Thank you for listening,

any questions?



a review on speech processing and use of lpc (linear ...excom.vsb.cz › images › files ›...

Documents