speech coding - physionetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf ·...

Speech Technologies – Speech Coding

Speech CodingSpeech Coding1. 1. IntroductionIntroduction2. LPC 2. LPC Vocoder Vocoder 3. 3. AnalysisAnalysis--byby--Synthesis CodingSynthesis Coding


Speech Coding ClassificationSpeech Coding Classification1.1. Waveform CodingWaveform Coding

To reconstruct To reconstruct a a signalsignal waveform similar to the waveform similar to the original oneoriginal one

PCM G.711 64 PCM G.711 64 kbkb/s, ADPCM G.721 32 /s, ADPCM G.721 32 kbkb/s/sSBC G.722SBC G.722

2.2. Source CodingSource CodingTo reconstruct a signal based on the To reconstruct a signal based on the production production

model ofmodel ofLPC LPC VocoderVocoder FS1015 2,4 FS1015 2,4 kbkb/s, MELP 2,4 /s, MELP 2,4 kbkb/s/s

3.3. Hybrid coders Hybrid coders -- AnalysisAnalysis--byby--Synthesis Synthesis Waveform Waveform Coding Coding based on the production model based on the production model

ETSI GSM, CELP G.729ETSI GSM, CELP G.729


Coders ComparisonCoders Comparison

1. Bit Rate kb/s2. Voice Quality MOS (Mean Opinion Score)3. Complexity4. Delay5. Channel error sensibility6. Bandwithd

Coder Bit Rate kb/s MOS BW (kHz)CD Audio 1.411 5.0 44,1

PCM 64 4.3 8

ADPCM 40,32,24,16 4.2 (32 kb/s) 8

SBC 64,56,48 >4.5 16


ComparisonComparison


SpeechSpeech CodingCoding: : LPC LPC Vocoder Vocoder


LPC LPC AnalysisAnalysis

LPC Synthesis:

P(z)

)(ns

)(ˆ ns

)(neH(z)=1/A(z)

∑=

−−=p

i

ii zazP

1·)(

H(z): Vocal tract transfer function


LPC LPC Vocoder Vocoder

Simplification of the excitation in the synthesis:

Train of periodic impulses for voicedsegments White Gaussian noise in the unvoiced segments Maintenance of the power in the new synthetic excitation. Examples:


LPC LPC VocoderVocoder

P(z)

+

H(z)x

ANÁLISISLPC

P(z)

- ANÁLISIS-PITCH-U/V

G

CoeficientesReflexión

G

V

U

F0)(nr

)(ˆ ns

)(ns

)(ns

1/F0


LPC10E/LPC10E/FS1015 FS1015 VocoderVocoder

54 bits/frame

Pitch + U/V->7bitsG->5bitsK1 a K4->5bitsK5 a K8-> 4bitsK9->3bitsK10->2bits

Fs= 8000 samples/sec54bits/frame180 samples/frame(22.5 ms/frame)

54*8000/180=2400bits/sec


LPC10E LPC10E Vocoder Vocoder

Examples:Original signalReconstructed signal LPC10E Reconstructed signal LPC10E (satellite radio transmision)

Features:Nasality: pole-zero modelSimple voiced excitation(train of impulses): buzzingFrame size: problems with fast transitions (p, t, k…)


MELP:MELP: MixedMixed--ExcitationExcitation LinearLinear Predictive Predictive VocoderVocoder

2400 bps Federal Standard speech coder

The excitation signal is generated by means of a mixture of noise and train of impulses in different frequency bands



Original signal “clean” Lpc-10

Reconstructed signal MELP “clean”

Original signal “noisy”

Reconstructued signal “noisy”Data rate: 2400 bps (54* 44,44444 frames/second) Sampling rate: 8 kHzBit stream format: For each 22.5 ms frame of input speech, the following 54 bits are placed into the bit-stream (in this order)Description Number of bits

Pitch index 7Jitter flag 1Bandpass voicing decision 4x1Gain for second half of frame 5Gain for first half of frame 3LSP frequencies (10 line spectrum pairs) 25Fourier magnitudes (10 harmonies) 8Sync bit 1 Total 54


Hybrid CodersHybrid CodersPredictivePredictive CodersCoders basedbased onon AnalysisAnalysis--byby--

SynthesisSynthesis


Hybrid CodersHybrid CodersDepending on the excitation they are classified in three basic types 1. MultiPulse Excitation (MPE)2. Regular Pulse Excitation (RPE)3. Code Excitated Linear Prediction (CELP)


Hybrid CodersHybrid CodersCELP: Code Excitated Linear Prediction


ShortShort--Time Time AnalysisAnalysis

Typical values:Analysis frame: 25 ms (200 samples)Speech frame: 20 ms (160 samples)Subframe: 5 ms (40 samples)


Synthesis FilterSynthesis FilterBased on short-time and long-time Linear Prediction

s(n) rL(n)

ANÁLISISA CORTO

ANÁLISISA LARGO

P(z)

-PL(z)

-r(n)

SÍNTESIS

PL(z) P(z)

r(n)rL(n)+ +

s(n)

)(ˆ nr )(ˆ ns


Synthesis FilterSynthesis FilterLong-Term predictor

ˆ( ) ( )r n r n Dβ= −

ˆ( ) ( ( 1)) ( ) ( ( 1))1 2 3

r n r n D r n D r n Dβ β β= − + + − + − −Estimation

or

)(ˆ nr

Parameter estimation, minimum mean square error( ) ( ) ( )e n r n r n Dβ= − −

[ ]211 2( ) ( ) ( )

00

NNE e n r n r n D

nnβ

−−= = − −∑∑

==

/ 0E β∂ ∂ =[ ]

1( ) ( )

01 2( )0

Nr n r n D

nN

r n Dn

β

−−∑

==−

−∑=


Synthesis FilterSynthesis FilterFound the value of D to minimize the error power E

[ ]

21( ) ( )

1 2 0( ) 1 20 ( )0

Nr n r n D

N nE r n Nn r n D

n

−−∑

− == −∑ −= −∑

=


PerceptualPerceptual Weighting Filter Weighting Filter

Function: to modify the frequency characteristics of the error to minimize, granting more importance to the zones of frequency in which the ear is going to be more sensible and less importance to the zones in which the ear is going to be less sensible. Based on the frequencial masking:

In the formats, it is possible to allow more errors The filter transfer function will be inversely proportional to the spectral envelope of the coding speech signal. Transfer Function proposed: W(z)=A(z)/A(γ-1z)The parameter γ=[0,1], controls the level of weighting. Must be updated jointly with the predictor.


1 1( ) 1 1( )

( / ) 11 (1 ( ) )1 1

P Pk ka z a zk kA z k kW zP PA z zk ka z pk kk k

γγ

γ

− −− −∑ ∑= == = =

− −− −∑ ∏= =

11( )

1(1 )1

P ka zkkW zP

p zkkγ

−− ∑==

−−∏=

0.8 0.9


γ≤ ≤normally


GSM 06.XX: RPE-LTPGSM 1982 "Groupe Spécial Mobile“ ,

now: "Global System for Mobile communications“RPE-LTP: Regular Pulse Excitation – Long Term Prediction

GSM 06.XX: RPE-LTP

SID – Silence Descrition FrameBFI – Bad Frame Indicator


GSM 06.XX: RPE-LTPFrameFrame lostlost:1) Speech frames

a) First lost -> repetition of the last good frameb) Following losts -> decrease the output level until

the silence in 320 ms 2) SID frames

a) First lost -> repetition of the last good frameb) Following losts -> decrease the output level until

the silence in 320 ms



GSM 06.XX: RPE-LTP – The coder


GSM 06.XX: RPE-LTP – The decoder


GSM 06.10

For each 160 samples (20 ms.)LAR1, LAR2->6 bitsLAR3, LAR4->5 bitsLAR5, LAR6->4 bitsLAR7, LAR8->3 bitsTotal LAR’s->36 bits

For 40 samples (5ms.)Long-term predictor delay-> 7 bitsLong-term predictor gain-> 2 bitsGrid position (k)->2 bitsblock amplitude-> 6 bitspulse amplitude (13)->3 bitsTotal subframe excitation-> 56 bits

36+56·4=260 bits / 20 ms.

Bitrate = 13 kbps

GSM 06.XXConfort Confort NoiseNoise

SID SID –– Background Background Acoustic Noise EvaluationAcoustic Noise Evaluation

SID codeword with 95 bits equal to zero

Over 4 consecutive frames with VAD=0the mean of the LAR parameters and Xmax are computed

The regular pulses are replaced by a sequence of ramdom integers uniform distributed between 1 and 6.



GSM 06.10

Examples:Original signal: Reconstructed signal GSMDifference between original-reconstructed (transcoding noise)

White noise with the same powerOriginal signal + white noise (without the perceptual weighting )


GSM 06.10

Original

Reconstructed

GSM try to keep the same waveform

speech coding - physionetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf ·...

Documents