speech coding - physionetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf ·...

36
Speech Technologies – Speech Coding Speech Coding Speech Coding 1. 1. Introduction Introduction 2. LPC 2. LPC Vocoder Vocoder 3. 3. Analysis Analysis - - by by - - Synthesis Coding Synthesis Coding

Upload: dinhkhanh

Post on 13-Jun-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Speech CodingSpeech Coding1. 1. IntroductionIntroduction2. LPC 2. LPC Vocoder Vocoder 3. 3. AnalysisAnalysis--byby--Synthesis CodingSynthesis Coding

Page 2: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Speech Coding ClassificationSpeech Coding Classification1.1. Waveform CodingWaveform Coding

To reconstruct To reconstruct a a signalsignal waveform similar to the waveform similar to the original oneoriginal one

PCM G.711 64 PCM G.711 64 kbkb/s, ADPCM G.721 32 /s, ADPCM G.721 32 kbkb/s/sSBC G.722SBC G.722

2.2. Source CodingSource CodingTo reconstruct a signal based on the To reconstruct a signal based on the production production

model ofmodel ofLPC LPC VocoderVocoder FS1015 2,4 FS1015 2,4 kbkb/s, MELP 2,4 /s, MELP 2,4 kbkb/s/s

3.3. Hybrid coders Hybrid coders -- AnalysisAnalysis--byby--Synthesis Synthesis Waveform Waveform Coding Coding based on the production model based on the production model

ETSI GSM, CELP G.729ETSI GSM, CELP G.729

Page 3: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Coders ComparisonCoders Comparison

1. Bit Rate kb/s2. Voice Quality MOS (Mean Opinion Score)3. Complexity4. Delay5. Channel error sensibility6. Bandwithd

Coder Bit Rate kb/s MOS BW (kHz)CD Audio 1.411 5.0 44,1

PCM 64 4.3 8

ADPCM 40,32,24,16 4.2 (32 kb/s) 8

SBC 64,56,48 >4.5 16

Page 4: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

ComparisonComparison

Page 5: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

SpeechSpeech CodingCoding: : LPC LPC Vocoder Vocoder

Page 6: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

LPC LPC AnalysisAnalysis

LPC Synthesis:

P(z)

)(ns

)(ˆ ns

)(neH(z)=1/A(z)

∑=

−−=p

i

ii zazP

1·)(

H(z): Vocal tract transfer function

Page 7: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

LPC LPC Vocoder Vocoder

Simplification of the excitation in the synthesis:

Train of periodic impulses for voicedsegments White Gaussian noise in the unvoiced segments Maintenance of the power in the new synthetic excitation. Examples:

Page 8: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

LPC LPC VocoderVocoder

P(z)

+

H(z)x

ANÁLISISLPC

P(z)

- ANÁLISIS-PITCH-U/V

G

CoeficientesReflexión

G

V

U

F0)(nr

)(ˆ ns

)(ns

)(ns

1/F0

Page 9: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

LPC10E/LPC10E/FS1015 FS1015 VocoderVocoder

54 bits/frame

Pitch + U/V->7bitsG->5bitsK1 a K4->5bitsK5 a K8-> 4bitsK9->3bitsK10->2bits

Fs= 8000 samples/sec54bits/frame180 samples/frame(22.5 ms/frame)

54*8000/180=2400bits/sec

Page 10: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

LPC10E LPC10E Vocoder Vocoder

Examples:Original signalReconstructed signal LPC10E Reconstructed signal LPC10E (satellite radio transmision)

Features:Nasality: pole-zero modelSimple voiced excitation(train of impulses): buzzingFrame size: problems with fast transitions (p, t, k…)

Page 11: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

MELP:MELP: MixedMixed--ExcitationExcitation LinearLinear Predictive Predictive VocoderVocoder

2400 bps Federal Standard speech coder

The excitation signal is generated by means of a mixture of noise and train of impulses in different frequency bands

Page 12: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

MELP:MELP: MixedMixed--ExcitationExcitation LinearLinear Predictive Predictive VocoderVocoder

Page 13: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

MELP:MELP: MixedMixed--ExcitationExcitation LinearLinear Predictive Predictive VocoderVocoder

Original signal “clean” Lpc-10

Reconstructed signal MELP “clean”

Original signal “noisy”

Reconstructued signal “noisy”Data rate: 2400 bps (54* 44,44444 frames/second) Sampling rate: 8 kHzBit stream format: For each 22.5 ms frame of input speech, the following 54 bits are placed into the bit-stream (in this order)Description Number of bits

Pitch index 7Jitter flag 1Bandpass voicing decision 4x1Gain for second half of frame 5Gain for first half of frame 3LSP frequencies (10 line spectrum pairs) 25Fourier magnitudes (10 harmonies) 8Sync bit 1 Total 54

Page 14: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Hybrid CodersHybrid CodersPredictivePredictive CodersCoders basedbased onon AnalysisAnalysis--byby--

SynthesisSynthesis

Page 15: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Hybrid CodersHybrid CodersDepending on the excitation they are classified in three basic types 1. MultiPulse Excitation (MPE)2. Regular Pulse Excitation (RPE)3. Code Excitated Linear Prediction (CELP)

Page 16: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Hybrid CodersHybrid CodersCELP: Code Excitated Linear Prediction

Page 17: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

ShortShort--Time Time AnalysisAnalysis

Typical values:Analysis frame: 25 ms (200 samples)Speech frame: 20 ms (160 samples)Subframe: 5 ms (40 samples)

Page 18: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Synthesis FilterSynthesis FilterBased on short-time and long-time Linear Prediction

s(n) rL(n)

ANÁLISISA CORTO

ANÁLISISA LARGO

P(z)

-PL(z)

-r(n)

SÍNTESIS

PL(z) P(z)

r(n)rL(n)+ +

s(n)

)(ˆ nr )(ˆ ns

Page 19: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Synthesis FilterSynthesis FilterLong-Term predictor

ˆ( ) ( )r n r n Dβ= −

ˆ( ) ( ( 1)) ( ) ( ( 1))1 2 3

r n r n D r n D r n Dβ β β= − + + − + − −Estimation

or

)(ˆ nr

Parameter estimation, minimum mean square error( ) ( ) ( )e n r n r n Dβ= − −

[ ]211 2( ) ( ) ( )

00

NNE e n r n r n D

nnβ

−−= = − −∑∑

==

/ 0E β∂ ∂ =[ ]

1( ) ( )

01 2( )0

Nr n r n D

nN

r n Dn

β

−−∑

==−

−∑=

Page 20: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Synthesis FilterSynthesis FilterFound the value of D to minimize the error power E

[ ]

21( ) ( )

1 2 0( ) 1 20 ( )0

Nr n r n D

N nE r n Nn r n D

n

−−∑

− == −∑ −= −∑

=

Page 21: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

PerceptualPerceptual Weighting Filter Weighting Filter

Function: to modify the frequency characteristics of the error to minimize, granting more importance to the zones of frequency in which the ear is going to be more sensible and less importance to the zones in which the ear is going to be less sensible. Based on the frequencial masking:

In the formats, it is possible to allow more errors The filter transfer function will be inversely proportional to the spectral envelope of the coding speech signal. Transfer Function proposed: W(z)=A(z)/A(γ-1z)The parameter γ=[0,1], controls the level of weighting. Must be updated jointly with the predictor.

Page 22: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

1 1( ) 1 1( )

( / ) 11 (1 ( ) )1 1

P Pk ka z a zk kA z k kW zP PA z zk ka z pk kk k

γγ

γ

− −− −∑ ∑= == = =

− −− −∑ ∏= =

11( )

1(1 )1

P ka zkkW zP

p zkkγ

−− ∑==

−−∏=

0.8 0.9

PerceptualPerceptual Weighting Filter Weighting Filter

γ≤ ≤normally

Page 23: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

PerceptualPerceptual Weighting Filter Weighting Filter

Page 24: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

PerceptualPerceptual Weighting Filter Weighting Filter

Page 25: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

GSM 06.XX: RPE-LTPGSM 1982 "Groupe Spécial Mobile“ ,

now: "Global System for Mobile communications“RPE-LTP: Regular Pulse Excitation – Long Term Prediction

Page 26: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

GSM 06.XX: RPE-LTP

SID – Silence Descrition FrameBFI – Bad Frame Indicator

Speech Technologies – Speech Coding

Page 27: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

GSM 06.XX: RPE-LTPFrameFrame lostlost:1) Speech frames

a) First lost -> repetition of the last good frameb) Following losts -> decrease the output level until

the silence in 320 ms 2) SID frames

a) First lost -> repetition of the last good frameb) Following losts -> decrease the output level until

the silence in 320 ms

Speech Technologies – Speech Coding

Page 28: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

GSM 06.XX: RPE-LTP – The coder

Page 29: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Page 30: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

GSM 06.XX: RPE-LTP – The coder

Page 31: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

GSM 06.XX: RPE-LTP – The decoder

Page 32: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

Page 33: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

GSM 06.10

For each 160 samples (20 ms.)LAR1, LAR2->6 bitsLAR3, LAR4->5 bitsLAR5, LAR6->4 bitsLAR7, LAR8->3 bitsTotal LAR’s->36 bits

For 40 samples (5ms.)Long-term predictor delay-> 7 bitsLong-term predictor gain-> 2 bitsGrid position (k)->2 bitsblock amplitude-> 6 bitspulse amplitude (13)->3 bitsTotal subframe excitation-> 56 bits

36+56·4=260 bits / 20 ms.

Bitrate = 13 kbps

Page 34: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

GSM 06.XXConfort Confort NoiseNoise

SID SID –– Background Background Acoustic Noise EvaluationAcoustic Noise Evaluation

SID codeword with 95 bits equal to zero

Over 4 consecutive frames with VAD=0the mean of the LAR parameters and Xmax are computed

The regular pulses are replaced by a sequence of ramdom integers uniform distributed between 1 and 6.

Speech Technologies – Speech Coding

Page 35: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

GSM 06.10

Examples:Original signal: Reconstructed signal GSMDifference between original-reconstructed (transcoding noise)

White noise with the same powerOriginal signal + white noise (without the perceptual weighting )

Page 36: Speech Coding - PhysioNetphysionet.cps.unizar.es/.../docencia/tvoz/ingles/coding.pdf · 2007-01-17 · Speech Technologies – Speech Coding GSM 06.10 For each 160 samples (20 ms.)

Speech Technologies – Speech Coding

GSM 06.10

Original

Reconstructed

GSM try to keep the same waveform