evaluation of speech compression systems - … · tilose obtained in the h.eadphono listening as...

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

Evaluation of speechcompression systems

Fant, G. and Risberg, A.

journal: STL-QPSRvolume: 4number: 2year: 1963pages: 015-021

http://www.speech.kth.se/qpsr

http://www.speech.kth.se

http://www.speech.kth.se/qpsr

V o c a l i c s ( n a s a l s i n c l u d e d )

1

y as i n y e t

W

Consonant a r t i c u l a t i o r t e s t s were based on randomiesd

l i s t s of t h o s e rJTT s g l l a b l ~ s p repa red from a r e c o r d i n g o f e male

speaker and a female speaker , each r e a d i n g t h e b a s i c l i s t once.

I n t h e f i n a l v e r s i o n p resen ted t o t h e l i s t e n e r s each word occurrsd

5 t i m e s , S u b j e c t s war6 exposed t o a few l i s t s b a f o r e e n t e r i n g

t l ie t e s t i n ord r t o pr:ctice t h e t r a n s c r i p t i o n .

The s u b j o c t i v c e v a l u a t i o n of q u a l i t y w a s performed

on each of t h r e e s e n t ~ n c e s compris ing a te lephone s o n ~ r e r s a t i o n .

Ma12 vo ice : t tBe l lo , i s Docent Fant tho re?"

Female vo ice : "No, lic i s i10t. He i s out r i g h t ilow. Can

I t a k e a m e ~ s a g e ? ~ '

Male v o i c e : IiYes p l e e s e would you a s k him t o c a l l 23 65 PO.

Systems wero compared l ~ a i r w i s s i n a l l p o s s i b l e combinat ions .

L i s t e n e r s were i n s t r u c t e d t o judge which of t h e two witll iri a

p a i r scunded more natura.1,

Ten l a ~ e r i c a n s t u d e n t s a t t h e U n i v e r s i t y o f Stockholm

>rere employed. a s l i s t e n e r s i n a i l t e s t s , The c~rlscrrlant a r t i c u l a t i o n

tssts were c a r r i e d through wi th h c a d p h o ~ e s (TDH-39) , f requency

rssponso up t o 5 lcc c/;; whozeas t h e q u a l i t y t e s t was made zndor

t h r ee d i f f e r e n t a c o u s t i c c o n d i t i o n s ; headrhono l i s t ~ n i n g , loud-

speaker l i s t e n i n g i n e msdium s i z e 2 l a b o r a t o r y room, and i n an

anechoic chamber, Thi; spcc i f ' i c conc!ii;ioiis p o v o d t o h a v ~ a marked

i n f l u e n c e on t h o a c c e p t a b i l i t y o f some o f t h s sgstcrns.

The v a r i o u s compression systems w i l l b~ r o f s r r c d t o

by t h a f c l l o w i n g a l p h a b e t i c code.

S y s t am 20-channel vocoder , Frcquoncy range 200-7000 c / s ,

Analog t r a n s m i s s i o n ovc r 20 c / s low-pass c h a n n e l s ,

System B. 16-channel vocoder, q u a n t i z e d t o 2300 b i t s / s e c .

Froqusncy range coocrod i s 200-3820 c / s .

System C . 15-channel vocoder, q u a n t i z e d a t 2400 b i t s / sec cover ing

t he frequency range 200- 6000 c /s .

System D, 16-channel vocoder, q u a n t i z e d a t 2400 b i t s / s e c ,

System E , 18-channel vocoder, q u a n t i z e d a t 1400 b i t s / s e c ,

System F. 18-channels vocoder, q u a n t i z e d a t 900 b i t s / s e c by

moans of spectrum matching t echn iques .

System G . Mixed channel and formant vocoder u t i l i z i n g channel

cod ing of t h e Fl-region and otherwise formant coding

w i t h p a r a l l e l formant g e n e r a t o r s , Analog t ransrc iss ion .

System R. Mixed channel and formant vocoder u t i l i z i n g channel

coding o f unvoiced sounds and formant coding o f voice

sounds and o f u n v o i c ~ d sounds i n t h e f r equency range

abovc? 2300 c / s . Formant g a n e r a t o r s i n p a r a l l e l .

Analog t r ansmiss ion .

Sys ten I. Speech s y n t h e s i z a r w i t h fcrinant c i r c u i t s i n s e r i e s and

s e p a r a t e s g n t h a s i s branches f o r v o c a l i c s , n a s a l s , and

f r i c a t i v e s .

Be s u l t s

Thc consonant a r t i c u l a t i o n s c o r e s a r e summarized. i n

t h e f o l l o w i n g t a b u l a t i o n .

Table 111-1 . Per cent c o r r e c t consonant i d e n t i f i c a t i o n

System Male speaker Female speaker

Unprocessed speech 95 05 99.6

System 8 /

i c , !

PERCENT CORRECT RESPONSE

CONSONANT 4 TEST. HEADPHONE d" 1 LISTENING -I

PAIR COMPAR- ISON TEST. HEADPHONE 4

I MALE VOICE FEMALE VOICE

PERCENT PREFERED

MEAN OF THREE SENTENCES

SYSTEM A B C D E F G H I CHANNEL VOCODERS PATTERN FORMANT SYNTHESI

MATCHING VOCODER SYSTEM

F i g . 111-1. Consonant a r t i c u l a t i o n s c o r e (above) and p a i r e d comparisons of q u a l i t y (down) under headphone l i s t e n i n g c o n - . d i t i o n s . Male speake r i n d i c a t e d wi t11 f i l l e d c o n t o u r s .

Tho scores f o r the unprocessed speech a re q u i t e high.

The depar ture from i d e a l sseaker performance f o r tha male voice

l i e s i n the consonant f which t o 20 '$ was received a s Q and the

consonant which t o 50 $ was received a s z ,

The chennels vocoders L, B, and C a r e approximately

equal wLth regard t o consonant i d e n t i f i c a t i o n . The observed sccras

a r e lower than what would have been obtained as word scores from

l i s t s of phonet ica l ly balanced words. I n t h l s sense the rime t e s t

i s more d i f f i c u l t and i t has the bene f i t cf z l a rge span between

good and poor systems. However, the spread among l i s t e n e r s i s a l so

l a rge . Within the I 0 sub jec t s tlit: scores variod from f o r instance

5 0 t o 80 p e r cent cor rec t i d a n t i f i c a t i o n .

'asu3.t s from t,he ?aired coml?arison q u a l i t y t e s t s are

summarized below i n terns of tha per cent of the t e s t s a system

was judged Lo be super ior t o any o ther system,

Table 111-2. Subjective q u a l i t y t ~ s t

B = loudspeaker i n reverberant room

A = II anechoic chamber

H = headphons l i s t e n i n g

System

A

B

C

D

I3

F

G

H

I

Female sentence

H A R

87.8 90.5 82 ,9

70.6 7003 7902

49.5 5702 71 06 66.5 62.6 62.6

16 .7 15.7 17.5

12 .6 4.9 3.7

4503 5009 41 05'

33.3 33.3 2902

66 ,6 65.2 62.6

SPEILK%R

Sentence I Sentence I1

H A R H A R

66.2 67.1 50.9 80.6 84.6 68.9

65.2 59.0 68,9 75.2 73.0 71 .I

32.0 33.3 64.0 43.7 47.3 64.8

6102 66.2 52,7 'We7 70.2 53.6

26.6 7 22.9 36.6 20.7 26.6

a-1.0 18.0 33.6 20.7 24.3 25.6

71 .6 73.8 70.2 55.9 61 .7 7 q 0 3 1

I

37.8 43.2 22.1 52.7 52.7 57.6

70.6 63.5 64.9 11 .7 15.3 5 ,2

Discussions

I n terms of male speech ccnsonant a r t i c u l a t i o n scores

a l l systerns except system I1 l i e batween 65 $ and 78 5 , Systems

E and F gcrform r e l a t i v e l y poor cn femalo speech. ks f a r a s

system P i s concerned this: i s explained from the f a c t t h a t tho

p a t t e r n inventory of t h i s system was designod f o r maiu speech.

An ove ra l l view of t h e performancc of tho va r ious

systems under hsndphono l i s t e n i n g condi t ions i s provided by Tig. 111-1.

Here i t may be seen t h a t system C and t h e two h igh ly compressed

channel vocoders C and. F performed b e t t e r i n terms of ccnsonant

a r t i c u l a t i c n than i n terms of q u a l i t y ~ h c r e a s t h e poores t system

from a consonant a r t i c u l a t i o n po in t of view, system H, performed

a s wel l i n q u a l i t y a s t he systam C ,

The apparent d i f f e r ence bctween t he two hybr id formant

vocoders G and H i s probably not a t t r i b u t a b l e t o t h e d i f f e r ence

i n spectrum coding a s much a s i n tho r e l i a b i l i t y of t ha formant

frequency t rack ing . Tht, sub j ec t i ve q u a l i t y of system G and of'

t h e syn thes i s system I i s of t he same l e v e l a s f o r t h e group of

channel vocodars A, B, C, D.

Fig. 111-2 i s devotad t o a more d s t a i l e d ana ly s i s

of t h e sub j ec t i ve q u a l i t y t e s t i n g . The most s t r ikLng impression

was tho in f luence of room acous t i c s on t h e q u a l i t y of system C .

A t t h e SCS-dcriionstration which took pla.cc i n a reverberant auditorium

system C received a spon.taneous applause f o r exce l l en t q u a l i t y .

I n our racen t l i s t s n i n g t e s t i n an ordinary labora tory room system

C rankcd among t h e ba s t but not t op and when l i s t e n i n g over head-

phones i t sounded very rough and noisy and was accordingly rank-

ordered among t h s worst of t h e systems, Ths e f f e c t s of a l a c k

of reverbera t ion on system C was v a r i f i e d by pa i red comparison

t o s t s c a r r i e d out by loudspcaker listening i n our anechoic chamber.

The r e s u l t s obtained under these condi t ions were q u i t e s i m i l a r t o

tilose obtained i n t he h.eadphono l i s t e n i n g a s sosn from Table 111-2.

The p a r t i c u l a r c h a r a c t e r i s t i c s of system C underlaying t h i s e f f ec t

i s an over-emphasis of t he high frequency par t o f t h e spectrum

toge the r wi th t h e l a ck of any snoothing of the s t e p s i n channel

SYSTEM A MALE VOICE RESPONSE

PERCENT CORRECT 68,L

Fig. 111-3. Co.n.fusion ma t r ix system A.

SYSTEM B MALE VOICE RESPONSE

PERCENT CORRECT 76,O

SYSTEM B FEMALE' VOICE RESPONSE

PERCENT CORRECT

Fig. 111-4. Confusion matrix system B.

SYSTEM D MALE V O I C E RESPONSE

m 1 121 138110 n I I I I I I I I 1 2 1 L 8 1

PERCENT CORRECT

SYSTEM 0 FEMALE VOICE RESPONSE

PERCENT CORRECT 58,5

I Fig. 111-6. Confusion matrix system D.

SYSTEM E MALE VOICE RESPONSE

SYSTEM E FEMALE VOICE RESPONSE

PERCENT CORRECT 383

F i g . 111-7. Confusion m a t r i x system E.

SYSTEM F MALE VOICE RESPONSE


SYSTEM F FEMALE V O I C E

RESPONSE


Fig. 111-8. Confus ion m a t r i x system F.

STIMULUS STIMULUS

SYSTEM I I RESPONSE

MALE VOICE

I I

PERCENT CORRECT 76,

Fig. 111-11. Confusion m a t r i x system I.

MANNER FEATURES

PERCENT CORRECT RESPONSE

MALE VOICE CI FEMALE VOICE

I

100-

100- --

I I

1

AFFRICATE E J 50 -

0 -

- FRICATIVE f e s j v d z 2 5 0 -

0 -

SYSTEM A B C D E F G H I CHANNEL VOCODERS PATTERN FORMANT SYNTHESIS


-

-

100- -

F i g . 111-13. Per cent cor rec t manner f ea tu re s .

r

-

VOCALIC r l m n y w 50-

7

STOP p t k b d g 50-

-

-

3 -

0 -

0 -

y

- - - 9 -

I I 0

E PLACE FEATURES

I PE RCENT j MALE VOICE 1 -

CORRECT RESPONSE C] FEMALE VOICE

I 100 -

-r,

INTER- DENTAL e d'

LABIAL p b v f m w 50-

0.

100 -

7

DENTAL t s d z l n 50 -

0

-

100 -

i PALATAL 50 / I

s E ? ! i y g k

i 0 -

SYSTEM A B C D E F G H I CHANNEL VOCODERS PATTERN FORMANT SYNTHESI


I 100- r 7 -

1 F i g . 111-14. P e r cent c o r r e c t pl-2ce f ea t~ ; l r e s . I

i

RETROFLEX r 50-

1 f

- -

-

- -

I

- C,

m

1

-

I 0

F -

ORIGINAL SPEECH

VOCODER SPEECH

1

0

msec.

SYSTEM C VOCODER SPEECH

SYSTEM VOCODER C

SPEECH FROM MI- CROPHONE IN REVER- BERANT ROOM

msec.

Fig. 111-15. Spectrograms i l l u s t r a t i n g var ious process ings of one and t h e same sentence. A = o r i g i n a l , B = vocoder system 8, C = vocoder system C , D = vocoder system C plus loudspeaker- microphone l i n k in a medium s i z e room.

research along these lines. Several institutions revised or

improved their systems shortly after the SCS meeting, The data

reported here on systams B, D, and H pertain to processings from

revised systems,

G o Fant and A. Risberg

References;

(1) Fant, G . , Risbarg, A,, Stevens, K.N.: "Evalua-tion of Vario~s Analysis-Synthesis Spccch Systems", Paper X2 presented at tha 65th Meeting of the Acoustical Society of America, J!ky 1963.

(2) Stevens, K . N o : "Beview of Existing Speoch Compression Systems", WC-TN-60-197, BXI October 1 960.

(3) Kunson, W ,Ae, Karlin, J .E. : "Iso-Preference Method for Evaluating Speech Transmission Circuit s1I, Paper El presented at the 61 st Meeting of the Acoustical Society of Amcrica, b y 1961 .

evaluation of speech compression systems - … · tilose obtained in the h.eadphono listening as...

Documents