evaluation of speech compression systems - … · tilose obtained in the h.eadphono listening as...
TRANSCRIPT
Dept. for Speech, Music and Hearing
Quarterly Progress andStatus Report
Evaluation of speechcompression systems
Fant, G. and Risberg, A.
journal: STL-QPSRvolume: 4number: 2year: 1963pages: 015-021
http://www.speech.kth.se/qpsr
V o c a l i c s ( n a s a l s i n c l u d e d )
1
y as i n y e t
W
Consonant a r t i c u l a t i o r t e s t s were based on randomiesd
l i s t s of t h o s e rJTT s g l l a b l ~ s p repa red from a r e c o r d i n g o f e male
speaker and a female speaker , each r e a d i n g t h e b a s i c l i s t once.
I n t h e f i n a l v e r s i o n p resen ted t o t h e l i s t e n e r s each word occurrsd
5 t i m e s , S u b j e c t s war6 exposed t o a few l i s t s b a f o r e e n t e r i n g
t l ie t e s t i n ord r t o pr:ctice t h e t r a n s c r i p t i o n .
The s u b j o c t i v c e v a l u a t i o n of q u a l i t y w a s performed
on each of t h r e e s e n t ~ n c e s compris ing a te lephone s o n ~ r e r s a t i o n .
Ma12 vo ice : t tBe l lo , i s Docent Fant tho re?"
Female vo ice : "No, lic i s i10t. He i s out r i g h t ilow. Can
I t a k e a m e ~ s a g e ? ~ '
Male v o i c e : IiYes p l e e s e would you a s k him t o c a l l 23 65 PO.
Systems wero compared l ~ a i r w i s s i n a l l p o s s i b l e combinat ions .
L i s t e n e r s were i n s t r u c t e d t o judge which of t h e two witll iri a
p a i r scunded more natura.1,
Ten l a ~ e r i c a n s t u d e n t s a t t h e U n i v e r s i t y o f Stockholm
>rere employed. a s l i s t e n e r s i n a i l t e s t s , The c~rlscrrlant a r t i c u l a t i o n
tssts were c a r r i e d through wi th h c a d p h o ~ e s (TDH-39) , f requency
rssponso up t o 5 lcc c/;; whozeas t h e q u a l i t y t e s t was made zndor
t h r ee d i f f e r e n t a c o u s t i c c o n d i t i o n s ; headrhono l i s t ~ n i n g , loud-
speaker l i s t e n i n g i n e msdium s i z e 2 l a b o r a t o r y room, and i n an
anechoic chamber, Thi; spcc i f ' i c conc!ii;ioiis p o v o d t o h a v ~ a marked
i n f l u e n c e on t h o a c c e p t a b i l i t y o f some o f t h s sgstcrns.
The v a r i o u s compression systems w i l l b~ r o f s r r c d t o
by t h a f c l l o w i n g a l p h a b e t i c code.
S y s t am 20-channel vocoder , Frcquoncy range 200-7000 c / s ,
Analog t r a n s m i s s i o n ovc r 20 c / s low-pass c h a n n e l s ,
System B. 16-channel vocoder, q u a n t i z e d t o 2300 b i t s / s e c .
Froqusncy range coocrod i s 200-3820 c / s .
System C . 15-channel vocoder, q u a n t i z e d a t 2400 b i t s / sec cover ing
t he frequency range 200- 6000 c /s .
System D, 16-channel vocoder, q u a n t i z e d a t 2400 b i t s / s e c ,
System E , 18-channel vocoder, q u a n t i z e d a t 1400 b i t s / s e c ,
System F. 18-channels vocoder, q u a n t i z e d a t 900 b i t s / s e c by
moans of spectrum matching t echn iques .
System G . Mixed channel and formant vocoder u t i l i z i n g channel
cod ing of t h e Fl-region and otherwise formant coding
w i t h p a r a l l e l formant g e n e r a t o r s , Analog t ransrc iss ion .
System R. Mixed channel and formant vocoder u t i l i z i n g channel
coding o f unvoiced sounds and formant coding o f voice
sounds and o f u n v o i c ~ d sounds i n t h e f r equency range
abovc? 2300 c / s . Formant g a n e r a t o r s i n p a r a l l e l .
Analog t r ansmiss ion .
Sys ten I. Speech s y n t h e s i z a r w i t h fcrinant c i r c u i t s i n s e r i e s and
s e p a r a t e s g n t h a s i s branches f o r v o c a l i c s , n a s a l s , and
f r i c a t i v e s .
Be s u l t s
Thc consonant a r t i c u l a t i o n s c o r e s a r e summarized. i n
t h e f o l l o w i n g t a b u l a t i o n .
Table 111-1 . Per cent c o r r e c t consonant i d e n t i f i c a t i o n
System Male speaker Female speaker
Unprocessed speech 95 05 99.6
System 8 /
i c , !
PERCENT CORRECT RESPONSE
CONSONANT 4 TEST. HEADPHONE d" 1 LISTENING -I
PAIR COMPAR- ISON TEST. HEADPHONE 4
I MALE VOICE FEMALE VOICE
PERCENT PREFERED
MEAN OF THREE SENTENCES
SYSTEM A B C D E F G H I CHANNEL VOCODERS PATTERN FORMANT SYNTHESI
MATCHING VOCODER SYSTEM
F i g . 111-1. Consonant a r t i c u l a t i o n s c o r e (above) and p a i r e d comparisons of q u a l i t y (down) under headphone l i s t e n i n g c o n - . d i t i o n s . Male speake r i n d i c a t e d wi t11 f i l l e d c o n t o u r s .
Tho scores f o r the unprocessed speech a re q u i t e high.
The depar ture from i d e a l sseaker performance f o r tha male voice
l i e s i n the consonant f which t o 20 '$ was received a s Q and the
consonant which t o 50 $ was received a s z ,
The chennels vocoders L, B, and C a r e approximately
equal wLth regard t o consonant i d e n t i f i c a t i o n . The observed sccras
a r e lower than what would have been obtained as word scores from
l i s t s of phonet ica l ly balanced words. I n t h l s sense the rime t e s t
i s more d i f f i c u l t and i t has the bene f i t cf z l a rge span between
good and poor systems. However, the spread among l i s t e n e r s i s a l so
l a rge . Within the I 0 sub jec t s tlit: scores variod from f o r instance
5 0 t o 80 p e r cent cor rec t i d a n t i f i c a t i o n .
'asu3.t s from t,he ?aired coml?arison q u a l i t y t e s t s are
summarized below i n terns of tha per cent of the t e s t s a system
was judged Lo be super ior t o any o ther system,
Table 111-2. Subjective q u a l i t y t ~ s t
B = loudspeaker i n reverberant room
A = II anechoic chamber
H = headphons l i s t e n i n g
System
A
B
C
D
I3
F
G
H
I
Female sentence
H A R
87.8 90.5 82 ,9
70.6 7003 7902
49.5 5702 71 06 66.5 62.6 62.6
16 .7 15.7 17.5
12 .6 4.9 3.7
4503 5009 41 05'
33.3 33.3 2902
66 ,6 65.2 62.6
SPEILK%R
Sentence I Sentence I1
H A R H A R
66.2 67.1 50.9 80.6 84.6 68.9
65.2 59.0 68,9 75.2 73.0 71 .I
32.0 33.3 64.0 43.7 47.3 64.8
6102 66.2 52,7 'We7 70.2 53.6
26.6 7 22.9 36.6 20.7 26.6
a-1.0 18.0 33.6 20.7 24.3 25.6
71 .6 73.8 70.2 55.9 61 .7 7 q 0 3 1
I
37.8 43.2 22.1 52.7 52.7 57.6
70.6 63.5 64.9 11 .7 15.3 5 ,2
Discussions
I n terms of male speech ccnsonant a r t i c u l a t i o n scores
a l l systerns except system I1 l i e batween 65 $ and 78 5 , Systems
E and F gcrform r e l a t i v e l y poor cn femalo speech. ks f a r a s
system P i s concerned this: i s explained from the f a c t t h a t tho
p a t t e r n inventory of t h i s system was designod f o r maiu speech.
An ove ra l l view of t h e performancc of tho va r ious
systems under hsndphono l i s t e n i n g condi t ions i s provided by Tig. 111-1.
Here i t may be seen t h a t system C and t h e two h igh ly compressed
channel vocoders C and. F performed b e t t e r i n terms of ccnsonant
a r t i c u l a t i c n than i n terms of q u a l i t y ~ h c r e a s t h e poores t system
from a consonant a r t i c u l a t i o n po in t of view, system H, performed
a s wel l i n q u a l i t y a s t he systam C ,
The apparent d i f f e r ence bctween t he two hybr id formant
vocoders G and H i s probably not a t t r i b u t a b l e t o t h e d i f f e r ence
i n spectrum coding a s much a s i n tho r e l i a b i l i t y of t ha formant
frequency t rack ing . Tht, sub j ec t i ve q u a l i t y of system G and of'
t h e syn thes i s system I i s of t he same l e v e l a s f o r t h e group of
channel vocodars A, B, C, D.
Fig. 111-2 i s devotad t o a more d s t a i l e d ana ly s i s
of t h e sub j ec t i ve q u a l i t y t e s t i n g . The most s t r ikLng impression
was tho in f luence of room acous t i c s on t h e q u a l i t y of system C .
A t t h e SCS-dcriionstration which took pla.cc i n a reverberant auditorium
system C received a spon.taneous applause f o r exce l l en t q u a l i t y .
I n our racen t l i s t s n i n g t e s t i n an ordinary labora tory room system
C rankcd among t h e ba s t but not t op and when l i s t e n i n g over head-
phones i t sounded very rough and noisy and was accordingly rank-
ordered among t h s worst of t h e systems, Ths e f f e c t s of a l a c k
of reverbera t ion on system C was v a r i f i e d by pa i red comparison
t o s t s c a r r i e d out by loudspcaker listening i n our anechoic chamber.
The r e s u l t s obtained under these condi t ions were q u i t e s i m i l a r t o
tilose obtained i n t he h.eadphono l i s t e n i n g a s sosn from Table 111-2.
The p a r t i c u l a r c h a r a c t e r i s t i c s of system C underlaying t h i s e f f ec t
i s an over-emphasis of t he high frequency par t o f t h e spectrum
toge the r wi th t h e l a ck of any snoothing of the s t e p s i n channel
SYSTEM A MALE VOICE RESPONSE
PERCENT CORRECT 68,L
Fig. 111-3. Co.n.fusion ma t r ix system A.
SYSTEM B MALE VOICE RESPONSE
PERCENT CORRECT 76,O
SYSTEM B FEMALE' VOICE RESPONSE
PERCENT CORRECT
Fig. 111-4. Confusion matrix system B.
SYSTEM D MALE V O I C E RESPONSE
m 1 121 138110 n I I I I I I I I 1 2 1 L 8 1
PERCENT CORRECT
SYSTEM 0 FEMALE VOICE RESPONSE
PERCENT CORRECT 58,5
I Fig. 111-6. Confusion matrix system D.
SYSTEM E MALE VOICE RESPONSE
SYSTEM E FEMALE VOICE RESPONSE
PERCENT CORRECT 383
F i g . 111-7. Confusion m a t r i x system E.
SYSTEM F MALE VOICE RESPONSE
PERCENT CORRECT 65,2
SYSTEM F FEMALE V O I C E
RESPONSE
PERCENT CORRECT 31,7
Fig. 111-8. Confus ion m a t r i x system F.
STIMULUS STIMULUS
SYSTEM I I RESPONSE
MALE VOICE
I I
PERCENT CORRECT 76,
Fig. 111-11. Confusion m a t r i x system I.
MANNER FEATURES
PERCENT CORRECT RESPONSE
MALE VOICE CI FEMALE VOICE
I
100-
100- --
I I
1
AFFRICATE E J 50 -
0 -
- FRICATIVE f e s j v d z 2 5 0 -
0 -
SYSTEM A B C D E F G H I CHANNEL VOCODERS PATTERN FORMANT SYNTHESIS
MATCHING VOCODER SYSTEM
-
-
100- -
F i g . 111-13. Per cent cor rec t manner f ea tu re s .
r
-
VOCALIC r l m n y w 50-
7
STOP p t k b d g 50-
-
-
3 -
0 -
0 -
y
- - - 9 -
I I 0
E PLACE FEATURES
I PE RCENT j MALE VOICE 1 -
CORRECT RESPONSE C] FEMALE VOICE
I 100 -
-r,
INTER- DENTAL e d'
LABIAL p b v f m w 50-
0.
100 -
7
DENTAL t s d z l n 50 -
0
-
100 -
i PALATAL 50 / I
s E ? ! i y g k
i 0 -
SYSTEM A B C D E F G H I CHANNEL VOCODERS PATTERN FORMANT SYNTHESI
MATCHING VOCODER SYSTEM
I 100- r 7 -
1 F i g . 111-14. P e r cent c o r r e c t pl-2ce f ea t~ ; l r e s . I
i
RETROFLEX r 50-
1 f
- -
-
- -
I
- C,
m
1
-
I 0
F -
ORIGINAL SPEECH
VOCODER SPEECH
1
0
msec.
SYSTEM C VOCODER SPEECH
SYSTEM VOCODER C
SPEECH FROM MI- CROPHONE IN REVER- BERANT ROOM
msec.
Fig. 111-15. Spectrograms i l l u s t r a t i n g var ious process ings of one and t h e same sentence. A = o r i g i n a l , B = vocoder system 8, C = vocoder system C , D = vocoder system C plus loudspeaker- microphone l i n k in a medium s i z e room.
research along these lines. Several institutions revised or
improved their systems shortly after the SCS meeting, The data
reported here on systams B, D, and H pertain to processings from
revised systems,
G o Fant and A. Risberg
References;
(1) Fant, G . , Risbarg, A,, Stevens, K.N.: "Evalua-tion of Vario~s Analysis-Synthesis Spccch Systems", Paper X2 presented at tha 65th Meeting of the Acoustical Society of America, J!ky 1963.
(2) Stevens, K . N o : "Beview of Existing Speoch Compression Systems", WC-TN-60-197, BXI October 1 960.
(3) Kunson, W ,Ae, Karlin, J .E. : "Iso-Preference Method for Evaluating Speech Transmission Circuit s1I, Paper El presented at the 61 st Meeting of the Acoustical Society of Amcrica, b y 1961 .