towards a universal quality scale for narrowband, wideband ... · problem: quality judgment depends...
TRANSCRIPT
Life is for sharing.
Towards a Universal Quality Scale for Narrowband, Wideband and Fullband Speech Services Sebastian Möller1, Jens Berger2
1 Quality and Usability Lab, Telekom Innovation Laboratories, TU Berlin, Germany 2 SwissQual AG – A Rohde & Schwarz Company, Solothurn, Switzerland
Agenda
Motivation Influence Factors
Modelling a telephony situation Bandwidth
Establishment of the Universal Scale Integration of different types of degradations Scale requirements Proposed procedure
Conclusions and Next Steps
2
Motivation Today’s situation.
Problem statement: Many different subjective experiments using a single scale from 1 to 5 The interpretation of the score highly depends on the experimental context
Listening-only vs. conversation
Bandwidth limitation (e.g. only one bandwidth in the test, or different ones)
Length of the stimuli (short sentences vs. long passages or emulated calls)
In practice, two main discussions are relevant 1) How is the relation between a score for a typical sentence and the quality of a longer call?
Measurement episode
Conversational mode
2) How is the relation between a narrowband score and a super-wideband score?
Bandwidth
Idea: Bandwidth- and situation-independent “universal” scale
Agenda
Motivation Influence Factors
Modelling a telephony situation Bandwidth
Establishment of the Universal Scale Integration of different types of degradations Scale requirements Proposed procedure
Conclusions and Next Steps
4
Influence Factors Modeling a telephony situation.
Real human conversation
Free conversation
Controlled conversation
Emulated conversation
3rd party listening test
Listening-only test
Free conversation between two persons
Scripted dialog between two persons
Listening pre-recorded samples Emulation of own speech activity by
keyword spotting
Listening to a pre-recorded conversation No own activity
Listening to pre-recorded short speech samples No own activity
ITU P.805
ITU P.800
ITU P.1302
ITU P.800
6
Influence Factors Bandwidth.
“Noi
sine
ss”
(Wältermann et al., JAES 2010)
Result is a Mean Opinion Score representing overall listening quality (e.g. ITU-T P.800)
This integral score reflects all perceived degradations by the users, including individual preferences and cross-masking effects
Result: One score for each presented speech sample despite length and bandwidth, only addressing the listening mode
excellent
good
fair
poor
bad
excellent
good
fair
poor
bad
(5)
(4)
(3)
(2)
(1)
(5)
(1)
Influence Factors Example: Test according to ITU P.800.
Agenda
Motivation Influence Factors
Modelling a telephony situation Bandwidth
Establishment of the Universal Scale Integration of different types of degradations Scale requirements Proposed procedure
Conclusions and Next Steps
8
9
E-model approach:
Establishment of the Universal Scale Integration of different types of degradations.
Backgr. noise, acoustic coupling
Linear distortion, delay
Codec Packet loss
Jitter buffer, VAD
Talker echo, listener echo
Circuit noise
Backgr. noise, acoustic coupling
IP WAN
4
4
10
E-model approach:
Establishment of the Universal Scale Integration of different types of degradations.
IP WAN
4
4
Overall quality R = Ro - Is - Id - Ie,eff
Estimated user judgment MOS = f (R )
Impairments SNR simultaneous delayed nonlin./timevar.
Ps, Ds, STMR
SLR, RLR, Ta
Ie, qdu Ppl Bpl TELR, T, WEPL, Tr
Nc, Nfor Pr, Dr, LSTR
11
E-model approach:
Problem: Quality judgment depends on the test context, i.e. the conditions included in the test corpus → “corpus effect”
Definition of an “absolute quality scale“ (R-scale) which should be independent of the judgment context
Relationship between judgment scale and quality scale is then context-dependent
Establishment of the Universal Scale Integration of different types of degradations.
0 50 100 1501
1.5
2
2.5
3
3.5
4
4.5
R
MO
S
12
Telephony situation:
Scale should reflect conversational quality, measured e.g. according to ITU-T Rec. P.800 and P.805
Listening-only tests may be used in case that no “conversational impairments“ are present, however scale endings might be used more frequently than in conversation tests
Conversations may be approximated by presenting selected stretches of speech (4…8 s) in “emulated conversation tests“ according to ITU-T Rec. P.1302
Listening-only tests according to ITU-T Rec. P.800 may be used for evaluating the single stretches of speech
Establishment of the Universal Scale Scale requirements.
13
Bandwidth:
Scale should rank correctly narrowband, wideband, super-wideband and fullband signals, and a “per call quality”
Transformation of individual (P.800) experiments, of different bandwidth contexts, to the universal scale must be possible
Establishment of the Universal Scale Scale requirements.
14
Bandwidth:
Conduct different tests according to ITU P-series Recommendations in any mode
Listening-only tests
Conversation tests
Emulated conversation tests
Transform results onto the universal scale using fixed anchor conditions
Use the transmission rating scale rather than the MOS scale as a first guess
Establishment of the Universal Scale Proposed procedure.
Transformation relative to anchor conditions
Independent of original experimental context, the score on the universal scale is the same
Establishment of the Universal Scale Proposed procedure.
Agenda
Motivation Influence Factors
Modelling a telephony situation Bandwidth
Establishment of the Universal Scale Integration of different types of degradations Scale requirements Proposed procedure
Conclusions and Next Steps
16
17
Conclusions:
Requirements for the new scale have been set up regarding
length of the measurement episode
conversational mode
bandwidth
Establishment of the scale requires tob-down and bottom-up considerations
Next steps:
Define anchor conditions
Conduct subjective tests
Transform results and adjust
Define transformation laws also for instrumental models
Conclusions and Next Steps Universal quality scale.
Thank you for your attention!
Visit www.qu.tu-berlin.de for more information.