towards a universal quality scale for narrowband, wideband ... · problem: quality judgment depends...

Life is for sharing.

Towards a Universal Quality Scale for Narrowband, Wideband and Fullband Speech Services Sebastian Möller1, Jens Berger2

1 Quality and Usability Lab, Telekom Innovation Laboratories, TU Berlin, Germany 2 SwissQual AG – A Rohde & Schwarz Company, Solothurn, Switzerland

Agenda

Motivation Influence Factors

Modelling a telephony situation Bandwidth

Establishment of the Universal Scale Integration of different types of degradations Scale requirements Proposed procedure

Conclusions and Next Steps

2

Motivation Today’s situation.

Problem statement: Many different subjective experiments using a single scale from 1 to 5 The interpretation of the score highly depends on the experimental context

Listening-only vs. conversation

Bandwidth limitation (e.g. only one bandwidth in the test, or different ones)

Length of the stimuli (short sentences vs. long passages or emulated calls)

In practice, two main discussions are relevant 1) How is the relation between a score for a typical sentence and the quality of a longer call?

Measurement episode

Conversational mode

2) How is the relation between a narrowband score and a super-wideband score?

Bandwidth

Idea: Bandwidth- and situation-independent “universal” scale

Agenda





4

Influence Factors Modeling a telephony situation.

Real human conversation

Free conversation

Controlled conversation

Emulated conversation

3rd party listening test

Listening-only test

Free conversation between two persons

Scripted dialog between two persons

Listening pre-recorded samples Emulation of own speech activity by

keyword spotting

Listening to a pre-recorded conversation No own activity

Listening to pre-recorded short speech samples No own activity

ITU P.805

ITU P.800

ITU P.1302

ITU P.800

6

Influence Factors Bandwidth.

“Noi

sine

ss”

(Wältermann et al., JAES 2010)

Result is a Mean Opinion Score representing overall listening quality (e.g. ITU-T P.800)

This integral score reflects all perceived degradations by the users, including individual preferences and cross-masking effects

Result: One score for each presented speech sample despite length and bandwidth, only addressing the listening mode

excellent

good

fair

poor

bad

excellent

good

fair

poor

bad

(5)

(4)

(3)

(2)

(1)

(5)

(1)

Influence Factors Example: Test according to ITU P.800.

Agenda





8

9

E-model approach:

Establishment of the Universal Scale Integration of different types of degradations.

Backgr. noise, acoustic coupling

Linear distortion, delay

Codec Packet loss

Jitter buffer, VAD

Talker echo, listener echo

Circuit noise

Backgr. noise, acoustic coupling

IP WAN

4

4

10

E-model approach:


IP WAN

4

4

Overall quality R = Ro - Is - Id - Ie,eff

Estimated user judgment MOS = f (R )

Impairments SNR simultaneous delayed nonlin./timevar.

Ps, Ds, STMR

SLR, RLR, Ta

Ie, qdu Ppl Bpl TELR, T, WEPL, Tr

Nc, Nfor Pr, Dr, LSTR

11

E-model approach:

Problem: Quality judgment depends on the test context, i.e. the conditions included in the test corpus → “corpus effect”

Definition of an “absolute quality scale“ (R-scale) which should be independent of the judgment context

Relationship between judgment scale and quality scale is then context-dependent


0 50 100 1501

1.5

2

2.5

3

3.5

4

4.5

R

MO

S

12

Telephony situation:

Scale should reflect conversational quality, measured e.g. according to ITU-T Rec. P.800 and P.805

Listening-only tests may be used in case that no “conversational impairments“ are present, however scale endings might be used more frequently than in conversation tests

Conversations may be approximated by presenting selected stretches of speech (4…8 s) in “emulated conversation tests“ according to ITU-T Rec. P.1302

Listening-only tests according to ITU-T Rec. P.800 may be used for evaluating the single stretches of speech

Establishment of the Universal Scale Scale requirements.

13

Bandwidth:

Scale should rank correctly narrowband, wideband, super-wideband and fullband signals, and a “per call quality”

Transformation of individual (P.800) experiments, of different bandwidth contexts, to the universal scale must be possible

Establishment of the Universal Scale Scale requirements.

14

Bandwidth:

Conduct different tests according to ITU P-series Recommendations in any mode

Listening-only tests

Conversation tests

Emulated conversation tests

Transform results onto the universal scale using fixed anchor conditions

Use the transmission rating scale rather than the MOS scale as a first guess

Establishment of the Universal Scale Proposed procedure.

Transformation relative to anchor conditions

Independent of original experimental context, the score on the universal scale is the same

Establishment of the Universal Scale Proposed procedure.

Agenda





16

17

Conclusions:

Requirements for the new scale have been set up regarding

length of the measurement episode

conversational mode

bandwidth

Establishment of the scale requires tob-down and bottom-up considerations

Next steps:

Define anchor conditions

Conduct subjective tests

Transform results and adjust

Define transformation laws also for instrumental models

Conclusions and Next Steps Universal quality scale.

Thank you for your attention!

Visit www.qu.tu-berlin.de for more information.

http://www.qu.tlabs.tu-berlin.de/



towards a universal quality scale for narrowband, wideband ... · problem: quality judgment depends...

Documents