automated quality monitoring in the call … · ibm t.j. watson research center, yorktown heights,...

AUTOMATED QUALITY MONITORING IN THE CALL CENTER WITH ASR ANDMAXIMUM ENTROPY

G. Zweig, O. Siohan, G. Saon, B. Ramabhadran, D. Povey, L. Mangu and B. Kingsbury

IBM T.J.WatsonResearchCenter, Yorktown Heights,NY 10598

ABSTRACT

This paperdescribesanautomatedsystemfor assigningqual-ity scoresto recordedcall centerconversations.Thesystemcom-binesspeechrecognition,patternmatching,andmaximumentropyclassificationto rank calls accordingto their measuredquality.Callsat bothendof thespectrumareflaggedas“interesting”andmadeavailablefor furtherhumanmonitoring.In thisprocess,pat-ternmatchingontheASRtranscriptis usedto answerasetof stan-dardqualitycontrolquestionssuchas“did theagentusecourteouswordsandphrases,” andto generatea question-basedscore.Thisis interpolatedwith the probability of a call being “bad,” asde-terminedby maximumentropy operatingon a setof ASR-derivedfeaturessuchas“maximumsilencelength” andtheoccurrenceofselectedn-gramword sequences.The systemis trainedon a setof callswith associatedmanualevaluationforms. We presentpre-cision andrecall resultsfrom IBM’ s North AmericanHelp Deskindicatingthat for a given amountof listeningeffort, this systemtriplesthenumberof badcallsthatareidentified,over thecurrentpolicy of randomlysamplingcalls.

1. INTRODUCTION

Every day, tensof millions of help-deskcallsarerecordedat callcentersaroundtheworld. As partof a typicalcall centeroperationa randomsampleof thesecalls is normally re-playedto humanmonitorswho scorethe calls with respectto a variety of qualityrelatedquestions,e.g.� Wastheaccountsuccessfullyidentifiedby theagent?� Did the agentrequesterror codes/messagesto help deter-

minetheproblem?� Wastheproblemresolved?� Did theagentmaintainappropriatetone,pitch,volumeandpace?

This processsuffers from a numberof importantproblems:first,themonitoringat leastdoublesthecostof eachcall (first anopera-tor is paidto take it, thena monitorto evaluateit). This causesthesecondproblem,which is that thereforeonly a very smallsampleof calls, e.g. a fraction of a percent,is typically evaluated. Thethird problemarisesfrom thefactthatmostcallsareordinaryanduninteresting;with randomsampling,the humanmonitorsspendmostof their time listeningto uninterestingcalls.

This paperdescribesanautomatedquality-monitoringsystemthat addressestheseproblems. Automatic speechrecognitionisusedto transcribe100% of the calls coming in to a call center,anddefault quality scoresareassignedbasedon featuressuchaskey-words,key-phrases,the numberandtype of hesitations,andthe averagesilencedurations. The default scoreis usedto rank

thecalls from worst-to-best,andthis sortedlist is madeavailableto the humanevaluators,who canthusspendtheir time listeningonly to callsfor which thereis somea-priori reasonto expectthatthereis somethinginteresting.

The automaticquality-monitoringproblem is interestinginpartbecauseof thevariability in how hardit is to answertheques-tions. Somequestions,for example,“Did theagentusecourteouswordsandphrases?”arerelatively straightforward to answerbylooking for key wordsandphrases.Others,however, requirees-sentiallyhuman-level knowledgeto answer;for exampleonecom-pany’s monitorsareasked to answerthe question“Did the agenttake ownershipof theproblem?”Our work focuseson calls fromIBM’ sNorthAmericancall centers,wherethereis asetof 31ques-tionsthatareusedto evaluatecall-quality. Becauseof thehighde-greeof variability found in thesecalls, we have investigatedtwoapproaches:

1. Usea partial scorebasedonly on the subsetof questionsthatcanbereliablyanswered.

2. Use a maximum entropy classifierto map directly fromASR-generatedfeaturesto theprobabilitythata call is bad(definedasbelongingto thebottom20%of calls).

Wehave foundthatbothapproachesareworkable,andwepresentfinal resultsbasedon an interpolationbetweenthe two scores.Theseresultsindicatethat for a fixed amountof listeningeffort,the numberof badcalls that are identified approximatelytripleswith ourcall-rankingapproach.Surprisingly, while therehasbeensignificantprevious scholarlyresearchin automatedcall-routingandclassificationin thecall center, e.g. [1, 2, 3, 4, 5], therehasbeenmuchlessin automatedquality monitoringperse.

2. ASR FOR CALL CENTER TRANSCRIPTION

2.1. Data

The speechrecognitionsystemswere trainedon approximately300hoursof 6kHz, monoaudiodatacollectedat oneof the IBMcall centerslocatedin Raleigh,NC. Theaudiowasmanuallytran-scribedandspeaker turnswereexplicitly markedin theword tran-scriptionsbut not the correspondingtimes. In order to detectspeaker changesin the training data,we did a forced-alignmentof thedataandchoppedit at speaker boundaries.

The test set consistsof 50 calls with 113 speakers totalingabout3 hoursof speech.

2.2. Speaker Independent System

The raw acousticfeaturesusedfor segmentationandrecognitionare perceptuallinear prediction(PLP) features. For the speaker

Segmentation/clustering Adaptation WERManual Off-line 30.2%Manual Incremental 31.3%Manual No Adaptation 35.9%

Automatic Off-line 33.0%Automatic Incremental 35.1%

Table 1. ASR resultsdependingon segmentation/clusteringandadaptationtype.

Accuracy Top20% Bottom20%Random 20% 20%

QA 41% 30%

Table 2. Accuracy for theQuestionAnsweringsystem.

independentsystem,the featuresare mean-normalizedon a perspeaker basis. Every 9 consecutive 13-dimensionalPLP framesare concatenatedand projecteddown to 40 dimensionsusingLDA+MLLT. The SI acousticmodel consistsof 50K Gaussianstrainedwith MPE andusesa quinphonecross-word acousticcon-text. Thetechniquesarethesameasthosedescribedin [6].

2.3. Incremental Speaker Adaptation

In the context of speaker-adaptive training, we use two formsof feature-spacenormalization: vocal tract length normalization(VTLN) and feature-spaceMLLR (fMLLR, also known as con-strainedMLLR) to producecanonicalacousticmodelsin whichsomeof thenon-linguisticsourcesof speechvariability have beenreduced.To thiscanonicalfeaturespace,we thenapplyadiscrim-inatively trainedtransformcalledfMPE [7]. Thespeaker adaptedrecognitionmodel is trainedin this resultingfeaturespaceusingMPE.

We distinguishbetweentwo formsof adaptation:off-line andincrementaladaptation. For the former, the transformationsarecomputedperconversation-sideusingthefull outputof a speakerindependentsystem.For thelatter, thetransformationsareupdatedincrementallyusingthedecodedoutputof thespeakeradaptedsys-tem up to the currenttime. The speaker adaptive transformsarethenappliedto thefuturesentences.Theadvantageof incrementaladaptationis that it only requiresa singledecodingpass(asop-posedto two passesfor off-line adaptation)resultingin adecodingprocesswhich is twice as fast. In Table1, we comparethe per-formanceof the two approaches.Most of the gain of full offlineadaptationis retainedin theincrementalversion.

2.3.1. Segmentation and Speaker Clustering

We usean HMM-basedsegmentationprocedurefor segmentingtheaudiointo speechandnon-speechprior to decoding.Therea-sonis thatwe wantto eliminatethenon-speechsegmentsin orderto reducethecomputationalloadduring recognition.Thespeechsegmentsareclusteredtogetherin orderto identify segmentscom-ing from thesamespeaker which is crucialfor speaker adaptation.Theclusteringis donevia k-means,eachsegmentbeingmodeledby a singlediagonalcovarianceGaussian.Themetric is givenbythe symmetricK-L divergencebetweentwo Gaussians.The im-


ME 49% 36%

Table 3. Accuracy for theMaximumEntropy system.


ME + QA 53% 44%

Table 4. Accuracy for thecombinedsystem.

pactof theautomaticsegmentationandclusteringontheerrorrateis indicatedin Table1.

3. CALL RANKING

3.1. Question Answering

This sectionpresentsautomatedtechniquesfor evaluating callquality. These techniques were developed using a train-ing/developmentset of 676 calls with associatedmanuallygen-eratedqualityevaluations.Thetestsetconsistsof 195calls.

Thequalityof theserviceprovidedby thehelp-deskrepresen-tativesis commonlyassessedby having humanmonitorslisten toarandomsampleof thecallsandthenfill in evaluationforms.Theform for IBM’ sNorthAmericanHelpDeskcontains31questions.A subsetof thequestionscanbeansweredeasilyusingautomaticmethods,amongthosetheonesthatcheckthattheagentfollowedtheguidelinese.g.� Did theagentfollow theappropriateclosingscript?� Did theagentidentify herselfto thecustomer?

But someof the questionsrequirehuman-level knowledgeof theworld to answer, e.g.� Did theagentaskpertinentquestionsto gainclarity of the

problem?� Wereall availableresourcesusedto solve theproblem?

We were able to answer21 out of the 31 questionsusing pat-tern matchingtechniques.For example, if the questionis “Didthe agentfollow the appropriateclosing script?”, we searchfor“THANK YOU FOR CALLING”, “ANYTHING ELSE” and“SERVICE REQUEST”. Any of theseis a goodpartialmatchforthefull script,“Thank you for calling, is thereanything elseI canhelp you with beforeclosingthis servicerequest?”Basedon theanswerto eachof the 21 questions,we computea scorefor eachcall anduseit to rankthem.We labela call in thetestsetasbeingbad/good if it hasbeenplacedin the bottom/top20% by humanevaluators. We report the accuracy of our scoringsystemon thetest set by computingthe numberof bad calls that occur in thebottom20%of our sortedlist andthenumberof good callsfoundin thetop 20%of our list. Theaccuracy numberscanbefoundinTable2.

3.2. Maximum Entropy Ranking

Anotheralternative for scoringcallsis to find arbitraryfeaturesinthespeechrecognitionoutputthatcorrelatewith theoutcomeof a

Fig. 1. Displayof selectedcalls.

call beingin the bottom20% or not. The goal is to estimatetheprobability of a call beingbad basedon featuresextractedfromtheautomatictranscription.To achieve this we build a maximumentropy basedsystemwhich is trainedon a setof callswith asso-ciatedtranscriptionsandmanualevaluations.Thefollowing equa-tion is usedto determinethe scoreof a call

�usinga setof �

predefinedfeatures:

�� !#"%$ '&( ��)*��,+ ��-�

(1)

where�� /.10�23546+87:9�;=<>2?@4 A

,�

is a normalizingfactor,

&� � �areindicatorfunctionsand

0 $ A,B !C"-D ��E aretheparametersof the

modelestimatedvia iterativescaling[8].

Dueto thefactthatour trainingsetcontainedunder700calls,we useda hand-guidedmethodfor defining features. Specifi-cally, we generateda list of VIP phrasesas candidatefeatures,e.g. “THANK YOU FOR CALLING”, and “HELP YOU”. Wealsocreateda pool of genericASR features,e.g. “numberof hes-itations”, “total silenceduration”,and“longestsilenceduration”.A decisiontreewasthenusedto selectthemostrelevant featuresandthethresholdassociatedwith eachfeature.Thefinal setof fea-turescontained5 genericfeaturesand25VIP phrases.If wetakealook at theweightslearnedfor differentfeatures,wecanseethatifa call hasmany hesitationsandlong silencesthenmostlikely thecall is bad.

Weuse��2?@4GF ��

asshown in Equation1 to rankall thecalls.Table3 shows theaccuracy of this systemfor thebottomandtop20%of thetestcalls.

At this point we have two scoringmechanismsfor eachcall:one that relies on answeringa fixed numberof evaluationques-tions anda moreglobal one that looks acrossthe entirecall forhints. Thesetwo scoresareboth between0 and1, andthereforecanbeinterpolatedto generateoneuniquescore.After optimizingthe interpolationweightson a held-outsetwe obtaineda slightlyhigher weight (0.6) for the maximumentropy model. It can beseenin Table4 thattheaccuracy of thecombinedsystemis greaterthat the accuracy of eachindividual system,suggestingthe com-plementarityof thetwo initial systems.

Fig. 2. Interfaceto listento audioandupdatetheevaluationform.

4. END-TO-END SYSTEM PERFORMANCE

4.1. User Interface

This sectiondescribestheuserinterfaceof theautomatedqualitymonitoring application. As explainedin Section 1, the evalua-tor scorescalls with respectto a setof quality-relatedquestionsafter listeningto the calls. To aid this process,the userinterfaceprovidesanefficient mechanismfor thehumanevaluatorto selectcalls,e.g.� All callsfrom a specificagentsortedby score� Thetop20%or thebottom20%of thecallsfrom a specific

agentrankedby score� Thetop20%or thebottom20%of all callsfrom all agents

The automatedquality monitoring user interfaceis a J2EEwebapplicationthat is supportedby back-enddatabasesand contentmanagementsystems1 Thedisplayedlist of calls providesa linkto the audio,the automaticallyfilled evaluationform, the overallscorefor this call, theagent’s name,server location,call id, dateanddurationof thecall (seeFigure 1). This interfacenow givesthe agentthe ability to listen to interestingcalls and updatetheanswersin theevaluationform if necessary(audioandevaluationform illustratedin 2). In addition,this interfaceprovidestheeval-uatorwith the ability to view summarystatistics(averagescore)andadditionalinformationaboutthequalityof thecalls.

4.2. Precision and Recall

This sectionpresentsprecisionand recall numbersfor theidentificationof “bad” calls. Thetestsetconsistsof ��H,I callsthatweremanuallyevaluatedby call centerpersonnel.Basedon thesemanualscores,the calls wereorderedby quality, andthe bottom20% weredeemedto be “bad.” To retrieve calls for monitoring,wesortthecallsbasedontheautomaticallyassignedqualityscoreandreturntheworst. In our summaryfigures,precisionandrecallareplottedasa function of the numberof calls that areselectedfor monitoring. This is importantbecausein reality only a smallnumberof callscanreceive humanattention.Precisionis theratio

1In ourcase,thebackendconsistsof DB2 andIBM’ sWebsphereInfor-mationIntegratorfor Contentandtheapplicationis hostedon Websphere5.1.)

0

20

40

60

80

100

0 20 40 60 80 100

ObservedIdeal

Random

Fig. 3. Precisionfor thebottom20%of thecallsasa functionofthenumberof callsretrieved.

0

20

40

60

80

100

0 20 40 60 80 100

ObservedIdeal

Random

Fig. 4. Recallfor thebottom20%of thecalls.

of badcalls retrieved to the total numberof calls monitored,andrecall is the ratio of thenumberof badcalls retrieved to the totalnumberof badcallsin thetestset.Threecurvesareshown in eachplot: the actuallyobserved performance,performanceof randomselection,and oracleor ideal performance.Oracleperformanceshows what would happenif a perfectautomaticorderingof thecallswasachieved.

Figure 3 shows precisionperformance. We seethat in themonitoring regime where only a small fraction of the calls aremonitored,we achieve over 60% precision. (Further, if 20% ofthecallsaremonitored,we still attainover 40%precision.)

Figure4 shows therecallperformance.In theregimeof low-volumemonitoring,the recall is midway betweenwhat could beachievedwith anoracle,andtheperformanceof random-selection.

Figure5 shows theratioof thenumberof badcallsfoundwithourautomatedrankingto thenumberfoundwith randomselection.This indicatesthat in the low-monitoring regime, our automatedtechniquetriplesefficiency.

4.3. Human vs. Computer RankingsAs a final measureof performance,in Figure 6 we presentascatterplotcomparinghumanto computerrankings. We do nothave calls thatarescoredby two humans,sowe cannotpresentahuman-humanscatterplotfor comparison.

5. CONCLUSION

This paperhaspresentedan automatedsystemfor quality moni-toring in thecall center. We proposea combinationof maximum-entropy classificationbasedonASR-derivedfeatures,andquestionansweringbasedon simplepattern-matching.Thesystemcanei-ther be usedto replacehumanmonitors,or to make them more

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

ObservedIdeal

Fig. 5. Ratioof badcallsfoundwith QTM to Randomselectionasa functionof thenumberof badcallsretrieved.

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200

Fig. 6. Scatterplot of Humanvs. ComputerRank.

efficient. Our resultsshow thatwe cantriple theefficiency of hu-manmonitorsin thesenseof identifying threetimesasmany badcallsfor thesameamountof listeningeffort.

6. REFERENCES

[1] J. Chu-CarrollandB. Carpenter, “Vector-basednaturallan-guagecall routing,” Computational Linguistics, 1999.

[2] P. Haffner, G. Tur, andJ.Wright, “Optimizing svmsfor com-plex call classification,” 2003.

[3] M. Tang,B. Pellom,andK. Hacioglu, “Call-type classifica-tion andunsupervisedtrainingfor thecall centerdomain,” inARSU-2003, 2003.

[4] D. Hakkani-Tur, G. Tur, M. Rahim,andG. Riccardi, “Unsu-pervisedandactive learningin automaticspeechrecognitionfor call classification,” in ICASSP-04, 2004.

[5] C. Wu, J.Kuo,E.E.Jan,V. Goel,andD. Lubensky, “Improv-ing end-to-endperformanceof call classificationthroughdataconfusionreductionand model toleranceenhancement,” inInterspeech-05, 2005.

[6] H. Soltau,B. Kingsbury, L. Mangu,D. Povey, G. Saon,andG. Zweig, “The ibm 2004conversationaltelephony systemfor rich transcription,” in Eurospeech-2005, 2005.

[7] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau,andG. Zweig, “fMPE: Discriminatively trainedfeaturesforspeechrecognition,” in ICASSP-2005, 2004.

[8] A. Berger, S. Della Pietra,andV. Della Pietra, “A maximumentropy approachto naturallanguageprocessing,” Computa-tional Linguistics, vol. 22,no.1, 1996.

automated quality monitoring in the call … · ibm t.j. watson research center, yorktown heights,...

Documents