![Page 1: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/1.jpg)
ACOUSTICFEATURE-BASEDSENTIMENTANALYSISOFCALLCENTERDATA
AThesis
Presentedto
TheFacultyoftheGraduateSchool
AttheUniversityofMissouri-Columbia
InPartialFulfillment
OftheRequirementsfortheDegree
MasterofScience
By
ZeshanPeng
Dr.YiShang,ThesisSupervisor
DEC2017
![Page 2: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/2.jpg)
Theundersigned,appointedbythedeanoftheGraduateSchool,haveexaminedthethesisentitled
ACOUSTICFEATURE-BASEDSENTIMENTANALYSISOFCALLCENTERDATA
PresentedbyZeshanPeng
Acandidateforthedegreeof
MasterofScience
Andherebycertifythat,intheiropinion,itisworthyofacceptance.
Dr.YiShang
Dr.DetelinaMarinova
Dr.DongXu
![Page 3: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/3.jpg)
ii
ACKNOWLEDGEMENTS
IwouldliketothankmyacademicadvisorDr.YiShangforallofthehelpsandthose
great suggestions that I received fromhim. Iwillnotbeable tohave this thesiswork
withouthishelps.
I would also like to thank all the people in our team for this project, especially
NickolasWergeles andWenboWang. They always provideme the best supports and
criticalthoughtsinthewholeprocess,whichletmemakethisfarforthiswork.
Finally,IwouldliketothankDr.DetelinaMarinovaandherstudentBittyBalduccifor
theirgreateffortstoprovidemedatasetsforthisprojectaswellastheirbusinessinsights
aboutmyworkthatdirectedmyinitialwork.
IwouldliketogivemylastthankstoDr.DongXuforbeingmycommitteemember
andhelpingmedefensemythesiswork.
- ZeshanPeng
![Page 4: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/4.jpg)
iii
TABLEOFCONTENTS
ACKNOWLEDGEMENTS......................................................................................................ii
LISTOFFIGURES.................................................................................................................v
LISTOFTABLES..................................................................................................................vi
ABSTRACT........................................................................................................................vii
1. INTRODUCTION..........................................................................................................1
2. BACKGROUNDANDRELATEDWORK.........................................................................4
2.1 Text-basedapproach.......................................................................................................4
2.2 Acousticfeature-basedapproach...................................................................................5
2.3 Multimodalapproach......................................................................................................6
3. PROPOSEDMETHODS................................................................................................8
3.1 AcousticFeature-BasedSentimentRecognitionUsingClassicMachineLearningAlgorithms...................................................................................................................................9
3.1.1 GatheringAudioData..............................................................................................11
3.1.2 DataCleaningandPre-processing...........................................................................11
3.1.3 FeatureExtractionandSelection.............................................................................14
3.1.4 TrainClassificationModel.......................................................................................15
3.1.5 TestClassificationModel.........................................................................................16
3.2 AcousticFeature-Matrix-BasedSentimentRecognitionUsingDeepConvolutionalNeuralNetwork.........................................................................................................................17
3.2.1 DeepLearning..........................................................................................................17
3.2.2 FeatureMatrix.........................................................................................................18
3.2.3 Architecture.............................................................................................................20
4. FEATURESETSANDCLASSIFICAITONALGORITHSM................................................22
4.1 FeatureSets...................................................................................................................22
4.1.1 FundamentalFeatures.............................................................................................22
4.1.2 ShimmerandJitter..................................................................................................22
4.1.3 Short-termFeatures................................................................................................22
![Page 5: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/5.jpg)
iv
4.1.4 EmotionFeatures....................................................................................................24
4.1.5 Spectrograms...........................................................................................................24
4.2 ClassificationAlgorithms...............................................................................................24
5. EXPERIMENTRESULTS..............................................................................................26
5.1 Dataset..........................................................................................................................26
5.2 FeatureExtractionLibraries..........................................................................................26
5.3 TrainingTools................................................................................................................27
5.4 Per-SegmentResults.....................................................................................................27
5.5 Per-RecordResults........................................................................................................28
5.6 DeepLearningResults...................................................................................................30
6. Conclusion................................................................................................................34
7. FUTUREWORK..........................................................................................................36
8. REFERENCES..............................................................................................................37
![Page 6: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/6.jpg)
v
LISTOFFIGURES
Figure1.ProblemIllustration.............................................................................................9
Figure2.SentimentRecognitionProcessOverview.........................................................10
Figure3.SpeakerDiarizationProcess..............................................................................13
Figure4.DataFlowDiagramforModelTraining.............................................................16
Figure5.MajorityVoteMethodforModelTesting.........................................................17
Figure6.FeatureMatrixIllustration................................................................................19
Figure7.DeepConvolutionalNeuralNetworkArchitecture...........................................21
Figure8.ComparisonofCustomerModelandRepresentativeModelonDifferent
FeatureSets.....................................................................................................................28
Figure9.ModelComparisonBasedonMajorityVotesPerAudioRecord.......................29
Figure10.PredictionAccuracyvs.NumberofConvolutionalLayers................................30
Figure11.1-DCNNvs.2-DCNNonDifferentPoolingKernelSize....................................31
Figure12.ComparisononDifferentWindowSize............................................................32
Figure13.ComparisononDifferentMachineLearningAlgorithms.................................33
![Page 7: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/7.jpg)
vi
LISTOFTABLES
Table1.Short-termFeatureDescriptions........................................................................23
Table2.DatasetSummary...............................................................................................26
![Page 8: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/8.jpg)
vii
ABSTRACT
Withtheadvancementofmachinelearningmethods,audiosentimentanalysis
has become an active research area in recent years. For example, business
organizations are interested in persuasion tactics from vocal cues and acoustic
measuresinspeech.Atypicalapproachistofindasetofacousticfeaturesfrom
audiodatathatcanindicateorpredictacustomer’sattitude,opinion,oremotion
state.Foraudiosignals,acousticfeatureshavebeenwidelyusedinmanymachine
learningapplications,suchasmusicclassification,languagerecognition,emotion
recognition,andsoon.Foremotionrecognition,previousworkshowsthatpitch
and speech rate features are important features. This thesis work focuses on
determining sentiment from call center audio records, each containing a
conversationbetweenasalesrepresentativeandacustomer.Thesentimentofan
audiorecordisconsideredpositiveiftheconversationendedwithanappointment
being made, and is negative otherwise. In this project, a data processing and
machinelearningpipelineforthisproblemhasbeendeveloped.Itconsistsofthree
majorsteps:1)anaudiorecordissplitintosegmentsbyspeakerturns;2)acoustic
featuresareextractedfromeachsegment;and3)classificationmodelsaretrained
ontheacousticfeaturestopredictsentiment.Differentsetoffeatureshavebeen
usedanddifferentmachinelearningmethods,includingclassicalmachinelearning
algorithmsanddeepneuralnetworks,havebeenimplementedinthepipeline.In
![Page 9: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/9.jpg)
viii
our deep neural network method, the feature vectors of audio segments are
stackedintemporalorderintoafeaturematrix,whichisfedintodeepconvolution
neural networks as input. Experimental results based on real data shows that
acousticfeatures,suchasMelfrequencycepstralcoefficients,timbreandChroma
features, are good indicators for sentiment. Temporal information in an audio
record can be captured by deep convolutional neural networks for improved
predictionaccuracy.
![Page 10: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/10.jpg)
1
1. INTRODUCTION
Speechanalyticsallowspeopletoextractinformationfromaudiodata.Ithas
beenusedwidelytogatherbusinessintelligencethroughtheanalysisofrecorded
callsincontactcenterformanybusinesscompanies.Sentimentanalysisisoneof
thespeechanalyticsthattriestoinfersubjectfeelingsaboutproductsorservices
in a conversation. This analysis can be also used to help agents who talk to
customersoverthephonebuildcustomerrelationshipandsolveanyissuesthat
mayemerge.
There are two major methods that have been used for audio sentiment
analysis.Oneisacousticmodellingandtheotherislinguisticmodeling.Inlinguistic
modeling,itrequirestranscribingaudiorecordsintotextfilesandconductanalyses
basedontextcontent.Itassumesthatsomespecificwordsorphrasesareusedin
a higher probability in some certain environments. One can expect things to
happen if some words or phrases appear frequently in the transcripts. Many
researchershave showed strongevidences that linguisticmodellinghas a good
performance on audio sentiment analysis. Among their work, lexical features,
disfluenciesandsemantic featuresaremostlyusedwhenbuildingtheirmodels.
However, acquiring a good transcript for each audio record can be humanly
tediousandfinanciallycostly.Agoodautomatictranscribingsystemisalsohardto
![Page 11: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/11.jpg)
2
establishtoaccuratelytranscribeaudiorecordsintotext.Evenasmallerrorinthe
textcanresultabigdifferenceinalinguisticmodel.Inacousticmodelling,itrelies
on acoustic features of audio data. These features include pitches, intensity,
speechrateandsoonthatcanbeeasilycalculatedbycomputerprograms.They
together can provide basic indicators of sentiment in some degree. Acoustic
featureshavebeenusedinmanyotherbusinessorpsychologicalanalysesaswell,
whichisanotherreasonthatweshouldusetheminsentimentanalysis.However,
theproblemwithacousticmodelingisthatthequalityofaudiorecordshasstrong
influenceon the final result.Sinceweareusingacousticcharacteristicofaudio
data,thequalityofaudiocansignificantlyimpacttheabilitytogetaccuratevalues
forthoseacousticfeatures,whichwill failbuildinggoodmodels intheend.The
otherproblemisthatinareal-worldsetting,therearealwayssomerandomnoises
inthebackground.Modeltrainingdatamaynotbeabletocapturetheserandom
noises,whichwouldchallengethesentimentanalysisresultifwewanttomonitor
itthroughaliveconversation.Butstill,thismethodisworthytoexploremoreand
researchonbecauseofitsconvenienceandefficiency.
Theremainingstructureofthisthesisisasfollows:chapter2givesbackground
andrelatedworkaboutsentimentanalysis,chapter3introducestwomethodsthat
weproposeandoursentimentrecognitionpipeline indetail,chapter4explains
our selection of feature sets and classification models, chapter 5 shows our
![Page 12: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/12.jpg)
3
experiment results, chapter6 is the conclusion, chapter7 talks aboutpotential
problemsandfutureworks,andthelastchapteristheworkreferences.
![Page 13: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/13.jpg)
4
2. BACKGROUNDANDRELATEDWORK
Sentiment analysis can be generalized into three categories: text-based
approach, acoustic feature-based approach and multimodal approach – a
combinationofprevioustwo.Differentfeaturesareusedfordifferentapproach.
Classification models are trained on different learning algorithms and
performancesarecompared.Manysentimentclassificationtechniqueshavebeen
proposedinrecentyears[4].However,theoverallperformanceisnotasgoodas
wethought.Muchmoreworkcouldbedoneforthisprobleminthefuture.
2.1 Text-basedapproach
Textdataareoneofthemost immensecorporathataregenerated
daily.Researchesareeagertotakeadvantageofthisandmanydatamining
workshavebeendoneforthissake.Forexample,Twitter,asoneofthemost
popularsocialmediaintheworld,generatestonsoftexteveryday.Different
miningmethodsbasedonTwitter texthavebeenproposed forsentiment
analysis [23] [24]. In 2010, it is shown that recurrent neural network can
exploittemporalfeaturesintextcontentsformodelinglanguage[13].Even
formusicfield,mininglyricsdirectlycancreatemodelsforsongsentiment
classificationproblem[20]andmusicmoodclassificationproblem[21].
![Page 14: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/14.jpg)
5
2.2 Acousticfeature-basedapproach
Many different acoustic features can be generated from an audio file.
Pitches,intensities,speechrates,articulationrateandsoon.Short-termacoustic
features, such asMel frequency cepstral coefficients(MFCCs), are proved to be
usefulformusicmodelingin[5].MFCCsaregoodformusicalgenreclassification
jobandevidencesareshownin[6].Authorsin[9]takeMFCCsasmajoracoustic
features for emotion recognition problem. MFCCs features are also used on
trainingsingledeepneuralnetworkforspeakerandlanguagerecognitionproblem
in [12]. Timbre and Chroma are other acoustic features that interest many
researchersinthepastyears.Theycanbeusedtogeneratemusicfromtextin[7].
Intheirwork,emotionsareextractedfromtext,andthenacombinationoftimber
andChromafeaturesareusedtocomposedmusicthatexpressesthoseemotions
inthetext.Chromaandtimbrefeaturescanbeusedtoclassifymusicaswellin[8].
Theyalsoclaimthat“Chromafeaturesarelessinformativeforclassessuchasartist,
but contain information that is almost entirely independence of the spectral
features.”Therefore,theyareequallyimportanttobeincludedwhenconducing
sentimentanalysisonacousticfeatures.Rhythm,timberandintensityofmusicare
usedformusicmoodclassificationproblemaswell [22].Acousticfeature-based
![Page 15: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/15.jpg)
6
approachcanbeextendedtospeechrecognitionin[10]andautomaticlanguage
identificationin[11]whenmodelingwithdeepneuralnetwork.
2.3 Multimodalapproach
Withtheemergeofsocialmedia,exceptfortextcorporaandaudiorecords,
videos and images brings new opportunities for sentiment analysis. Facial and
vocalfeaturesareextractedfromthesenewstreamsandmultimodalsentiment
analysisseemstohavemuchmorepotentialinthefuture[14][15].YouTube,as
oneofthebiggestvideohosts,hasahugenumberofvideos.Researchesaretrying
to identify sentiments from those videos based on linguistic, audio and visual
features all together [25] [26]. Fusion of modalities of audio, visual and text
sentimentfeaturesassourceofinformationisanotherwaythatisproposedin[27].
For linguisticanalysis, languagedifferencecouldbeaproblem. It is shown that
language-independent audio-visual analysis is competitive with mere linguistic
analysis[28].Worksin[16][17][18]showthatsystemsthatcombinelyricfeatures
andaudiofeaturessignificantlyoutperformothersystemsthatuselyricsoraudios
singularly for music mood classification problem. Fusion of lyrics features and
melody featuresofmusic improvesperformanceofmusic information retrieval
system[19].Fusionofmultiplemodalitiesiscurrenttrendsincewecanalwaysfind
newinformationwhenweviewanexistingobjectfromanewangle.Sometimes,
![Page 16: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/16.jpg)
7
theaccesstovisualandlinguisticsourcescouldbehardandcostly.Forexample,
inourdatasource, it ishardtogetfacialexpressionsfromacustomeroverthe
phone.Butinothercircumstanceswhenwehavealltheinformation,multimodal
approachwouldbethefirstchoice.
![Page 17: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/17.jpg)
8
3. PROPOSEDMETHODS
The problem that this thesis work wants to solve is to identify whether a
meetinghasbeenscheduledbetweenacustomerandarepresentativeinaphone
callconversation.Therearetwomethodsthatweproposetoaddressthiskindof
problem.Thefirstmethodistospliteachaudiorecordintosegmentsbyspeaker
turnsbeforeextractingfeaturevectorsfromeachsegment,andthenuseclassic
machine learning algorithms to classify those feature vectors according to the
sentiments.Allsegmentsthatcomefromthesameaudiorecordsharethesame
sentimentlabelastheaudiorecordlabel.Thesecondmethodistocuteachaudio
recordintoshortaudioclipswithsamelength.Thenfeaturesvectorsareextracted
fromeachaudioclips.Afterthat,westackupthosefeaturevectorsintofeature
matrixes in timeorder and then feed featurematrixes into deep convolutional
neuralnetworkstobuildclassificationmodels.Allfeaturematrixesthatcomefrom
thesameaudiorecordsharethesamesentimentlabelastheaudiorecordlabel.
Thenexttwosubsectordescribesthesetwomethodsinmoredetail.
![Page 18: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/18.jpg)
9
Figure1.ProblemIllustration
3.1 AcousticFeature-BasedSentimentRecognitionUsingClassicMachine
LearningAlgorithms
Thewholesentimentrecognitionprocessconsistsoffivesteps.Firstly,we
gatherusefulaudiodatafromcallcenterrecords.Weselectaudiodatathathas
labelwith itsincewearemore interested inusingsupervisedmachine learning
algorithms. It ishighlypossible thatunsupervisedmachine learningcanprovide
interesting results for audio signal sentiment analysis, which is left to future
exploringeitherbycombiningitwithsupervisedmachinelearningormerelyitself.
![Page 19: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/19.jpg)
10
The second step is cleaning and pre-processing the chosen audio files. Any
unrelatedpartsinaudiodataareremoved,suchasringtones,receptionisttalking,
music, advertisement, etc. Several methods have been tried to remove those
unrelatedparts,however,noneofthemmetourexpectations.So,wedecideto
removethemmanuallysincethedatasetwecurrentlyhaveisprettysmall.After
that we apply speaker diarization algorithm to split each audio record into
segmentsbyspeakerturns.Thethirdstepisfeatureextractionandselection.We
considerdifferentcombinationoffeaturesetsandselectdifferentfeaturestotrain
on in the next stage. The fourth step is classification model training. We use
different training algorithms, techniques and models, then compare their
performanceintermsofpredictionaccuracies.Thelastpartistestingthetrained
classificationmodel.
Figure2.SentimentRecognitionProcessOverview
Pre-processing/Cleaning
FeatureExtraction/FeatureSelection
ModelTraining
ModelTesting
![Page 20: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/20.jpg)
11
3.1.1 GatheringAudioData
Everytrainingdataimpactsthefinalclassificationmodel.Wedonotwantbad
data to affect model’s performance, and we want to have effective training
datasetsthatcangeneralizewelltoallpossiblescenariosbutnotmanyunlikely
events.Asforaudiodatasets,thequalityoftheaudiofilecouldbeonemetricto
be used when selecting audio files. Call center audio records are not always
recordedgood.Duetotheinternetlatencyandbackgroundnoises,somefilesare
veryblurredandevenhardforhumantounderstandtheconversationinit.We
needaudiodatathatarerecordedwithgoodquality.Therearealsosomeother
filesthatdonothavealabelwithit,andwewanttogetridofthosefilesaswell
forsupervisedlearningpurpose.
3.1.2 DataCleaningandPre-processing
Inanaudiofile,itisusuallyaconversationbetweentwospeakers:acustomer
andasalesperson.Insomescenarios,thereisareceptionistansweringthephone
callandtransferringthesalespersontowhoeverheorshewantstoreachto.The
featuresofreceptionistarenotourmainlyinterestheresoweneedtocutthose
conversationsbetweensalespersonandreceptionistifithappensintheaudiofile.
Afterthat,wewouldliketospliteachaudiofileintosegmentsbyspeakerturns.
![Page 21: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/21.jpg)
12
Knowingchangesofcertainfeaturesbetweenconsecutivesegmentswouldhelp
thewholesentimentanalysis.Thatiswhyit is importanttosplitaudiofilesinto
segmentsbyspeakerturnsbeforegatheringfeaturesfromthem.Wealsowantto
studythetemporalrelationsoffeaturesinconsecutivesegmentsinanaudiofile.
Thatwould helpmonitor live conversations in call centers. There are plenty of
speakerdiarizationalgorithmsthatarepubliclyavailable.As referred in [1], [2],
and[3],alotofresearcheshasbeenconductedforthistypeofproblem,andmany
ofthemhaveagoodperformancetosplitaudiofilebyspeakerturnsautomatically.
In our research, we have used IBM Watson and Google cloud computing
services to split our audio files into segments by speaker turns. Both of these
systemshavetheabilitytoidentifywhospeaksatwhattimeandtranscribethe
conversationintotext.Eventhoughthetranscriptionpartisnotasaccurateaswe
though, the timestamps of start and end time for each speaker turn are good
enoughforustousetoseparateeachaudiofileintosegments.
Sinceweonlyhaveasmallportionofaudiodatafornow,andthetranscripts
ofallaudiofilesareavailableforus,wespliteachaudiofileintosegmentsbased
on timestamps in the transcript accordingly. However, the speaker diarization
procedureisstillcrucialwhenhavemoreandmoredatacomingininthefuture
andtheaccesstoallofthetranscriptswouldbeverycostly.Thatisthetimewe
maywanttouseautomaticspeakerdiarizationtechniques.
![Page 22: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/22.jpg)
13
Figure3.SpeakerDiarizationProcess
Pseudo-code:
functionspeaker_diarization():
openaudiofile
opentranscriptionfile
foreachtimestamppair(start_time,stop_time)intranscriptionfile:
segment=cut_audio_file(start_time,stop_time)
segment_label=audio_file_label
ifsegmentbelongstocustomer:
Addsegmenttocustomer_cluster
Ifsegmentbelongstorepresentative:
Addsegmenttorepresentative_cluster
![Page 23: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/23.jpg)
14
3.1.3 FeatureExtractionandSelection
After we have split each audio file into segments by speaker turns, we
extract features from each segment. There are many features that can be
extracted from an audio file. For our research purpose, we focus on acoustic
features.Therearemainly twogroupof features that interestus.Onegroup is
prosodyfeatures,suchaspitch,intensity,speechrate,numberofpauses,etc.The
othergroupisshort-termfeatures.Wewillexplainourfeaturessets indetail in
chapter4.
Pseudo-code:
functionfeature_extraction():
foreachsegmentincustomer_cluster:
prosody,short_term=calc_algorithm(segment)
Addprosodytoprosody_feature_clusterforcustomer
Addshor_termtoshort_term_clusterforcustomer
foreachsegmentinrepresentative_cluster:
prosody,short_term=calc_algorithm(segment)
Addprosodytoprosody_feature_clusterforrepresentative
Addshor_termtoshort_term_clusterforrepresentative
![Page 24: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/24.jpg)
15
3.1.4 TrainClassificationModel
Oncewehave features foreachsegment,wesplitdata into trainingset,
validationsetandtestingset,thenwetrainaclassificationmodelbasedonthese
featuresandtheirassociatedlabels.Everysegmentinthesameaudiofilehasthe
samelabelasthataudiofile.Wealsopartitionsegmentsintotwogroups:customer
groupandsalespersongroup.Sincecustomersmayhavedifferentbehaviorsthan
salesperson,wewouldliketotraintwodifferentmodelstorepresenteachofthem.
Wetakeallfeaturesincustomergrouptobeourtrainingdatasetforthecustomer
model, and all features in salesperson group to be our training date for the
salespersonmodel.Validationsetisusedtosetuphyperparametersoflearning
algorithms, and testing set is used tomeasure the performance of the trained
model.Moredetailsofclassificationmodelsandlearningalgorithmsareinchapter
4.
![Page 25: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/25.jpg)
16
Figure4.DataFlowDiagramforModelTraining
3.1.5 TestClassificationModel
Afterthemodelisbuiltinthelastprocedure,weuseittopredictlabelsfor
anyaudiofilethatwewanttodosentimentanalysison.Theaudiofileisstillsplit
into segmentsby timestamps in the transcript. Themodelwill predict label for
eachsegmentinthatfile,andwetakethemajorityvoteofthelabelsasthefinal
labelforthattestaudiofile.
y= 𝑦%&% 𝑛 ≥ 0.5? 1 ∶ 0
wherenisthenumberofsegmentsinanaudiorecord
![Page 26: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/26.jpg)
17
Figure5.MajorityVoteMethodforModelTesting
3.2 Acoustic Feature-Matrix-Based Sentiment Recognition Using Deep
ConvolutionalNeuralNetwork
3.2.1 DeepLearning
Ithasbeenshownin[29]thatdeepconvolutionalneuralnetworkcanbe
applied to character-level feature matrix for text classification problem. It has
![Page 27: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/27.jpg)
18
achievedcompetitiveresultswithcurrentstate-of-artmethods.Evidenceshows
thattemporalinformationcanbecapturedbyapplyingdeepconvolutionalneural
network on feature matrix of time series dataset. With the advancement of
machinelearning,especiallydeeplearningmethods,audiosignalanalysiscantake
advantage of it to build competitivemodels or even bettermodels than other
approaches.
3.2.2 FeatureMatrix
Convolutional neural networks usually take two-dimensional matrix as
input.Westackupaudiofeaturevectorintofeaturematrixtomeetthat.Short-
termfeaturesareextractedevery50millisecondsfromeachaudiorecordandthen
stackedupintimeordertobuildfeaturematrixforthataudiorecord.Thewindow
size is 300, which means for every 300 rows we cut the matrix. Every matrix
generatedfromsameaudiorecordsharesthesamesentimentastheaudiorecord.
Ifthenumberofrowsortheremainingrowsaftercuttingislessthan300,weuse
zero-valuedpatchtocompensate.Eachtrainingmatrixhasthesizeof300by34
sincewe have 34 short-term features in total. The figure below illustrates the
processforbuildingfeaturematrixes.
![Page 28: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/28.jpg)
19
Pseudo-code:
functionfeature_matricization():
foreachaudiorecord:
cutitintoequallength(50ms)segments
foreachsegment:
short-term=calc_algorithm(segment)
stackupshort-termfeaturesintofeaturematrixintimeorder
cutfeaturematrixevery300rows
matrix_label=audio_record_label
addtotrainingdataset
Figure6.FeatureMatrixIllustration
![Page 29: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/29.jpg)
20
3.2.3 Architecture
Massiveexperimentshavebeenconductedtocompareperformanceswhen
usingdifferentnetworksettings.Thosesettingsincludenumberofconvolutional
layer,kernelsize,maxpoolingkernelsize,andwindowsizewhencuttingmatrix.
Thefigurebelowisatypicalnetworkarchitectureoffourconvolutionallayers.This
architecturedesignisbasedonpre-trainedneuralnetworksthatworkextremely
well on imagedatasets. Every convolutional layer is followedby amaxpooling
layer.Weuse1-Dconvolutionalkernelheresincefeaturesaredistinctfromeach
other,andourexperimentresultsshowthat1-Dconvolutionalkernelhasbetter
performancethan2-Dconvolutionalkernel.
![Page 30: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/30.jpg)
21
Figure7.DeepConvolutionalNeuralNetworkArchitecture
![Page 31: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/31.jpg)
22
4. FEATURESETSANDCLASSIFICAITONALGORITHSM
4.1 FeatureSets
4.1.1 FundamentalFeatures
Thefirst featuresetthatweconsiderarefundamental featuresforanyaudio
data,suchasdecibel,meanofpitches,standarddeviationofpitches,covariance
ofpitches,speechrate,numberofpausesinasegment,lengthofthesegmentin
milliseconds.Thesearegeneralfeaturesthatcanrepresentanaudiosegmentin
some extent. These features have shown crucial importance on audio signal
analysisfromliteraturesadecadeago.
4.1.2 ShimmerandJitter
Shimmermeasures thevariation in frequencyofvocal foldmovementswhile
jittermeasures the variation in amplitudeof vocal foldmovements. These two
featurescanrepresentcharacteristicsofanaudiosegmentinanotherperspective.
4.1.3 Short-termFeatures
Mel-frequency cepstrum coefficients, Chroma and timbre features are short-
termfeaturesusedinthiswork.Themel-frequencycepstrumisarepresentation
of the short-term power spectrum of a sound. Mel-frequency cepstral
coefficients(MFCCs) are coefficients that are used to make up the spectrum.
![Page 32: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/32.jpg)
23
Different combination of those coefficients has different indications for some
sentimentanalysis,sowewanttoincludetheminourfeatureset.Chromafeatures
represent spectral energy of different pitch classes. Timbre features show the
characterorqualityofasoundthataredifferentfrompitchfeatureandintensity
feature. These three groups of features have been frequently used in many
classificationproblemsasexplainedinrelatedworkinsection2.
Feature Description
MFCCs MelFrequencyCepstralCoefficientsfromacepstralrepresentation
wherethefrequencybandsarenotlinearbutdistributedaccording
tothemel-scale
Chroma Arepresentationofthespectralenergywherethebinsrepresent
equal-temperedpitchclasses
Timbre Thecharacterorqualityofasoundorvoiceasdistinctfromits
pitchandintensity
Table1.Short-termFeatureDescriptions
![Page 33: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/33.jpg)
24
4.1.4 EmotionFeatures
Wealsofindthatemotionfeaturescanbecriticalindicatorsforsomesentiments.
Positiveemotions,likehappyandsatisfaction,canleadtoagoodresultintheend
oftheconversationwhilenegativeemotions,likeupsetandangry,canturnthings
aroundintheend.Wewanttotrainourmodelonthesefeaturestoo.
4.1.5 Spectrograms
Besidesofvaluedinputs,wealsogenerateimagefeaturesfromeachsegment.
Theyarespectrograms.Spectrogramisavisualrepresentationofthespectrumof
frequencies of sound varied with time. Some sentiments may have specific
patterns in the spectrograms.Wewant ourmodel to be able to identify those
patternsinthespectrogramsiftheyexist.
4.2 ClassificationAlgorithms
Inourexperiment,weusebothlegendmachinelearningalgorithmsandneural
networks.Forlegendalgorithms,wehaveusedsupportvectormachinewithcubic
kernelandk-nearestneighborclassifier.Forneuralnetworks,wehaveusedclassic
shallow feed forward neural networks and deep convolutional neural network.
![Page 34: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/34.jpg)
25
Differentlearningalgorithmhasdifferentadvantagesandcharacteristics.Wewant
toexperimentonallofthemandcomparetheirresults.
![Page 35: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/35.jpg)
26
5. EXPERIMENTRESULTS
5.1 Dataset
Ourdatasets consistof88audio recording files fromPenske cooperated
offices.Thedurationofthesefilesrangesfromafewminutestotenminutes.The
average length is about three minutes. After we split these audio files into
segmentsbyspeakerturns,wehave2859segmentsintotal.Amongthem,1438
are for customers and 1421 are for salespersons. Features extracted from
customergroupareusedtotraincustomermodel,soassalespersongroupisfor
salespersonmodel.
Positive Negative Total
AudioRecords 31 55 86
Segments 1284 1304 2588
Table2.DatasetSummary
5.2 FeatureExtractionLibraries
Praat is the main software that we used to extract fundamental audio
features aswell as shimmer and jitter values.We use a python library named
![Page 36: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/36.jpg)
27
pyAudioAnalysisonGitHubtoextractMFCCfeatures.WeuseOpenSmilepackage
toextractemotionfeatures.Anotherpythonlibrarypyplotinmatplotlibpackage
isusedtogeneratespectrogramsforeachaudiosegment.
5.3 TrainingTools
Weusemachine learningpackageinMatlabtotrainmodelsusingclassic
machine learning algorithms. For neural networks, they are coded in Keras
platformwhichisbasedontensorflow.
5.4 Per-SegmentResults
Since customerbehaviors different than representative,wewant to two
separatemodelstorepresenteachofthem.Speakerdiarizationprocesshelpson
that. Features extracted fromcustomer segments are grouped intoone cluster
while other features extracted from representative segments are grouped into
anothercluster.Featuresincustomerclusterareusedtotraincustomermodeland
so as the representativemodel.Models’ performances are compared by using
different feature set for training. Below is the figure that summarizes the
comparisonbasedonpredictionaccuracyforsegments:
![Page 37: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/37.jpg)
28
Figure8.ComparisonofCustomerModelandRepresentativeModelonDifferentFeatureSets
From above figure, we can easily conclude that short-term features are
superiorindicatorsthanprosodyfeatures.Whenbothofthemareusedfortraining,
accuracyisslightlyimproved.Thatdoesnotmeancombinationofthemhasbetter
indicationstrength.Moreexperimentsshouldbeconductedfora largerdataset
beforewecansayanythingaboutit.
5.5 Per-RecordResults
Segment-wise prediction is not our final goal here.We want to predict
whetherrepresentativeispersuasiveornotinawholeaudiorecord.Toaccomplish
this,wepredicteachsegmentofatestaudiorecordusingmodelstrainedin5.2.1
0.646
0.82 0.822
0.570.684 0.694
0
0.2
0.4
0.6
0.8
1
Prosody Short-term Together
CustomerModelvs.RepresentativeModel
Customer Representative
![Page 38: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/38.jpg)
29
andtakethemajorityvoteofallthepredictionsasourfinalpredictionofthataudio
record.Werunourexperimentsonbasedonusingdifferentmodeltopredict.Here
wehavethreemodelsettings:1)usecustomermodelonlytopredictcustomer
segmentsandtakethemajorityvoteasthefinalpredicationforthataudiorecord,
2)userepresentativemodelonlytopredictrepresentativesegmentsandtakethe
majorityvoteasthefinalpredictionforthataudiorecord,3)usebothcustomer
andrepresentativemodeltopredictallsegmentsinanaudiorecordandtakethe
majorityvoteasthefinalpredictionforthataudiorecord.Belowisthecomparison
result.
Figure9.ModelComparisonBasedonMajorityVotesPerAudioRecord
0.583
0.773 0.786
0.536
0.685 0.6910.571
0.739 0.75
0
0.2
0.4
0.6
0.8
1
Prosody Short-term Together
ModelComparisonBasedOnMajorityVotes
Customer Representative Together
![Page 39: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/39.jpg)
30
Abovefigureshowsthatusingcustomermodelonlyhasthebestperformance
comparedwith othermodels. Itmatches intuitive instinct that representatives
havesimilarbehaviorsindependentofcustomersbehaviors.Representativesare
more consistent no matter what customers respond while customers have
differentcharacteristicswhentheyaresentimentallypositiveornegative.Using
bothofthemhascompromisedperformance.
5.6 DeepLearningResults
Thefigurebelowistheresultcomparingdifferentnumberofconvolutional
layerswhenusingeitherfeaturevectorsorfeaturematrixesasinputtotheneural
network.
Figure10.PredictionAccuracyvs.NumberofConvolutionalLayers
0.623 0.636 0.6470.632
0.6730.691
0.6580.644
PredictionAccuracyvs.NumberofConvolutionalLayers
featurevector featurematrix
![Page 40: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/40.jpg)
31
Itcanbeseenthatfour-convolutionallayerhasthebestperformanceamongthe
others,andusingfeaturematrixeshasbetterperformancethanusingfeature
vectors,whichinfersthattemporalinformationbetweenaudioclipscanbe
capturedbyusingdeepconvolutionalneuralnetworks.
Figure11.1-DCNNvs.2-DCNNonDifferentPoolingKernelSize
Abovefigureshowsthecomparisonresultofusing1-Dconvolutionalversususing
2-Dconvolutionalondifferentpoolingkernelsize.Fromtheresult,wefindthat
1-Dconvolutionalneuralnetworkperformsbetterthan2-Dconvolutionalneural
network.Itmightbebecausetherearenotemporalmeaningsbetweeneach
0.55
0.6
0.65
0.7
0.75
2 3 4 5
1-DCNNvs.2-DCNNOnDifferentPoolingKernelSize
1-DCNN 2-DCNN
![Page 41: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/41.jpg)
32
individualshort-termfeaturesincethecolumninthematrixrepresentsdifferent
short-termfeatures.Italsoshowsthatpredicationaccuracycanbeimprovedby
applyingalargermaxpoolingkernelsize.
Figure12.ComparisononDifferentWindowSize
Thisfigureshowsthecomparisonresultonusingdifferentwindowsizewhen
extractingfeaturesfromaudiorecords.Itseemsthatalargerwindowsizewill
haveabetterperformance.However,largerwindowsizewillresultsmaller
trainingdatasetforneuralnetwork.Whetheritistrueornotneedstobeproved
byconductingmoreexperimentsusingalargerdataset.
0.55
0.6
0.65
0.7
0.75
0.8
0.85
50ms 100ms 300ms 1000ms
ComparisonOnDifferentWindowSize
1-DCNN
![Page 42: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/42.jpg)
33
Figure13.ComparisononDifferentMachineLearningAlgorithms
Thisfigurecomparesmodelperformanceusingdifferentmachinelearning
algorithms.Itcanbeseenthatdeepconvolutionalneuralnetworkhasaslightly
betterperformanceoverothermethodslistedhere.Short-termfeatureshavea
strongerindicationeffectofthesentimentthanprosodyfeatures.
0.583
0.773 0.786
0.5920.735 0.749
0.614
0.792 0.7960.647
0.826 0.813
ComparisonsonDifferentMachineLearningAlgorithms
SVM K-nearestneighbor FeedforwardNN DeepCNN
![Page 43: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/43.jpg)
34
6. Conclusion
Audio sentiment analysis is amethodof inferring opinions of customers in
conversationswithcustomersupportrepresentatives.Havingamaturesystemto
analyzedsentimentswouldbeverybeneficialforbusinessworld.Thisthesiswork
presents twomethods to address this problem.One is basedon segment-wise
feature vector classification. Every audio recording is split into segments by
speakerturnsbeforeanalyzing.Differentsetsoffeaturesareextractedfromeach
segment and then a classification model is trained based on these features.
Anotheristofeedfeaturematrixintodeepconvolutionalneuralnetworkandlet
it learnthetemporalinformationbetweenconsecutiveaudioclips.Eachfeature
matrix is constructed by stacking up feature vectors which are extracted from
equallengthaudioclipsofaudiorecords.Wecomparetheperformanceofmodels
using different machine learning algorithms. We show that there are strong
connections between the sentiment we analyze and the features we extract.
Short-term features, such asMFCCs, Chrome and timbre features have better
indicationsthanprosodyfeatures.Deepconvolutionalneuralnetworkcannotonly
beappliedtoimagedataset,butalsobeappliedtonon-imagedataset,likefeature
matrixes.Temporalinformationbetweenconsecutiveaudioclipswithinafeature
matrix can be captured by using deep convolutional neural network from our
experiment results. We also present the problems that we have encountered
![Page 44: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/44.jpg)
35
duringourresearchprocess.Furtherresearchworkcanbeconductedtoaddress
theseproblemstoimprovetheoverallsystemperformance.
![Page 45: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/45.jpg)
36
7. FUTUREWORK
Automatic speaker diarization process can be embedded into this audio
sentimentanalysisprocesssincewewanttospliteachaudiodataintosegments
by speaker turns. It is even important when transcripts are costly and time-
consuming to get. Accurate text transcription could be another direction since
muchworkhasbeendoneinregardoftext-basedaudiosentimentanalysis.Ifwe
can get a good text transcription of an audio file automatically, then we can
combinethosetechniquesintext-basedanalysiswithouracousticfeaturebased
techniquestoimprovetheperformanceofanalysissystem.Featurematrixisnot
theend.Therecouldexistabetterrepresentationoffeaturestofeedintodeep
neural networks. Moreover, we can find other acoustic features and better
indicatorstocharacterizeaudiodata.Moreexperimentsshouldbeconductedfor
alargerdatasettobuildmorereliablemodels.
![Page 46: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/46.jpg)
37
8. REFERENCES
[1]. Lapidot,Itshak&Aminov,L&Furmanov,T&Moyal,Ami.(2014).SpeakerDiarizationinCommercialCalls.
[2]. A.X.Miro,S.Bozonnet,N.Evans,C.Fredouille,G.Friedland,andO.Vinyals,"SpeakerDiarization:AReviewofRecentResearch,"IEEETransactionsonAudio,Speech,andLanguage,vol.20,no.2,pp.356-370,Feb.2012
[3]. Cyrta,Pawel,TomaszTrzcinskiandWojciechStokowiec.“Speaker
DiarizationusingDeepRecurrentConvolutionalNeuralNetworksforSpeakerEmbeddings.”ISAT(2017).
[4]. W.Medhat,A.Hassana,andH.Korashy.Sentimentanalysisalgorithms,
applications:Asurvey.AinShamsEngineeringjournal,pages1093–1113,2014.
[5]. B.Logan.Melfrequencycepstralcoefficientsformusicmodeling.InProc.
InternationalSymposiumonMusicInformationRetrieval,2000.[6]. G.TzanetakisandP.Cook.Musicalgenreclassificationofaudiosignals.IEEE
TransactionsonSpeechandAudioProcessing,10(5):293–302,2002.[7]. H.DavisandS.M.Mohammad.Generatingmusicfromliterature.InProc.
3rdWorkshoponComputationalLinguisticsforLiterature,pages1–10,2014.[8]. D.P.W.Ellis.Classifyingmusicaudiowithtimbralandchromafeatures.In
Proc.8thInternationalConferenceonMusicInformationRetrieval(ISMIR),pages339–340,2007.
[9]. D.Ververidis,C.Kotropoulos,andI.Pitas.Automaticemotionalspeech
classification.InProc.ICASSP,volume1,pages593–596,2004.[10]. G.Hinton,L.Deng,D.Yu,G.Dahl,A.Mohamed,N.Jaitly,A.Senior,V.
Vanhoucke,P.Nguyen,T.Sainath,andB.Kingsbury.Deepneuralnetworksfor
![Page 47: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/47.jpg)
38
acousticmodelinginspeechrecognition.IEEESignalProcessingMagazine,29(6):82–97,2012.
[11]. I.Lopez-Moreno,J.Gonzalez-Dominguez,O.Plchot,D.Martinez,J.
Gonzalez-Rodriguez,andP.Moreno.Automaticlanguageidentificationusingdeepneuralnetworks.InProc.ICASSP,pages5337–5341,2014.
[12]. F.Richardson,D.Reynolds,andN.Dehak.Aunifieddeepneuralnetwork
forspeaker,languagerecognition.InProc.INTERSPEECH,pages1146–1150,2015.
[13]. T.Mikolov,M.Karafiat,L.Burget,J.H.Cernocky,andS.Khudanpur.
Recurrentneuralnetworkbasedlanguagemodel.InProc.INTERSPEECH,pages1045–1048,2010.
[14]. S.J.Fulse,R.Sugandhi,andA.Mahajan.Asurveyonmultimodalsentiment
analysis.InternationalJournalofEngineeringResearch,Technology(IJERT)ISSN:2278-0181,3(4):1233–1238,Nov2014.
[15]. M.Sikandar.Asurveyformultimodalsentimentanalysismethods.
InternationalJournalofComputerTechnology,Applications(IJCTA)ISSN:2229-6093,5(4):1470–1476,July2014.
[16]. X.HuandJ.S.Downie.Improvingmoodclassificationinmusicdigital
librariesbycombininglyricsandaudio.InProc.JointConferenceonDigitalLibraries,(JCDL),pages159–168,2010.
[17]. J.Zhong,Y.Cheng,S.Yang,andL.Wen.Musicsentimentclassification
integratingaudiowithlyrics.InformationandComputationalScience,9(1):35–54,2012.
[18]. A.Jamdar,J.Abraham,K.Khanna,andR.Dubey.Emotionanalysisofsongs
basedonlyricalandaudiofeatures.InternationalJournalofArtificialIntelligenceandApplications(IJAIA),6(3):35–50,2015.
![Page 48: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/48.jpg)
39
[19]. T.Wang,D.Kim,K.Hong,andJ.Youn.Musicinformationretrievalsystemusinglyricsandmelodyinformation.InProc.Asia-PacificConferenceonInformationProcessing,pages601–604,2009.
[20]. Y.Xia,L.Wang,K.-F.Wong,andM.Xu.Sentimentvectorspacemodelfor
lyric-basedsongsentimentclassification.InProc.ACL-08:HLT,ShortPapers,pages133–136,2008
[21]. X.Hu,J.S.Downie,andA.F.Ehmann.Lyrictextmininginmusicmood
classification.InProc.10thInternationalConferenceonMusicInformationRetrieval(ISMIR),pages411–416,2009.
[22]. B.G.Patra,D.Das,andS.Bandyopadhyay.Unsupervisedapproachto
Hindimusicmoodclassification.InProc.MiningIntelligenceandKnowledgeExploration(MIKE),pages62–69,2013.
[23]. A.KumarandT.M.Sebastian.Sentimentanalysisontwitter.International
JournalofComputerScience(IJCSI),9(4):372–378,July2012.[24]. P.GamalloandM.Garcia.Citius:ANaive-Bayesstrategyforsentiment
analysisonEnglishtweets.InProc.8thInternationalWorkshoponSemanticEvaluation(SemEval2014),pages171–175,August2014.
[25]. R.Mihalcea.Multimodalsentimentanalysis.InProc.3rdWorkshopin
ComputationApproachestoSubjectivityandSentimentAnalysis,pages1-1July2012.
[26]. V.Rosas,R.Mihalcea,L.-P.Morency,"Multimodalsentimentanalysisof
spanishonlinevideos",IEEEIntelligentSystems,vol.28,no.3,pp.0038-45,2013.
[27]. SoujanyaPoria,ErikCambria,NewtonHoward,Guang-BinHuang,Amir
Hussain,Fusingaudio,visualandtextualcluesforsentimentanalysisfrommultimodalcontent,InNeurocomputing,Volume174,PartA,2016,Pages50-59,ISSN0925-2312,January2016.
![Page 49: Acoustic Feature-Based Sentiment Analysis of Call Center Datadslsrv1.rnet.missouri.edu/~shangy/Thesis/ZeshanPeng2017.pdf · Zeshan Peng Dr. Yi Shang, Thesis Supervisor DEC 2017 The](https://reader033.vdocuments.net/reader033/viewer/2022050511/5f9c6a643e8f324cdb296d2f/html5/thumbnails/49.jpg)
40
[28]. M.Wöllmer,F.Weninger,T.Knaup,B.Schuller,C.Sun,K.Sagae,L.P.
Morency,"Youtubemoviereviews:Sentimentanalysisinanaudio-visualcontext",IEEEIntelligentSystems,vol.28,no.3,pp.46-53,2013.
[29]. X.Zhang,J.Zhao,Y.LeCun.Character-levelConvolutionalNetworksfor
TextClassification.InAdvancesinNeuralInformationProcessingSystems28(NIPS2015),2015.