automatic summarization of highly spontaneous …

AUTOMATIC SUMMARIZATION OF HIGHLY SPONTANEOUS SPEECH

András Beke – György Szaszák

ResearchInstitutefor Linguistics

Hungarian Academy of

Sciences

Dept.ofTelecommunicationsandMediaInformatics

University of Technology and Economics

DATAEXPLOSION:HUMANVSMACHINEPROCESSING

2006 2007 2008 2009 2010 2011

180EXABYTE

1,800EXABYTE

The UNSTUCTURED data explosion- Growing 10x every 5 years and 100x every 10 years- Requires a new approach

The human not able toread every documentumor to listen every audiofile, or watch everymovies!

STATISTICAL MACHINE LEARNING

STUCTURED data

AUTOMATICSUMMARIZATION

• informationretrieval• documentclustering• informationextraction• visualization• questionanswering• textsummarization

POSSIBLEAPPROACHES

TYPESOFSUMMARIZATION

• Indicative• Describesthedocumentanditscontents

• Informative• ‘Replaces’thedocument

• Extractive• Concatenatepiecesofexistingdocument

• Generative• Createsanewdocument

• Documentcompression

COMPARINGSPEECHANDTEXTSUMMARIZATION

• Identifyingimportantinformation

• Somelexical,discoursefeatures

• Extractionorgenerationorcompression

– Speech Signal– Prosodic features– NLP tools?– Segments –

sentences?– Generation?– Errors– Data size

ALIKE DIFFERENT

SentenceExtraction/Similaritymeasures(Salton,etal.1995)

• Extractsentencesbytheirsimilaritytoatopicsentenceandtheirdissimilaritytosentencesalreadyinsummary(MaximalMarginalRelativity)

• Similaritymeasures• CosineMeasure• VocabularyOverlap• Topicwordoverlap• ContentSignaturesOverlap

• Automaticsentencesegmentation(tokenization)iscrucialbeforesuchasentencebasedextractivesummarization(Liu andXi 2008).Thedifficultycomesnotonlyfromrecognitionerrors,butalsofrommissingpunctuationmarks,whichwouldbefundamentalinsyntacticparsingandPOStagging(disambiguation).

Our work

• This worksonextractivesummarizationusetwomajorsteps:

1) thefirststepisrankingthesentencesbasedontheirscoreswhicharecomputedbycombiningfeaturessuchastermfrequency(TF),positionalinformationandcuephrases;

2) thesecondstepconsistsinselectingafewtoprankedsentencestopreparethesummary

Our work• InthisworkwepresentaninitialefforttodevelopaHungarianspeechsummarizationsystem.

• Summarizationwillalsobecomparedtoabaselineversionusingtokensavailablefromhumanannotation.

• Incurrentworkweproposeaprosodybasedautomatictokenizerwhichrecoversintonational phrases(IP)anduseIPsassentencelikeunitsinfurtheranalysis.

• Thebaselinetokenizationreliesonacoustic(silence)andsyntactic-semantic(syntactically orsemanticallycloselytogetherbelonging)axes.

• InHungarian,bothspeechrecognitionandtext-basedsyntacticalanalysisaredifficultcomparedtoEnglishduetotheveryrichmorphologyofthelanguage.

MATERIALAND

SPEECH-TO-TEXT

SPEECHMATERIAL

• 4interviewsfromtheBEAHungarianSpontaneousSpeechdatabase

• Participantstalkabouttheirjobs,family,andhobbies.• Threeofthespeakersaremaleandoneofthemisfemale.

• AllspeakersarenativeHungarian,livinginBudapest(agedbetween30and60).

• Thetotalmaterialis28minuteslong(averagedurationwas7minutesperparticipant).

TheSpeech-to-TextSystem• Weuse160interviewsfromBEA,accountingfor120hoursofspeech(theinterviewerdiscarded)totrainSpeech-to-text(S2T)acousticmodels.

• Speakersinvolvedinthe4interviewsusedforsummarizationareheldout.

• UsingtheKalditoolkitwetrain3hiddenlayerDNNacousticmodelswith2048neuronsperlayerandtanh non-linearityon160interviewsfromBEA(Hungarian).

• Inputdatais9xsplicedMFCC13+CMVN+LDA/MLLT.• Atrigramlanguagemodelistrainedontranscriptsofthe160interviewsaftertextnormalization,withKneser-Neysmoothing.Dictionariesareobtainedusingarule-basedphonetizer (spokenHungarianisveryclosetothewrittenform).

WordError Rate (WER)was found around 44%for this task.This relative highWERisjustified by the high spontaneity ofspeech.Stem error rate was found to besomewhat smaller,39%.

UTTERANCESEGMENTATION(THEIP-TOKENIZER)• ThissystemusesphonologicalphrasemodelsandalignsthemtotheinputspeechbasedonprosodicfeaturesF0andmeanenergy. (Szaszák andBeke2012).

• Inthisworkweuseittoobtainsentence-liketokensfromspeech-to-textoutput.

WeusetheIPtokenizerinanoperatingpointwithhighprecision(96%onreadspeech)andlowerrecall(80%onreadspeech).

THE SUMMARIZATION APPROACH

BLOCKDIAGRAM

PRE-PROCESSING

• Stopwordsareremovedfromthetokensandstemmingisperformed.Stop-wordsarecollectedintoalist,whichcontains

• allwordstaggedasfillersbytheS2Tcomponent(speakernoise)and

• apredefinedsetofnon-contentwordssuchasarticles,conjunctionsetc.

• Themagyarlánc toolkit(Zsibrita etal.2013) wasusedforthestemmingandPOS-taggingoftheHungariantext.

• Thewordsarefilteredtokeeponlynouns.

TEXTUALFEATUREEXTRACTION(WORDLEVEL)

• TF-IDF(TermFrequency- InverseDocumentFrequency)reflectstheimportanceofasentenceandisgenerallymeasuredbythenumberofkeywordspresentinit.TheimportancevalueofasentenceiscomputedasthesumofTF-IDFvaluesofitsconstituentwords(inthiswork:nouns)dividedbythesumofallTF-IDFvaluesfoundinthetext.

• LatentSemanticAnalysis(LSA)exploitscontexttotrytofindwordswithsimilarmeaning.LSAisabletoreflectbothwordandsentenceimportance.SingularValueDecomposition(SVD)isusedtoassesssemanticsimilarity.

TEXTUALFEATUREEXTRACTION(SENTENCELEVEL)• PositionalValue:themoremeaningfulsentencescanbefoundatthebeginningofthedocument.

• Thisisevenmoretrueincaseofspontaneousnarratives,astheintervieweraskstheparticipanttotellsomethingabouther/hislife,job,hobbies

Pk =1/√kwherethePk isthepositionalscoreofkth sentence.

• Sentenceranking:TherankingscoreRSK iscalculatedasthelinearcombinationoftheso-calledthematictermbasedscoreSk andpositionalscorePk.Thefinalscoreofasentencekis:

whereαisthelower,βistheuppercut-offforthesentenceposition (0≤α,β≤1)andLL isthelowerandLU istheuppercut-offonthesentencelengthLk.

• Sentencelength: Usuallyashortsentenceislessinformativethanalongeroneandhence,readersorlistenersaremorepronetoselectalongersentencethanashortonewhenaskedtofindgoodsummarizingsentencesindocuments.Ifasentenceistooshortortoolong,itisassignedarankingscoreof0.

SUMMARYGENERATION

Thelaststepistogeneratethesummary.InthisprocesstheN-toprankedsentencesareselectedfromthetext(Sarkar 2012).WesetNto10,sothefinaltextsummarycontainsthetop10sentences.

EVAULATION

METRICS

Compare reference summary andsystem summary• OBJECTIVE

• F1-measure:soft comparison• ROUGE:hard comparison

Create reference summary

• 10participantswereaskedtoselectupto10sentencesthattheyfindtobethemostinformativeforagivendocument(presentedalsoinspokenandinwrittenform).

• Participantsused6.8sentencesonaveragefortheirsummaries.Foreachnarrative,asetofreferencesentenceswascreated:sentenceschosenbyatleast1/3oftheparticipantswereaddedtothereferencesummary.

EXPERIMENTS

EXPERIMENST3setups:• OT-H:Usetheoriginaltranscribedtextassegmentedbythehumanannotatorsintosentence-likeunits.

• S2T-H:Usespeech-to-textconversiontoobtaintext,butusethehumanannotatedtokens.

• S2T-IP:Usespeech-to-textconversiontoobtaintextandtokenizeitbasedonIPboundarydetectionfromspeech.

Soft comparison Hard comparison

Setup Approach Recall % Precision % F1 Recall Precision F1

OT-HTF-IDF 0.51 0.76 0.61 0.36 0.28 0.32LSA 0.36 0.71 0.46 0.36 0.3 0.32

S2T-HTF-IDF 0.51 0.8 0.61 0.34 0.29 0.31LSA 0.49 0.77 0.56 0.39 0.27 0.32

S2T-IPTF-IDF 0.62 0.79 0.68 0.33 0.28 0.30LSA 0.59 0.78 0.65 0.33 0.32 0.32

CONCLUSION

CONCLUSION

• ThispaperaddressedspeechsummarizationforhighlyspontaneousHungarian.GiventhishighdegreeofspontaneityandalsotheheavyagglutinatingpropertyofHungarian,webeleive theobtanied resultsarepromisingastheyarecomparabletoresultspublishedforotherlanguages(Campr,M.,Ježek 2015).

• TheproposedIPdetectionbasedtokenizationwasassuccessfulastheavailablehumanone.Theoverallbestresultswere62%recalland79%precision(F1=0.68).

automatic summarization of highly spontaneous …

Documents