poster design grid a1 › ... › aria-poster-goodale_0.pdf · 2018-09-07 · poster design grid a1...

Automatic transcription quality analysis on the AudioBNC corpusMichael Goodale, Montreal Language Modeling Lab(Morgan Sonderegger), McGil l University

AudioBNC is a corpus of spoken,spontaneous speech that was compiledin the 1990s by the British NationalCorpus(BNC) Consortium. I t featuresrecordings of different speakers ofBritish English across dialects, ages,and social backgrounds.The dialogues were then transcribedand "aligned"; that is, the transcriptionswere matched in time with the part ofthe recording they're in.

What is AudioBNC?

The AudioBNC was automaticallyaligned and since the recordings weredone unprofessionally, the alignment isplagued with problems, where the audioand the transcription could be off byover 30 seconds, which is untenable forany solution.Since the corpus contains hundreds ofhours of dialogue, an automatic solutionfor determining recording quality(and byextension transcription quality) had tobe found.

Bad AlignmentGood AlignmentThe SPeech Across Dialects of English Project isa collaboration between five universities seekingto develop user friendly software to allow linguiststo compare multiple English speech corpora andsee how English has changed over time andspace.AudioBNC was one such corpus that wasincluded to see British speech as it was spoken inthe 1990s, although it needed a significantoverhaul to be ready for analysis.

SPADE Project

ISCAN is a web interface allowing for user-friendly interaction with multiplecorpora. While the backend analysis is fairly stable and developed, thefrontend web interface is currently under development which was an aspectof this project.

ISCAN

Ultimately the classification was performedusing a decision tree, which was chosen forits high degree of interpretabil ity.Additionally, it was tuned to prefer highprecision to high recall , e.g. , to ensure thatall utterances labeled good were in factgood, rather than ensuring a very large"good" subset which contained more falsepositives.

Classification

First each conversation was split intoutterances, continous segments of speechseparated by 150ms or more of si lence.Then, each utterance was re-aligned with theMFA to compare to the original alignment.Large differences between alignmentsimplied the original audio was difficult toalign.The other features used were utteranceduration, word length and harmonic-to-noiseratio.

Features

About 30 minutes of dialogue wasannotated by hand with notes as towhether it had a "good" or "bad"alignment.The causes of bad dataincluded such things as extremely quietspeech, overlapping conversations,loud background noise, or peopleplaying with the audio equipment.

Data Quality

What was theproblem?

poster design grid a1 › ... › aria-poster-goodale_0.pdf · 2018-09-07 · poster design grid a1...

Documents