interactive visualization of ai-based speech recognition...

EuroVis Workshop on Visual Analytics (2020)K. Vrotsou and C. Turkay (Editors)

Interactive Visualization of AI-based Speech Recognition Texts

Tsung Heng Wu and Ye Zhao † and Md Amiruzzaman

Kent State University

AbstractSpeech recognition technology has achieved impressive success recently with AI techniques of deep learning networks. Speech-to-text tools are becoming prevalent in many social applications such as field surveys. However, the speech transcription resultsare far from perfection for direct use in these applications by domain scientists and practitioners, which prevents the users fromfully leveraging the AI tools. In this paper, we show interactive visualization can play important roles in post-AI understanding,editing, and analysis of speech recognition results by presenting specified task characterization and case examples.

1. Introduction

Speech recognition is an important field in computational lin-guistics [CRS05, CFL13]. For many years, researchers have de-veloped a variety of technologies and tools to identify wordsand phrases in spoken language [JM14, BMG∗16, HM15a]. Re-cently, AI techniques, especially deep learning networks, havebecome revolutionary as they outperform previous methods andlead to high quality and low error rate in the speech to text re-sults [HDY∗12, MLJ∗14]. Cloud-based speech to text services hasbeen provided by many big companies such as Microsoft [Mic],Google [Goo19], etc. using deep learning models. Users from mul-tiple domains are eager to utilize these AI tools for real-world appli-cations such as conducting field surveys and collecting user opin-ions [BZK12, HM15b, Muh15]. However, the transcription resultsstill have a set of practical problems including: (1) A full speechis recognized as a group of fragments, which usually do not rep-resent natural sentences or paragraphs from the speaker; (2) Errorsof audio recognition is inevitable and the quality varies greatly; (3)The confidence scores of words and fragments given by the speechrecognition algorithms sometimes do not reflect the real probabil-ity of misrecognition. These problems have already hindered themore widespread use of speech to text tools [KRS17]. Domain sci-entists face challenges to effectively complete the following tasksin collecting lengthy audios from multiple speakers:

• Understanding the characteristics of speeches as well as speak-ers;

• Manually editing the massive speech-to-text results;• Analyzing and comparing multiple speeches and speakers.

Interactive visual exploration can help the users alleviate thetasks in post-AI processing [KAKC17, KCK∗18]. Fig. 1 illustrates

† Corresponding author: [email protected]

Figure 1: Visual exploration loop in the speech-to-text pathway.

the pathway of data processing, where a visual exploration loopis shown in red. It can enhance the usability of AI-based speechrecognition tools. This short paper contributes to this path in twofacets:

• We identify the characteristics and shortcomings of AI-basedspeech-to-text results. Then, we propose major directions that vi-sualization techniques can promote their usability.• We develop a prototype to showcase the usefulness of visualiza-

tion tools, which combines visual metaphors with semantic andsentimental analysis.

The purpose of this paper is to show how visual analytics canhelp to address the uncertainty issues in AI-based speech recogni-tion.

2. Related Work

Traditional speech recognition systems use hidden Markov andGaussian mixture models to represent the acoustic input in speechrecognition [JM14]. Well trained deep neural networks withmany hidden layers, possibly in combination with hidden Markovmodels, have outperformed on speech recognition benchmarks[GMH13]. Recurrent neural networks (RNNs) model speech asa dynamic time process, whose hidden state is a function of allprevious hidden states [HDY∗12]. Many research and commercialproducts, such as Microsoft, Google, Apple, IBM, and TensorFlow,have become usable for acoustic modeling and speech recogni-tion [HDY∗12].

c© 2020 The Author(s)Eurographics Proceedings c© 2020 The Eurographics Association.

https://orcid.org/0000-0001-9851-2677https://orcid.org/0000-0003-3877-0943https://orcid.org/0000-0002-2292-5798

T. H. Wu, Y. Zhao & M. Amiruzzaman / Interactive Visualization of AI-based Speech Recognition Texts

Figure 2: Speech-to-Text result of google cloud tool. Multiple frag-ments (separated by background color) have different lengths andconfidence values.

Visualization of neural networks has been an emerging topicfor inspecting the training, improving the models and understand-ing the results ( [RFFT17, LSL∗17, ZZ18, HKPC18]). While mosttechniques work on vision datasets with a focus on CNNs (convo-lutional neural networks), RNN visualization tools are developedfor linguistic, biological, and vision tasks. They focus on analyzinghidden state properties, phrase structure, and chord progressions,and so on [KJFF15, SGPR18, WPW∗11].

Visualization can be useful to temporal event sequence data visu-alization, specifically to identify potential privacy issues in event-based or time-varying data [CWM19]. Moreover, researchers havepresented various approaches in visual analytics of text corpora[LWC∗18]. Visual sentiment analysis has been addressed whereword clouds and other techniques are employed [KK15]. However,visualization techniques have not been applied to speech-to-textrecognition results. This paper identifies the unique features andrequirements of the AI outputs and visualizes them in a few exam-ples which can help end-users better utilize these AI tools.

3. AI-based Speech to Text Outcome and Attributes

3.1. Text Fragments and Confidence

AI tools transcribe an audio speech to a text document (i.e., tran-script) consisting of a list of speech fragments. Each fragment isa natural language segment of the speaker’s narration based ontheir talking speed, stop, and other attributes. It consists of multipleterms (i.e. keywords) while each term has a unique term speakinglength (audio length), which is different from its text word length.As shown in the example in Fig. 2, these fragments do not neces-sarily represent sentences as in written documents. The list is noteasy to read and understand in comparison to a written document.

Second, the AI model usually provides a term confidence scorefor each term. Moreover, a fragment confidence score is also givenfor one speech fragment. These confidence scores assess the relia-bility of automatic speech transcriptions. They provide cues to theaudio recognition errors for applications [RLGW18]. These con-fidence information and related errors need to be addressed withhuman in the loop. Moreover, each term may be marked with aspeaker tag which indicates the AI recognized different speakers.

An important feature is that recognition errors may not be wellindicated by the confidence scores. Other features such as audiolength need to be used to help users address the errors, for exam-ple, a very long-recognized term usually refers to a failed wordrecognition.

3.2. Semantic and Sentiment Attributes

The attributes of term/fragment length and confidence can be vi-sualized together with semantic and sentiment information, so asto promote deep understanding and allow quick revision. We havecomputed the following attributes (to be extended by using moreNLP and text mining tools):

• Term frequency: The keywords in the text are ranked by theirappearance frequency to identify top term in a speech;• Term and fragment sentiment: Sentiment analysis can identify

whether the expressed opinion in a document or a sentenceis positive, negative, or neutral (e.g., AFINN [Nie11], MAN[JTL∗20], SentiDiff [WNY19]). This is very helpful to discoverspeakers’ attitudes and emotions. In this paper, we apply a senti-ment analysis tool to discover the sentiment score of each frag-ment and each keyword [NPM]. AFINN English word list isutilized where each term is rated as an integer between minusfive (negative) and plus five (positive) [Nie11]. Each fragment isgiven a sentiment score by summing up the sentiment integers ofall the terms.• Term Entropies: One same term in a speech can appear multi-

ple times (N). Each time it may have different attributes (di, i ∈(1..N)), such as audio length and confidence. The entropy of oneterm attribute shows the diversity of this attribute throughout thespeech. For example, a high term entropy of audio length in-dicates that a speaker uses diverse vocal lengths for the sameword. We compute the entropies of different attributes as H =− ∑

i∈(1..N)p(di) log p(di), where p(di) = di∑

i∈(1..N)(di)

.

4. Identification of Visualization Tasks and Functions

We talked with three social scientists (in public health, crime, ur-ban study, and disaster management) who have used speech recog-nition in their work. From the requirement analysis, several tasksare identified for interactive visualization systems including (i.e.,broken down into speech fragments):

• T1: Helping users quickly understand the structures and charac-teristics of the recognized text;• T2: Allowing users to discover and compare semantics of

speeches and sentiments of speakers;• T3: Guiding the revision of the transcript to achieve better quality

for downstream applications.

Addressing each task of T1 to T3 by visualization techniquesand systems is challenging - e.g., for T3, developing an interac-tive interface for users to quickly and effectively edit and revise alarge number of transcripts requires intensive system design, de-velopment, and evaluation. In this short paper, we coordinate thefollowing visualization functions in an integrated prototype:

Transcript and Fragment Visualization: The speech-to-text resultsare visualized based on the recognized speech fragments with mul-tiple attributes. Users can study the structure and patterns, and thendrill-down to details by listening to the audio clip and visualizingthe terms in the fragments.

Term Visualization: The keywords are visualized to show and inte-grate important attributes such as frequency, audio length, and con-



Figure 3: Visual interface with Obama’s Civil Right Speech 2015. (A) A bar page showing the confidence values of fragments in this speech(i.e., green color means high confidence and red color means low confidence). Hovering over one fragment f shows the text; (B) Histogramcharts of speech attributes; (C) The word bars of f : each bar’s height shows the word’s audio length and color shows the word confidence.The word “privilege” has low confidence; (D) The text view of (F) “privilege” is highlighted (i.e., yellow color text and yellow color bar)while listening to the audio. (E) Control panel of visualization functions.

fidence. Users can drill-down to study and compare related audioclips and their details.

Sentiment Visualization: Speaker’s sentiments in a speech are visu-alized, together with audio features, so that users can identify thespeaker’s attitude and find important parts of the speech.

Comparative Visualization: Multiple speeches from the same ordifferent speakers can be compared for insight discovery.

These functions are not complete for visual analytics tasks T1-T3, but instead are shown as a modest spur to encourage more valu-able researches.

5. Prototype Visualization Design

The text results from AI recognition are visualized according tothe following design requirements: (1) The whole picture of tran-scription structure should be displayed, with respect to the frag-ments and keywords; (2) The confidence and other attributes offragments and keywords should be easily discerned; (3) The screenspace should be well utilized to show these information.

Transcript Overview: A bar chart view is designed for theoverview of a speech as shown in Fig. 3(A). It maps each fragmentto a bar, and all the bars of one transcribed text are sequentiallyvisualized, in order to represent as many data items as possible onthe screen at the same time. The bar is colored and highlightedby different attributes selected by users. This view allows users toeasily “browse” the document where hey can hover and click thebars to hear the raw audio and read the text. Instead of combiningfull text with confidence, the bar-chart visualization can present thefull document in limited real estate. Fig. 3(A) visualizes the 2015speech of President Obama about Civil Rights. The bar length is

Figure 4: Visualizing words in Obama’s speech with their variousattributes. Word size is the entropy of its audio lengths in multipleappearances and color is the minimum confidence (i.e., green colormeans high confidence and red color means low confidence).

mapped to the audio length of a fragment, its color represents thefragment confidence. Meanwhile, a set of histograms of fragmentconfidence, audio length, and sentiments are shown in Fig. 3(B).Users can click on the histogram to highlight specific fragments ofinterests.

Fragment Drill-down: Users can click a fragment bar to study itsdetails, whose audio clip is played. In Fig 3(A), a fragment f isclicked so that a fragment bar view is shown in Fig. 3(C). Hereeach bar refers to one word in this fragment. The words are visual-ized in Fig. 3(D). The terms are highlighted in yellow dynamicallywhile the audio is playing. For example, the word “privilege” haslow confidence and users can click it to listen to the raw audio re-peatedly. This view allows users to identify erroneous recognitionand correct them as needed.

Term visualization: The attributes of transcribed terms (keywords)are visualized to (1) find suspicious recognition; and (2) discoverimportant semantic information with audio attributes. The visualdesign thus should give intuitive and fast cues for users to extractthe information. A word cloud view is utilized to visualize the key-



Figure 5: Compare two speeches: President Obama on Civil Right 2015 Vs. President Trump in United Nations 2018.

Figure 6: Find speech recognition errors in the well-known speech“I have a dream” by Martin Luther King Jr. 1963.

words in a speech, whose size and color can be assigned accord-ing to different attributes. As shown in Fig. 4, the terms are shownwhen they appear multiple times and have a large entropy of audiolengths. A large word indicates that it is talked with very differ-ent lengths by the speaker(s), and the color a word indicates con-fidence. Here, the term “president” is clicked for drill-down studydue to a large entropy. The scatter plot shows all instances, wherethe x-axis shows confidence and the y-axis audio length. Users canclick any instance to highlight (in brown) and listen to the fragment.

Sentiment and Comparative Visualization: It is important toshow the sentiments of speakers together with speaking character-istics. The sentiment scores between -5 to 5 can be mapped to frag-ment or word as shown in Fig. 5. Comparing different speeches tovisually discover speakers’ similarities and differences quickly andintuitively (see Fig. 5 for side-to-side views of two speeches). Moredetails are discussed in the case studies (see Sec. 6).

6. Case Studies

Comparing two speeches: Two speeches are visualized for com-parison in Fig. 5. They are the Civil Right Speech (2015) by Pres-ident Obama (Fig. 5(A)) and the Speech at United Nations (2018)by President Trump (Fig. 5(B)). From the overview of fragmentscolored by sentiment scores, it can be seen that Trump has morepositive talk than Obama since the fragment bars have more yellow-orange colors (see color legend Fig. 5(C)). But his talk also has verydark and negative fragments. The information can be discovered

from the statistics in Fig. 5(F). In Fig. 5(D), Obama’s speech haslarger confidence values (green arrow) which show his talk is rel-atively clearer for the speech-to-text tool. Fig. 5(E) indicates thatTrump tentatively uses longer audio terms (purple arrow) whichmay be slow and/or emphasized.

Find speech recognition errors: Users can define combined con-ditions over different attributes to investigate recognized fragmentsor terms. In the example of Fig. 6, the well-known speech “I havea dream” is transcribed by Google Speech to Text tool. It is a pub-lic speech delivered by American civil rights activist Martin LutherKing Jr. on August 28, 1963. From the histogram views in Fig.6(B), it can be seen that there exist some low confidence terms be-low 0.7. By selecting this range and find terms with a single ap-pearance in the speech (Fig. 6(C)), several suspicious keywords areshown in Fig. 6(A). By checking it in a detail view, or listeningto the audio error can be recognized, and later it can be correctedmanually.

7. Conclusion and Discussion

Speech to text AI tools are prevalent while the results often needto be revised and investigated to discover errors and understandspeech/speaker features. A set of visualizations are presented in thistype of emerging data. It is useful for many post-processing tasksin a variety of applications. This paper does not present a com-plete design and user study for all potential visualization functions.However, we show a preliminary prototype in this short paper. Inthe future, it will be extended in several facets: (1) the visual designwill be further improved with alternatives; (2) effective guided in-teraction is preferred; and (3) the audio signal processing attributescan be integrated. The system will also be evaluated by domainusers with a formal user study.

Acknowledgement

The authors would like to thank the anonymous reviewers. Thiswork was supported in part by the U.S. NSF under Grant 1739491.



References

[BMG∗16] BURGOON J., MAYEW W. J., GIBONEY J. S., ELKINSA. C., MOFFITT K., DORN B., BYRD M., SPITZLEY L.: Which spokenlanguage markers identify deception in high-stakes settings? evidencefrom earnings conference calls. Journal of Language and Social Psy-chology 35, 2 (2016), 123–157. 1

[BZK12] BUMBALEK Z., ZELENKA J., KENCL L.: Cloud-based assis-tive speech-transcription services. In International Conference on Com-puters for Handicapped Persons (2012), Springer, pp. 113–116. 1

[CFL13] CLARK A., FOX C., LAPPIN S.: The handbook of computa-tional linguistics and natural language processing. John Wiley & Sons,2013. 1

[CRS05] COLLINS M., ROARK B., SARACLAR M.: Discriminative syn-tactic language modeling for speech recognition. In Proceedings ofthe 43rd Annual Meeting on Association for Computational Linguistics(2005), Association for Computational Linguistics, pp. 507–514. 1

[CWM19] CHOU J.-K., WANG Y., MA K.-L.: Privacy preserving visu-alization: A study on event sequence data. In Computer Graphics Forum(2019), vol. 38, Wiley Online Library, pp. 340–355.

[GMH13] GRAVES A., MOHAMED A. R., HINTON G.: Speech recog-nition with deep recurrent neural networks. ICASSP, IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing - Pro-ceedings (2013), 6645–6649. arXiv:1303.5778, doi:10.1109/ICASSP.2013.6638947. 1

[Goo19] GOOGLE: Google Cloud Speech-to-Text.https://cloud.google.com/speech-to-text/ (2019). URL: https://cloud.google.com/speech-to-text/. 1

[HDY∗12] HINTON G., DENG L., YU D., DAHL G. E., MOHAMEDA., JAITLY N., SENIOR A., VANHOUCKE V., NGUYEN P., SAINATHT. N., KINGSBURY B.: Deep Neural Networks for Acoustic Modelingin Speech Recognition. IEEE Signal Processing Magazine 29 (2012), 82– 97. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp={&}arnumber=6296526{&}isnumber=6296521,arXiv:1207.0580. 1

[HKPC18] HOHMAN F. M., KAHNG M., PIENTA R., CHAU D. H.: Vi-sual Analytics in Deep Learning: An Interrogative Survey for the NextFrontiers. IEEE Transactions on Visualization and Computer Graphics(2018). doi:10.1109/TVCG.2018.2843369. 2

[HM15a] HIRSCHBERG J., MANNING C. D.: Advances in natural lan-guage processing. Science 349, 6245 (2015), 261–266. 1

[HM15b] HOSSAIN M. S., MUHAMMAD G.: Cloud-assisted speech andface recognition framework for health monitoring. Mobile Networks andApplications 20, 3 (2015), 391–399. 1

[JM14] JURAFSKY D., MARTIN J.: Speech and Language Process-ing, vol. 3. 2014. arXiv:arXiv:1011.1669v3, doi:10.1017/CBO9781107415324.004. 1

[JTL∗20] JIANG N., TIAN F., LI J., YUAN X., ZHENG J.: Man: mutualattention neural networks model for aspect-level sentiment classificationin siot. IEEE Internet of Things Journal (2020). 2

[KAKC17] KAHNG M., ANDREWS P. Y., KALRO A., CHAU D. H. P.: Acti v is: Visual exploration of industry-scale deep neural network models.IEEE transactions on visualization and computer graphics 24, 1 (2017),88–97. 1

[KCK∗18] KWON B. C., CHOI M.-J., KIM J. T., CHOI E., KIM Y. B.,KWON S., SUN J., CHOO J.: Retainvis: Visual analytics with inter-pretable and interactive recurrent neural networks on electronic medicalrecords. IEEE transactions on visualization and computer graphics 25,1 (2018), 299–309. 1

[KJFF15] KARPATHY A., JOHNSON J., FEI-FEI L.: Visualizing andUnderstanding Recurrent Networks. arXiv preprint arXiv:1506.02078(2015). URL: http://arxiv.org/abs/1506.02078, arXiv:1506.02078. 2

[KK15] KUCHER K., KERREN A.: Text visualization techniques: Tax-onomy, visual survey, and community insights. IEEE Pacific Visu-alization Symposium 2015-July (2015), 117–121. doi:10.1109/PACIFICVIS.2015.7156366. 2

[KRS17] KISLER T., REICHEL U., SCHIEL F.: Multilingual processingof speech via web services. Computer Speech & Language 45 (2017),326–347. 1

[LSL∗17] LIU M., SHI J., LI Z., LI C., ZHU J., LIU S.: Towards BetterAnalysis of Deep Convolutional Neural Networks. IEEE Transactionson Visualization and Computer Graphics 23, 1 (2017), 91–100. doi:10.1109/TVCG.2016.2598831. 2

[LWC∗18] LIU S., WANG X., COLLINS C., DOU W., OUYANG F., EL-ASSADY M., JIANG L., KEIM D. A.: Bridging text visualization andmining: A task-driven survey. IEEE transactions on visualization andcomputer graphics 25, 7 (2018), 2482–2504. 2

[Mic] MICROSOFT: Microsoft Azure Speech to Text. 1

[MLJ∗14] MOU L., LI G., JIN Z., ZHANG L., WANG T.: Tbcnn: A tree-based convolutional neural network for programming language process-ing. corr abs/1409.5718 (2014). arXiv preprint arXiv:1409.5718 (2014).1

[Muh15] MUHAMMAD G.: Automatic speech recognition using inter-laced derivative pattern for cloud based healthcare system. Cluster Com-puting 18, 2 (2015), 795–802. 1

[Nie11] NIELSEN F. A.: A new ANEW: Evaluation of a word list forsentiment analysis in microblogs. Proceedings of the ESWC2011 Work-shop on Making Sense of Microposts: Big things come in small packages(2011), 93–98. 2

[NPM] NPM: AFINN-based sentiment analysis for Node.js. URL:https://www.npmjs.com/package/sentiment. 2

[RFFT17] RAUBER P. E., FADEL S. G., FALCÃO A. X., TELEA A. C.:Visualizing the Hidden Activity of Artificial Neural Networks. IEEETransactions on Visualization and Computer Graphics 23, 1 (2017),101–110. doi:10.1109/TVCG.2016.2598838. 2

[RLGW18] RAGNI A., LI Q., GALES M., WANG Y.: Confidence Es-timation and Deletion Prediction Using Bidirectional Recurrent NeuralNetworks. IEEE Workshop on Spoken Language Technology (2018).URL: http://arxiv.org/abs/1810.13025, arXiv:1810.13025. 2

[SGPR18] STROBELT H., GEHRMANN S., PFISTER H., RUSH A. M.:LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Re-current Neural Networks. IEEE Transactions on Visualization and Com-puter Graphics 24, 1 (2018), 667–676. doi:10.1109/TVCG.2017.2744158. 2

[WNY19] WANG L., NIU J., YU S.: Sentidiff: Combining textual infor-mation and sentiment diffusion patterns for twitter sentiment analysis.IEEE Transactions on Knowledge and Data Engineering (2019). 2

[WPW∗11] WU Y., PROVAN T., WEI F., LIU S., MA K.-L.: Semantic-preserving word clouds by seam carving. In Computer Graphics Forum(2011), vol. 30, Wiley Online Library, pp. 741–750. 2

[ZZ18] ZHANG Q., ZHU S.-C.: Visual Interpretability for Deep Learn-ing: a Survey. URL: http://arxiv.org/abs/1802.00614,arXiv:1802.00614. 2

http://arxiv.org/abs/1303.5778http://dx.doi.org/10.1109/ICASSP.2013.6638947http://dx.doi.org/10.1109/ICASSP.2013.6638947https://cloud.google.com/speech-to-text/https://cloud.google.com/speech-to-text/http://ieeexplore.ieee.org/stamp/stamp.jsp?tp={&}arnumber=6296526{&}isnumber=6296521http://ieeexplore.ieee.org/stamp/stamp.jsp?tp={&}arnumber=6296526{&}isnumber=6296521http://arxiv.org/abs/1207.0580http://dx.doi.org/10.1109/TVCG.2018.2843369http://arxiv.org/abs/arXiv:1011.1669v3http://dx.doi.org/10.1017/CBO9781107415324.004http://dx.doi.org/10.1017/CBO9781107415324.004http://arxiv.org/abs/1506.02078http://arxiv.org/abs/1506.02078http://arxiv.org/abs/1506.02078http://dx.doi.org/10.1109/PACIFICVIS.2015.7156366http://dx.doi.org/10.1109/PACIFICVIS.2015.7156366http://dx.doi.org/10.1109/TVCG.2016.2598831http://dx.doi.org/10.1109/TVCG.2016.2598831https://www.npmjs.com/package/sentimenthttp://dx.doi.org/10.1109/TVCG.2016.2598838http://arxiv.org/abs/1810.13025http://arxiv.org/abs/1810.13025http://arxiv.org/abs/1810.13025http://dx.doi.org/10.1109/TVCG.2017.2744158http://dx.doi.org/10.1109/TVCG.2017.2744158http://arxiv.org/abs/1802.00614http://arxiv.org/abs/1802.00614

interactive visualization of ai-based speech recognition...

Documents