design and implementation of the note-taking style haptic voice
TRANSCRIPT
Design and Implementation of the Note‐taking Style Haptic Voice Recognition for Mobile Devices
Seungwhan MoonFranklin W. Olin College of Engineering
1000 Olin WayNeedham, MA, [email protected]
Khe Chai SimNational University of SingaporeComputing 1, 13 Computing DriveSingapore, Singapore 117417
1
14th ACM International Conference on Multimodal InteractionDoubleTree Suites Santa Monica, California.
October 22‐26th, 2012
Introduction
• Haptic Voice Recognition (HVR)
Haptic Input
Speech Input
2
Introduction
• Haptic Voice Recognition (HVR)
• Boundary of Sentence (BoS)• Boundary of Word (BoW)• First Letter of Word (FLoW)
…
• Synchronous• Asynchronous
3
Note‐taking Style HVR
4
Motivation
Lecture Note
Haptic voice recognition- combine speech / touch- increases accuracy
Semantically Meaningful Keywords Natural to write & take notes
Note‐taking Style HVR
meeting tom at 6 pm
Haptic Note Sequence
Haptic Input
Speech Input
5
Note‐taking Style HVR
1. An element in a haptic note sequence refers to a partially or fully spelled word in the decoded word sequence.
2. The number and the order of keywords in a haptic note sequence do not need to match those of words in the actual word sequence.
3. The exact time at which a haptic event occurs is ignored.
6
Note‐taking Style HVR
3 Types of Haptic Input Methods
1. Longhand Handwriting
2. Shorthand Handwriting
3. Virtual Keyboard
7
n o t e
N O T E
Note‐taking Style HVR
(Adapted) Gregg Shorthand Handwriting Recognition
1. Facilitates much faster and more effective input
2. Adds ambiguousness to the letters that have phonetic similarities
3. Adapted to HVR – uses isolated letters to spell a word.
8
Note‐taking Style HVR
9
Demo
10
Algorithm Design
11
: Word sequence
: PLI sequence
: Sequence of observed acoustic features
: Sequence of observed haptic features
Haptic Voice Recognition Finding the joint optimal solution for W, L given O, H.
Algorithm Design
12
: Lattice of multiple word sequence hypotheses
: PLI model
: Lattice of permutations of haptic note sequence
Shortest Path of Eq (2) Optimal solution for Eq (1)
Weighted Finite State Transducer (WFST)
Algorithm Design
13
fstcompose
Using OpenFST …
fstshortestpath
fstcompile
Experimental Results(1) Simulation
‐ Single user, 72 sentences, 100 iterations.
‐ N words (partially / fully spelled) are randomly chosen (artificial haptic events)– NW3L / NW
‐ Under two Sound Noise Ratio (SNR) conditions– clean, 15dB (artificially corrupted)
‐ Compared with FLoW, Oracle Error Rate
14
Experimental Results
Figure: Simulation results (a) when performed without any additional noise and (b) when performed with artificial noise at SNR = 15dB. x‐axis denotes the number of randomly chosen keywords (N), whereas y‐axis denotes the word error rate (WER). The red and the blue lines refer to the Note‐taking‐style HVR performance with the first 3 letters of N randomly chosen words (N‐W3L), and the Note‐taking‐style HVR performance with N fully‐spelled words (N‐W). The error bars indicate the standard deviations of the 100 iterations.
(1) Simulation
15
Experimental Results(1) Simulation
16
‐ Notable improvement in the Word Error Rate (WER) for both NW3L & NW in both SNR conditions.
‐ Higher improvement for bigger N – with decreasing rate of improvement
‐ Bottleneck at the Oracle Error Rate performance depends on the quality of the speech
recognizer.
‐ Large standard deviation of WER choice of keywords significantly affect the performance.
Experimental Results(2) Preliminary User Studies
‐ Single User (72 sentences for each)
‐ 3 keywords (partially spelled – only the first 3 characters) are chosen– 3W3L
‐ 3 Different Input Method– Shorthand / Longhand / Keyboard
‐ Compared with BoS, and FLoW
17
Experimental Results(2) Preliminary User Studies
Table: Five haptic methods were applied in this experiment: Boundary of Sentences (BoS), 3 Words and 3 Letters (3W3L) via Shorthand, Longhand, and Keyboard input, and First Letter of Words (FLoW). The table reports the Word Error Rate (WER), the Keyword Error Rate (KER), and the absolute improvement in the error rate from the Automatic Speech Recognition (ASR) results to the Hatpic Voice Recognition (HVR) results.
18
Experimental Results(2) Preliminary User Studies
‐ Notable improvement in the Word Error Rate (WER) and the Keyword Error Rate (KER)
‐ Greater improvement in KER can enhance the user experience with the speech recognition system.
‐ Increased duration of speech.minimized by the use of partially spelled words and Gregg shorthand.
19
Conclusion
• Summary– Improvement in WER & particularly in KER– Less‐increased duration of speech
(Gregg Shorthand, partial spelling)– Large standard deviations of WER
• Future Work– HVR API– Application in Spoken Document Retrieval
(for Online Lectures, e‐Learning, Conferences, etc.)
20
Design and Implementation of the Note‐taking Style Haptic Voice Recognition for Mobile Devices
Seungwhan MoonFranklin W. Olin College of Engineering
1000 Olin WayNeedham, MA, [email protected]
Khe Chai SimNational University of SingaporeComputing 1, 13 Computing DriveSingapore, Singapore 117417
‐ The End ‐
21
References[1] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. Openfst: A general and ecient weighted finite-state transducer library. Lecture Notes in Computer Science,
4783(11):11{23, 2007.
[2] H. Butler. Teeline Shorthand. Butterworth Heinemann, 1991.
[3] J. R. Gregg. The Basic Principles of Gregg Shorthand. New York: Gregg Pub, 1923.
[4] S. Gunter and H. Bunke. Hmm-based handwritten word recognition: on the optimization of the number of states, training iterations and gaussian components. Pattern
Recognition, 37:2069{2079, 2004.
[5] J. Hu, S. G. Lim, and M. K. Brown. Writer independent on-line handwriting recognition using an hmm approach. Pattern Recognition, 33(1):133 - 147, 2000.
[6] M. Mohri, F. Pereira, and M. Riley. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69 - 88, 2002.
[7] G. A. Reid, E. J. Thompson, and M. Angus. Pitman Shorterhand. New York: Pitman Pub, 1972.
[8] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, P. Woodland, and S. Young. WSJCAM0 Cambridge Read News. Linguistic Data Consortium, Philadelphia, 1995.
[9] K. C. Sim. Haptic voice recognition: Augmenting speech modality with touch events for efficient speech recognition. IEEE Spoken Language Technology
Workshop (SLT), pages 73{78, 2010.
[10] K. C. Sim. Probabilistic integration of partial lexical information for noise robust haptic voice recognition. In Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics, pages 31- 39, July 2012.
[11] A. Varga and H. J. Steeneken. Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech
recognition systems. Speech Communication, 12(3):247 - 251, 1993.
[12] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The HTK Book (for HTK version 3.4). Cambridge University, December 2006.
22