speech user interface 語音介面

Speech User Interface語音介面

無所不在的資訊取得Pervasive Information Access

動機當載具變得越來越小，輸入與輸出方式也受到相對的限制

– 輸入端 : 實體鍵盤大小受限，虛擬鍵盤也有同樣問題，且缺乏觸覺回饋。

– 輸出端 : 螢幕大小限制 ( 目前市售最大螢幕手機 Samsung note 5.3 吋 )

應用實例電話語音系統 ( 客服專線 )文字輸入汽車語音導航語音搜尋對話系統語音記事視障者介面

應用實例 : 語音搜尋例如 : google voice search

應用實例 : 文字輸入Dragon dictation ( 聲龍聽寫 )

http://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8



應用實例 : 對話系統Siri: Apple 於 2011 年 10 月推出基於語音辨識之虛擬個人助理 (Apple 官方影片)

http://www.youtube.com/watch?v=rNsrl86inpo&feature=related

應用實例 : 語音記事reQall

語音介面的優勢輸入速度 : 一般人說話速度可達每分鐘

100 字 ( 前提 : 辨識度 )指令集的數量幾乎無限制身體其他部位仍可同時動作 : 開車時邊與乘客聊天、邊聽音樂自然 : 作為人與人間的主要的溝通方式( 演化結果 )

語音介面的限制語音辨識仍不完美

– 錯誤率超過 5% 時，花費在偵測與更正錯誤的時間可能比使用鍵盤輸入還久– 語音辨識的準確率易受雜訊影響語音介面沒有可見的狀態 (no visible state)語音介面難以學習– 如何知道要下哪些指令 ?– 如何得知介面涵蓋的範圍 ?

完整之語音對話系統架構

AutomaticSpeech

Recognition

NaturalLanguage

Understanding

DialogueManagement

Planning

NaturalLanguageGeneration

Text-to-speech

signal words

logical form

words

主要組成元件語音辨識 (speech recognition)

– 電腦需辨識 ( 理解 ) 使用者之語音輸入

語音合成 (speech synthesis, text-to-speech, TTS)– 電腦必須能將文字轉為語音，與使用者溝通

語音辨識的型態連續 vs. 非連續語音 (continuous vs. non-

continuous)語者相關或無關 (speaker independent vs. dependent)即興或朗讀文章 (spontaneous vs. read)關鍵字搜尋或全句辨識 (keyword spotting vs. continuous recognition of spoken words)字彙集大或小 (small vs. large vocabulary set)

語音辨識技術隱藏式馬可夫模型 (Hidden Markov

Model)參考論文 :A tutorial on hidden Markov models and selected applications in speech recognition

語音辨識系統評估透過 word error rate (WER) 來評估語音辨識系統的表現ErrorRate = 100*(Subs + Ins + Dels) /

Nwords

REF: I WANT TO GO HOME ***REC: * WANT TWO GO HOME NOWSC: D C S C C I100*(1S+1I+1D)/5 = 60%

語音辨識的技術挑戰如何提升辨識率 ?如何克服雜訊干擾問題 ?如何處理贅字、停頓、發語詞等情況 ?如何加快辨識速度 ? – 雖然在桌上型電腦或筆記型電腦上的速度已沒有太大問題，但在智慧型手機尚仍有改善空間，通常做法是將語音上傳至伺服器進行後續處理及辨識。斷字 segmentation (silly versus sill lea)同音異義字 (mail vs. male)從語音辨識到語意辨識

語音合成又稱為文字轉語音 (text-to-speech,

TTS)技術必須將輸入文字段落進行分析 ( 如中文的斷詞 ) ，決定對應的發音與其聲調，再交由波形合成單元產生語音。一般而言，波形合成乃利用在資料庫內的許多已錄好的語音連接起來。系統則因為儲存的語音單元大小不同而有所差異，若是要儲存 phone 以及 diphone 的話，系統必須提供大量的儲存空間。

實例說明 ( 清大 MIR 實驗室 )

中文 TTS 線上展示NTHU MIR Lab (清華大學 MIR 實驗室 )NTU CSIE (台大 )GUTTS (台科大 )工研院資通所科大訊飛

http://mir.cs.nthu.edu.tw/demo/TTS/

http://nlg.csie.ntu.edu.tw/systems/TWLLMT/

http://guhy.csie.ntust.edu.tw/gutts/

http://atc.ccl.itri.org.tw/

http://www.iflytek.com/TtsDemo/viviVoiceShow.aspx

英文 TTS 線上展示AT & T Natural VoicesGood evening, class. Today we are going to discuss an important type of human-computer interface: speech UI, also known as voice UI. We will demonstrate a TTS engine developed by AT & T, which, in my opinion, is the best TTS so far.

http://www2.research.att.com/~ttsweb/tts/demo.php#top

語音合成技術

Phonetic AnalysisDictionary LookupGrapheme-to-Phoneme (LTS)

Text AnalysisText NormalizationPart-of-Speech taggingHomonym Disambiguation

Prosodic AnalysisBoundary placementPitch accent assignmentDuration computationWaveform

synthesis

RawText in

Speech out

波形合成方法Concatenative synthesis: based on

the concatenation (or stringing together) of segments of recorded speech ( 將預錄的語音片段串連起來 )Formant synthesis: created using additive

synthesis and an acoustic model with various fundamental frequency, voicing, and noise levels.Articulatory synthesis: synthesizing

speech based on models of the human vocal tract

波形合成 : 連鎖合成法目前所有商業語音合成系統均採用

Concatenative Synthesis 連鎖合成法，可再細分為以下三類 :Diphone Synthesis– Units are diphones; middle of one phone to middle of

next.– Why? Middle of phone is steady state.– Record 1 speaker saying each diphoneUnit Selection Synthesis– Larger units (Record 10 hours or more, so have multiple

copies of each unit)– Use search to find best sequence of unitsDomain-specific synthesis: concatenates prerecorded words and phrases to create complete utterances

語音合成的技術挑戰如何正確斷字 (斷詞 )? (中文自然語言處理 )如何合成正確的聲韻 ?使用 concatenative synthesis 技術時，如何在音節與音節之間交接處更為平順 ?如何在語音中加入聲音表情 ?如何產生有特色、辨識度高的語音 ?

語音對話系統Speech conversational systemSIRI: 基於美國國防部 Cognitive

Assistant that Learns and Organizes (CALO) project以語音為基礎的個人虛擬助理http://en.wikipedia.org/wiki/Siri_(software)

http://en.wikipedia.org/wiki/Siri_(software)

http://en.wikipedia.org/wiki/Siri_(software)

展示影片A conversation with Siri on the iPhone 4S

http://www.youtube.com/watch?v=5mNcnj2l6RE

http://www.youtube.com/watch?v=5mNcnj2l6RE

主要技術Conversational Interface: 語音辨識核心由 Nuance 所提供。Personal Context Awareness: CALO 計畫相關技術。Service Delegation: 資訊搜尋與服務提供，有多家公司參與。

資料與服務蒐尋OpenTable, Gayot, CitySearch, BooRah, Yelp, Yahoo Local, ReserveTravel, Localeze for restaurant and business questions and actions;Eventful, StubHub, and LiveKick for events and concert information;MovieTickets, RottenTomatoes and the New York Times for movie information and reviews;True Knowledge, Bing Answers, and Wolfram Alpha for factual question answering;Bing, Yahoo and Google for web search.

http://en.wikipedia.org/wiki/OpenTable

http://en.wikipedia.org/wiki/Gayot

http://en.wikipedia.org/wiki/CitySearch

http://en.wikipedia.org/w/index.php?title=BooRah&action=edit&redlink=1

http://en.wikipedia.org/wiki/Yelp,_Inc.

http://en.wikipedia.org/w/index.php?title=ReserveTravel&action=edit&redlink=1

http://en.wikipedia.org/wiki/Localeze

http://en.wikipedia.org/wiki/Eventful

http://en.wikipedia.org/wiki/StubHub

http://en.wikipedia.org/w/index.php?title=LiveKick&action=edit&redlink=1

http://en.wikipedia.org/wiki/MovieTickets

http://en.wikipedia.org/wiki/RottenTomatoes

http://en.wikipedia.org/wiki/New_York_Times

http://en.wikipedia.org/wiki/True_Knowledge

http://en.wikipedia.org/wiki/Bing_Answers

http://en.wikipedia.org/wiki/Wolfram_Alpha

http://en.wikipedia.org/wiki/Question_answering

http://en.wikipedia.org/wiki/Bing

http://en.wikipedia.org/wiki/Yahoo

http://en.wikipedia.org/wiki/Google

ChatterBot聊天機器人對於無法理解之問題，採取如 ELIZA 等對話產生器之方式來回應。Siri meets ELIZA

http://jordanmechner.com/blog/2011/10/siri/

語音介面 : 實用面之問題Major problems:– modes (no feedback)

• certain commands only work when in specific states

– deep hierarchies (also known as voice mail hell)Verbose feedback wastes time/patience– only confirm consequential things– use meaningful, short cuesInterruption– half-duplex communication (i.e., no barge-in

support)Too much speech on the part of customer is

tiringSpeech takes up space in working memory– can cause problems when problem solving

語音介面開發標準VoiceXML (VXML) is the W3C's

standard XML format for specifying interactive voice dialogues between a human and a computer. 目前版本 VoiceXML 2.1 VoiceXML 3.0 (working draft)

http://en.wikipedia.org/wiki/World_Wide_Web_Consortium

http://en.wikipedia.org/wiki/XML

http://www.w3.org/TR/voicexml21/

http://www.w3.org/TR/voicexml30/

語音介面開發工具語音辨識 : CMU Sphinx; Open Source

Toolkit For Speech Recognition http://cmusphinx.sourceforge.net/語音合成 festvox: http://festvox.org/index.html語音介面: Microsoft Speech API (SAPI 5.3)Java Speech API

http://cmusphinx.sourceforge.net/

http://festvox.org/index.html

http://msdn.microsoft.com/en-us/library/ms723627(v=vs.85).aspx

http://msdn.microsoft.com/en-us/library/ms723627(v=vs.85).aspx

http://java.sun.com/products/java-media/speech/

參考資料X. Huang, A. Acero and H. W. Hn

, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, 2001.Rabiner

and Schafer, Theory and Applications of Digital Speech Processing, 2010.Why is Siri Important?

http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165







http://www.amazon.com/Theory-Applications-Digital-Speech-Processing/dp/0136034284/ref=pd_sim_b_7



http://www.quora.com/Siri-product/Why-is-Siri-important



speech user interface 語音介面

Documents