introduction to voicexml and voice web architecture
DESCRIPTION
Introduction to VoiceXml and Voice Web ArchitectureTRANSCRIPT
© 2007 Ken Rehor. All Rights Reserved. 1
Introduction to VoiceXML and Voice Web Architecture
Ken Rehor
© 2007 Ken Rehor. All Rights Reserved. 2
Session Overview
• Voice Web Architecture– Components of a Voice Web Application
• Voice Standards– W3C Speech Interface Framework
• VoiceXML– Language features– Execution model - Form Interpretation Algorithm (FIA)
• Application Design Techniques– Static vs. dynamic VoiceXML– Performance Considerations
• CCXML, VoiceXML and VoIP• Application Deployment Models• New Technologies
– Speaker Biometrics, Video, Multimodal, VoiceXML 3.0
© 2007 Ken Rehor. All Rights Reserved. 3
Simplifying Voice Services programming
• Web-based architecture for interactive speech services– Exploit web technologies to simplify voice service creation and deployment
– Enable consolidation of voice and web services
– Separate service logic from user interaction
• High-level programming languages– Control speech and telephony resources in uniform manner
– Shield application programmers from implementation details• No need to know ASR, TTS, telephony APIs
– Create portable applications• Run on enterprise system or in telephone network
• Run on a variety of platforms, ASR agnostic
© 2007 Ken Rehor. All Rights Reserved. 4
Voice Web Application Architecture
© 2007 Ken Rehor. All Rights Reserved. 5
• Standard/Common high-level language– Designed for the task
• Leverage open, known technology– Web protocols, servers, networks, development tools, expertise
• Phone number mapped to URL– Phone number associated with URL of voice service
Key Ideas
© 2007 Ken Rehor. All Rights Reserved. 6
Internet or
Intranet
Any phone
Web Browser
HTTP
HTTP
Application(web) server
• Application logic• Content and data• Transaction processing• Database interface
<html>
VoiceXMLbrowser
PSTN orVoIP
Voice / Web Application Architecture
• Grammars• Audio files• Scripts
• Images• Audio files• Scripts
HTTP
.wav
<grxml>
<vxml>
© 2007 Ken Rehor. All Rights Reserved. 7
.wav
<grxml>
Internet or intranet
PSTN
Caller
Customer service, please…
HTTP
Webserver
<vxml>
AS
R
TT
SA
udio
DT
MF
Te
lep
ho
ny
VoiceXMLinterpreter
middleware
VoiceXMLplatform
Welcome toAcme products
…
Voice Application Architecture and Components
OA
&M
© 2007 Ken Rehor. All Rights Reserved. 8
Internet orIntranet
Application(web) server
• Application logic• Content and data• Transaction processing• Database interface
HTTP
<vxml>
Application Backend Architecture
Database(content)
Transaction Server
Web service
Intranet or
Internet
• Grammars• Audio files• Scripts
© 2007 Ken Rehor. All Rights Reserved. 9
Components of a Voice Solution
• Traditional phone, VoIP phone, mobile phone, or multimodal device
• Telephone network– Circuit-switched PSTN or packet-switched VoIP
– Connects caller’s telephone with Telephony Server
• Voice User Interface– Dialog structure / flow
– Prompts – what the application says to the user
– Speech grammars – what the user can say
• Application logic that executes on an application server– Web "back-end“
– Database, or database interface
• VoiceXML Server that executes dialogs– Controls resources such as ASR, SIV, TTS, etc
• Data network to connect application server and VoiceXML server
© 2007 Ken Rehor. All Rights Reserved. 10
Inbound or Outbound calls
• VoiceXML application works the same for inbound and outbound calls
– Additional call progress detection generally required for outbound
• Simple protocol for initiating outbound calls– No firm standards, but most vendors follow similar techniques
– HTTP, Web Services, etc.
© 2007 Ken Rehor. All Rights Reserved. 11
Standards
© 2007 Ken Rehor. All Rights Reserved. 12
Value of Open Standards
• Non-proprietary interfaces between components
• Allow choice of best components for the task
• User interface languages– W3C Speech Interface Framework: VoiceXML, SRGS, SSML, SI– W3C: HTML, XHTML, SMIL, X+V– OMA: WAP
• Communication protocols– W3C: CCXML for 3rd-party telephony call control– W3C: HTTP, HTTPS, SOAP, WSDL– IETF: SIP, MRCP, MSCP– 3GPP: IMS– ITU: T1, ISDN
© 2007 Ken Rehor. All Rights Reserved. 13
Visual vs. Voice markup
Web app UI• HTML – Structure
– Layout
– Input declaration
– Transitions
• Images
• Audio files / streams
• Video
• Text
• Scripts
Voice Web app UI• VoiceXML – Structure
– Dialog flow
– Input declaration
– Transitions
• Audio files
• Video, Images
• Text (for TTS)
• Scripts
© 2007 Ken Rehor. All Rights Reserved. 14
Protocols
Web applications• HTTP, HTTPS
• RTP
• SOAP
• WSDL
• …
Voice Web applications• HTTP, HTTPS
• RTP
• SOAP
• WSDL
• SIP
• …
© 2007 Ken Rehor. All Rights Reserved. 15
Voice Standards Activities
• Speech Interface Framework
• Network protocols
– SIP, MRCP v2, etc.
• Platform Certification, Developer Certification,
Speaker Biometrics, Architecture, Tools
© 2007 Ken Rehor. All Rights Reserved. 16
Scripts
HTTPHTTPS
HTTPHTTPS
VoIPGateway
VoiceXMLBrowser
Telephony Control Interface: SIP, etc.Dialog Control Interface: SIP, MSCP, etc.
DialogControlInterface
VoiceXMLApplication
CCXML VXML
Conference/MediaServer
CCXMLBrowser
Voice Application Standards
PhoneNetwor
k
Caller
CCXMLCall ControlApplication
Media ControlInterface
SOAP
MRCP Client
Audio
DTMF
GRXML
Scripts
Audio
MediaMixer /Server
T1 / E1ISDNSS7
SIP
RFC 2833
RTP
TTS
Server
M R C P
SIV
Server
ASR
Server
GRXMLSSML ** standards in progress **
GRXML
G.711, WAV, .au, mp3, etc.
SIP NetannMSCMLMOML / MSMLMSCPDMSPMGCPetc.
Telephony ControlInterface
VoiceXML 2.0VoiceXML 2.1ECMAScript 262
MRCP v1MRCP v2
SSML
© 2007 Ken Rehor. All Rights Reserved. 17
W3C Speech Interface Framework
© 2007 Ken Rehor. All Rights Reserved. 18
Voice Application Components
• Dialog – flow control of the inputs, outputs, next steps
• Input grammars– Control input constraints for DTMF and speech recognition
• Output formatting– Pronunciation, timing, sequencing
© 2007 Ken Rehor. All Rights Reserved. 19
W3C Speech Interface Framework
• VoiceXML
• SRGS
• SSML
• Semantic Interpretation
• Pronunciation Lexicon
• Call Control
For more information, see:W3C Voice Browser Working Group http://www.w3.org/Voice/
© 2007 Ken Rehor. All Rights Reserved. 20
Voice User Interface - Dialog• W3C VoiceXML 2.0
– W3C Recommendation March 2004– Widely implemented
• Approximately 4 dozen platforms• Many service providers worldwide
– VoiceXML Forum certification program• Nearly two dozen certified platforms, more coming
• W3C VoiceXML 2.1– Candidate Recommendation Sept 2006– Test suite under development; Certification Program to follow– Many platform vendors are implementing
• W3C VoiceXML 3.0– Early stages of development– SCXML – state chart markup language designed as a controller for V3 and
CCXML 2.0 ("Working Draft" Jan 2006)
© 2007 Ken Rehor. All Rights Reserved. 21
User Interaction – Input / Output Control
• Input grammars W3C SRGS 1.0
– W3C Recommendation– Widely implemented
• Output formatting W3C SSML 1.0
– W3C Recommendation– Widely implemented, yet minor real support
(most TTS engines ignore the SSML instructions)
• Semantic Interpretation for Speech Recognition W3C SISR 1.0– Nearing Candidate Recommendation– Implementation gaining acceptance
© 2007 Ken Rehor. All Rights Reserved. 22
W3C Speech Interface FrameworkSemantic Interpretation
© 2007 Ken Rehor. All Rights Reserved. 23
W3C Speech Recognition Grammar Specification
• Markup language to control input constraints– Finite-state speech recognition
– DTMF recognition
• Two variations– XML (GRXML)
– ABNF
• Version 1.0: W3C Recommendation – March 2004
• Implemented and supported by numerous vendors
© 2007 Ken Rehor. All Rights Reserved. 24
GRXML ASR example
• asdf<grammar type="application/srgs+xml" root="r2" version="1.0"> <rule id="r2" scope="public">
<one-of> <item>coffee</item> <item>tea</item> <item>milk</item> <item>nothing</item> </one-of> </rule> </grammar>
© 2007 Ken Rehor. All Rights Reserved. 25
GRXML DTMF example<?xml version="1.0"?>
<grammar mode="dtmf" version="1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" xmlns="http://www.w3.org/2001/06/grammar">
<rule id="digit"> <one-of> <item> 0 </item> <item> 1 </item> <item> 2 </item> <item> 3 </item> <item> 4 </item> <item> 5 </item> <item> 6 </item> <item> 7 </item> <item> 8 </item> <item> 9 </item> </one-of></rule>
<rule id="pin" scope="public"> <one-of> <item> <item repeat="4"><ruleref uri="#digit"/></item> # </item></one-of></rule>
</grammar>
© 2007 Ken Rehor. All Rights Reserved. 26
W3C Speech Synthesis Markup Language
• Markup language to control spoken and audio output
• Version 1.0: W3C Recommendation – Sept 2004
• Implemented and supported by numerous vendors
• Version 1.1: under development– Adds support for tonal languages
– First public Working Draft published January 2007
© 2007 Ken Rehor. All Rights Reserved. 27
SSML Functions
• Audio output– <audio>
• Text-to-Speech output– Contained within SSML constructs
• Pronunciation controls– <say-as>
• Interpret-as
• Format
• Detail
– <emphasis>
• Timing– <break>
© 2007 Ken Rehor. All Rights Reserved. 28
SSML Functions (cont’d)
• Spoken language– xml:lang
• Prosody and Style – voice control– Voice– Gender– Age– Name
• Prosody– <prosody>
• Pitch• Contour• Range• Rate• Duration• Volume
© 2007 Ken Rehor. All Rights Reserved. 29
SSML Functions (cont’d)
• Sentence structure– <p>
– <s>
• phoneme -- Modify text– <sub> - substitute text
• Location identification– <mark>
© 2007 Ken Rehor. All Rights Reserved. 30
VoiceXML 2.x
© 2007 Ken Rehor. All Rights Reserved. 31
VoiceXML Scope
• Human-machine interaction provided by voice response systems: – Output
• play audio files
• produce synthesized speech
– Input
• record spoken input
• recognize spoken input
• collect character input
– Control flow
– Telephony
• transfer a user to another destination, such as a live agent
• disconnect a user
© 2007 Ken Rehor. All Rights Reserved. 32
VoiceXML Goals
• Separate user interaction from service logic – Creates new possible business models
• Service developer can be separate from telephony platform provider
• Enable service portability across implementation platforms– Assume common set of platform capabilities
– Provide common language for:
• Content providers, Tool providers, Platform providers
• Safely handle shared network-based applications– deterministic behavior
• Easy to build common types of applications
• Features to build complex types of applications
• Shield application authors from low-level platform-specific details– Promotes portability, ease of service creation
© 2007 Ken Rehor. All Rights Reserved. 33
VoiceXML 2.0 Basic Functions
• Input– <field>, <menu> recognition– <record> audio recording
• Output– <prompt> container for TTS or prerecorded audio– <audio> prerecorded audio
• Control Flow– <if>, <else>, <elseif> basic conditional logic– <script> complex scripts using ECMAScript– <goto> transition to a new document– <submit> submit data to a web application
• Telephony– <disconnect>– <transfer>
© 2007 Ken Rehor. All Rights Reserved. 34
VoiceXML Execution Model
• Form Interpretation Algorithm <form>• Execution is synchronous (mostly)
– Disconnect events are handled (somewhat) asynchronously
• Audio is queued– Played only when encountering a waiting state
• Processing is always in one of two states:– Waiting for input in an input item
• such as <field>, <record>, or <transfer>– Transitioning between input items in response to an input
• Event-driven– <catch>, <throw> generalized event mechanism– <nomatch>, <noinput> short-hand user-input event handling– <error> short-hand error event handling
© 2007 Ken Rehor. All Rights Reserved. 35
Key Points
• Architecture leverages all things "internet"– Languages, protocols, servers, developers, etc.
• Separation of concerns– Application logic / database vs. telephony / speech resources
– Enables new business models
• Voice ASP
• Prepackaged applications
• URL (application) associated with phone number– Calling party or Called party
– Share resources among many applications (VoiceASP)
• High-level languages, specific to domain / task– Simplify development and maintenance
© 2007 Ken Rehor. All Rights Reserved. 36
VoiceXML <form> and <field>
• <form> – Dialog container
– "Form Interpretation Algorithm" (FIA) specifies default behavior
• <field> – Collect input from caller– <grammar> specifies input 'constraints'
• <prompt> – Container for <audio> and text
© 2007 Ken Rehor. All Rights Reserved. 37
<?xml version="1.0"?><vxml version="2.0">
<form>
<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>
<block> <submit next="http://acme.com/route... " method="get"/> </block>
</form></vxml>
Example
main.vxmlNote: Code simplified for demonstration purposes…
© 2007 Ken Rehor. All Rights Reserved. 38
User Input - Grammars
• Grammars can be speech or DTMF (touchtone)– Both types can be active simultaneously
• Specified by SRGS– XML grammars are normative (aka GRXML)– ABNF grammars are more concise but more complex to author
• Grammars may be specified inline or sourced externally
• External grammars are referenced by URI
• Multiple grammars may be active simultaneously.
© 2007 Ken Rehor. All Rights Reserved. 39
Sales I'd like to place an order I need to talk to a salesmanRepair repair department service service department customer serviceOrder status where's my order? track my order track my shipment where the hell is my stuff?
Grammars can get very complicated:There are many ways to say the same thing…
© 2007 Ken Rehor. All Rights Reserved. 40
<grammar …xml:lang="en-US" version="1.0">
<rule id="dept" scope="public"> <one-of> <item>sales</item> <item>repair</item> <item>order status</item></one-of></rule>
</grammar>
Basic GRXML grammar example
main_menu.grxml
© 2007 Ken Rehor. All Rights Reserved. 41
<form>
<field name="sales_menu"> <prompt> <audio src="sales_menu.wav"> You've reached Acme's sales department. To place an order, say sales. To speak to an associate, say I'd like to speak to someone. </audio> </prompt> <grammar src="sales_menu.grxml"/> </field>
<block> <submit next="http://acme.com/... " method="get"/> </block>
</form>
VoiceXML example – next step
sales.vxml
© 2007 Ken Rehor. All Rights Reserved. 42
<form>
<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>
<noinput> You must say something. </noinput>
<block> <submit next="http://acme.com/route... " method="get"/> </block>
</form>
VoiceXML example with error handling
newmain.vxml
© 2007 Ken Rehor. All Rights Reserved. 43
<form>
<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>
<noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch>
<block> <submit next="http://acme.com/route... " method="get"/> </block>
</form>
VoiceXML example with error handling
newmain.vxml
© 2007 Ken Rehor. All Rights Reserved. 44
<form>
<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>
<help> You can say sales, repair, or order status. </help> <noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch>
<block> <submit next="http://acme.com/route... " method="get"/> </block>
</form>
VoiceXML example with error handling
newmain.vxml
© 2007 Ken Rehor. All Rights Reserved. 46
Set platform features via <property>
• Input modes: type of input from a callerDTMF-only <property name="inputmodes" value="dtmf">
Voice-only <property name="inputmodes" value="voice">Both <property name="inputmodes" value="dtmf voice">
• Timeouts<property name="timeout" value="1450ms">
<property name="termtimeout" value="2500ms">
...
© 2007 Ken Rehor. All Rights Reserved. 47
Call processing: <transfer>
• Blind– Go somewhere but don't return
• Bridge– Add on another party, resume
execution when done talking
© 2007 Ken Rehor. All Rights Reserved. 48
<form id="xfer">
<block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" >
</transfer>
</form>
Call processing: <transfer>
• Blind transfer
© 2007 Ken Rehor. All Rights Reserved. 49
<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" >
</transfer></form>
Call processing: <transfer>
• Bridge transfer
© 2007 Ken Rehor. All Rights Reserved. 50
<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>
</transfer></form>
Call processing: <transfer>
• Bridge transfer with cancel feature
© 2007 Ken Rehor. All Rights Reserved. 51
<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>
<filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try again later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled>
</transfer></form>
Call processing: <transfer>
© 2007 Ken Rehor. All Rights Reserved. 52
<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" transferaudio="music.wav" connecttimeout="60s" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>
<filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try back later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled>
</transfer></form>
Call processing: <transfer>
© 2007 Ken Rehor. All Rights Reserved. 53
Call processing: <transfer>
© 2007 Ken Rehor. All Rights Reserved. 54
New Features in VoiceXML 2.1
• Dynamically referencing grammars and scripts– <grammar expr=“…”> <script expr=“…”>
• Detect Barge-in During Prompt Playback: enhance SSML 1.0 <mark>– Add markexpr attribute
– Add markname and marktime to application.lastresult$ object
• Fetch (XML) data without transition: <data>– Uses read-only subset of DOM
• Dynamically concatenate prompts: <foreach> – Interate through ECMAScript array and execute content
• Record user’s utterance while attempting ASR – recordutterance property
– Add shadow variables: recording, recordingsize, recordingduration
• Send data upon disconnect– <disconnect namelist=“…” >
• Additional <transfer> types– <transfer type=“…” …/>
© 2007 Ken Rehor. All Rights Reserved. 55
Dynamic Applications
© 2007 Ken Rehor. All Rights Reserved. 56
VoiceXML Application Structure
• Static– User experience is the same for everyone
• Information doesn’t change frequently
• No customization per user, time of day, etc.
• Pages are created once and used many times
• Dynamic– User experience is customized by:
• User: e.g. my.yahoo.com, amazon.com (especially once you log in)
• Situation: e.g. travel specials on expedia.com
– Data driven, e.g. inventory system, airline reservations
– Generated by a program at runtime
• JSP, ASP
• App servers such as BEA, IBM Websphere, Oracle 9iAS
© 2007 Ken Rehor. All Rights Reserved. 57
VoiceXML 2.1 and AJAX
• VoiceXML + ECMAScript + <data> + XML
• <data> element allows retrieval of arbitrary XML data without document transition
• Static VoiceXML document can fetch user-specific data at runtime
• Decouple presentation layer from business logic
• Performance improvements due to:– Cache-able VoiceXML
– No need to generate entirely new pages for each dialog when only the content is new
– Less network traffic
© 2007 Ken Rehor. All Rights Reserved. 58
Dynamic Application ConsiderationsExecution of VoiceXML is running a program on your server…
• Must guarantee quality of dynamically-generated VoiceXML documents and ASR grammars
– Catch parse errors, execution errors
– What does the caller hear if there is an error?
• not “Could not parse VoiceXML document”
• Runtime performance– Parse and interpretation time of large documents
– Inefficient scripts and speech grammars
• Security implications– Exploit a bug in a particular implementation? Make free phone calls?
– Could there be a VoiceXML virus? Will all platforms protect against them?
Careful application design, testing and monitoring is essential
© 2007 Ken Rehor. All Rights Reserved. 59
Dynamic Application Considerations
• A mix of different simultaneous applications means variable platform load and execution profile– Parse time of VoiceXML document
– Fetching VoiceXML documents, grammars, audio from remote web servers
– Load Balancing
– How to protect platform from harmful application? (intentional or otherwise?)
• Max size of document
• Max size of grammar
• Complexity measurement of document or grammar (statically checked before execution?)
Platforms, networks, and applications must be carefully engineered
© 2007 Ken Rehor. All Rights Reserved. 60
Performance Considerations
© 2007 Ken Rehor. All Rights Reserved. 61
Load Balancing for Performance and Reliability
• CPU/memory utilization– Grammar compilation
– ASR load
– TTS load
• Telephony Network– Channel balancing
– Dead channel
• Incoming/Outgoing channel assignment / mix
© 2007 Ken Rehor. All Rights Reserved. 62
Performance: Caching
• Fetched documents, grammars, audio files, streams
• Local or distributed cache?
• Effects of prefetching
• Where to cache generated grammars?– Per system
– In-network
• Use external grammar compilation server?
© 2007 Ken Rehor. All Rights Reserved. 63
Application Management
© 2007 Ken Rehor. All Rights Reserved. 64
Application Monitoring and Maintenance
• Runtime logs– Web / application server
– Voice server
– Call Detail Reporting
• Utterance recordings and logs– Useful for grammar and dialog tuning
– Security of recordings may be an issue
– Disk space: full-call recordings may be prohibitively large
Usage data must be continually monitored to improve user experience
© 2007 Ken Rehor. All Rights Reserved. 65
Operations, Administration, Maintenance, Provisioning
• System Monitoring– Interfacing to existing Telco OSSs– Web-based for ISP environment
• Provisioning– Application, Customer
• DN-URI mapping– Telephony
• Call origination/transfer• Max call timeout• Max number of concurrent calls
– Platform-specific VoiceXML features• ECMAScript allowed?• Telephony control allowed?• Max grammar size
© 2007 Ken Rehor. All Rights Reserved. 66
Billing
• "platform time"– Usage of server resources
• Toll Free usage– It's toll free, not free
• Transferred calls– Inbound minutes
– Outbound minutes
– Network features, e.g. Network Redirect
• Outbound calls
Logging and Charging for usage of resources
Accurate billing information is a critical factor in application cost or profitability
© 2007 Ken Rehor. All Rights Reserved. 67
Application Deployment Models
Build-your-own network vs. Outsourcing
© 2007 Ken Rehor. All Rights Reserved. 68
Build vs. Outsource? Deployment Options Enable a Variety of Business Models
• Completely in-house– Maintain complete control for security– Development and deployment systems can be identical
• Outsourced VoiceXML/Telephony– Large-scale distributed networks without major capital investment– Grow quickly and incrementally
• Completely outsourced hosting– All components and systems managed by 3rd party
• Packaged software– VoiceXML application integrated with existing apps
© 2007 Ken Rehor. All Rights Reserved. 69
Completely In-House
• Local control of all systems
• Voice server, app server, database can be on local network
• Development and deployment systems can be identical
• Physical security: in-house team “owns” it
• Failover, reliability, scalability must be locally managed
• Redundant power, networks, etc. are required
© 2007 Ken Rehor. All Rights Reserved. 70
CiscoIPCC
VoiceXML On-premises Deploymentusing TDM or VoIP carrier connection
PSTN
VoiceXMLBrowsers
VoIPGateway, PBX, etc.
DatabaseCo-location facility
TDM:DS3,
Multiple PRI,etc.
ASRservers
WebApplications
WebApplications
VoIP"pipe"
© 2007 Ken Rehor. All Rights Reserved. 71
Outsourced VoiceXML / Telephony
• Telephony and VoiceXML servers outsourced to "Voice Service Provider" (VSP)
• Application remains in your data center(s)– Geographically distributed
– May be dedicated to specific customers
• Many carrier-grade vendors to choose from
© 2007 Ken Rehor. All Rights Reserved. 72
CiscoIPCC
Outsourced VoiceXML / Telephony
PSTN
VoiceXMLBrowsers
VoIPgateway
Database
Co-location facility
ASRservers
Internet
Voice Service Provider:Carrier-grade outsourcing facility
• Architecture is identical to in-house deployment
• Secure IP connection used between facilities
WebApplications
WebApplications
© 2007 Ken Rehor. All Rights Reserved. 73
Advantages of Outsourcing to a VSP
• Choice of many vendors: one for all customers, or choose the
best one for each customer
• Add capacity by adding multiple vendors
• No capital investment
• Pay-as-you-go pricing models
• Failover, reliability, scalability simplified
• Physical security of equipment and networks managed by VSP
• VPN or dedicated data connection to your backend systems
© 2007 Ken Rehor. All Rights Reserved. 74
Distribute Load to Multiple VSPs
Database
Customerco-location facility
CiscoIPCC
VoiceXMLBrowsers
ASRservers Cisco
IPCC
VoiceXMLBrowsers
ASRservers
CiscoIPCC
VoiceXMLBrowsers
ASRservers
Internet
CiscoIPCC
VoiceXMLBrowsers
ASRservers
PSTN
Multiple co-lo facilitiescan be deployed for geographicredundancy and enhancedcapacity.
WebApplications
WebApplications
© 2007 Ken Rehor. All Rights Reserved. 75
Completely Outsourced
• Deploy hardware & software systems at customer-managed co-location facilities
• Deploy complete systems at co-location facilities managed by 3rd party
• Deploy pre-packaged VoiceXML application integrated with customer's call center (managed by customer)
© 2007 Ken Rehor. All Rights Reserved. 76
Combination of In-house and Outsourced Several ways to balance resources
• Primary in-house, with overflow or failover to a VSP– Local control of resources
– Overflow to VSP during peak usage
– Backup for failover / disaster recovery
• In-house development, with primary deployment via VSP– In-house development and trials
– “Push to the network” when ready to deploy
© 2007 Ken Rehor. All Rights Reserved. 77
CCXML, VoiceXML, and VoIP
3rd-Party Call Control
© 2007 Ken Rehor. All Rights Reserved. 78
PSTN
Inbound call using TDM connections
VoiceXMLServer
• 1st-party call control: VoiceXML server handles call routing/setup/answer
Caller
© 2007 Ken Rehor. All Rights Reserved. 79
PSTN
customer
Inbound call using VoIP (SIP and RTP)
VoIPGateway
VoiceXMLServer
1. INVITE
2. RTP
• 1st-party call control: VoIP gateway routes call to VoiceXML server, which handles call routing/setup/answer
© 2007 Ken Rehor. All Rights Reserved. 80
Why VoIP?
• Flexible network topology
• Simplified integration of voice dialog resources
• Vendor independence for network elements
• Separation of concerns: voice dialog resources vs. call control
© 2007 Ken Rehor. All Rights Reserved. 81
PSTN
caller
Inbound Call using 3rd Party Call Control
VoIPGateway
Call RoutingApplication
VoiceXMLServer
1. INVITE
3. RTP
2. INVITE
• 3rd party application handles call routing/setup/answer
© 2007 Ken Rehor. All Rights Reserved. 82
PSTN
caller
Outbound call using 3rd Party Call Control
VoIPGateway
OutboundCalling
Application
VoiceXMLServer
1. INVITE
3. RTP
2. INVITE
• 3rd party application handles outbound call initiation/setup/routing
• “Attaches” VoiceXML dialog to connection
© 2007 Ken Rehor. All Rights Reserved. 83
What is CCXML?
• XML-based language that manages the connections and resources used in phone calls
• Designed for 3rd-party call control applications
• Allows for easy integration into back end web applications very similar to VoiceXML’s model
• Uses the finite state machine model– Event handlers move from one state to the next using markup tags
• CCXML provides commands to run a “dialog” on a call leg
© 2007 Ken Rehor. All Rights Reserved. 84
Why is CCXML Needed?
• VoiceXML was designed primarily for voice dialogs– 1st-party call control: <disconnect> and a several predefined common
<transfer> types
• Connection management requires full asynchronous event handling– Connection/telephony events can occur any time during a call and must be
handled
– VoiceXML specifically limits asynchronous events to simplify the execution and programming model
• 1st-party Call Control can be useful but has limited flexibility– VoiceXML 2.1 <transfer> adds "consultation" feature for network
redirect
© 2007 Ken Rehor. All Rights Reserved. 85
Media
HTTPHTTP
PSTN
Caller
TelephonyInterface
CCXMLServer
DialogServer
Telephony ControlInterface
DialogControlInterface
TelephonyWeb
Application
VoiceWeb
Application
CCXML VXML
CCXML System Architecture
ConferenceServer
© 2007 Ken Rehor. All Rights Reserved. 86
CCXML features
• Telephony channel control: voice paths and signaling– <createcall>, <accept>, <disconnect>, <reject>, <redirect>
• Media control: Conference Bridges and Mixers– <join>, <unjoin>, <createconference>, <destroyconference>
• Dialog control: Add a VoiceXML (or other dialog) resource to a connection– <dialogstart>, <dialogprepare>, <dialogterminate>
© 2007 Ken Rehor. All Rights Reserved. 87
Integration of CCXML and VoiceXML
• Dialogs are created using <dialogstart>– You pass the URL of the document that you want to run
• Dialogs can be ended using <dialogterminate>– This allows CCXML to end a dialog based on a external event such as
someone calling you on a second line
• Dialogs can return data back to the CCXML platform– In VoiceXML use <exit namelist="a b c"/>– This is exposed in the CCXML dialog.exit event
© 2007 Ken Rehor. All Rights Reserved. 88
W3C CCXML 1.0 status
• Nearing "Candidate Recommendation" status– Language complete– Test suite under development– Certification Program under consideration
• Growing support throughout the world
• Several open source projects underway– See http://www.sourceforge.net
© 2007 Ken Rehor. All Rights Reserved. 89
Next-Generation Technologies
© 2007 Ken Rehor. All Rights Reserved. 90
Next-Generation Technologies
• Speaker Biometrics-based authentication– Speaker Identification– Speaker Verification
• Video IVR --VoiceXML augmented with video– Early stages of commercial deployment now– Simple extension to standard platforms– Straightforward step towards full multimodal
• Multimodal– Multiple input modalities: speech recognition, keypad, handwriting,
biometrics (voice, fingerprint, iris, etc.), geolocation, motion– Multiple output modalities: graphics, audio (speech, TTS, music,
polyphonic tones)
© 2007 Ken Rehor. All Rights Reserved. 91
Speaker Biometrics
© 2007 Ken Rehor. All Rights Reserved. 92
Why Speaker Biometrics?
• Identify an individual for remote transactions
• Text / DTMF PINs are inadequate– Easily compromised
– Easily forgotten
– Does not identify an individual
• US Federal Regulations– FFIEC guidelines for financial services
© 2007 Ken Rehor. All Rights Reserved. 93
Speaker Identification and Verification (SIV)
• Authentication– The process of confirming one or more identities.
• Speaker Identification (one-to-many)– Authentication with multiple identity claims.
• Speaker Verification (one-to-one)– Authentication with a single identity claim.
© 2007 Ken Rehor. All Rights Reserved. 94
Types of SIV
• Text independent– SIV technology that can operate on any freeform or structured spoken input.
• Text dependent– SIV technology (usually verification technology) that requires the voice input
of one or more specific passwords or pass phrases (having been enrolled).
• Text prompted– SIV technology (usually verification) that randomly selects words and/or
phrases and prompts the speaker to repeat them. The term is also called challenge-response.
© 2007 Ken Rehor. All Rights Reserved. 95
Fundamental Phases of SIV
• Enrollment– Capture one or more user utterances to ‘train’ the system
• Verification– Capture one or more user utterances to make an identity claim
• Adaptation & Scoring– Judge how close the user’s verification utterance is to the enrolled
utterance
– Refine the existing enrolled utterance with information from the verification utterance
© 2007 Ken Rehor. All Rights Reserved. 96
Video and Multimodal
© 2007 Ken Rehor. All Rights Reserved. 97
“Video” VoiceXML
• Video extensions to VoiceXML– 3G Wireless
– VoIP phones
• VoiceXML is just a dialog language– Initially only for voice input/output
• Example– Videomail is a dialog application very similar to voicemail
• Video and audio are somewhat analogous– VoiceXML can be ‘hacked’ to handle video now:
• <audio src="foo.au“/> could “play” a video file via <audio src=“foo.mpeg4”/>
– VoiceXML 3.0 might add a new language feature
• e.g. <video src="foo.avi"> or <media src="foo.mpeg4">
© 2007 Ken Rehor. All Rights Reserved. 98
“Video” VoiceXML Deployment and Standardization
• Simple extension to standard platforms– Easy integration with current platforms
– Doesn’t “break” existing functionality
– Well aligned with “VoiceXML model”
• Early stages of commercial deployment– Several vendors have deployed large-scale commercial systems
• Step towards full multimodal
© 2007 Ken Rehor. All Rights Reserved. 99
Multimodal Applications
• W3C Multimodal Interaction Working Group– Defining new standards based on extensive industry experience
• IBM / Motorola / Opera X+V 1.2– Early stages of commercial deployment– Freely available from Opera http://dev.opera.com/articles/voice/
For more information, see:W3C Multimodal Interaction Working Group http://www.w3.org/2002/mmi
© 2007 Ken Rehor. All Rights Reserved. 100
VoiceXML 3.0
© 2007 Ken Rehor. All Rights Reserved. 101
VoiceXML 3.0
• Modularization– Cleanly separate functions to enable integration with other modalities
– Enables code reuse
• New media processing– Video
– Voice processing
– Navigation
– Speaker biometrics
• Separation of data, control flow and presentation– Control flow embodied in new language: SCXML
• Clean data model
© 2007 Ken Rehor. All Rights Reserved. 102
• W3C Voice Browser Working Group http://www.w3.org/voice
– VoiceXML 2.0 Recommendation
• http://www.w3.org/TR/voicexml20/
– VoiceXML 2.1 Working Draft
• http://www.w3.org/TR/voicexml21/
– Semantic Interpretation Working Draft
• http://www.w3.org/TR/semantic-interpretation/
– SRGS 1.0 Recommendation
• http://www.w3.org/TR/speech-grammar/
– SSML
• 1.0 Recommendation http://www.w3.org/TR/speech-synthesis/
• 1.1 Working Draft http://www.w3.org/TR/speech-synthesis11/
– CCXML 1.0
• http://www.w3.org/TR/ccxml/
– SCXML
• http://www.w3.org/TR/scxml/
• IETF http://www.ietf.org
References
© 2007 Ken Rehor. All Rights Reserved. 103
Ken Rehorhttp://www.kenrehor.com
VoiceXML Forum Co-founder and past-Chair
Chair, VoiceXML Forum Conformance Committee
Co-Chair, VoiceXML Forum Speaker Biometrics Committee
W3CCo-editor: VoiceXML 1.0, 2.0, 2.1, 3.0Co-editor: CCXML 1.0