pressplatypus.files.wordpress.com€¦ · web viewintroduction. speech is the primary way in which...

30
You Don’t Say? Enriching Human-Computer Interactions through Voice Synthesis Megan Jeffrey March 17 th , 2010 Com546: Evolutions and Trends in Digital Media University of Washington: MCDM Abstract As computers continue to be an integral part of how individuals communicate, proponents of voice synthesis have claimed that the technology is a way to “humanize” our interactions with machines. Companies seeking to improve their customer service after business hours rely on call centers that use synthetic voices to answer consumer questions, and in-car GPS devices relay instructions to drivers in a safe, personable manner. Furthermore, for others who have lost the ability to speak, synthetic voices offer another chance to be heard, and express their feelings in a way that is far less robotic-sounding than the text-to-speech technology of the 1970s. Through the use of

Upload: leanh

Post on 08-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

You Don’t Say? Enriching Human-Computer Interactions through Voice Synthesis

Megan Jeffrey

March 17th, 2010

Com546: Evolutions and Trends in Digital Media

University of Washington: MCDM

Abstract

As computers continue to be an integral part of how individuals communicate, proponents of voice synthesis have claimed that the technology is a way to “humanize” our interactions with machines. Companies seeking to improve their customer service after business hours rely on call centers that use synthetic voices to answer consumer questions, and in-car GPS devices relay instructions to drivers in a safe, personable manner. Furthermore, for others who have lost the ability to speak, synthetic voices offer another chance to be heard, and express their feelings in a way that is far less robotic-sounding than the text-to-speech technology of the 1970s. Through the use of complex voice concatenation engines, technicians are approaching a time when the synthesized voices of our computers will enable us to be understood by our machines not only phonetically, but also culturally and emotionally.

Page 2: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

Introduction

Speech is the primary way in which humans engage each other. Shortly after birth,

infants become aware that through the manipulation of their vocal cords, they can manipulate

others, and this awareness grows alongside an individual’s understanding of rhetoric. By the time

many of us enter grade-school, we recognize that speech not only enables us to interact with

others, but also access and disseminate information. Hence, it is not surprising that “with the

rapid advancement in information technology and communications, computer systems

increasingly offer users the opportunity to interact with information through speech” (Al-Said,

2009). After all, as technology continues to increase the ease with which we access data, the

apparent convenience of talking to our machines to get what we want must seem appealing.

However, effective communication is not one-sided, and if humans want to be clearly understood

by machines, machines should be able to respond in a voice of their own.

For decades, scientists have attempted to use technology to synthesize voices that will

enable silent machines, and those humans silenced by physical trauma, to communicate.

However, Clifford Nass, a professor of communication at Stanford University, argues that

200,000 years of evolution have made us “hard-wired to interpret every voice as if it were

human, even when we know it comes from a computer” (Logan, 2007). As a result of this “hard-

wiring,” Nass claims consumers find it difficult to respond to artificial voices because they lack

personality and social awareness. For instance, due to their monotone inflection, synthesized

voices can often sound indifferent, and their overtly-phonetic pronunciation too often reminds us

that we are talking to a machine, and not a human. Nevertheless, the recent success of voice

synthesis companies like CereProc Ltd. and AT&T Natural Voices have resulted in machine-

Page 3: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

generated voices that are capable of expressing emotion and demonstrating the type of nuanced

speech that may one day make chatting with one’s computer an enjoyable exercise.

Historical Background

At the 1939 World’s Fair in New York, Homer Dudley debuted the Voder, or “VOice

CoDER,” a Bell Labs machine that converted white noise into speech using a set of controls that

were “played” like a musical instrument by a human operator. These controls enabled the

operator to alter the rhythm, pitch, and inflection of each available tone until it roughly

resembled human speech. However, it was not until the first digital revolution of the late 1970s

that synthesized-speech systems gained widespread attention, thanks to the “Speak’n’Spell,” an

educational toy developed by Texas Instruments in 1978 (Logan, 2007). By using Text-to-

Speech (or TTS) software that applied mathematical models of sound moving along the human

vocal tract, the toy would “speak” any word typed on its keyboard.

The following year, Dennis Klatt of the Massachusetts Institute of Technology (MIT)

utilized similar “language-taking” software to give world-renowned physicist Stephen Hawking

a new digital voice. Dubbed “Perfect Paul,” the program enabled Hawking to verbally converse

with others, in spite of his neuromuscular dystrophy (Logan, 2007). However, both “Perfect

Paul” and his deeper, more-masculine cousin “Huge Harry” still sounded very robotic, and voice

technicians struggled to develop a TTS-based technology that would synthesize more natural-

sounding and expressive voices.

Voice Synthesis Technology and Methodology: An Overview

Since Klatt’s pioneering research at MIT, voice synthesis has grown to include two

primary components: a TTS engine and a library of pre-recorded voices that enable devices to

Page 4: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

speak in a variety of languages and accents. On their Website, AT&T Natural Voices describes

how modern TTS operates in two stages: During the first, the engine decides how the text should

be spoken (pronunciation, pitch, etc.), and in the second phase, the system generates audio that

matches the previously-identified specifications (2010). However, it is important to note that

TTS systems do not actually understand the human language they mimic. Instead, the process is

“more like learning to read a foreign language aloud [;] with a good dictionary, grammar rules,

etc. you can get better, but still make mistakes obvious to native speakers” (AT&T, 2010).

Therefore, before TTS technology can advance to a level where it is self-correcting,

programmers would have to create software that teaches machines the meaning of words (both

literal and cultural) so that computers would be capable of understanding the text they read.

Voice Concatenation

Computer scientist Boris M. Lobanov writes that using TTS software to reproduce the

human voice resembles “the widely-known biological problem of cloning, whereby on the basis

of a comparatively small amount of genetic information, an attempt is made of reproducing a

living being copy as a whole” (2004). In the early 1990s, speech synthesis researchers abandoned

attempts to create human sounds from scratch and instead began using “voice concatenation,”

which broke down recordings of a person’s voice into small units of speech (phonemes,

allophones, etc.) and then put them back together to form new words and sentences (Logan,

2007). A phoneme is an abstract unit that speakers of a particular language recognize as a

distinctive sound (ex: The hard “d” in the word “dive”). In comparison, an allophone is a variant

of a phoneme; changing the allophone will not change the meaning of a word, but the result may

sound unnatural or be unintelligible (ex: “Night Rate” vs. “nitrate”).

Page 5: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

If we were to think of voice synthesis in terms of cloning, as Lobanov suggests, this use

of “speech DNA” makes the task of synthesizing a new voice more manageable for a system; it

has only to store the smaller units, rather than a copy of every word in a given language.

Moreover, CereProc Ltd. researcher Matthew Aylett argues concatenation is effective because “a

critical element of careful speech is to be able to mark important information bearing sections [;]

a large amount of clarity can be added by inserting short phrase breaks appropriately” (2003).

Hence, by using these small units, concatenation more closely resembles human speech because

the words and sounds do not all run together. Nevertheless, because the quality of a speech

synthesizer is judged on its similarity to the human voice, it is essential that companies maintain

a database of pre-recorded texts and a large number of phonemes that preserve “the individual

acoustic characteristics of a speaker’s voice” (Lobanov, 2004).

According to Aylett, the personal acoustic characteristics of the human voice are

determined by a number of physical factors, such as the unique shape of each person’s speech

organs: the larynx, vocal cords, mouth, etc. (2003). He and other voice synthesis experts like

Tian-Swee Tan believe concatenation engines can compensate for this physical complexity by

using sound units to generate precise speech. However, Tan argues that a select string of

“continuous phoneme from the same source, instead of individual phoneme from different

sources,” will reduce the number of concatenation points and tonal distortion, and will result in a

more natural sounding synthesized voice (2008). Hence, this is why the best synthesized voices

use audio from a single database. In the most basic TTS systems at the turn of the century,

computers stored more than 100,000 bits of sound data associated with written words (Rae-

Dupree, 2001). However, because human communication is more than a string of sounds, voice

synthesis researchers have also developed natural language processors (NLP).

Page 6: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

NLP apply the prosody rules that individuals use to give grammatical meaning to a

sentence to whatever speech is being synthesized (Economist, 1999). The aptly named “prosodic-

processor” divides a sentence into accentual units and then determines the amplitude and

frequency (i.e. volume and pitch) of each unit. One possible end-result of this process would be a

spoken text that ends in the upward-inflection that differentiates a question from a statement

(Vasilopoulos, 2007). Nevertheless, despite these technological refinements, communication

experts like Nass continued to denounce even the best artificial voice systems for their lack of

convincing emotions; for want of a personality, computers remained silent.

Emotional Expression and Voice Synthesis

In 1999, a student from the University of Florida named D’Arcy Haskins Truluck created

the GALE system in an early attempt to make synthesized voices more personable. Truluck

developed a series of prosody rules that described how humans sounded when they were angry,

sad, happy, or fearful and then coded these rules into a TTS program. For example, if someone

using a synthesized voice wanted to express anger, there would be a marked increase in the

“frication” of the speech, meaning that consonants would be heavily stressed and clipped, and

the pitch would fall at the end of a sentence to demonstrate assertiveness (Economist, 1999).

Taking inspiration from Truluck’s research, CereProc Ltd. provides clients with a set of

emotional tags that can be entered alongside text to indicate how it should be intoned. For

instance, <voice emotion="cross"> Get that out of my face /<voice>. However, the CereProc Ltd.

software also uses a combination of pre-recorded voice styles and digital signal processing to

simulate a fuller range of emotions, even though the company admits that there is a certain point

at which strongly emotional speech can sound artificial and unnatural (2010). Therefore, as of

this writing, it is far more difficult to synthesize “homicidal rage” than a vocal tone indicating

Page 7: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

that one is “cross.” Still, CereProc Ltd. claims that it can simulate “a wide variation in the

underlying emotion of our voices,” as most emotional states can be categorized along two

spectrums: positive-negative and active-passive.1 An active state requires a faster speech rate,

and higher volume and pitch, whereas a passive state is slower and lower. CereProc Ltd. also has

coded for emotions that are tied to the content of an exchange, such as surprise or

disappointment (2010).

For those individuals like famed movie critic Roger Ebert who rely on TTS as a

substitute voice for the one they have lost, the ability to add emotional intonations to their speech

is of paramount importance. After the removal of his lower jaw, Ebert wrote last August of his

frustration with having to converse with others through the use of TTS during business meetings:

“I came across as the village idiot. I sensed confusion, impatience and condescension. I ended up

having conversations with myself, just sitting there.” Speaking in public or on TV was also

unpleasant, as the critic felt he sounded “like Robby the Robot” (2010).

Nass would say that Ebert’s business associates responded negatively to his computer

voice because that is how humans react to a voice that sounds bored or insincere, as most

machines do (Logan, 2007); we respond in kind to what we hear. Furthermore, Nass found that

humans are more likely to respond positively to a voice that demonstrates qualities similar to

their own, rather than one that sounds alien and false. In 2007, he and his team discovered that

participants were more likely to follow the advice of a computer voice whose gender matched

their own, and that if an artificial voice is to be trusted as a salesperson, its “personality” can be

more important than what it actually says. For instance, those who identified themselves as

extroverts preferred a voice that constantly asked them if they needed help, while the introverts

1 Anger, for example, would be described as an active-negative emotion. See Appendix A for a graphic representation and Table 1 for more information

Page 8: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

preferred the “salesperson” who only brokered advice when asked (Logan, 2007). Therefore,

Nass’ research indicates that businesses that rely heavily on computer voices would do well to

choose a “digital workforce” with which their customers can identify.

Current Applications and Limitations of Voice Synthesis

Traditionally, consumers have viewed TTS as an assistive technology, used on personal

devices owned by the visually impaired, by those who have no voice of their own, or by

individuals who need to proofread a document. In recent years, synthesized voices have been

employed by GPS devices that recite directions and helpful information to drivers who need to

keep their eyes on the road and not on a screen. Many companies like AT&T Natural Voices

offer a C-based software development kit for engineers seeking to “humanize” their programs,

while CereProc Ltd. promises to quickly build voices that “not only sound real, but have

character, making them suitable for any application that requires speech output” (2010). Drawing

on Nass’ communication research, CereProc Ltd. asks why companies wouldn’t welcome a

chance to “talk to customers in their own accent? Or communicate with younger or older

customers in a voice they can identify with” (CereProc, 2010)?

Nathan Muller, author of the Desktop Encyclopedia of Telecommunications, feels that

help desks and voice response systems are the most commercially important businesses for this

type of technology, especially when customers need to access information after business hours

(2002). The CereProc Ltd. Website also touts the corporation’s “full voice branding and

selection service,” which will profile a business’ target market, and then use the research to

“cast” and “test” the voice that would be the most appealing to customers (2010). If CereProc

Ltd. and its competitors are to be believed, now that digital voices are friendlier, human callers

will be more inclined to interact with a machine they feel is actually responsive to their needs.

Page 9: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

However, before we can feel completely at ease with chatting to computers, there are numerous

issues which voice synthesis researchers will have to address.

In 2001, Adam Greenhalgh, cofounder and CEO of Speaklink, a software company that

creates voice-centered applications, predicted that “we're probably two to five years away from

having a synthesized voice that will be entirely undetectable by the human ear” (Rae-Dupree,

2001). However, nine years later, voice synthesis companies have yet to produce such a product.

Even CereProc Ltd, the Scottish company that promised to give Roger Ebert’s own voice back to

him, can only generate a halting, albeit expressive, replica that the critic says “still needs

improvement, but at least sounds like me” (2010). Aylett admits that synthetic speech generated

from pre-recorded audio still has a “buzzy” quality that results from the vocoding of speech

waveforms (2003). Such tonal distortion can be disastrous in a synthesized voice that needs to be

able to speak a language like Mandarin Chinese or Thai where “the meaning of words with the

same sequence of phonemes can be different if they have different tones” (Chomphan, 2009).

Furthermore, New York Times journalist Keith Bradsher writes that “researchers have

made slow progress in understanding how language works, how human beings speak, and how to

program computers with this understanding” (1991). While we have advanced somewhat in our

understanding of how humans process and understand language, there have been few efforts to

“teach” computers how to think critically and respond appropriately to human speech.

In 2005, Voxify Inc., a voice recognition software company in California, attempted to

correct what it saw as one glaring communication oversight: the lack of “cultural affirmative

behavior traits” (ex: unconsciously muttering “uh-huh” in response to a statement) in a

computer’s speech database. Chief technology officer Amit Desai says that most TTS systems

get confused by such “chatter,” which can sour human-computer interactions. His feeling is that

Page 10: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

“if voice technology is going to get an expanded role in self service business applications, it has

to adapt to what people utter” (Hall, 2005). Finally, before voice synthesis technology can be

used to enrich our experiences with our personal computers, it is going to have to win over those

individuals who can type faster than they talk, and read faster than they listen.

The Future… and What Needs to Happen Before We Can Get There

According to journalist Janet Rae-Dupree, “unrestricted use of the human voice--both to

be understood by the computer and to vocalize the computer's output--has long been the holy

grail of computing interfaces” (2001). However, Bill Meisel, a veteran of the speech-recognition

market, believes that the main use of speech synthesis technology at the moment, and for the

next couple of years, will instead be in specialized fields like medicine (Gomes, 2007).

In two years, the declining cost and increasing speed of microprocessors (as first stated in

Moore’s Law2), will continue to make TTS systems synthesize smoother sentences (Guernsey,

2001). As voice synthesis technology becomes more widely available and cost-effective, patients

suffering from aphasia, ALS, throat cancer, or other diseases that rob them of speech will turn to

companies like CereProc Ltd. and ModelTalker for a chance to regain some semblance of their

old voice and once more be heard. No longer will high-quality voices require “a good voice

talent, a soundproof room, professional audio equipment, and hours of written material with

thorough coverage of phoneme combinations” (AT&T, 2010). Instead, interested individuals will

be able to create their own vocal databases using a personal computer, a microphone and a

previously-determined set of “expressive” phrases that will result in an effective inventory of

words and emotions. From this relatively small amount of speech data, a voice synthesis 2 “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer”

Page 11: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

company will be able to create a voice that, although still imperfect, will sound like the subject in

question. Finally, the “success” of these computer voices will be publicized by organizations like

the NIH National Institute on Deafness and Other Communication Disorders, which will

continue funding these companies and researching the effectiveness of synthesized speech and

how social groups (the elderly, men, women, etc.) react to the still-developing technology.

In terms of mobile technology, there will also be an increased use of voice commands to

access information stored on a computer. Currently, both CereProc Ltd. and AT&T Natural

Voices’ software is licensable for nearly any use. Already it has been incorporated into

“FeedMe,” an iPhone application that will read the news and other content to phone owners

while they drive or have their eyes engaged elsewhere (Hermann, 2010). Similarly, mobile-

phone users will be able to search the Web with their voices, and hear the selections spoken back

to them in the voice of their choice3 (Gomes, 2007). For instance, on the iPhone, users can

choose Brian, the wise-confident-navigator, Jerry, the laid-back-young-entrepreneur-who-still-

knows-his-stuff, Kate, the perky-clever-fashionista, or Mary, the-voice-of-reason. Furthermore,

as we train our devices to respond to the sound of our voices, we can securely access content

stored within the cloud, as it will still be difficult to clone voices that are not our own.

As indicated by Nass’ research, when it comes to artificial voices, being able to match the

synthesized voice to the user is a vital part of the technology’s success and subsequent adoption.

However, any system capable of this will need to be able to detect and respond appropriately to

human moods, so that the voice the computer selects is the closest possible match (Logan, 2007).

In five years, the mood-detection software pioneered by Microsoft’s Project Natal for the Xbox

3 Select voices will be available on select brands/models/carriers, based on market research that indicates which voices (read: personalities) would be most popular amongst a target audience

Page 12: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

will be applied to a variety of personal technologies with which people daily interact, including

mobile phones, personal computers, and (of course) video games.

By being able to detect when users are upset, excited, or stressed, companies will be

better able to respond to their clients’ needs and stay on their good-side. For example, if while

navigating the Internet an individual attempts to complete a transaction and runs into technical

difficulties, their computer can guide them through the site’s trouble-shooting process and read

them suggestions from the site’s FAQ or similar discussion occurring on the site’s forum. If

nothing is working and the computer recognizes that the user is growing angrier, it can first

attempt to placate the individual by apologizing for the difficulty, and then directing the user to

an actual human as a last resort. Although in five years time human assistance will still be

necessary in extreme cases, the newly-empathetic synthesized voice of the computer can still be

used to maintain a mutually beneficial relationship between users and online organizations.

Moreover, recordings of these sessions will be beneficial in helping an organization determine at

what point in the help process clients start losing control of their negative feelings, as “there are

stress characteristics common to all speakers” (Logan, 2007); if some pattern or trend can be

establish, businesses will have a better understanding of what needs to be done to solve the

problem and when.

In ten years, voice synthesis will be used to develop speech-to-speech (STS) translation

where a subject’s speech in one language can be used to be produce corresponding speech in

another language while continuing to sound like the user’s voice. In the world of entertainment,

this technology can be used to render subtitles and clunky voice-overs in international media

obsolete (Aylett, 2003). For example, with this technology, Christopher Walken will still sound

like Christopher Walken even after his latest film has been translated into Hindi. However, in

Page 13: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

order to accomplish this, voice synthesis engines would have to make significant advancements,

both in their processing abilities and in their understanding of foreign languages, and the

nuances of meaning in the words themselves between cultures. For instance, computer-speech

expert Ruediger Hoffmann stresses that a “decisive factor in creating authentic voices is

completeness in the resources and databases used… including vocabulary and grammar” (2004).

Despite this, his colleague Lobanov points out that although there are roughly 2,736 Russian

vowel allophones, voice synthesizers currently only have the power to process less than 1,500

(2004). Hence, it is extremely difficult to create an authentic-sounding Russian voice because the

vocabulary of current databases is limited by an incomplete allophone collection. This means

that it would be difficult to develop natural sounding English-to-Russian STS translations.

Therefore, based on its previous record of advancement, perhaps a more feasible

prediction for voice synthesis in ten years would be that companies like CereProc Ltd. will

perfect their voice-cloning technology and produce computer speech that is practically “as good

as the real thing.” In fact, Mullen already foresees a future where celebrities’ contracts will have

to include voice-licensing clauses. New York Times journalist Robert Frank argues that “voice

cloning is just one of many technologies that expand the market reach of the economy's most

able performers [and] creates a winner-take-all market — one in which even small differences in

performance give rise to large differences in economic reward” (2001). For example, if Sephora

wanted to license Adam Lambert’s voice for a promo about their new line of metallic eye-

shadow, they could obtain his permission to dump his vocals into a TTS engine and churn out the

audio for the commercial. All this would be done at a fraction of the cost of flying Lambert into

town to lay down a 30 second track in the recording studio. However, Frank warns, although

Page 14: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

“cloning frees up resources […] the downside is that the monetary value of these gains is

distributed so unequally” (2001).

Moreover, if voice synthesis technology does become more widespread and voice-

cloning is made simpler and more convincing, synthesized voices may be used to perpetrate

fraud. Similar to the e-mail and social media scams of today, criminals could obtain sensitive

information by tricking people into thinking they were getting phone calls from someone they

know (Guernsey, 2001). For instance, if the law did not protect against the misappropriation of

an individual’s voice, companies like VoxChange could obtain recordings of a person’s voice,

synthesize it using software similar to CereVoice or ModelTalker, and then use it for whatever

purpose they like. After all, “the best way to check up woman’s fidelity or to prove man’s

infidelity is to talk to them with a voice of common acquaintance or relative whom they let into

secrets4” (VoxChange, 2010).

Conclusion

In his original research, Aylett wrote that humans regard vocal mimicry by computers

with both awe and suspicion. He went on to say that this is in part due to the fact that “perfect

vocal mimicry is also the mimicry of our own sense of individuality” (2003). Hence, although

we may talk with, scream at, or supplicate our machines, the idea that in the future they may

answer us with their own voice both fascinates and frightens us. For the time being, we seem

quite content to use TTS systems to give a voice to those who literally have none. However, if

humans do grow desirous of a meaningful relationship with the technology that already seems

like such an integral part their lives, perhaps both they and their machines will have to learn how

to listen to what the other has to say.

4 Direct, unaltered quote from the VoxChange Website

Page 15: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

Appendix A:

Table 1 shows how various emotions can be arranged in the evaluation/activation space continuum. The '+' sign means a more extreme value. The (+Content) means that the emotion will be simulated if appropriate content is used.

Table 1: Active Negative Active Positive

++ Angry++ Frightened/Scared/Panicked+ Tense/Frustrated/Stressed/AnxiousAuthoritative/Proud (+Content)

++ Happy+ Upbeat/Surprised (+Content)/Interested (+Content)

Passive Negative Passive Positive++ SadDisappointed (+Content)/Bored

+ RelaxedConcerned/Caring

Page 16: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

Bibliography

Al-Said, G., & Abdallah, M. (2009). An Arabic text-to-speech system based on artificial neural networks. Journal of Computer Science, 5(3), 207.

AT&T Labs, Inc. Research. (2010) Text-To-Speech (TTS) -- Frequently Asked Questions. Retrieved 2/20/10, from http://www2.research.att.com/~ttsweb/tts/faq.php#TechWhat.

Aylett, M. & Junichi Yamagishi. (2003). Combining Statistical Parametric Speech Synthesis and Unit-Selection for Automatic Voice Cloning. Centre for Speech Technology Research, University of Edinburgh, U.K. Retrieved 02/16/2010 from, http://www.cstr.ed.ac.uk/downloads/publications/2008/03_AYLETT.pdf.

Bradsher, K. (1991) Computers, Having Learned to Talk, Are Becoming More Eloquent. New York Times (D6). Retrieved 2/16/10, from ProQuest Historical Newspapers.

CereProc. (2010). CereProc research and development. CereProc. Retrieved 2/20/2010, from http://www.cereproc.com/about/randd.

Chomphan, S. (2009). Towards the development of speaker-dependent and speaker-independent hidden markov model-based Thai speech synthesis. Journal of Computer Science, 5(12), 905. Retrieved 2/20/2010.

Dutton, G. (1991). Breaking communications barriers. Compute!, 13(9), 28. Retrieved 2/20/2010, from Academic Search Elite database.

Ebert, R. (2010). Finding my own voice - Roger Ebert's journal . Retrieved 2/20/2010, from http://blogs.suntimes.com/ebert/2009/08/finding_my_own_voice.html.

Economist. (1999). Once more, with feeling. The Economist 350 (8108), 78. Retrieved 2/20/2010, from http://search.ebscohost.com.ezproxy.lib.calpoly.edu:2048/login.aspx?direct=true&db=afh&AN=1584087&site=ehost-live.

Frank, R. (2001). The Downside of Hearing Whoopi at the Mall. New York Times. Retrieved 2/16/2010, from http://www.robert-h-frank.com/PDFs/NYT.8.7.01.pdf.

Gomes, L.  (2007). After Years of Effort, Voice Recognition Is Starting to Work. Wall Street Journal (Eastern Edition),  p. B.1.  Retrieved 2/21/2010, from ABI/INFORM Global.

Guernsey, L. (2001). Voice Cloning- Software Recreates Voices of Living and Dead. New York Times. Retrieved 2/16/2010, from http://www.rense.com/general12/ld.htm.

Hall, M. (2005). Speech-recognition apps behave…. Computerworld. 39 (48), 6. Retrieved 2/21/2010 from, http://offcampus.lib.washington.edu/login?url=http://search.ebscohost.com.offcampus.lib.washington.edu/login.aspx?direct=true&db=a9h&AN=19004476&site=ehost-live.

Page 17: pressplatypus.files.wordpress.com€¦ · Web viewIntroduction. Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through

Herrman, J. (2010). How Ebert Will Get His Voice Back. Gizmodo. Retrieved 2/20/2010, from http://gizmodo.com/5474950/how-roger-ebert-will-get-his-voice-back?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+gizmodo/full+(Gizmodo).

Hoffmann, R. & Edward Shpilewsky, Boris Lobanov, Andrey Ronzhin. (2004). Development of Multi-Voice and Multi-Language Text to Speech (TTS) and Speech to Text (STT) conversion system. Retrieved 2/16/2010.

Lobanov, B. & Lilia I. Tsirulnik. (2004). Phonetic-Acoustical Problems of Personal Voice Cloning by TTS. United Institute of Informatics Problems, Nat. Ac. of Sc. Belarus. Retrieved 2/16/2010.

Logan. T. (2007). A little more conversation; Ever enjoyed talking to a machine? One day you might. New Scientist, (34-37). Retrieved 02/20/2010.

Model Talker. (2010). ModelTalker Speech Synthesis System. Retrieved 2/20/2010, from http://www.modeltalker.com/.

Moore, Gordon E. (1965). "Cramming more components onto integrated circuits" (PDF). Electronics Magazine. pp. 4. Retrieved 2/2//2010, from ftp://download.intel.com/museum/Moores_Law/Articles-Press_Releases/Gordon_Moore_1965_Article.pdf.

Muller, N. (2002). Desktop encyclopedia of telecommunications. McGraw-Hill telecommunications. McGraw-Hill Professional, (3), 1134.

Rae-Dupree, J. (2001). A bit of drawl, and a byte of baritone. U.S. News & World Report; 131 (60), 44.

Tan, T. & Sh-Hussain. (2008). Implementation of phonetic context variable length unit selection module for Malay text to speech. Journal of Computer Science, 4(7), 550.

Vasilopoulos, I. & Aggeliki S. Prayati, Antonis V. Athanasopoulos. (2007). Implementation and evaluation of a Greek Text To Speech System based on a Harmonic plus Noise Model. IEEE Transactions on Consumer Electronics, 53, (2), Retrieved 2/16/2010.

VoxChange. (2010). “100 % imitation of another person's voice.” VOXCHANGE.COM. Retrieved 2/20/2010 from, http://www.voxchange.com/voting/imitation-of-another-persons-voice.