topic 4: speech processing...

TOPIC 4: SPEECH PROCESSING SYSTEMS

NATURAL LANGUAGE PROCESSING (NLP)

CS-724

WondwossenMulugeta (PhD) email: [email protected]

1

Topics2

Topics Subtopics5: Speech Processing Systems

• Introduction, Challenges, • Automatic Speech Recognition

(Approaches, Acoustic Modeling, Lexical Modeling)

• Text to Speech System (Text Analysis, Wave form synthesis)

• Evaluations

INTRODUCTION

Language is the ability to express one’s thoughts by

means of a set of signs, whether graphical, gestual,

acoustic or even musical.

It is a distinctive feature of human beings who use

such structured system

Speech

Speech is major component of a language

Speech is the oldest means of communication

Levels of speech:

1. Acoustic

2. Phonetic

3. Phonological

4. Morphological

5. Syntactic

6. Semantic

7. Pragmatic

Why Speech Processing?

No visual contact required… helpful for many

instances

No special equipment required… voice from human

beings directly as input to the computer

Can be done while doing other things…. People can

talk while doing other things… productivity

Available Services

Windows Operating System

Mobile Phones (Smart Phones)

GPS

Wave Forms

“THE SPACE NEARBY”

“THE AREA AROUND”

Identifying word boundaries is one ofthe major challenges in SpeechProcessing Systems. It also depend onhow the speaker speaks

What is a Dialog System?

Dialog systems seek to provide a natural

conversational interaction between the user and the

computer system, e.g.,

User:

“Is there a way I can get to Bole International Airport from here?”

Domains for Dialog Systems

Possible Applications

Travel reservation

Weather forecasting

In-vehicle driver assistance

Call routing

On-line learning environments

Dialog Systems: Information Flow

Must model two-way flow of information

User-to-system

Put inquiries

Provide Clarification

Gives confirmation

System-to-user

Asks for Clarification

Gives Response

Tasks in Speech Processing

1) Speech Coding

Compress a Speech File

Making storage of audio files more compact

2) Speech Synthesis

Construct Speech waveform from words with good speaker

Quality and Accent, Prosody?

Tasks in Speech Processing

3) Speech Recognition

Convert a sound waveform to words

The most relevant and important task in the industry

Tools: Sphinx, ViaVoce & SDK

4) Speaker Recognition/Verification

Concerned with Biometrics

Concerned with:

Speaker Quality

Prosody

Pitch, Accent etc.

ASR: Application

ASR: Development Method

Challenges of ASR

Co-articulation:

Cases where two speakers speaking at the same time

Speaker Variation

Many speakers for the system at various times (call centers)

Spontaneity

Naturalness of the speech

Language Modeling

Representation of the language

Noise Robustness

Tolerating and dealing with noise (natural environment)

Research Issues

Many fundamental problems must be solved for

these systems to mature.

Three general areas include:

Automatic Speech Recognition (ASR)

Natural Language Processing (NLP)

Human-computer Interaction (HCI)

The Noisy Channel Model

Search through space of all possible sentences.

Pick the one that is most probable given the

waveform.

Dealing with Noise

What is the most likely sentence out of all sentences

in the language L given some acoustic input O?

Treat acoustic input O as sequence of individual

observations

O = o1,o2,o3,…,ot

Define a sentence as a sequence of words:

W = w1,w2,w3,…,wn

Dealing with Noise

Probabilistic implication: Pick the highest prob S:

We can use Bayes rule to rewrite this:

Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

ˆ W argmaxWL

P(W |O)

ˆ W argmaxWL

P(O |W )P(W )

ˆ W argmaxWL

P(O |W )P(W )

P(O)

Noisy channel model

ˆ W argmaxWL

P(O |W )P(W )

likelihood prior

The noisy channel model

Ignoring the denominator leaves us with two factors:

P(Source) and P(Signal|Source)

Speech Architecture

Noisy Channel

NLP Issue: Semantic Representation

Two Approaches:

Hand-craft the grammar for the application,

using robust parsing to understand meaning

Problem: time, expense

Use statistical approach, generating initial

rules and using annotated tree-banked data

to discover the full rule set

Problem: annotated training data

NLP Issue:

Resolving Meaning Using Context

Must maintain knowledge of the conversational

context.

“I’m at Dembel. How do I get to a gas station close to it and a café close to it?”

After request for nearest gas station, user says,

“What is it close to?”

Resolving “it” -……… more on this later

Another follow-up by the user,

“How about …restaurant?”

Resolving “…” with “nearest”- ellipsis

Resolving Meaning: Discourse

Analysis

To resolve such requests, system must track

context of the conversation.

The system need to keep long distance

relationship between words.

This is typically handled by a discourse

analysis component in the Dialog Manager.

How is this resolved in speech system:

“Kill him, not leave him.” Vs “Kill him not, leave him.”

Dialog Manager: Discourse Analysis

Anaphora resolution approach:

Use focus mechanism, assuming conversation has focus.

For instance, “gas station” is current focus.

But how about:

“I’m at Dembel. How do I get to a gas station close

to it and then a café close to it?”

Problem: Resolving the two “it”.

Dialog Manager: Clarification

Often cannot satisfy request in one iteration.

The previous example may require clarification

from the user,

“Do you want to go to the gas station first?”

TEXT TO SPEECH SYNTHESIS

Reading

For Human beings the reading process involves:

Seeing, Thinking, Saying, Hearing

These are most complex processes and cannot be

imitated

Human

Speech

Production

System

Text AnalysisDocument Structure DetectionText NormalizationLinguistic Analysis

Prosodic AnalysisPitch & Duration attachment

Speech SynthesisVoice Rendering

Raw text or tagged text

Tagged text

Tagged phone

Controls

Phonetic AnalysisGrapheme-to-Phoneme Conversion

Architecture

of TTS

system

TTS Synthesizer System

A text to speech synthesizer is a computer based system that should be able to read any text. And speech should be intelligible and natural.

“Text-to-Speech software is used to convert words from a computer document (e.g. word processor document, web page) into Audible Speech spoken through the computer speaker”

Applications

1. Talking Calculator

2. Smart Phone Features

SMS Reader

Caller Reader

3. Computer generated wiring instruction

4. Aids for the blind

5. Telephone inquiry service (Ethio Telecom 994)

6. Teaching machines

Typical TTS Components

Text

NATURAL LANGUAGE PROCESSING

Linguistic FormalismInference EnginesLogical Inferences

DIGITAL SIGNAL PROCESSING

Mathematical ModelsAlgorithms

Computations

Phonemes

Prosody

Speech

TEXT-TO-SPEECH SYNTHESIZER

Typical TTS Components

TTS has two components

1. Natural Language Processing Module (NLP) Linguistics Formalism

Inference Engine

Logical Inferences

2. Digital Signal Processing Module (DSP) Mathematical Models

Algorithms

Computations

Phonetic Transcription

Phones

Prosody

NLP and DSP Modules

The NLP module is capable of producing aphonetic transcription of the text to be read,together with the desired intonation and rhythm.

It takes in the text as input and give narrowphonetic transcription as output which is furtherforwarded to the DSP module.

The DSP module which transforms the symbolicinformation it receives into natural soundingspeech. “Narrow phonetic transcription” whichis taken as intermediate varies from synthesizersystem to another.

NLP Module of typical TTS system

Text Analyzer (Morpho Syntactic Analysis)

Pre-processor

Morphological Analyzer

Contextual Analyzer

Syntactic-Prosodic parser

Letter to Sound Module

Preprocessor

Takes in texts as strings of ASCII characters Transforms text into Broad Segmentation Units (BSU’s)

following the set: A sequence of characters A sequence of digits A single punctuation mark or another special character A sequence of white space characters

Eg: Sentences: I Know 1,000 words, Dr. Jones.

BSU: (I)()(know)()(1)(,)(000)()(words)(,)()(Dr)(.)() (Jones)(.)

Rewrites the BSU’s into a list of word-like units and of syntax bearing punctuation marks called Final Segmentation Units are produced (FSU’s).

Preprocessor

Sentence end detection (semicolon, period – ratio, time and decimal point, sentence ending respectively)

Abbreviations (e.g. – for instance) Changed to their full form with the help of lexicons

Acronyms (I.B.M – these can be read as a sequence of characters, or NASA which can be read following the default way)

Numbers (Once detected, first interpreted as rational, time of the day, dates and ordinal depending on their context)

Idioms (eg. “In spite of”, “as a matter of fact”– these are combined into single FSU using a special lexicon)

Morphological Analysis

Task is to propose all possible parts of speech categories to each word taken individually on the basis of their spelling.

The part of speech might affect the way it is pronounced.

Words – Function and Content words

Contextual Analysis

Considers words in their context

Reduces the list of their parts of speech categoriesto a very restricted number of highly probablehypotheses, given the corresponding possibleparts of speech of neighboring words.

Achieved by N-grams, multi-layer perceptron(Neural networks), local stochastic grammars(provided by expert linguistics) etc

Letter to Sound Module

LTS module is responsible for the automatic determination of the phonetic transcription of the incoming text

Cannot just look up in a pronunciation dictionary Do not follow the rule “one character = one phoneme” Examples

Single character correspond to two phonemes x as /ks/

Several characters producing one phoneme gh in thought

Single character pronounced in different ways c in ancestor, ancient, epic

Single phoneme resulting in several spellings sh in dish, t in action, c in ancient

Two Basic Strategies

There are two commonly used strategies to produce

audio from text:

1. Dictionary based and

2. Rule-based

Dictionary Based approach

The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciation is stored by the program.

Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary.

Rule based approach

The other approach used for text-to-phoneme

conversion is the rule-based approach, where rules

for the pronunciations of words are applied to

words to work out their pronunciations based on

their spellings.

This is similar to the "sounding out" approach to

learning reading.

Synthesizer technologies

There are two main technologies used for the

generating synthetic speech waveforms:

1. concatenative synthesis and

2. formant synthesis (a.k.a: parametric speech

synthesis)

Formant Synthesis

Formant synthesis does not use any human speech samples at runtime. Instead, the output synthesized speech is created using an acoustic model.

Parameters such as frequency amplitude etcare varied over time to create a waveform of artificial speech.

Concatenative synthesis

Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech.

Generally, concatenative synthesis gives the most natural sounding synthesized speech.

However, natural variation in speech and automated techniques for segmenting the waveforms sometimes result in audible glitches in the output, detracting from the naturalness.

Concatenative Synthesis

Record basic inventory of sounds

Retrieve appropriate sequence of units at run time

Concatenate and adjust durations and pitch

Synthesize waveform

Phonetic Post Processing

In order to increase the intelligibility and thenaturalness of synthetic speech, some kind ofphonetic post processing is required.

After first phonemic transcription of each wordhas been obtained, this is applied so as toaccount for co-articulatory smoothing. Thissmoothing results in high quality speech.

Prosody refers to certain properties of thespeech signal which are related to audiblechanges in pitch, loudness, syllable length. Thisis also referred as intonation.

DSP Module

Digital signal processing (DSP) is the numerical

manipulation of signals, usually with the intention to

measure, filter, produce or compress continuous

analog signals.

DSP takes in the narrow phonetic transcription and

gives out speech as output

More of a mathematical computation and system

development issue.

Evaluating Speech Systems

System Based Evaluation:

Total system initiative provides low usability.

User Based Evaluation:

Total user initiative introduces higher error

rate.

Thus, mixed initiative approach, balancing

usability and error rate, is taken most often.

Evaluating Speech Systems

Task Success

Was the necessary information exchanged?

Efficiency/Cost

Number dialog turns, task completion time

Qualitative

ASR rejections, timeouts, helps

Usability

User satisfaction with ASR, task ease, interaction pace, system response

End of Topic 5

topic 4: speech processing...

Documents