language - pku · language consists of sentences that are true/false (cf. logic) “modern” view...
TRANSCRIPT
Language
13
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 1
13 Language
13.1 Linguistics
13.2 Grammar
13.3 Syntactic analysis
13.4 Processing
13.5 Practical systems∗
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 2
Linguistics
Natural language understanding (NLU) or natural language process-ing (NLP) (computational linguistics, psycholinguistics) concern withthe interactions between computers and human natural languages
– extracting meaningful information from natural language input– producing natural language output
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 3
A brief history of NLU
1940-60s Foundational Insightsautomaton, McCulloch-Pitts neuronprobabilistic or information-theoretic modelsformal language theory (Chomsky, 1956)
1957–70 The Two Campssymbolic and stochastic (parsing algorithms)Bayesian method (text recognition)the first on-line corpora (Brown corpus of English)
1970–83 Four Paradigmsstochastic paradigm: Hidden Markov Modellogic-based paradigm: Prolog (Definite Clause Grammars)natural language understanding: SHRDLU (Winograd, 1972)discourse modeling paradigm: speech acts, BDI
1983–93 Empiricism and Finite State Models Redux
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 4
A brief history of NLU
1994–99 The Field Comes Togetherprobabilistic and data-driven models
2000–07 The Rise of Machine Learningbig data (spoken and written)statistical learningResurgence of probabilistic and decision-theoretic methods
2008– Deep learninghigh-performance computingULP as recognition
Ref Grosz et al. (1986), Readings in Natural Language Processing
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 5
Communication
“Classical” view (pre-1953):language consists of sentences that are true/false (cf. logic)
“Modern” view (post-1953):language is a form of action
Wittgenstein (1953), Philosophical InvestigationsAustin (1962), How to Do Things with Words
Searle (1969), Speech Acts
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 6
Speech acts
SITUATION
Speaker Utterance Hearer
Speech acts achieve the speaker’s goals:Inform “There’s a pit in front of you”Query “Can you see the gold?”Command “Pick it up”Promise “I’ll share the gold with you”Acknowledge “OK”
Speech act planning requires knowledge of– Situation– Semantic and syntactic conventions– Hearer’s goals, knowledge base, and rationality
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 7
Stages in communication (informing)
Intention S wants to inform H that PGeneration S selects words W to express P in context CSynthesis S utters words W
Perception H perceives W ′ in context C ′
Analysis H infers possible meanings P1, . . . Pn
Disambiguation H infers intended meaning Pi
Incorporation H incorporates Pi into KB
How could this go wrong?
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 8
Stages in communication (informing)
Intention S wants to inform H that PGeneration S selects words W to express P in context CSynthesis S utters words W
Perception H perceives W ′ in context C ′
Analysis H infers possible meanings P1, . . . Pn
Disambiguation H infers intended meaning Pi
Incorporation H incorporates Pi into KB
How could this go wrong?– Insincerity (S doesn’t believe P )– Speech wreck ignition failure– Ambiguous utterance– Differing understanding of current context (C 6= C ′)
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 9
Knowledge representation in language
Engaging in complex language behavior requires various kinds ofknowledge of language
• Phonetics and phonology: the linguistic sounds• Morphology: the meaningful components of words• Syntax: the structural relationships between words• Semantics: meaning• Pragmatics: the relationship of meaning to the goals and intentionsof the speaker• Discourse: the linguistic units larger than a single utterance
and
• World knowledge: common knowledge, commonsense knowledge– language cannot be understood without the everyday knowledge
that all speakers share about the world
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 10
Grammar
Vervet monkeys, antelopes etc. use isolated symbols for sentences⇒ restricted set of communicable propositions, no generative
capacity
Chomsky (1957): Syntactic Structures
Grammar specifies the compositional structure of complex messagese.g., speech (linear), text (linear), music (two-dimensional)
A formal language is a set of strings of terminal symbols
Each string in the language can be analyzed/generated by the gram-mar
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 11
Grammar
The grammar is a set of rewrite rules, e.g.,
S → NP VP
Article → the | a | an | . . .
Here S is the sentence symbol, NP and VP are nonterminals
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 12
Grammar types
Regular: nonterminal → terminal[nonterminal ]
S → aSS → Λ
Context-free: nonterminal → anything
S → aSb
Context-sensitive: more nonterminals on right-hand side
ASB → AAaBB
Recursively enumerable: no constraints
Related to Post systems and Kleene systems of rewrite rules
Natural languages probably context-free, parsable in real time
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 13
Wumpus lexicon
Noun → stench | breeze | glitter | nothing
| wumpus | pit | pits | gold | east | . . .
Verb → is | see | smell | shoot | feel | stinks
| go | grab | carry | kill | turn | . . .
Adjective → right | left | east | south | back | smelly | . . .
Adverb → here | there | nearby | ahead
| right | left | east | south | back | . . .
Pronoun → me | you | I | it | . . .
Name → John | Mary | Beijing | UCB | PKU | . . .
Article → the | a | an | . . .
Preposition → to | in | on | near | . . .
Conjunction → and | or | but | . . .
Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 14
Wumpus lexicon
Noun → stench | breeze | glitter | nothing
| wumpus | pit | pits | gold | east | . . .
Verb → is | see | smell | shoot | feel | stinks
| go | grab | carry | kill | turn | . . .
Adjective → right | left | east | south | back | smelly | . . .
Adverb → here | there | nearby | ahead
| right | left | east | south | back | . . .
Pronoun → me | you | I | it | S/HE | Y ′ALL . . .
Name → John | Mary | Boston | UCB | PAJC | . . .
Article → the | a | an | . . .
Preposition → to | in | on | near | . . .
Conjunction → and | or | but | . . .
Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 15
Wumpus grammarS → NP VP I + feel a breeze| S Conjunction S I feel a breeze + and + I smell a wumpus
NP → Pronoun I| Noun pits| Article Noun the + wumpus| Digit Digit 3 4| NP PP the wumpus + to the east| NP RelClause the wumpus + that is smelly
VP → Verb stinks| VP NP feel + a breeze| VP Adjective is + smelly| VP PP turn + to the east| VP Adverb go + ahead
PP → Preposition NP to + the eastRelClause → that VP that + is smelly
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 16
Grammaticality judgements
Formal language L1 may differ from natural language L2
L1 L2
false positives
false negatives
Adjusting L1 to agree with L2 is a learning problem
* the gold grab the wumpus* I smell the wumpus the gold
I give the wumpus the gold
Intersubjective agreement reliable, independent of semanticsReal grammars 10–500 pages, insufficient even for “proper” English
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 17
Syntactic analysis
Exhibit the grammatical structure of a sentence
I shoot the wumpus
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 18
Parse trees
Exhibit the grammatical structure of a sentence
I shoot the wumpus
Pronoun Verb Article Noun
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 19
Parse trees
Exhibit the grammatical structure of a sentence
I shoot the wumpus
Pronoun Verb Article Noun
NP VP NP
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 20
Parse trees
Exhibit the grammatical structure of a sentence
I shoot the wumpus
Pronoun Verb Article Noun
NP VP NP
VP
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 21
Parse trees
Exhibit the grammatical structure of a sentence
I shoot the wumpus
Pronoun Verb Article Noun
NP VP NP
VP
S
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 22
Parsing
Bottom-up: replacing any substring that matches RHS of a rule withthe rule’s LHS
function BottomUpParse(words, grammar) returns a parse tree
forest←words
loop do
if Length(forest) = 1 and Category(forest[1]) = Start(grammar) then
return forest[1]
else
i← choose from {1. . .Length(forest)}
rule← choose from Rules(grammar)
n←Length(Rule-RHS(rule))
subsequence←Subsequence(forest, i, i+n-1)
if Match(subsequence,Rule-RHS(rule)) then
forest[i. . . i+n-1]← [Make-Node(Rule-LHS(rule), subsequence)]
else fail
end
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 23
Context-free parsing
Efficient algorithms (e.g., chart parsing) O(n3) for context-free, runat several thousand words/sec for real grammars
Context-free parsing ≡ Boolean matrix multiplication⇒ unlikely to find faster practical algorithms
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 24
Logical grammars
BNF notation for grammars too restrictive:– difficult to add “side conditions” (number agreement, etc.)– difficult to connect syntax to semantics
Idea: express grammar rules as logic
X → YZ becomes Y (s1) ∧ Z(s2) ⇒ X(Append(s1, s2))X → word becomes X([“word”])X → Y | Z becomes Y (s) ⇒ X(s) Z(s) ⇒ X(s)
Here, X(s) means that string s can be interpreted as an X
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 25
Logical grammars
It’s easy to augment the rules
NP (s1) ∧ EatsBreakfast(Ref(s1)) ∧ V P (s2)
⇒ NP (Append(s1, [“who”], s2))
NP (s1) ∧ Number(s1, n) ∧ V P (s2) ∧Number(s2, n)
⇒ S(Append(s1, s2))
Parsing is reduced to logical inference:Ask(KB, S([“I” “am” “a” “wumpus”]))
Can add extra arguments to return the parse structure, semantics– semantic interpretations
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 26
Logical grammars
Generation simply requires a query with uninstantiated variables:Ask(KB, S(x))
If we add arguments to nonterminals to construct sentence semantics,NLP generation can be done from a given logical sentence:
Ask(KB, S(x,At(Robot, [1, 1]))
Montague grammarR. Montague, English as a Formal Language, 1970(Formal Philosophy, 1974)I. Heim, A. Kratzer, Semantics in Generative Grammar, 1998C. Potts, Logic of Conventional Implicatures, 2005– Chomsky: Minimalist Program– Discourse Representation Theory– Situation Semantics/Situation Theory– Game-theoretic Semantics
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 27
Probabilistic grammar
Probabilistic context-free grammar (PCFG): the grammar assigns aprobability to every string
VP → Verb[0.70]| VP NP [0.03]
With probability 0.70 a verb phrase consists solely of a verb, andwith probability 0.30 it is a VP followed by an NP
Also assign a probability to every word (lexicon)
Chart parsers: to avoid inefficiency of repeated parsing, every timewe analyze a substring, store the results so we wont have to reanalyzeit later
such a bottom-up (PCFG) version called chart parsing algorithm
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 28
Syntax in NLP
Most view syntactic structure as an essential step towards meaning;“Mary hit John” 6= “John hit Mary”
“And since I was not informed—as a matter of fact, since I did notknow that there were excess funds until we, ourselves, in that checkupafter the whole thing blew up, and that was, if you’ll remember, thatwas the incident in which the attorney general came to me and toldme that he had seen a memo that indicated that there were no morefunds.”
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 29
Syntax in NLP
Most view syntactic structure as an essential step towards meaning;“Mary hit John” 6= “John hit Mary”
“And since I was not informed—as a matter of fact, since I did notknow that there were excess funds until we, ourselves, in that checkupafter the whole thing blew up, and that was, if you’ll remember, thatwas the incident in which the attorney general came to me and toldme that he had seen a memo that indicated that there were no morefunds.”
“Wouldn’t the sentence ’I want to put a hyphen between the wordsFish and And and And and Chips in my Fish-And-Chips sign’ havebeen clearer if quotation marks had been placed before Fish, andbetween Fish and and, and and and And, and And and and, andand and And, and And and and, and and and Chips, as well as afterChips?”
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 30
Problems
Real human languages provide many problems for NLP
• ambiguity
• anaphora
• indexicality
• vagueness
• discourse structure
• metonymy
• metaphor
• noncompositionality etc.
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 31
Ambiguity
Ambiguity at all levels• Lexical“You held your breath and the door for me”• Syntactic“Put the book in the box on the table”
[the book] in the box[the book in the box]
• Semantic: sentence can have more than one meaning“Alice wants a dog like Bob’s”• Pragmatic“Alice: Do you know whos going to the party?Bob: Who?”
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 32
Understanding
Levels of understanding1. Keyword processing: limited knowledge of particular wordsor phrases
e.g., Chatbots, information retrieval, Web searching2. Limited linguistic ability: appropriate response to simple,highly constrained sentences
e.g., database queries in NL, simple NL interfaces3. Full text comprehension: multi-sentence text and its relationto the real world
e.g., conversational dialogue, automatic knowledge acquisition4. Emotional understanding/generation
e.g., responding to literature, poetry, story narration
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 33
Understanding
Why is understanding hard?– Ambiguity: mapping is one-to-many– Rich structures than strings: often hierarchical or scope-bearing– Strong expressiveness: mapping from surface-form to meaning
is many-to-one
Debate: empiricism vs. rationalismGold showed that it is not possible to reliably learn a correct
context-free grammarChomsky argued that there must be an innate universal grammar
that all children have from birthHorning showed that it is possible to learn a probabilistic context-
free grammar (by PAC algorithms)
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 34
Understanding
Goal: a scientific theory of communication by language• To understand the structure of language and its use as a com-
plex computational system• To develop the data structures and algorithms that can imple-
ment that system
Long way to go
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 35
Processing
• Probabilistic models of language
• Text classification
• Information retrieval
• Information extraction
Ref Bird Steven, Edward Loper and Ewan Klein (2009), Natural Lan-guage Processing with Python, OReilly Media Inc.
(Python has good string-handling functionality, besides LISP)
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 36
Probabilistic models of language
Define a natural language (approximative) model as a probability dis-tribution over sentences and possible meaning
A corpus is a body of textN -gram (letters or units) model P (c1:N): probability distribution ofn-letter (or word) sequences, defined as Markov chain of order n− 1Say, a trigram (3-gram) model:
P (ci|c1:i−1) = P (ci|ci−2:i−1)
In a language with 100 characters, the distribution has a millionentries, and can be accurately estimated by counting character se-quences in a corpus with 10 million charactersWith a vocabulary of 105 words, there are 1015 trigram probabilitiesto estimate
e.g., books.google.com/ngram
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 37
Text classification
Text classification (categorization): given a text of some kind, decidewhich of a predefined set of classes it belongs to
Eg., language identification, spam detection etc.
Language identification: determine what natural language it is written(or spelling correction, genre classification, name-entity recognitionetc.)
–N -gram models are well suited for task of language identificationwith small n ≥ 3
– the training corpora need validation for nonzero probability
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 38
Information retrieval
Information retrieval (IR): find documents that are relevant to a query
Eg.: search engine
IR system:
1. A corpus of documents
2. Queries posed in a query language
3. A result set
4. A presentation of the result set
IR technique moves from Boolean keyword model to statistic model
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 39
IR evaluation
Precision: the proportion of documents in the result set that areactually relevant
Recall: the proportion of all the relevant documents in the collectionthat are in the result set
In a larger document collection, such as Web, recall is difficult tocompute
There have only some refinement techniques to improve the perfor-mance of a search engine
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 40
PageRank algorithm
PageRank is a link analysis algorithm based on the Web graph (pagesas nodes and hyperlinks as edges), taking the rank value to indicatean importance of a particular page, which is defined recursively anddepends on the number of all pages that link to it (in-links)
A page p that is linked to by many pages with high PageRank receivesa high rank itself
PR(pi) =1−dN
+ d(∑pj links to pi
PR(pj)
C(pj)+ ∑
pj has no out linkPR(pi)
N)
PR(pi) — the Pagerank value of page piN — the total number of pages in the corpuspj — the pages that link in to piC(pj) — the count of the total number of out-links on pages pjd — a damping factor (random surfer)
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 41
PageRank algorithmfunction PageRank(G, k) returns a PageRank value
G: a inlink file, k: iteration number
persistent: N, number of pages from G
ho,hi, outlink/inlink count hash from G,respectively
d, damping factor, initially 0.85
d← 0.85, ho,hi,N←G
for all p in the graph do
opg← 1N
while k > 0 do
for all p that has no out-links do
dp← dp + d × opg [p]
for all p in the graph do
npg[p]← dp + 1−dN
for all pi in hi(p) do
npg[p]←npg [p] + d×opg [pi ]ho [pi ]
opg←npg
k← k - 1
return PageRank
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 42
Question answering
Question answering (QA): answering a question, not a ranked list ofdocuments but rather a short response (a sentence or phrase)
QA program may use either a pre-structured database or a collectionof natural language documents (a text corpus such as Web)
Question types: fact, list, definition, how, why, hypothetical, seman-tically constrained, and cross-lingual questions
AskMSR: Web-based QA system (2002)
Watson: IBM’s DeepQA– In 2011, competed on the quiz show Jeopardy– access to 200 million pages of structured and unstructured con-
tent consuming four terabytes of disk storage, including the full textof Wikipedia, but was not connected to the Internet during the game
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 43
Information extraction
Information extraction: acquiring knowledge by skimming a text andlooking for occurrences of a particular class of object and for relation-ships among objects
also known as Information filtering: remove redundant informa-tion to management of information overload
E.g, addresses from Web
Some methods: finite state automata (regular expressions), proba-bilistic models, machine learning (deep learning), machine reading,ontology
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 44
Ontology extraction
Ontology extraction from large corpora, say, KG (knowledge graph)– all types of domains, not just one specific domain– dominated by precision, not recall– the results gathered from multiple sources
E.g., general templates for categories
NP such as NP (, NP )∗(, )?((and|or)NP )?
To learn templates from a few examples, then use the templates tolearn more examples, from which more templates can be learned
Wordnet: dictionary of about 100,000 words and phrases– parts of speech, semantic relations (synonym, antonym)
Penn Treebank: parse trees for a 3-million-word corpus (English)The British National Corpus: 100 million wordsWeb: trillion words
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 45
Machine reading
Machine reading: extracting by reading on its own and build up itsown database without no human input
E.g., TextRunner– using syntactic templates extracted from the Penn Treebank
Watson, say, reading one million medical documents within one hour
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 46
Recommender system
Recommender system: to predict the “rating” or “preference” that auser would give to an item
E.g., movies, music, news, books, articles, search queries, social tags,products and online dating etc.
Collaborative filtering– building a model from users’ past behavior (say, ratings)– using a model to predict items (say, ratings for items) that the
user may have an interest inMachine learning models: matrix computation, probabilistic modelsetc.
Netflix (2006-2009) and other prizes for competitions of RS
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 47
Practical systems
• Machine translation
• Speech recognition
• Conversational agent
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 48
Machine translation
MT: automatic translation of text from one natural language (thesource) to another (the target)
Try to translate a passage of a page in a browser by Google translatorin the source Chinese into the target English, and then translate backfrom English to Chinese
What can you find??
A translator (human or machine) requires in-depth understanding ofthe bilingual text
A representation language that makes all the distinctions necessaryfor a set of languages is called an interlingua
– creating a complete knowledge representation of everything– parsing into that representation– generating sentences from that representation
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 49
Machine translation
NMT (Neural MT): end-to-end (deep) learning approach for MT– regard MT as a sequence-to-sequence prediction task and,
without using any information from standard MT systems– design two deep neural networks ⇒ viewing MT as recognition– – an encoder: to learn continuous representations of source
language sentences– – a decoder: to generate the target language sentence with
source sentence representation
Currently, the best MT system, than conventional or statistical phrase-based systems
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 50
Speech recognition
Speech recognition: identify a sequence of words uttered by a speaker,given the acoustic signal
It’s not easy to wreck a nice beach (recognize speech)
Speech signals are noisy, variable, ambiguous
Since the mid-1970s, speech recognition has been formulated as prob-abilistic inference
What is the most likely word sequence, given the speech signal?I.e., choose Words to maximize P (Words|signal)
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 51
Speech recognition
Use Bayes’ rule:
P (Words|signal) = αP (signal|Words)P (Words)
I.e., decomposes into acoustic model + language model
Words are the hidden state sequence, signal is the observationsequence
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 52
Phones
All human speech is composed from 40-50 phones, determined by theconfiguration of articulators (lips, teeth, tongue, vocal cords, air flow)
Form an intermediate level of hidden states between words and signal⇒ acoustic model = pronunciation model + phone model
ARPAbet designed for American English
[iy] beat [b] bet [p] pet[ih] bit [ch] Chet [r] rat[ey] bet [d] debt [s] set[ao] bought [hh] hat [th] thick[ow] boat [hv] high [dh] that[er] Bert [l] let [w] wet[ix] roses [ng] sing [en] button... ... ... ... ... ...
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 53
Speech sounds
Raw signal is the microphone displacement as a function of time;processed into overlapping 30ms frames, each described by features
Analog acoustic signal:
Sampled, quantized digital signal:
Frames with features:
10 15 38
52 47 82
22 63 24
89 94 11
10 12 73
Frame features are typically formants—peaks in the power spectrum
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 54
Phone models
Frame features in P (features|phone) summarized by– an integer in [0 . . . 255] (using vector quantization); or– the parameters of a mixture of Gaussians
Three-state phones: each phone has three phases (Onset, Mid, End)E.g., [t] has silent Onset, explosive Mid, hissing End⇒ P (features|phone, phase)
Triphone context: each phone becomes n2 distinct phones, dependingon the phones to its left and right
E.g., [t] in “star” is written [t(s,aa)] (different from “tar”!)
Triphones useful for handling coarticulation effects: the articulatorshave inertia and cannot switch instantaneously between positions
E.g., [t] in “eighth” has tongue against front teeth
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 55
Phone model example
Phone HMM for [m]:
0.1
0.90.3
0.6
0.4
C1: 0.5
C2: 0.2
C3: 0.3
C3: 0.2
C4: 0.7
C5: 0.1
C4: 0.1
C6: 0.5
C7: 0.4
Output probabilities for the phone HMM:
Onset: Mid: End:
FINAL
0.7
Mid EndOnset
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 56
Word pronunciation models
Each word is described as a distribution over phone sequences
Distribution represented as an HMM transition model
0.5
0.5
0.2
0.8
[m]
[ey]
[ow][t]
[aa]
[t]
[ah]
[ow]
1.0
1.0
1.0
1.0
1.0
P ([towmeytow]|“tomato”) = P ([towmaatow]|“tomato”) = 0.1P ([tahmeytow]|“tomato”) = P ([tahmaatow]|“tomato”) = 0.4
Structure is created manually, transition probabilities learned fromdata
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 57
Isolated words
Phone models + word models fix likelihood P (e1:t|word) for isolatedword
P (word|e1:t) = αP (e1:t|word)P (word)
Prior probability P (word) obtained simply by counting word frequen-cies
P (e1:t|word) can be computed recursively: define
ℓ1:t=P(Xt, e1:t)
and use the recursive update
ℓ1:t+1 = Forward(ℓ1:t, et+1)
and then P (e1:t|word) = Σxtℓ1:t(xt)
Isolated-word dictation systems with training reach 95–99% accuracy
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 58
Continuous speech
Not just a sequence of isolated-word recognition problems– Adjacent words highly correlated– Sequence of most likely words 6= most likely sequence of words– Segmentation: there are few gaps in speech– Cross-word coarticulation—e.g., “next thing”
Continuous speech systems manage 60–80% accuracy on a good day
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 59
Language model
Prior probability of a word sequence is given by chain rule:
P (w1 · · ·wn) =n∏
i=1P (wi|w1 · · ·wi−1)
Bigram model:
P (wi|w1 · · ·wi−1) ≈ P (wi|wi−1)
Train by counting all word pairs in a large text corpus
More sophisticated models (trigrams, grammars, etc.) help a littlebit
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 60
Combined HMM
States of the combined language+word+phone model are labelled bythe word we’re in + the phone in that word + the phone state in thatphone
Viterbi algorithm finds the most likely phone state sequence
Does segmentation by considering all possible word sequences andboundaries
Doesn’t always give the most likely word sequence becauseeach word sequence is the sum over many state sequences
Jelinek invented A∗ in 1969 a way to find most likely word sequencewhere “step cost” is − logP (wi|wi−1)
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 61
DBNs for speech recognition
articulators
tongue, lips
P(OBS | 2) = 1end-of-word observation
deterministic, fixed
stochastic, learned
deterministic, fixed
phoneme
index
transition
phoneme
0 1 0
o
P(OBS | not 2) = 0
1 1 1 2 2
n nn
0
o
observation stochastic, learned
a a b bu u r ra u stochastic, learned
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 62
DNNs for speech recognition
DNN architecture for speech recognitionSpeech signal → Feature extraction → Decoderacoustic models + pronunciation dictionary + language models→ Recognized Words
Deep belief networks: combined DNNs + DBNs
Currently, the best practices, than DBNs– various Speech Assistant (Cortana/Siri/GAssistant/Alexa/FMessenger/· · ·)
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 63
Conversational agent
Conversational systems (conv. agents, dialogue systems, (chat)bots):converse with humans via natural language ⇐ Turing test
A dialogue contains the following phases1. Input recognizer: converting user’s input to plain text
e.g., speech, gesture, handwriting etc.2. NLP units3. Dialog manager: analyzing the semantic information, keeping thehistory and state of the dialog and managing the general flow of theconversation4. Task managers: the dialog manager contacts one or more, thathave knowledge of the specific task domain5. Output generator: producing output
e.g., natural language generator, layout engine etc.6. Output renderer
e.g., text-to-speech engine (TTS), talking head, robot, avatar etc.
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 64
Conversational software
Pipeline architecture for a spoken dialogue system
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 65
Context
• Situational context– specific setting, time and location
• Discourse context– local: the preceding sentence– global: world knowledgee.g., representing mental states: beliefs, desires, intentions, and
plans
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 66
ELIZA
The ELIZA program (Weizenbaum, 1966): imitate the responses ofa Rogerian psychotherapist
User: You are like my father in some ways.ELIZA: WHAT RESEMBLANCE DO YOU SEE?User: You are not very aggressive but I think you dont want me tonotice that.ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE?User: You dont argue with me.ELIZA: WHY DO YOU THINK I DONT ARGUE WITH YOU?User: You are afraid of me.ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OFYOU?
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 67
ELIZA
Use of pattern-matching to process the input and translate it intosuitable outputs
Word Rank Pattern Outputsalike 10 ?X In what way?
ELIZA doesnt actually need to know anything to mimic a Rogerianpsychotherapist
So-called Eliza-style fool
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 68
Dialogue
Trying practical dialogue systemssay, Microsoft Xiao Bing (in Chinese) ⇒ say something??
How long have you been asked for a dialogue??
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 69
The dream
Trend: Deep learning is better than statistical learning both in ma-chine translation and speech recognition, but not in conservation
Combines: language processing + machine learning (deep learning)
Linguistics + Psycho-linguistics + Knowledge representation and rea-soning + Machine Learning + Information Science + Signal process-ing
Learning: models of how children learn their language just from whatthey hear and observe
– apply machine learning to show how children can learn– to map words in a sentence to real world objects– the relation between verbs and their arguments⇐ Understanding??
The dream: “The linguistic computer”Human-like competence in language ⇐ strong AI
AI Slides (5e) c© Lin Zuoquan@PKU 2003-2019 13 70