two paradigms for natural- language processing robert c. moore senior researcher microsoft research
DESCRIPTION
Some of Microsoft’s near(er) term goals in NLP Better search Help find things on your computer. Help find information on the Internet. Document summarization Help deal with information overload. Machine translationTRANSCRIPT
Two Paradigms for Natural-Language Processing
Robert C. MooreSenior ResearcherMicrosoft Research
Why is Microsoft interested in natural-language processing? Make computers/software easier to
use. Long term goal: just talk to your
computer (Startrek scenario).
Some of Microsoft’s near(er) term goals in NLP Better search
Help find things on your computer. Help find information on the Internet.
Document summarization Help deal with information overload.
Machine translation
Why is Microsoft interested in machine translation? Internal: Microsoft is the world’s largest
user of translation services. MT can help Microsoft Translate documents that would otherwise not
be translated – e.g., PSS knowledge base (http://support.microsoft.com/default.aspx?scid=fh;ES-ES;faqtraduccion).
Save money on human translation by providing machine translations as a starting point.
External: Sell similar software/services to other large companies.
Knowledge engineering vs. machine learning in NLP Biggest debate over the last 15 years in
NLP has been knowledge engineering vs. machine learning.
KE approach to NLP usually involves hand-coding of grammars and lexicons by linguistic experts.
ML approach to NLP usually involves training statistical models on large amounts of annotated or un-annotated text.
Central problems in KE-based NLP Parsing – determining the syntactic
structure of a sentence. Interpretation – deriving formal
representation of the meaning of a sentence.
Generation – deriving a sentence that expresses a given meaning representation.
Simple examples of KE-based NLP notations Phrase-structure grammar:
S Np Vp, Np Sue, Np MaryVp V Np, V sees
Syntactic structure:[[Sue]Np [[sees]V [Mary]Np]Vp]S
Meaning representation:[see(E), agt(E,sue), pat(E,mary)]
Unification Grammar: the pinnacle of the NLP KE paradigm Provides a uniform declarative
formalism. Can be used to specify both
syntactic and semantic analyses. A single grammar can be used for
both parsing and generation. Supports a variety of efficient
parsing and generation algorithms.
Background: Question formation in EnglishTo construct a yes/no question: Place the tensed auxiliary verb from the
corresponding statement at the front of the clause.
John can see Mary. Can John see Mary?
If there is no tensed auxiliary, add the appropriate form of the semantically empty auxiliary do.
John sees Mary. John does see Mary. Does John see Mary?
Question formation in English (continued)To construct a who/what question: For a non-subject who/what question, form a
corresponding yes/no question. Does John see Mary?
Replace the noun phrase in the position being questioned with a question noun phrase and move to the front of the clause.
Who does John see ? For a subject who/what question, simply replace
the subject with a question noun phrase. Who sees Mary?
Example of a UG grammar rule involved in who/what questionsS1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).
Context-free backbone of ruleS1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).
Category subtype featuresS1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).
Features for tracking long distance dependenciesS1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[] ).
Semantic featuresS1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).
Parsing algorithms for UG Virtually any CFG parsing algorithm can
be applied to UG by replacing identity tests on nonterminals with unification of nonterminals.
UG grammars are Turing complete, so grammars have to be written appropriately for parsing to terminate.
“Reasonable” grammars generally can be parsed in polynomial time, often n3.
Generation algorithms for UG Since grammar is purely declarative,
generation can be done by “running the parser backwards.”
Efficient generation algorithms are more complicated than that, but still polynomial for “reasonable” grammars and “exact generation.”
Generation taking into account semantic equivalence is worst-case NP-hard, but still can be efficient in practice.
A Prolog-based UG system to play with Go to
http://www.research.microsoft.com/research/downloads/ Download “Unification Grammar Sentence
Realization Algorithms,” which includes A simple bottom-up parser, Two sophisticated generation algorithms, A small sample grammar and lexicon, A paraphrase demo that
Parses sentences covered by the grammar into a semantic representation.
Generates all sentences that have that semantic representation according to the grammar.
A paraphrase example?- paraphrase(s(_,'CAT'([]),'CAT'([]),'CAT'([])), [what,direction,was,the,cat,chased,by,the,dog,in]).
in what direction did the dog __ chase the cat __
in what direction was the cat __ chased __ by the dog
in what direction was the cat __ chased by the dog __
what direction did the dog __ chase the cat in __
what direction was the cat __ chased in __ by the dog
what direction was the cat __ chased by the dog in __
generation_elapsed_seconds(0.0625)
Whatever happened to UG-based NLP? UG-based NLP is elegant, but lacks
robustness for broad-coverage tasks. Hard for human experts to
incorporate enough details for broad coverage, unless grammar/lexicon are very permissive.
Too many possible ambiguities arise as coverage increases.
How machine-learning-based NLP addresses these problems Details are learned by processing
very large corpora. Ambiguities are resolved by
choosing most likely answer according to a statistical model.
Increase in stat/ML papers at ACL conferences over 15 years
1988
1993
1998
2003
0
10
20
30
40
50
60
70
80
90
100
1985 1990 1995 2000 2005
Year
Perc
ent S
tat/M
L
Characteristics of ML approach to NLP compared to KE approach Model-driven rather than theory-
driven. Uses shallower analyses and
representations. More opportunistic and more diverse
in range of problems addressed. Often driven by availability of
training data.
Differences in approaches to stat/ML NLP Type of training data
Annotated – supervised training Un-annotated – unsupervised training
Type of model Joint model – e.g., generative probabilistic Conditional model – e.g., conditional
maximum entropy Type of training
Joint – maximum likelihood training Conditional – discriminative training
Statistical parsing models Most are:
Generative probabilistic models, Trained on annotated data (e.g., Penn
Treebank), Using maximum likelihood training.
The simplest such model would be a probabilistic context-free grammar.
Probabilistic context-free grammars (PCFGs) A PCFG is a CFG that assigns to each
production a conditional probability of the right-hand side given the left-hand side.
The probability of a derivation is simply the product of the conditional probabilities of all the productions used in the derivation.
PCFG-based parsing chooses, as the parse of a sentence, the derivation of the sentence having the highest probability.
Problems with simple generative probabilistic models Incorporating more features into
the model splits data, resulting in sparse data problems.
Joint maximum likelihood training “wastes” probability mass predicting the given part of the input data.
A currently popular technique: conditional maximum entropy models Basic models are of the form:
Advantages: Using more features does not require
splitting data. Training maximizes conditional
probability rather than joint probability.
iii yxf
xZxyp ),(exp
)(1)|(
Unsupervised learning in NLP Tries to infer unknown parameters and alignments
of data to “hidden” states that best explain (i.e., assign highest probability to) un-annotated NL data.
Most common training method is Expectation Maximization (EM):
Assume initial distributions for joint probability of alignments of hidden states to observable data.
Compute joint probabilities for observed training data and all possible alignments.
Re-estimate probability distributions based on probabilistically weighted counts from previous step.
Iterate last two steps until desired convergence is reached.
Statistical machine translation A leading example of unsupervised
learning in NLP. Models are trained from parallel
bilingual, but otherwise un-annotated corpora.
Models usually assume a sequence of words in one language is produced by a generative probabilistic process from a sequence of words in another language.
Structure of stat MT models Often a noisy-channel framework is
assumed:
In basic models, each target word is assumed to be generated by one source word.
)|()()|( efefe ppp
A simple model: IBM model 1
A sentence e produces a sentence f assuming The length m of f is independent of the length l of
e. Each word of f is generated by one word of e
(including an empty word e0). Each word in e is equally likely to generate the
word at any position in f, independently of how any other words are generated.
Mathematically:)|(
)1()|(
1 0i
m
j
l
ijm eft
lp
ef
More advanced models Most approaches
Model how words are ordered (but crudely). Model how many words a given word is likely
to translates into. Best performing approaches model word-
sequence-to-word-sequence translations. Some initial work has been done on
incorporating syntactic structure into models.
Examples of machine learned English/Italian word translations PROCESSOR PROCESSORE APPLICATIONS APPLICAZIONI SPECIFY SPECIFICARE NODE NODO DATA DATI SERVICE SERVIZIO THREE TRE IF SE SITES SITI TARGET DESTINAZIONE RESTORATION RIPRISTINO ATTENDANT SUPERVISORE GROUPS GRUPPI MESSAGING
MESSAGGISTICA MONITORING
MONITORAGGIO
THAT CHE FUNCTIONALITY FUNZIONALITÀ PHASE FASE SEGMENT SEGMENTO CUBES CUBI VERIFICATION VERIFICA ALLOWS CONSENTE TABLE TABELLA BETWEEN TRA DOMAINS DOMINI MULTIPLE PIÙ NETWORKS RETI A UN PHYSICALLY FISICAMENTE FUNCTIONS FUNZIONI
How do KE and ML approaches to NLP compare today? ML has become the dominant paradigm in
NLP. (“Today’s students know everything about maxent modeling, but not what a noun phrase is.”)
ML results are easier to transfer than KE results.
We probably now have enough computer power and data to learn more by ML than a linguistic expert could encode in a lifetime.
In almost every independent evaluation, ML methods outperform KE methods in practice.
Do we still need linguistics in computational linguistics? There are still many things we are not
good at modeling statistically. For example, stat MT models based on
single-words or strings are good at getting the right words, but poor at getting them in the right order.
Consider: La profesora le gusta a tu hermano. Your brother likes the teacher. The teacher likes your brother.
Concluding thoughts If forced to choose between a pure ML approach
and a pure KE approach, ML almost always wins. Statistical models still seem to need a lot more
linguistic features for really high performance. A lot of KE is actually hidden in ML approaches, in
the form of annotated data, which is usually expensive to obtain.
The way forward may be to find methods for experts to give advice to otherwise unsupervised ML methods, which may be cheaper than annotating enough data to learn the content of the advice.