cartic ramakrishnan meenakshi nagarajan amit sheth

125
CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Upload: damian-willis

Post on 28-Dec-2015

239 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

CARTIC RAMAKRISHNANMEENAKSHI NAGARAJANAMIT SHETH

Page 2: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

How little you really know

A Great way to find out….

Page 3: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

We have used material from several popular books, papers, course notes and presentations made by experts in this area. We have provided all references to the best of our knowledge. This list however, serves only as a pointer to work in this area and is by no means a comprehensive resource.

Page 4: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

KNO.E.SIS knoesis.org

Director: Amit Sheth knoesis.wright.edu/amit/

Graduate Students: Meena Nagarajan

knoesis.wright.edu/students/meena/ [email protected]

Cartic Ramakrishnan knoesis.wright.edu/students/cartic/ [email protected]

Page 5: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Word Sequence Syntactic

Parser Parse Tree

Semantic Analyzer

Literal MeaningDiscourse

AnalyzerMeaning

An Overview of Empirical Natural Language Processing, Eric Brill, Raymond J. Mooney

Page 6: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Traditional (Rationalist) Natural Language Processing Main insight: Using rule-based representations of

knowledge and grammar (hand-coded) for language study

KB

Text

NLP System

Analysis

Page 7: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Empirical Natural Language Processing Main insight: Using distributional environment of a

word as a tool for language study

KB

Text

NLP System

Analysis

Corpus

Learning System

Page 8: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Two approaches not incompatible. Several systems use both.

Many empirical systems make use of manually created domain knowledge.

Many empirical systems use representations of rationalist methods replacing hand-coded rules with rules acquired from data.

Page 9: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Several algorithms, methods in each task, rationalist and empiricist approaches

What does a NL processing task typically entail? How do systems, applications and tasks perform these

tasks? Syntax : POS Tagging, Parser Semantics : Meaning of words, using context/domain

knowledge to enhance tasks

Page 10: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Finding more about what we already know Ex. patterns that characterize known information The search/browse OR ‘finding a needle in a haystack’ paradigm

Discovering what we did not know Deriving new information from data

▪ Ex. Relationships between known entities previously unknown

The ‘extracting ore from rock’ paradigm

Page 11: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Information Extraction - those that operate directly on the text input

▪ this includes entity, relationship and event detection 

Inferring new links and paths between key entities

▪ sophisticated representations for information content, beyond the "bag-of-words" representations used by IR systems

Scenario detection techniques▪ discover patterns of relationships between entities that

signify some larger event, e.g. money laundering activities. 

Page 12: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

They all make use of knowledge of language (exploiting syntax and structure, different extents) Named entities begin with capital letters Morphology and meanings of words

They all use some fundamental text analysis operations Pre-processing, Parsing, chunking, part-of-speech,

lemmatization, tokenization

To some extent, they all deal with some language understanding challenges Ambiguity, co-reference resolution, entity variations etc.

Use of a core subset of theoretical models and algorithms State machines, rule systems, probabilistic models, vector-space

models, classifiers, EM etc.

Page 13: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Wikipedia like text (GOOD) “Thomas Edison invented the light bulb.”

Scientific literature (BAD) “This MEK dependency was observed in BRAF mutant cells

regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest.”

Text from Social Media (UGLY) "heylooo..ano u must hear it loadsss bu your propa faabbb!!"

Page 14: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Illustrate analysis of and challenges posed by these three text types throughout the tutorial

Page 15: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

WHAT CAN TM DO FOR HARRY PORTER?

A bag of words

Page 16: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Leonardo Da Vinci

The Da Vinci code

The Louvre

Victor Hugo

The Vitruvian man

Santa Maria delle Grazie

Et in Arcadia EgoHoly Blood, Holy Grail

Harry Potter

The Last Supper

Nicolas Poussin

Priory of Sion

The Hunchback of Notre Dame

The Mona Lisa

Nicolas Flammel

painted_by

painted_by

painted_by

painted_by

member_of

member_of

member_of

written_by

mentioned_in

mentioned_in

displayed_at

displayed_at

cryptic_motto_of

displayed_at

mentioned_in

mentioned_in

Discovering connections hidden in textUNDISCOVERED PUBLIC KNOWLEDGE

Page 17: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Undiscovered Public Knowledge [Swanson] – as mentioned in [Hearst99]

Search no longer enough▪ Information overload – prohibitively large number

of hits▪ UPK increases with increasing corpus size

Manual analysis very tedious Examples [Hearst99]

▪ Example 1 – Using Text to Form Hypotheses about Disease

▪ Example 2 – Using Text to Uncover Social Impact

Page 18: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Swanson’s discoveries▪ Associations between Migraine and

Magnesium [Hearst99]▪ stress is associated with migraines ▪ stress can lead to loss of magnesium ▪ calcium channel blockers prevent some migraines ▪ magnesium is a natural calcium channel blocker ▪ spreading cortical depression (SCD) is implicated in

some migraines ▪ high levels of magnesium inhibit SCD ▪ migraine patients have high platelet aggregability ▪ magnesium can suppress platelet aggregability

Page 19: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Mining popularity from Social Media Goal: Top X artists from MySpace artist

comment pages Traditional Top X lists got from radio plays, cd

sales. An attempt at creating a list closer to listeners preferences

Mining positive, negative affect / sentiment Slang, casual text necessitates transliteration

▪ ‘you are so bad’ == ‘you are good’

Page 20: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Mining text to improve existing information access mechanisms Search [Storylines] IR [QA systems] Browsing [Flamenco]

Mining text for Discovery & insight [Relationship

Extraction] Creation of new knowledge

▪ Ontology instance-base population▪ Ontology schema learning

Page 21: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Web search – aims at optimizing for top k (~10) hits

Beyond top 10 Pages expressing related latent views on

topic Possible reliable sources of additional

informationStorylines in search results [3]

Page 22: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH
Page 23: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

TextRunner[4] A system that uses the result of dependency

parses of sentences to train a Naïve Bayes classifier for Web-scale extraction of relationships

Does not require parsing for extraction – only required for training

Training on features – POS tag sequences, if object is proper noun, number of tokens to right or left etc.

This system is able to respond to queries like "What did Thomas Edison invent?"

Page 24: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Castanet [1] Semi-automatically builds faceted

hierarchical metadata structures from text

This is combined with Flamenco [2] to support faceted browsing of content

Page 25: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Doc

umen

ts

Sel

ect

ter

ms

WordNet

Build core tree

Augmentcore tree

Remove

top level

categories

Compress

Tree

Divide into facets

Page 26: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

frozen dessert

sundae

entity

substance,matter

nutriment

dessert

ice cream sundae

frozen dessert

entity

substance,matter

nutriment

dessert

sherbet,sorbet

sherbet sundae sherbet

substance,matter

nutriment

dessert

sherbet,sorbet

frozen dessert

entity

ice cream sundae

Domains used to prune applicable senses in Wordnet (e.g. “dip”)

Page 27: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Biologically active substance

LipidDisease or Syndrome

affects

causes

affectscauses

complicates

Fish Oils Raynaud’s Disease???????

instance_of instance_of

UMLS Semantic Network

MeSH

PubMed9284 documents

4733 documents

5 documents

Page 28: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

[Hearst92]Finding class instances

[Nguyen07]

Finding attribute “like”relation instances

[Ramakrishnan et. al. 08]

Page 29: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Automatic acquisition of Class Labels Class hierarchies Attributes Relationships Constraints Rules

Page 30: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Leonardo Da Vinci

The Da Vinci code

The Louvre

Victor Hugo

The Vitruvian man

Santa Maria delle Grazie

Et in Arcadia EgoHoly Blood, Holy Grail

Harry Potter

The Last Supper

Nicolas Poussin

Priory of Sion

The Hunchback of Notre Dame

The Mona Lisa

Nicolas Flammel

painted_by

painted_by

painted_by

painted_by

member_of

member_of

member_of

written_by

mentioned_in

mentioned_in

displayed_at

displayed_at

cryptic_motto_of

displayed_at

mentioned_in

mentioned_in

Page 31: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

SYNTAX, SEMANTICS, STATISTICAL NLP, TOOLS, RESOURCES, GETTING STARTED

Page 32: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

[hearst 97] Abstract concepts are difficult to represent

“Countless” combinations of subtle, abstract relationships among concepts

Many ways to represent similar concepts E.g. space ship, flying saucer, UFO

Concepts are difficult to visualize High dimensionality Tens or hundreds of thousands of features

Page 33: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Ambiguity (sense) Keep that smile playin’ (Smile is a track) Keep that smile on!

Variations (spellings, synonyms, complex forms) Illeal Neoplasm vs. Adenomatous lesion of

the Illeal wall Coreference resolution

“John wanted a copy of Netscape to run on his PC on the desk in his den; fortunately, his ISP included it in their startup package,”

Page 34: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

[hearst 97] Highly redundant data …most of the methods count on this

property

Just about any simple algorithm can get “good” results for simple tasks: Pull out “important” phrases Find “meaningfully” related words Create some sort of summary from

documents

Page 35: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Concerned with processing documents in natural language Computational Linguistics, Information

Retrieval, Machine learning, Statistics, Information Theory, Data Mining etc.

TM generally concerned with practical applications As opposed to lexical acquisition (for

ex.)in CL

Page 36: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Computing Resources Faster disks, CPUs, Networked Information

Data Resources Large corpora, tree banks, lexical data for

training and testing systems Tools for analysis

NL analysis: taggers, parsers, noun-chunkers, tokenizers; Statistical Text Analysis: classifiers, nl model generators

Emphasis on applications and evaluation Practical systems experimentally evaluated on

real data

Page 37: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Computational Linguistics - Syntax Parts of speech, morphology, phrase

structure, parsing, chunking Semantics

Lexical semantics, Syntax-driven semantic analysis, domain model-assisted semantic analysis (WordNet),

Getting your hands dirty Text encoding, Tokenization, sentence

splitting, morphology variants, lemmatization Using parsers, understanding outputs Tools, resources, frameworks

Page 38: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

POS Tags, Taggers, Ambiguities, Examples

Word Sequence Syntactic

Parser Parse Tree

Semantic Analyzer

Literal MeaningDiscourse

AnalyzerMeaning

Page 39: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Assigning a pos or syntactic class marker to a word in a sentence/corpus. Word classes, syntactic/grammatical

categories Usually preceded by tokenization

delimit sentence boundaries, tag punctuations and words.

Publicly available tree banks, documents tagged for syntactic structure

Typical input and output of a tagger▪ Cancel that ticket. Cancel/VB that/DT ticket/NN ./.

Page 40: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Lexical ambiguity Words have multiple usages and parts-

of-speech▪ A duck in the pond ; Don’ t duck when I bowl

▪ Is duck a noun or a verb?

▪ Yes, we can ; Can of soup; I canned this idea▪ Is can an auxiliary, a noun or a verb?

Problem in tagging is resolving such ambiguities

Page 41: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Information about a word and its neighbors has implications on language models

▪ Possessive pronouns (mine, her, its) usually followed a noun

Understand new words▪ Toves did gyre and gimble.

On IE▪ Nouns as cues for named entities▪ Adjectives as cues for subjective expressions

Page 42: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Rule-based Database of hand-written/learned rules to

resolve ambiguity -EngCG Probability / Stochastic taggers

Use a training corpus to compute probability of a word taking a tag in a specific context - HMM Tagger

Hybrids – transformation-based The Brill tagger

A comprehensive list of available taggers http://www-nlp.stanford.edu/links/

statnlp.html#Taggers

Page 43: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Not a complete representationEngCG based on the Constraint

Grammar ApproachTwo step architecture

Use a lexicon of words and likely pos tags to first tag words

Use a large list of hand-coded disambiguation rules that assign a single pos tag for each word

Page 44: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Sample lexiconWord POS AdditionalPOS

features Slower ADJ COMPARITIVE Show V PRESENT Show N NOMINATIVE

Sample rules

Page 45: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

What is the best possible tag given this sequence of words? Takes context into account; global

Example: HMM (hidden Markov models) A special case of Bayesian Inference likely tag sequence is the one that

maximizes the product of two terms: ▪ probability of sequence of tags and probability

of each tag generating a word

Page 46: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Peter/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

to/TO race/???ti = argmaxj P(tj|ti-1)P(wi|tj) P(VB|TO) × P(race|VB)

Based on the Brown Corpus: Probability that you will see this POS transition

and that the word will take this POS

P(VB|TO) = .34 × P(race|VB) = .00003= .00001

Page 47: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Be aware of possibility of ambiguities Possible one has to normalize content

before sending it to the tagger Pre Post Transliteration

▪ “Rhi you were da coolest last eve”▪ Rhi/VB you/PRP were/VBD da/VBG coolest/JJ

last/JJ eve/NN▪ “Rhi you were the coolest last eve”▪ Rhi/VB you/PRP were/VBD the/DT coolest/JJ

last/JJ eve/NN

Page 48: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Understanding Phrase Structures, Parsing, Chunking

Word Sequence Syntactic

Parser Parse Tree

Semantic Analyzer

Literal MeaningDiscourse

AnalyzerMeaning

Page 49: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Words don’t just occur in some orderWords are organized in phrases

groupings of words that clunk together

Major phrase types Noun Phrases Prepositional phrases Verb phrases

Page 50: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Deriving the syntactic structure of a sentence based on a language model (grammar)

Natural Language Syntax described by a context free grammar

the Start-Symbol S ≡ sentence Non-Terminals NT ≡ syntactic constituents Terminals T ≡ lexical entries/ words Productions P NT (NTT)+ ≡ grammar

ruleshttp://www.cs.umanitoba.ca/~comp4190/2006/NLP-Parsing.ppt

Page 51: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

S NT, Part-of-Speech NT, Constituents NT, Words T, Rules: S NP VP statement S Aux NP VP question S VP command NP Det Nominal NP Proper-Noun Nominal Noun | Noun Nominal | Nominal PP VP Verb | Verb NP | Verb PP | Verb NP PP PP Prep NP Det that | this | a Noun book | flight | meal | money

Page 52: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Bottom-up Parsing or data-drivenTop-down Parsing or goal-driven

S

Aux NP VP

Det Nominal Verb NP

Noun Det Nominal

does this flight include a meal

Page 53: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Natural Language Parsers, Peter Hellwig, Heidelberg

Constituency Parse - Nested Phrasal Structures

Dependency parse - Role Specific Structures

Page 54: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Tagging John/NNP bought/VBD a/DT book/NN ./.

Constituency Parse Nested phrasal structure▪ (ROOT (S (NP (NNP John)) (VP (VBD bought) (NP (DT a)

(NN book))) (. .))) Typed dependencies

Role specific structure▪ nsubj(bought-2, John-1) ▪ det(book-4, a-3) ▪ dobj(bought-2, book-4)

Page 55: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Grammar checking: sentences that cannot be parsed may have grammatical errors

Using results of Dependency parse Word sense disambiguation

(dependencies as features or co-occurrence vectors)

Page 56: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

MINIPAR http://www.cs.ualberta.ca/~lindek/

minipar.htm Link Grammar parser:

http://www.link.cs.cmu.edu/link/ Standard “CFG” parsers like the

Stanford parser http://www-nlp.stanford.edu/software/lex-pa

rser.shtml ENJU’s probabilistic HPSG grammar

http://www-tsujii.is.s.u-tokyo.ac.jp/enju/

Page 57: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Some applications don’t need the complex output of a full parse

Chunking / Shallow Parse / Partial Parse Identifying and classifying flat, non-

overlapping contiguous units in text ▪ Segmenting and tagging

Example of chunking a sentence▪ [NPThe morning flight] from [NPDenver] [VPhas

arrived] Chunking algos mention

Page 58: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

From Hearst 97

Page 59: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Entity recognition▪ people, locations, organizations

Studying linguistic patterns (Hearst 92)

▪ gave NP▪ gave up NP in NP▪ gave NP NP▪ gave NP to NP

Page 60: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Stanford and Enju parser demos; analyzing results http://www-tsujii.is.s.u-tokyo.ac.jp/enju/d

emo.html http://nlp.stanford.edu:8080/parser/

If you want to know how to run it stand alone Talk to one of us or see their very helpful

help pages

Page 61: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

COLORLESS GREEN IDEAS SLEEP FURIOUSLY

Word Sequence Syntactic

Parser Parse Tree

Semantic Analyzer

Literal MeaningDiscourse

AnalyzerMeaning

Page 62: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

When raw linguistic inputs nor any structures derived from them will facilitate required semantic processing

When we need to link linguistic information to the non-linguistic real-world knowledge

Typical sources of knowledge Meaning of words, grammatical constructs,

discourse, topic..

Page 63: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Lexical Semantics The meanings of individual words

Formal Semantics (Compositional Semantics or Sentential Semantics) How those meanings combine to make meanings

for individual sentences or utterances Discourse or Pragmatics

How those meanings combine with each other and with other facts about various kinds of context to make meanings for a text or discourse

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 64: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Lexeme: set of forms taken by a single word run, runs, ran and running forms of the same

lexeme RUN Lemma: a particular form of a lexeme that is

chosen to represent a canonical form▪ Carpet for carpets; Sing for sing, sang, sung

Lemmatization: Meaning of a word approximated by meaning of its lemma Mapping a morphological variant to its root

▪ Derivational and Inflectional Morphology

Page 65: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Word sense: Meaning of a word (lemma) Varies with context

Significance Lexical ambiguity

▪ consequences on tasks like parsing and tagging▪ implications on results of Machine translation, Text

classification etc.

Word Sense Disambiguation Selecting the correct sense for a word

Page 66: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

HomonymyPolysemySynonymyAntonymyHypernomyHyponomyMeronomy

Why do we care?http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 67: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Homonymy: share a form, relatively unrelated senses Bank (financial institution, a sloping

mound)

Polysemy: semantically related Bank as a financial institution, as a

blood bank Verbs tend more to polysemy

Page 68: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Different words/lemmas that have the same sense Couch/chair

One sense more specific than the other (hyponymy) Car is a hyponym of vehicle

One sense more general than the other (hypernymy) Vehicle is a hypernym of car

Page 69: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Meronymy Engine part of car; engine meronym of

carHolonymy

Car is a holonym of engine

Page 70: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Semantic fields Cohesive chunks of knowledge Air travel:

▪ Flight, travel, reservation, ticket, departure…

Page 71: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Models these sense relations A hierarchically organized lexical

database On-line thesaurus + aspects of a

dictionary▪ Versions for other languages are under development

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 72: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 73: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Verbs and Nouns in separate hierarchies

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 74: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

The set of near-synonyms for a WordNet sense is called a synset (synonym set) Their version of a sense or a concept

Duck as a verb to mean ▪ to move (the head or body) quickly downwards or

away▪ dip, douse, hedge, fudge, evade, put off,

circumvent, parry, elude, skirt, dodge, duck, sidestep

Page 75: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

IR and QnA Indexing using similar (synonymous)

words/query or specific to general words (hyponymy / hypernymy) improves text retrieval

Machine translation, QnA Need to know if two words are similar to

know if we can substitute one for another

Page 76: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Most well developed Synonymy or similarity

Synonymy - a binary relationship between words, rather their senses

Approaches Thesaurus based : measuring word/sense

similarity in a thesaurus Distributional methods: finding other

words with similar distributions in a corpus

Page 77: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Thesaurus based Path based similarity – two words are

similar if they are similar in the thesaurus hierarchy

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 78: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

We don’t have a thesaurus for every language. Even if we do, many words are missing Wordnet: Strong for nouns, but lacking for

adjectives and even verbs Expensive to build

They rely on hyponym info for similarity car hyponym of vehicle

Alternative - Distributional methods for word similarity

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 79: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Firth (1957): “You shall know a word by the company it keeps!”

Similar words appear in similar contexts - Nida example noted by Lin:

▪ A bottle of tezgüino is on the table▪ Everybody likes tezgüino▪ Tezgüino makes you drunk▪ We make tezgüino out of corn.

Partial material from http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 80: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

Page 81: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

So you want to build your own text miner!

Page 82: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Infrastructure intensive

Luckily, plenty of open source tools, frameworks, resources.. http://www-nlp.stanford.edu/links/statnlp

.html http://www.cedar.buffalo.edu/~rohini/CS

E718/References2.html

Page 83: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Mining opinions from casual text Data – user comments on artist pages

from MySpace “Your musics the shit,…lovve your video

you are so bad” “Your music is wicked!!!!”

Goal Popularity lists generated from listener’s

comments to complement radio plays/cd sales lists

Page 84: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

“Your musics the shit,…lovve your video you are so bad”

Pre-processing▪ strip html, normalizing text from different sources..

Tokenization▪ Splitting text into tokens : word tokens, number

tokens, domain specific requirements Sentence splitting

▪ ! . ? … ; harder in casual text Normalizing words

▪ Stop word removal, lemmatization, stemming, transliterations (da == the)

Page 85: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

‘The smile is so wicked!!’ Syntax : Marking sentiment expression

from syntax or a dictionary▪ The/DT smile/NN is/VBZ so/RB wicked/JJ !/. !/.

Semantics : Surrounding context▪ On Lily Allen’s MySpace page. Cues for Co-ref

resolution▪ Smile is a track by Lilly Allen. Ambiguity

Background knowledge / resources▪ Using urbandictionary.com for semantic orientation

of ‘wicked’

Page 86: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

GATE - General Architecture of Text Engineering, since 1995 at University of Sheffield, UK

UIMA - Unstructured Information Management Architecture, IBM

Document processing tools, Components syntactic tools, nlp tools, integrating framework

Page 87: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

TO COME: USAGE EXAMPLES OF WHAT WE COVERED THUS FAR

Page 88: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

SAMPLE APPLICATIONS, SURVEY OF EFFORTS IN TWO SAMPLE AREAS

102

Page 89: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest.

*MEK dependency ISA Dependency_on_an_Organic_chemical *BRAF mutant cells ISA Cell_type*downregulation of cyclin D1 protein expression ISA Biological_process*tissue lineage ISA Biological_concept*induction of G1 arrest ISA Biological_process

Information Extraction = segmentation+classification+association+mining

Text mining = entity identification+named relationship extraction+discovering association chains….

Segmentation

Classification

Named Relationship ExtractionMEK dependency

observed in

BRAF mutant cells

downregulation of cyclin D1 protein expression

correlated with

induction of G1 arrest

correlated with

Page 90: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

MEK dependency

observed in

BRAF mutant cells

downregulation of cyclin D1 protein expression

correlated with

induction of G1 arrest

correlated with

Page 91: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

The task of classifying token sequences in text into one or more predefined classes

Approaches Look up a list

▪ Sliding window

Use rules Machine learning

Compound entities Applied to

Wikipedia like text Biomedical text

Page 92: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

The simplest approach Proper nouns make up majority of

named entities Look up a gazetteer

▪ CIA fact book for organizations, country names etc.

Poor recall▪ coverage problems

Page 93: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Rule based [Mikheev et. Al 1999]

Frequency Based

"China International Trust and Investment Corp”"Suspended Ceiling Contractors Ltd”"Hughes“ when "Hughes Communications Ltd.“ is already marked as an organization

Scalability issues:•Expensive to create manually•Leverages domain specific information – domain specific •Tend to be corpus-specific – due to manual process

Page 94: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Machine learning approaches Ability to generalize better than rules Can capture complex patterns Requires training data

▪ Often the bottleneck Techniques [list taken from Agichtein2007]

Naive Bayes SRV [Freitag 1998], Inductive Logic Programming Rapier [Califf and Mooney 1997] Hidden Markov Models [Leek 1997] Maximum Entropy Markov Models [McCallum et al.

2000] Conditional Random Fields [Lafferty et al. 2001]

Page 95: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Orthographical Features CD28 a protein

Context Features Window of words

▪ Fixed ▪ Variable

Part-of-speech features Current word Adjacent words – within fixed

window Word shape features

Kappa-B replaced with Aaaaa-A

Dictionary features Inexact matches

Prefixes and Suffixes “~ase” = protein

Page 96: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

HMMs▪ a powerful tool for representing sequential data▪ are probabilistic finite state models with parameters for

state-transition probabilities and state-specific observation probabilities

▪ the observation probabilities are typically represented as a multinomial distribution over a discrete, finite vocabulary of words

▪ Training is used to learn parameters that maximize the probability of the observation sequences in the training data

Generative ▪ Find parameters to maximize P(X,Y)▪ When labeling Xi future observations are taken into

account (forward-backward)

Problems▪ Feature overlap in NER

▪ E.g. to extract previously unseen company names from a newswire article the identity of a word alone is not very predictive knowing that the word is capitalized, that is a noun,

that it is used in an appositive, and that it appears near the top of the article would all be quite predictive

▪ Would like the observations to be parameterized with these overlapping features

▪ Feature independence assumption

Several features about same word can affect parameters

Page 97: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

MEMMs [McCallum et. al, 2000] Discriminative

▪ Find parameters to maximize P(Y|X) No longer assume that features are

independent▪ f<Is-capitalized,Company>(“Apple”,

Company) = 1. Do not take future observations into

account (no forward-backward)

Problems▪ Label bias problem

Page 98: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

CRFs [Lafferty et. al, 2001] Discriminative Doesn’t assume that features are

independent When labeling Yi future observations are

taken into account Global optimization – label bias

preventedThe best of both worlds!

Page 99: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Example [ORG U.S. ] general [PER David Petraeus ] heads for [LOC

Baghdad ] . Token POS Chunk Tag---------------------------------------------------------U.S. NNP I-NP I-ORGgeneral NN I-NP O David NNP I-NP B-PER Petraeus NNP I-NP I-PERheads VBZ I-VP O for IN I-PP O Baghdad NNP I-NP I-LOC . . O O

CONLL format – Mallet Major bottleneck is training data

Page 100: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Context Induction approach [Talukdar2006]

▪ Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy.

▪ New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists.

▪ Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.

PrunedExtraction

patterns

Feature generation

For CRF

Page 101: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Machine Learning Best performance

Problem Training data bottleneck

Pattern induction Reduce training data creation time

Page 102: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Knowledge Engineering approach Manually crafted rules

▪ Over lexical items <person> works for <organization>

▪ Over syntactic structures – parse trees

GATEMachine learning approaches

Supervised Semi-supervised Unsupervised

Page 103: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Supervised▪ BioText – extraction of relationships between

diseases and their treatments [Rosario et. al 2004]

▪ Rule-based supervised approach [Rinaldi et. al 2004]

▪ Semantics of specific relationship encoded as rules

▪ Identify a set of relations along with their morphological variants (bind, regulate, signal etc.)

subj(bind,X,_,_),pobj(bind,Y,to,_) prep(Y,to,_,_) => bind(X,Y).

▪ Axiom formulation was however a manual process involving a domain expert.

Page 104: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Hand-coded domain specific rules that encode patterns used to extract

▪ Molecular pathways [Freidman et. al. 2001]▪ Protein interaction [Saric et. al. 2006]

All of the above in the biomedical domain Notice – specificity of relationship types Amount of effort required Also notice types of entities involved in the

relationships

Page 105: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

IMPLICIT

EXPLICIT

Page 106: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Semantic Role Labeling Features

Detailed tutorial on SRL is available By Yih & Toutanova here

Page 107: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Other approaches Discovering concept-specific relations

▪ Dmitry Davidov, et. al 2007,

preemptive IE approach▪ Rosenfeld & Feldman 2007

Open Information Extraction▪ Banko et. al 2007▪ Self supervised approach ▪ Uses dependency parses to train extractors

On-demand information extraction▪ Sekine 2006

▪ IR driven▪ Patterns discovery▪ Paraphrase

Page 108: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Rule and Heuristic based method YAGO Suchanek et. al, 2007 Pattern-based approach Uses WordNet

Subtree mining over dependency parse trees Nguyen et. al, 2007

Page 109: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

• Entities (MeSH terms) in sentences occur in modified forms• “adenomatous” modifies “hyperplasia”• “An excessive endogenous or exogenous stimulation” modifies “estrogen”

• Entities can also occur as composites of 2 or more other entities• “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”

Page 110: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Relationship head

Subject head

Object head Object head

Small set of rules over dependency types dealing with modifiers (amod, nn) etc.

subjects, objects (nsubj, nsubjpass) etc.

Since dependency types are arranged in a hierarchy We use this hierarchy to

generalize the more specific rules

There are only 4 rules in our current implementation

Carroll, J., G. Minnen and E. Briscoe (1999) `Corpus annotation for parser evaluation'. In Proceedings of the EACL-99 Post-Conference Workshop on Linguistically Interpreted Corpora, Bergen, Norway. 35-41. Also in Proceedings of the ATALA Workshop on Corpus Annotés pour la Syntaxe - Treebanks, Paris, France. 13-20.

Page 111: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

estrogen

An excessive endogenous or

exogenous stimulation

modified_entity1composite_entity1

modified_entity2

adenomatous hyperplasia

endometrium

hasModifier

hasPart

induces

hasPart

hasPart

hasModifier

hasPart

ModifiersModified entitiesComposite Entities

Page 112: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Manual Evaluation Test if the RDF conveys same “meaning”

as the sentence Juxtapose the triple with the sentence Allow user to assess

correctness/incorrectness of the subject, object and triple

Page 113: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH
Page 114: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH
Page 115: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Discovering informative subgraphs (Harry Potter) Given a pair of end-points (entities) Produce a subgraph with relationships connecting them

such that▪ The subgraph is small enough to be visualized▪ And contains relevant “interesting” connections

We defined an interestingness measure based on the ontology schema In future biomedical domain the scientist will control

this with the help of a browsable ontology Our interestingness measure takes into account

▪ Specificity of the relationships and entity classes involved▪ Rarity of relationships etc.

Cartic Ramakrishnan, William H. Milnor, Matthew Perry, Amit P. Sheth: Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explorations 7(2): 56-63 (2005)

Page 116: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Two factor influencing interestingness

Page 117: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

• Bidirectional lock-step growth from S and T• Choice of next node based on interestingness measure• Stop when there are enough connections between the frontiers• This is treated as the candidate graph

Page 118: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Model the Candidate graph as an electrical circuit S is the source and T the sink Edge weight derived from the ontology schema are

treated as conductance values Using Ohm’s and Kirchoff’s laws we find maximum

current flow paths through the candidate graph from S to T

At each step adding this path to the output graph to be displayed we repeat this process till a certain number of predefined nodes is reached

Results Arnold schwarzenegger, Edward Kennedy

Other related work Semantic Associations

Page 119: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH
Page 120: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH
Page 121: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Text Mining, Analysis understanding utilization in decision making knowledge discovery

Entity Identification focus change from simple to compound

Relationship extraction implicit vs. explicit

Need more unsupervised approaches Need to think of incentives to evaluate

Page 122: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

Existing corpora GENIA, BioInfer many others Narrow focus

Precision and Recall Utility

How useful is the extracted information? How do we measure utility?▪ Swanson’s discovery, Enrichment of Browsing

experience Text types and mining

Systematically compensating for (in)formality

Page 123: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

http://www.cs.famaf.unc.edu.ar/~laura/text_mining/

http://www.stanford.edu/class/cs276/cs276-2005-syllabus.html

http://www-nlp.stanford.edu/links/statnlp.html

http://www.cedar.buffalo.edu/~rohini/CSE718/References2.html

Page 124: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

1. Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst, and Megan Richardson, in the proceedings of NAACL-HLT, Rochester NY, April 2007

2. Finding the Flow in Web Site Search, Marti Hearst, Jennifer English, Rashmi Sinha, Kirsten Swearingen, and Ping Yee, Communications of the ACM, 45 (9), September 2002, pp.42-49.

3. R. Kumar, U. Mahadevan, and D. Sivakumar, "A graph-theoretic approach to extract storylines from search results",  in Proc. KDD, 2004, pp.216-225.

4. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007: 2670-2676

5. Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2 (Nantes, France, August 23 - 28, 1992).

6. Dat P. T. Nguyen, Yutaka Matsuo, Mitsuru Ishizuka: Relation Extraction from Wikipedia Using Subtree Mining. AAAI 2007: 1414-1420

7. "Unsupervised Discovery of Compound Entities for Relationship Extraction" Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns

8. Mikheev, A., Moens, M., and Grover, C. 1999. Named Entity recognition without gazetteers. In Proceedings of the Ninth Conference on European Chapter of the Association For Computational Linguistics (Bergen, Norway, June 08 - 12, 1999).

9. McCallum, A., Freitag, D., and Pereira, F. C. 2000. Maximum Entropy Markov Models for Information Extraction and Segmentation. In Proceedings of the Seventeenth international Conference on Machine Learning

10. Lafferty, J. D., McCallum, A., and Pereira, F. C. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth international Conference on Machine Learning

Page 125: CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH

11. Barbara, R. and A.H. Marti, Classifying semantic relations in bioscience texts, in Proceedings of the 42nd ACL. 2004, Association for Computational Linguistics: Barcelona, Spain.

12. M.A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING‘ 92, pages 539–545

13. M. Hearst, "Untangling text data mining," 1999. [Online]. Available: http://citeseer.ist.psu.edu/563035.html

14. Friedman, C., et al., GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 2001. 17 Suppl 1: p. 1367-4803.

15. Saric, J., et al., Extraction of regulatory gene/protein networks from Medline. Bioinformatics, 2005.

16. Ciaramita, M., et al., Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology, in 19th IJCAI. 2005.

17. Dmitry Davidov, Ari Rappoport, Moshe Koppel. Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining. Proceedings, ACL 2007, June 2007, Prague.

18. Rosenfeld, B. and Feldman, R. 2007. Clustering for unsupervised relation identification. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007).

19. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007: 2670-2676

20. Sekine, S. 2006. On-demand information extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (Sydney, Australia, July 17 - 18, 2006). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 731-738.

21. Suchanek, F. M., Kasneci, G., and Weikum, G. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07.