sequence classification: chunking & ner shallow processing techniques for nlp ling570 november...

Sequence Classification:

Chunking & NERShallow Processing Techniques for NLP

Ling570November 23, 2011

Roadmap Named Entity Recognition

Chunking

Named Entity Recognition

RoadmapNamed Entity Recognition

Definition

Motivation

Challenges

Common Approach

Named Entity RecognitionTask: Identify Named Entities in (typically)

unstructured text

Typical entities:Person namesLocationsOrganizationsDatesTimes

ExampleMicrosoft released Windows Vista in 2007.

Example due to F. Xia

<ORG>Microsoft</ORG> released <PRODUCT>Windows Vista</PRODUCT> in <YEAR>2007</YEAR>

Entities:Often application/domain specific

Business intelligence:

Business intelligence: products, companies, featuresBiomedical:

Business intelligence: products, companies, featuresBiomedical: Genes, proteins, diseases, drugs, …

Named Entity TypesCommon categories

Named Entity ExamplesFor common categories:

Why NER?Machine translation:

Person

Person names typically not translatedPossibly transliteratedWaldheim

Number:

Person names typically not translatedPossibly transliteratedWaldheim

Number: 9/11: Date vs ratio911: Emergency phone number, simple number

Why NER?Information extraction:

MUC task: Joint ventures/mergersFocus on

MUC task: Joint ventures/mergersFocus on Company names, Person Names (CEO),

valuations

Information retrieval:Named entities focus of retrieval In some data sets, 60+% queries target NEs

valuations

Information retrieval:Named entities focus of retrieval In some data sets, 60+% queries target NEs

Text-to-speech:

Why NER? Information extraction:

valuations

Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs

Text-to-speech: 206-616-5728

Phone numbers (vs other digit strings) , differ by language

ChallengesAmbiguity

Washington chose

ChallengesAmbiguity

Washington choseD.C., State, George, etc

Most digit strings

ChallengesAmbiguity

Most digit strings

cat: (95 results)

ChallengesAmbiguity

Most digit strings

cat: (95 results)CAT(erpillar) stock tickerComputerized Axial TomographyChloramphenicol Acetyl Transferasesmall furry mammal

Context & Ambiguity

EvaluationPrecision

Recall

F-measure

ResourcesOnline:

Name listsBaby name, who’s who, newswire services,

census.govGazetteersSEC listings of companies

ToolsLingpipeOpenNLPStanford NLP toolkit

Approaches to NERRule/Regex-based:

Match names/entities in listsRegex:

Match names/entities in listsRegex: e.g \d\d/\d\d/\d\d: 11/23/11Currency: $\d+\.\d+

Machine Learning via Sequence Labeling:Better for names, organizations

Hybrid

NER as Sequence Labeling

NER as Classification TaskInstance:

NER as Classification TaskInstance: token

Labels:

Labels:Position: B(eginning), I(nside), Outside

Labels:Position: B(eginning), I(nside), OutsideNER types: PER, ORG, LOC, NUM

Labels:Position: B(eginning), I(nside), OutsideNER types: PER, ORG, LOC, NUMLabel: Type-Position, e.g. PER-B, PER-I, O, …How many tags?

(|NER Types|x 2) + 1

NER as Classification: Features

What information can we use for NER?

Predictive tokens: e.g. MD, Rev, Inc,..

How general are these features?

Predictive tokens: e.g. MD, Rev, Inc,..

How general are these features? Language? Genre? Domain?

NER as Classification: Shape Features

Shape types:

Shape types: lower: e.g. cumming

All lower case

All lower casecapitalized: e.g. Washington

First letter uppercase

First letter uppercaseall caps: e.g. WHO

all letters capitalized

all letters capitalizedmixed case: eBay

Mixed upper and lower case

Mixed upper and lower caseCapitalized with period: H.

Mixed upper and lower caseCapitalized with period: H.Ends with digit: A9

All lower case capitalized: e.g. Washington

First letter uppercase all caps: e.g. WHO

all letters capitalized mixed case: eBay

Mixed upper and lower case Capitalized with period: H. Ends with digit: A9 Contains hyphen: H-P

Example Instance Representation

Example

Sequence LabelingExample

EvaluationSystem: output of automatic tagging

Gold Standard: true tags

Precision: # correct chunks/# system chunks

Recall: # correct chunks/# gold chunks

F-measure:

F1 balances precision & recall

EvaluationStandard measures:

Precision, Recall, F-measureComputed on entity types (Co-NLL evaluation)

Classifiers vs evaluation measuresClassifiers optimize tag accuracy

Most common tag?

Most common tag? O – most tokens aren’t NEs

Evaluation measures focuses on NE

Most common tag? O – most tokens aren’t NEs

Evaluation measures focuses on NE

State-of-the-art:Standard tasks: PER, LOC: 0.92; ORG: 0.84

Hybrid ApproachesPractical sytems

Exploit lists, rules, learning…

Exploit lists, rules, learning…Multi-pass:

Early passes: high precision, low recallLater passes: noisier sequence learning

Hybrid system:High precision rules tag unambiguous mentions

Use string matching to capture substring matches

Hybrid system:High precision rules tag unambiguous mentions

Use string matching to capture substring matchesTag items from domain-specific name listsApply sequence labeler

Chunking

RoadmapChunking

Definition

Motivation

Challenges

Approach

What is Chunking?Form of partial (shallow) parsing

Extracts major syntactic units, but not full parse trees

Task: identify and classify Flat, non-overlapping segments of a sentence

Task: identify and classify Flat, non-overlapping segments of a sentenceBasic non-recursive phrases

Task: identify and classify Flat, non-overlapping segments of a sentenceBasic non-recursive phrasesCorrespond to major POS

May ignore some categories; i.e. base NP chunking

May ignore some categories; i.e. base NP chunkingCreate simple bracketing

[NPThe morning flight][PPfrom][NPDenver][Vphas arrived]

May ignore some categories; i.e. base NP chunkingCreate simple bracketing

[NPThe morning flight][PPfrom][NPDenver][Vphas arrived]

[NPThe morning flight] from [NPDenver] has arrived

Why Chunking?Used when full parse unnecessary

Or infeasible or impossible (when?)

Extraction of subcategorization frames Identify verb arguments

e.g. VP NP VP NP NP VP NP to NP

Information extraction: who did what to whom

Summarization: Base information, remove mods

Information retrieval: Restrict indexing to base NPs

Processing Example Tokenization: The morning flight from Denver has arrived

POS tagging: DT JJ N PREP NNP AUX V

Chunking: NP PP NP VP

Extraction: NP NP VP

ApproachesFinite-state Approaches

Grammatical rules in FSTsCascade to produce more complex structure

ApproachesFinite-state Approaches

Grammatical rules in FSTsCascade to produce more complex structure

Machine LearningSimilar to POS tagging

Finite-State Rule-Based Chunking

Hand-crafted rules model phrasesTypically application-specific

Left-to-right longest match (Abney 1996)Start at beginning of sentenceFind longest matching rule

Left-to-right longest match (Abney 1996)Start at beginning of sentenceFind longest matching ruleGreedy approach, not guaranteed optimal

Chunk rules:Cannot contain recursion

NP -> Det Nominal:

NP -> Det Nominal: OkayNominal -> Nominal PP:

NP -> Det Nominal: OkayNominal -> Nominal PP: Not okay

Examples:NP (Det) Noun* NounNP Proper-NounVP VerbVP Aux Verb

Chunk rules: Cannot contain recursion

NP -> Det Nominal: OkayNominal -> Nominal PP: Not okay

Examples: NP (Det) Noun* Noun NP Proper-Noun VP Verb VP Aux Verb

Consider: Time flies like an arrow

Is this what we want?

Cascading FSTsRicher partial parsing

Pass output of FST to next FST

Cascading FSTsRicher partial parsing

Pass output of FST to next FST

Approach:First stage: Base phrase chunkingNext stage: Larger constituents (e.g. PPs, VPs)Highest stage: Sentences

Example

Chunking by ClassificationModel chunking as task similar to POS tagging

Instance:

Instance: tokens

Labels: Simultaneously encode segmentation &

identification

Instance: tokens

identification IOB (or BIO tagging) (also BIOE or BIOSE)

Segment: B(eginning), I (nternal), O(utside)

Instance: tokens

Segment: B(eginning), I (nternal), O(utside)Identity: Phrase category: NP, VP, PP, etc.

Instance: tokens

Segment: B(eginning), I (nternal), O(utside)Identity: Phrase category: NP, VP, PP, etc.The morning flight from Denver has arrivedNP-B NP-I NP-I PP-B NP-B VP-B VP-I

Instance: tokens

Labels: Simultaneously encode segmentation & identification IOB (or BIO tagging) (also BIOE or BIOSE)

Segment: B(eginning), I (nternal), O(utside)Identity: Phrase category: NP, VP, PP, etc.The morning flight from Denver has arrivedNP-B NP-I NP-I PP-B NP-B VP-B VP-INP-B NP-I NP-I NP-B

Features for ChunkingWhat are good features?

Preceding tagsfor 2 preceding words

Wordsfor 2 preceding, current, 2 following

Parts of speechfor 2 preceding, current, 2 following

Vector includes those features + true label

Chunking as ClassificationExample

Gold Standard: true tags Typically extracted from parsed treebank

F-measure:

F1 balances precision & recall

State-of-the-ArtBase NP chunking: 0.96

Complex phrases: Learning: 0.92-0.94Most learners achieve similar results

Rule-based: 0.85-0.92

Limiting factors:

Limiting factors:POS tagging accuracy Inconsistent labeling (parse tree extraction)Conjunctions

Late departures and arrivals are common in winterLate departures and cancellations are common in winter

Building a MaxEnt POS Tagger

Q1: Build feature vector representations for POS tagging in SVMlight format

maxent_features.* training_file testing_file rare_wd_threshold rare_feat_threshold outdir

training_file, testing_file: like HW#7w1/t1 w2/t2 …wn/tn

Filter rare words and infrequent features

Store vectors & intermediate representations in outdir

Feature RepresentationsFeatures:

Ratnaparkhi, 1996, Table 1 (duplicated in MaxEnt slides)

Character issues:Replace “,” with “comma”Replace “:” with “colon”

Mallet and svmlight format use these as delimiters

Q2: ExperimentsRun MaxEnt classification using your training and

test files

Compare effects of different thresholds on feature count, accuracy, and runtime

Note: Big filesThis assignment will produce even larger sets of

results that HW#8. Please gzip your tar files. If the DropBox won’t accept the files, you can store

the files on patas. Just let Sanghoun know where to find them.

sequence classification: chunking & ner shallow processing techniques for nlp ling570 november...

Documents

maxent pos tagging shallow processing techniques for nlp...

np chunking in hungarian

chunking, summary, & annotation. reading strategies chunking...

chunking: what do i study?. chunking is not “flash...

chunking in spatial memory 1 running head: chunking in...

chunking and storyboarding

information extraction shallow processing techniques for nlp...

introduction & tokenization ling570 shallow processing...

content chunking & new revenue streams

classification & mallet shallow processing techniques for...

basics of content chunking

chunking -...

subtraction by chunking

how chunking helps content processing

chunking: shallow parsing

language models & smoothing shallow processing techniques...

content chunking in e-learning

division by chunking

text chunking using nltk

chunking slides