Final Review 31 October 2003 1
WP2: Named Entity Recognition and Classification
Claire GroverUniversity of Edinburgh
Final Review 31 October 2003 2
NERC
Multilingual IE ArchitectureMultilingual IE Architecture
Web pages
ENERC
FNERC
HNERC
INERC
Fact ExtractionDemarcator
Database
Domain Ontology
Final Review 31 October 2003 3
WP2: ObjectivesWP2: Objectives
• Specification of language neutral NERC architecture (month 6: D2.1)
• NERC v.1: adaptation and integration of the four existing NERC modules (month 12: D2.2)
• Specification of Corpus Collection Methodology• NERC v.2: improvement of NERC v.1, incorporation
of name matching (month 18: D2.3)• NERC-based Demarcation• NERC v.3: improvement of NERC v.2, incorporation
of rapid adaptation mechanisms, porting to the 2nd domain (month 26: D2.4)
Final Review 31 October 2003 4
Features Specific to Features Specific to CROSSMARC CROSSMARC NERCNERC
• Multilinguality. Currently 4 languages but should be able to add new languages.
• Web pages as input. Conversion of HTML to XHTML and use of XML as common exchange format with a specific DTD per domain.
• Extensible to new domains. There is a need to rapidly add new domains.
Final Review 31 October 2003 5
Shared Features of the NERC ComponentsShared Features of the NERC Components
• XHTML input and output, shared DTD• Shared domain ontology• Each reuses existing NLP tools and linguistic
resources• Stepwise transformation of the XHTML to
incrementally add mark-up, e.g. tokenisation, sentence identification, part-of-speech tagging, entity recognition.
Final Review 31 October 2003 6
NERC Version 2NERC Version 2
• Final version of NERC for the 1st domain• All four monolingual systems use hand-coded rule
sets– HNERC uses the Ellogon Text Engineering Platform.– ENERC uses the LT TTT and LT XML tools and adds
XML annotations incrementally.– INERC is implemented as a sequence of XSLT
transformations of the XML document.– FNERC uses Lingway’s XTIRP Extraction Tool which
applies a sequence of rule-based modules.
Final Review 31 October 2003 7
NERC Version 3NERC Version 3
• Reported in D2.4.• Final version of NERC, dealing with 2nd domain.• Main focus is customisation methodology and
experimentation to allow rapid adaptation to new domains.
• NERC architecture where the monolingual components are different from each other means that customisation methods are defined per component.
Final Review 31 October 2003 8
ENERC Customisation MethodologyENERC Customisation Methodology
• Retain XML pipeline architecture.
• Replace the named entity rule sets with a maximum entropy tagger.
• Experiments with the C&C Tagger and OpenNLP.
• Limited human intervention (selection of appropriate features).
Final Review 31 October 2003 9
FNERC Customisation MethodologyFNERC Customisation Methodology
• Retain XTIRP-based architecture and modules. • Use machine learning to assist in the acquisition
of regular expression named entity rules.• The machine-learning module produces a first
version of human-readable rules plus lists of examples and counter-examples.
• The human expert modifies the rule set appropriately.
• This method reduces rule set development time to about a third.
Final Review 31 October 2003 10
HNERC Customisation MethodologyHNERC Customisation Methodology
• ML-HNERC comprises: • Token-based HNERC
– operates over word tokens, treating NERC as a tagging problem.
– word token classification performed by five independent taggers with the final tag chosen through a simple majority voter.
• Phrase-based HNERC – operates over phrases which have been identified using a
grammar automatically induced from the training corpus
– uses a C4.5 decision tree classifier to recognize phrases that describe entities.
Final Review 31 October 2003 11
INERC Customisation MethodologyINERC Customisation Methodology
• INERC is modular, with components which are general and reusable in new domains. Customization can be restricted to the lexical knowledge bases.
• Statistically driven process of generalizing from the annotated corpus material to derive more generalized lexical resources.
• Compute a frequency score to expand the lexical resources.
Final Review 31 October 2003 12
Evaluation MethodologyEvaluation Methodology
• For both domains we have a hand annotated corpus of 100 pages per language, split 50-50 into training and testing material.
• Each monolingual NERC is evaluated against the testing corpus.
• Standard measures of precision, recall and f-measure are used.
Final Review 31 October 2003 13
Evaluation SummaryEvaluation Summary
Domain 1 F-score
Domain 2 F-score
ENERC 0.73 0.59
FNERC 0.77 0.75
HNERC 0.86 0.68
INERC 0.82 0.77
Final Review 31 October 2003 14
• Rule-based approach gives better results but it is knowledge intensive and requires significant resources for customisation to each new domain.
• The FNERC approach to rule induction is promising.
• In our experiments the machine learning approaches give lower results but:– they allow easy adaptation to new domains– there is scope to improve performance.– more training material would give better
performance.
ConclusionsConclusions
Final Review 31 October 2003 15
Other WP2 ActivitiesOther WP2 Activities
• Collection and annotation of corpora for each language and domain.
• NERC-based Demarcation
Final Review 31 October 2003 16
Corpus Collection MethodologyCorpus Collection Methodology
• For each domain the process follows two steps:– identification of interesting characteristics
of product descriptions and the collection of statistics relevant to these characteristics from at least 50 different sites for a language.
– collection of pages and their separation into training and testing corpora.
Final Review 31 October 2003 17
Corpus collection principlesCorpus collection principles
Domain Independent principles:• Training and testing corpora have the same number of pages
• Corpus size fixed for all languages.
• Corpora are representative of the statistics found per language in the site classification step.
Domain Specific Principles:• The maximum number of pages from one site allowed in a corpus
must be decided depending on the domain.
• The testing corpus must contain X number of pages that come from sites not represented in the training corpus.
Final Review 31 October 2003 18
AnnotationAnnotation
• Annotation performed using NCSR’s annotation tool.
• Annotation guidelines drawn up per domain.• Each corpus annotated by two separate
annotators and inter-annotator agreement checked.
• Final corpus result of correction of cases of disagreement.
Final Review 31 October 2003 19
NERC-Based DemarcatorNERC-Based Demarcator
• Operates after NERC and before FE.
• Locates different product descriptions inside a web page.
• Current version is heuristics-based.
• Characteristic information:– 1st domain: manufacturer, model, price– 2nd domain: job_title, organization, education title
• Output: Product_No attribute on entities
Final Review 31 October 2003 20
Demarcator EvaluationDemarcator Evaluation
Greek Italian English French
1st domain
NE 0.77 0.91 0.63 0.52
NUMEX 0.75 0.84 0.54 0.52
TIMEX 0.59 0.72 0.44 0.41
2nd domain
NE 0.77 0.64 0.47 0.62
Final Review 31 October 2003 21
Results OverviewResults Overview
• Successful multilingual NERC system which is an integral part of a resaerch platform for extracting information from web pages.
• An architecture that allows for new languages and swift adaptation to new domains.
• Four independent approaches each of which provide good results.
• Well motivated corpus collection methodology.• Publicly distributed corpora for all languages
and both domains
Final Review 31 October 2003 22
Final Review 31 October 2003 23
Final Review 31 October 2003 24
Final Review 31 October 2003 25
Final Review 31 October 2003 26
Shared DTDsShared DTDs
Domain 1
NE: MANUF, MODEL, PROCESSOR, SOFT_OS
TIMEX: TIME, DATE, DURATION
NUMEX: LENGTH, WEIGHT, SPEED, CAPACITY,
RESOLUTION, MONEY, PERCENT
Domain 2
NE: MUNICIPALITY, REGION, COUNTRY, ORGANIZATION, JOB_TITLE, EDU_TITLE, LANGUAGE, S/W
TIMEX: DATE, DURATION
NUMEX: MONEY
TERM: SCHEDULE, ORG_UNIT
Final Review 31 October 2003 27
11stst Domain Evaluation Results Domain Evaluation Results
ENERC FNERC HNERC INERC
NE MANUF 0.52 0.68 0.86 0.93
MODEL 0.70 0.58 0.71 0.70
SOFT_OS 0.76 0.90 0.80 0.94
PROCESSOR 0.91 0.93 0.91 0.96
NUMEX SPEED 0.78 0.84 0.90 0.88
CAPACITY 0.90 0.85 0.84 0.96
LENGTH 0.85 0.61 0.88 0.89
RESOLUTION 0.96 0.83 0.75 0.89
MONEY 0.62 0.80 0.80 0.74
PERCENT 0.67 0.75 0.77 0.86
WEIGHT 0.96 0.93 1.00 0.88
TIMEX DATE 0.45 0.84 0.96 0.57
DURATION 0.73 0.85 0.87 0.41
TIME 0.47 0.69 1.00 -
Overall (aprox) 0.73 0.77 0.86 0.82
Final Review 31 October 2003 28
2nd Domain Evaluation Results2nd Domain Evaluation Results
ENERC FNERC HNERC INERC
NE MUNICIPALITY 0.70 0.77 0.82 0.92
REGION 0.65 0.81 0.40 0.94
COUNTRY 0.87 0.73 0.84 0.86
ORGANIZATION 0.56 0.58 0.50 0.71
JOB_TITLE 0.55 0.71 0.50 0.78
EDU_TITLE 0.36 0.57 0.67 0.82
LANGUAGE 0.67 0.69 0.95 0.83
S/W 0.55 0.82 0.70 0.75
NUMEX MONEY 0.25 0.93 0.00 0.00
TIMEX DATE 0.79 0.61 0.93 0.77
DURATION 0.83 0.88 0.91 0.74
TERM ORG_UNIT 0.37 0.66 0.39 0.51
SCHEDULE 0.00 0.57 0.00 0.40
Overall 0.59 0.75 0.68 0.77