28/02/02-01/03/02 4 th meeting athens enerc v.2. 28/02/02-01/03/02 4 th meeting athens updates...
TRANSCRIPT
28/02/02-01/03/02 4th Meeting Athens
ENERC v.2
28/02/02-01/03/02 4th Meeting Athens
Updates• Change in early tokenisation: identification
of words now a two stage process. • Updated lexical resources based on new
version of LexiconEn.xml.• Current version does not include statistical
classifier or POS tagger.• Non-GUI version of NERC-based
Demarcator added at end of pipeline.
28/02/02-01/03/02 4th Meeting Athens
egrep -v '^<\!DOCTYPE' \| $EN/SCRIPTS/entsout.pl \| $bin/fsgmatch -q ".*" $EN/GRAM/char/pretok.gr \| $EN/SCRIPTS/openangle.pl \| $bin/xmlperl2 $EN/SCRIPTS/findels-s.rule \| $bin/xmlperl2 $EN/SCRIPTS/nobold.rule \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/tok.gr \| $bin/xmlperl $EN/SCRIPTS/dels.rule \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/numbers.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/numex-sf.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/timex.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/prodex-ll.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/prodex-sf.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/attribex.gr \| $bin/xmlperl $EN/SCRIPTS/delete-tags.rule \| $EN/SCRIPTS/tidyup-dem.pl \> $EN/ddiri/current.xhtml wish8.4 $EN/Demarc/CROSSMARC_Demarcation_Tool.tcl -source $EN/ddiri -destination $EN/ddiro -gui 0 -language english > /dev/null cat $EN/ddiro/current.xhtml \| $bin/xmlperl $EN/SCRIPTS/del-npno.rule
ENERC Pipeline
28/02/02-01/03/02 4th Meeting Athens
Name Matching• LexiconEn.lex derived from LexiconEn.xml: each
synonym of a concept becomes a lexical entry.
• 1st stage of name matching performs lexical look-up to find matches. (Case insensitive and entities such as ® ignored.)
• 2nd stage of name matching uses a fuzzy matching program. This uses a list of target strings also derived from synonyms in LexiconEn.xml.
• Name matching operates on entities and encodes the ontology ID as the value of an attribute.
• Can be performed after NERC, Demarcation or FE.
28/02/02-01/03/02 4th Meeting Athens
Normalisation
• We use an xmlperl program to match particular facts containing certain NUMEXes. e.g
<processorSpeed> <NUMEX type=‘SPEED’>1.7 GHz</NUMEX>
</processorSpeed>
• Perl action in rule performs normalisation using a list of conversion rates.
• Normalised version appears as attribute value on NUMEX which can then be inherited by the fact.
• Normalisation could be performed before FE but fact type is useful in determining the conversion.
28/02/02-01/03/02 4th Meeting Athens
Evaluation Results: just NERC Precision Recall F-measure
MANUF 0.36 0.95 0.52
MODEL 0.61 0.82 0.70
SOFT_OS 0.74 0.79 0.76
PROCESSOR 0.85 0.98 0.91
SPEED 0.80 0.77 0.78
CAPACITY 0.88 0.93 0.90
LENGTH 0.94 0.77 0.85
RESOLUTION 0.93 1.00 0.96
MONEY 0.46 0.97 0.62
PERCENT 0.63 0.71 0.67
WEIGHT 0.97 0.95 0.96
DATE 0.30 0.88 0.45
DURATION 0.72 0.74 0.73
TIME 0.33 0.80 0.47
28/02/02-01/03/02 4th Meeting Athens
Evaluation Results: NERC+Demarcator Precision Recall F-measure
MANUF 0.33 0.72 0.45
MODEL 0.54 0.59 0.56
SOFT_OS 0.76 0.65 0.70
PROCESSOR 0.81 0.59 0.68
SPEED 0.80 0.53 0.64
CAPACITY 0.83 0.57 0.68
LENGTH 0.92 0.53 0.67
RESOLUTION 0.94 0.90 0.92
MONEY 0.32 0.52 0.40
PERCENT 0.80 0.57 0.67
WEIGHT 0.98 0.79 0.87
DATE 0.29 0.50 0.37
DURATION 0.71 0.55 0.62
TIME 0.? 0.? 0.?
28/02/02-01/03/02 4th Meeting Athens
28/02/02-01/03/02 4th Meeting Athens
Microsoft Office
LexiconEn.lexMicrosoft Office XP :: SOFT OV-d0e594
Windows XP :: OS OV-d0e522
W98 OS OV-d0e521
W 98 :: OS OV-d0e521
Win98 OS OV-d0e521
Win 98 :: OS OV-d0e521
Microsoft SOFT OV-d0e594
Office SOFT OV-d0e594
XP SOFT OV-d0e594
98 OS OV-d0e521
Win OS OV-d0e521
W OS OV-d0e521
R