28/02/02-01/03/02 4 th meeting athens enerc v.2. 28/02/02-01/03/02 4 th meeting athens updates...

9
28/02/02- 01/03/02 4 th Meeting Athens ENERC v.2

Upload: phoebe-jackson

Post on 14-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

ENERC v.2

Page 2: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

Updates• Change in early tokenisation: identification

of words now a two stage process. • Updated lexical resources based on new

version of LexiconEn.xml.• Current version does not include statistical

classifier or POS tagger.• Non-GUI version of NERC-based

Demarcator added at end of pipeline.

Page 3: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

egrep -v '^<\!DOCTYPE' \| $EN/SCRIPTS/entsout.pl \| $bin/fsgmatch -q ".*" $EN/GRAM/char/pretok.gr \| $EN/SCRIPTS/openangle.pl \| $bin/xmlperl2 $EN/SCRIPTS/findels-s.rule \| $bin/xmlperl2 $EN/SCRIPTS/nobold.rule \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/tok.gr \| $bin/xmlperl $EN/SCRIPTS/dels.rule \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/numbers.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/numex-sf.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/timex.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/prodex-ll.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/prodex-sf.gr \| $bin/fsgmatch -q ".*/.[PROC='yes']" $EN/GRAM/xml/attribex.gr \| $bin/xmlperl $EN/SCRIPTS/delete-tags.rule \| $EN/SCRIPTS/tidyup-dem.pl \> $EN/ddiri/current.xhtml wish8.4 $EN/Demarc/CROSSMARC_Demarcation_Tool.tcl -source $EN/ddiri -destination $EN/ddiro -gui 0 -language english > /dev/null cat $EN/ddiro/current.xhtml \| $bin/xmlperl $EN/SCRIPTS/del-npno.rule

ENERC Pipeline

Page 4: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

Name Matching• LexiconEn.lex derived from LexiconEn.xml: each

synonym of a concept becomes a lexical entry.

• 1st stage of name matching performs lexical look-up to find matches. (Case insensitive and entities such as &reg; ignored.)

• 2nd stage of name matching uses a fuzzy matching program. This uses a list of target strings also derived from synonyms in LexiconEn.xml.

• Name matching operates on entities and encodes the ontology ID as the value of an attribute.

• Can be performed after NERC, Demarcation or FE.

Page 5: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

Normalisation

• We use an xmlperl program to match particular facts containing certain NUMEXes. e.g

<processorSpeed> <NUMEX type=‘SPEED’>1.7 GHz</NUMEX>

</processorSpeed>

• Perl action in rule performs normalisation using a list of conversion rates.

• Normalised version appears as attribute value on NUMEX which can then be inherited by the fact.

• Normalisation could be performed before FE but fact type is useful in determining the conversion.

Page 6: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

Evaluation Results: just NERC  Precision Recall F-measure

MANUF 0.36 0.95 0.52

MODEL 0.61 0.82 0.70

SOFT_OS 0.74 0.79 0.76

PROCESSOR 0.85 0.98 0.91

 

SPEED 0.80 0.77 0.78

CAPACITY 0.88 0.93 0.90

LENGTH 0.94 0.77 0.85

RESOLUTION 0.93 1.00 0.96

MONEY 0.46 0.97 0.62

PERCENT 0.63 0.71 0.67

WEIGHT 0.97 0.95 0.96

 

DATE 0.30 0.88 0.45

DURATION 0.72 0.74 0.73

TIME 0.33 0.80 0.47

Page 7: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

Evaluation Results: NERC+Demarcator  Precision Recall F-measure

MANUF 0.33 0.72 0.45

MODEL 0.54 0.59 0.56

SOFT_OS 0.76 0.65 0.70

PROCESSOR 0.81 0.59 0.68

 

SPEED 0.80 0.53 0.64

CAPACITY 0.83 0.57 0.68

LENGTH 0.92 0.53 0.67

RESOLUTION 0.94 0.90 0.92

MONEY 0.32 0.52 0.40

PERCENT 0.80 0.57 0.67

WEIGHT 0.98 0.79 0.87

 

DATE 0.29 0.50 0.37

DURATION 0.71 0.55 0.62

TIME 0.? 0.? 0.?

Page 8: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

Page 9: 28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now

28/02/02-01/03/02 4th Meeting Athens

Microsoft Office

LexiconEn.lexMicrosoft Office XP :: SOFT OV-d0e594

Windows XP :: OS OV-d0e522

W98 OS OV-d0e521

W 98 :: OS OV-d0e521

Win98 OS OV-d0e521

Win 98 :: OS OV-d0e521

Microsoft SOFT OV-d0e594

Office SOFT OV-d0e594

XP SOFT OV-d0e594

98 OS OV-d0e521

Win OS OV-d0e521

W OS OV-d0e521

R