flexible interfaces in the application of language technology to an escience corpus c.j. rupp, ann...
TRANSCRIPT
![Page 1: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/1.jpg)
Flexible Interfaces in the Applicationof Language Technology
to an eScience Corpus
C.J. Rupp, Ann Copestake,Simone Teufel & Benjamin Waldron
Computer Laboratory, University of Cambridge
![Page 2: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/2.jpg)
Outline Two key interfaces:
SciXML: XML markup for the logical structure of research papers
SAF: Standoff Annotation Formalism for diverse linguistic information
Both coded in XML and designed for flexibility,
But what that means is distinct in the two cases.
![Page 3: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/3.jpg)
SciBorg Architecture
RSC papers
Nature papers SciXML
IUCr papers
Biology and CL(pdf)
POS tagging
OSCAR RASP
ERG/PET
WSD
anaphora tasks
standoff annotation
rhetoricalanalysis
RMRSmerge
![Page 4: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/4.jpg)
Sciborg Corpus
A corpus of Chemistry research papers from 3 publishers: The Royal Society of Chemistry (RSC), The Nature Publishing Group (NPG), and The International Union of Crystallography.
Provided in Publishers’ XML markup, but with distinct markup schemes.
![Page 5: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/5.jpg)
Conversion to SciXMLRSC
papers
Nature papers SciXML
IUCr papers
Biology and CL(pdf)
PLOS Biology papers
![Page 6: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/6.jpg)
SciXML Interface Requirements
Extensible So we can add additional publications
Neutral So as not to compromise any IP issues
Compatible with existing software Expressive enough
For adequate rendering in applications
![Page 7: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/7.jpg)
Rendering Issues
We assume application will display the paper Probably in Hypertext
We must retain enough information to do this effectively Previous versions of SciXML have focused
on the logical structure of scientific papers.
![Page 8: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/8.jpg)
The Development of SciXML
Developed for a medical corpus (2000) Extracted from HTML web pages
Extended for a Computational Linguistics corpus First from LaTeX Then from PDF via OCR
Now defined as Relax NG Schema
![Page 9: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/9.jpg)
Legacy Issues
The original SciXML schema had to interpret formatting. Lacking any organisation by function Dictating a flat paragraph structure Collecting all floats and notes in end lists But excluding text formatting
![Page 10: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/10.jpg)
Adapted from Publishers’ Markup
List and Table formats Inline text formatting Functional paragraph types (e.g.
Theorem) Position markers for floats
![Page 11: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/11.jpg)
Conversion by XSLT Most constructs can be handled quite simply
<xsl:template match="sec">
<DIV DEPTH="{@level}">
<xsl:apply-templates/>
</DIV>
</xsl:template>
Making the script virtually a stylesheet
![Page 12: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/12.jpg)
Schema Development
Both the XSLT stylesheet and RNG Schema have been developed on a naïve basis. Coding conversion for constructs that occur
in the corpus
Eventually we have a big enough bag of tricks to make extension quite painless.
![Page 13: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/13.jpg)
SciXML Constructs Paper Identifiers
Unique identifiers, titles and authors Sections
Divisions embed recursively with headers Inline text markup
Font settings and LaTeX inclusion Paragraph structure
Paragraph elements and sub paragraph boundaries in lists, abstracts, captions, etc.
![Page 14: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/14.jpg)
SciXML Constructs Citations and Cross References
Citations are significant, but we also need textual cross references, compound references, footnote markers, float markers.
Equations and examples (Linguistic) examples and equation environments
Lists, tables and figures Lists, including definitions lists, tables, figures, and various
other sections for (external) data. Bibliography
The bibliography section is important for citation tracking
![Page 15: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/15.jpg)
RNG Schema (Fragment)<define name="PAPER.ELEMENT"> <element name="PAPER"> <ref name="METADATA.ELEMENT" /> <optional><ref name="PAGE.ELEMENT" /></optional> <ref name="TITLE.ELEMENT" /> <optional> <ref name="AUTHORLIST.ELEMENT" />
</optional> <optional> <ref name="ABSTRACT.ELEMENT" /> </optional> <element name="BODY"> <zeroOrMore>
<ref name="DIV.ELEMENT" /> </zeroOrMore> </element> <optional> <element name="ACKNOWLEDGMENTS">
<zeroOrMore> <choice> <ref name="REF.ELEMENT" /> <ref name="INLINE.ELEMENT" /> </choice></zeroOrMore>
</element> </optional>
<optional> <ref name="REFERENCELIST.ELEMENT"> </optional> <optional> <ref name="AUTHORNOTELIST.ELEMENT"> </optional> <optional> <ref name="FOOTNOTELIST.ELEMENT"> </optional> <optional> <ref name="FIGURELIST.ELEMENT"> </optional> <optional> <ref name="TABLELIST.ELEMENT"> </optional> </element></define>
<define name="REFERENCELIST.ELEMENT"> <element name="REFERENCELIS"> <zeroOrMore><ref name="REFERENCE.ELEMENT"
/></zeroOrMore> </element></define>
![Page 16: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/16.jpg)
Language Technology in Sciborg The goal is Information Extraction from
Chemistry research papers. various analysis components interfacing
Different levels of analysis Different analysis methods Specialised and General analysers
But a common semantic representation: RMRS (Robust Minimal Recursion Semantics)
And a common interface structure: SAF
![Page 17: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/17.jpg)
Multiple Analysis Components PET/ERG: “deep” analysis using detailed
(HPSG) grammars and lexicons RASP: Robust shallow parsing with a statically
trained grammar Each strand has a tokeniser, tagger and
parser OSCAR-3 analyses Chemistry terms and
notation
![Page 18: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/18.jpg)
Getting the Text out of SciXML
Only some spans of marked up text contain linguistic text.
Using SciXML we can divide element into: Text (<P>), Markup (<IT>), Non-Text elements
(<SUP>). The analysers process, ignore and skip these,
respectively. We also use OSCAR-3 to detect data sections
without significant text portions.
![Page 19: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/19.jpg)
SciBorg Parsing Architecture
SciXML
Tokeniserfor Rasp
OSCARRASPparser
PET parser
SAFLattice
Sentencesplitter POS tagging
Tokeniserfor ERG
![Page 20: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/20.jpg)
SAF Interface Requirements Support results from different analysis
components. Allow the combination of complementary
results But they will assign conflicting structures Ambiguity is common Analyses will form a graph or lattice (c.f. chart
parsing and word lattices)
![Page 21: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/21.jpg)
Motivating Standoff XML can only combine linguistic and
formatting markup if they share the same tree structure calculated for C11 H18 O3
<IT>calculated for</IT> C<SB>11</SB>H<SB>18</SB>O<SB>3</SB>
<v>calculated</v> <pp>for <ne>C11H1803</ne></pp>
![Page 22: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/22.jpg)
Standoff Annotation
A common solution is to separate the flow of text from the annotations representing its analysis
The connection is formed by indexing at some consistent common level
SAF supports character offset indexing and XPoint indexing
![Page 23: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/23.jpg)
Character Offset IndexingFormatted text: Come here!
raw text: "<p>Come <i>here</i>!</p>"
Unicode character points:
.<.p.>.C.o.m.e. .<.i.>.h.e.r.e .< ./ .i .> .! .< ./ .p .> .
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Tokens
<token from='3' to='7' value='Come'/>
<token from='11' to='14' value='here'/>
<token from='18' to='19' value='!'/>
![Page 24: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/24.jpg)
XPoint Indexing
Root (/)
. ’P’(/1).
. ’I’(/1/2).
. text(/1/2/1).
. h.e.r.e.
. text(/1/1). . text(/1/3).
. C.o.m.e. . !.
![Page 25: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/25.jpg)
Index Conversion
We currently use both character offset and XPoint indexing.
The choice is influenced by the XML parser.
This implies maintaining a conversion table for a (SciXML) file. /1/3/0 <-> 18
![Page 26: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/26.jpg)
Standards for Standoff Annotation
MAF: ISO standard for morphological annotation
SMAF: an emergent standard extending this to sentence, e.g. for parser input
SAF: includes all annotations for a paper in one file
![Page 27: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/27.jpg)
Types of SAF Annotation Sentence segments
<annot type='sentence' id='s133' from='42065' source='v4987' target='v5154' to='43039' value='…calculated for C11H18O3….'/>
Tokens <annot type='token' id='t5151' from='42988' to='43030'
deps='s133' source='v5150' target='v5151' value='calculated'/> <annot type='token' id='t5152' from='43031' to='43034'
deps='s133' source='v5151' target='v5152' value='for'/> <annot type='token' id='t5153' from='43035' to='43043'
deps='s133' source='v5152' target='v5153' value='C11H18O3'/>
![Page 28: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/28.jpg)
Types of SAF Annotation Part of Speech (POS) Tags
<annot type='pos' id='p5151' deps='t5151' source='v5150' target='v5151' value='VVN'/>
<annot type='pos' id='p5152' deps='t5152' source='v5151' target='v5152' value='IF'/>
<annot type='pos' id='p5153' deps='t5153' source='v5152' target='v5153' value='NP1'/>
OSCAR (NER) mark up <annot from="/1/5/6/27/51/2/83.1" to="/1/5/6/27/51/2/88/1.1"
type="oscar" id="o554"><slot name="type">compound</slot><slot name="surface">C11H18O3</slot><slot name="provenance">formulaRegex</slot></annot>
![Page 29: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/29.jpg)
Types of SAF Annotation RMRS analyses:
<rmrs cfrom='42329' cto='43303'>
<label vid='420'/>
…
<ep cfrom='43258‘ cto='43288'><gpred>proper_q_rel</gpred><label vid='409'/><var sort='x' vid='410'/></ep>
<ep cfrom='43258' cto='43288'><gpred>named_rel</gpred><label vid='411'/><var sort='x' vid='410'/></ep>
…
<rarg><rargname>RSTR</rargname><label vid='409'/><var sort='h' vid='412'/></rarg>
<rarg><rargname>BODY</rargname><label vid='409'/><var sort='h' vid='413'/></rarg>
<rarg><rargname>CARG</rargname><label vid='411'/><constant>c11h18o3</constant></rarg>
…
<hcons hreln='qeq'><hi><var sort='h' vid='412'/></hi><lo><label vid='411'/></lo></hcons>
</rmrs>
![Page 30: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/30.jpg)
SAF Flexibility
The standoff supports a variety of annotation types
Which communicate between different levels of analysis
And between different analysis paths Hence it is also the main route for
communication in the architecture
![Page 31: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer](https://reader030.vdocuments.net/reader030/viewer/2022033108/56649e585503460f94b523cd/html5/thumbnails/31.jpg)
SciXML Flexibility
A common representation for the logical structure and essential formatting of research papers
Conversion from various publishers’ markup schemes
And, also, from HTML, LaTeX and PDF Applied to several disciplines