extracting information science concepts based on jape regular expression
TRANSCRIPT
8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression
http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 1/7
International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.
ManuscriptReceived:
3,Jun.,2012
Revised:
25,Jun.,2012
Accepted:
30,Jan.,2013
Published: 15,Mar.,2013
Keywords
ontology,Regular
expression,
Information
extraction,
Natural
Language
Programming
Abstract Recently, unstructured data
on the World Wide Web has generated
significant interest in the extraction of text,
emails, web pages, reports and research
papers in their raw form. Far more
interestingly, extracting information from a
specific domain using distributed corpora
from the World Wide Web is a vital step
towards creating corpus annotation. This
paper describes a method of annotation,
based on concepts from InformationScience, to build a domain ontology, using
Natural Language Programming (NLP)
technology. We used Java Annotation
Patterns Engine (JAPE) grammars to
support regular expression matching and
thus annotate IS concepts using a GATE
developer tool. This speeds up the
time-consuming development of the
ontology which is important for experts in
the domain facing time constraints and
high workloads. The rules provide
significant results: the pattern matching of
IS concepts based on the lookup list
produced 403 correct concepts and theaccuracy was generally higher, with 0
partially correct, missing and false positive
results. Using NLP technique is good
approaches to reduce the domain expert’s
work and they can be evaluated the results
1. Introduction
Recently, Information Extracting (IE) has receivedsignificant interest due to the number of web pagesemerging on the internet containing unstructured data.Due to the amount of information available on the internet,
it is necessary to have a tool for extracting it. Manyspecialists in the field of IE have worked to find suitabletools, such as Wrappers, that classify interesting data andmap them onto appropriate formats such as XML orrelational database. Furthermore, some HTML-awaretools are based on inheriting the constructural features ofdocuments so as to extract the data. On the other hand, Natural Language Programming (NLP) is a technique
used by many tools to extract the data in natural languagedocuments. Tools such as GATE use techniques such as a
part-of-speech tagging, filtering, or lexical semantictagging to link relevant information, and identify
Ahlam Sawsaa, Joan Lu. Informatics, University of Huddersfield
([email protected]; J [email protected] )
relationships among phrases and sentence elements within
text. In fact, each of these tools has advantages anddisadvantages. A comparative analysis of the existing
tools for data extraction is needed to assess theircapabilities. This is done in the next section.
In this paper, first we provide a brief background of IEtools to justify why we feel the NLP technique should beused to speed up the building of an Ontology ofInformation Science (OIS). To extract concepts in the field,
we used CREOLE plug-ins from GATE in the IE system.We also show how the JAPE grammar has beenimplemented by detailing the rules we use to annotate ISconcepts.
The paper is structured as follows: In section 2, wediscuss the background of IE. In section 3, we discuss themethods used to extract Information Science (IS) conceptsand how they were constructed. In section 4, we presenthow the domain knowledge is acquired for creating thecorpus, Gazetteer, and how the JAPE rule is implemented.Our discussion and evaluation is in section 5. Finally, wedraw conclusions and make suggestions for future work.
2. Background
It is a shared belief that ontology receives a lot ofrecognition from various research fields. Although thereare some well-known domain ontologies, such as CYC,the Standardized Nomenclature for Medicine (SNOMED,a clinical terminology), Toronto Virtual Enterprise(TOVE), and the GENE ontology (GO), study of theontology area is still immature and improvements areneeded [6].
IS is a multidisciplinary field, including branches suchas Library Science, Archival Science and Computer
Science, and therefore lacks a unified model of domainknowledge. The inconsistencies in the structure of the
domain make it difficult to use and share data at thesyntactic and semantic levels. It is thus necessary to
develop an OIS to represent the domain knowledge [9].The growing amount of unstructured data appearing on
the internet makes it extremely difficult to extractknowledge from it. The IS domain includes a hugenumber of documents made up of web-based knowledgethat is inaccessible. Building an OIS thus requires us toset up an appropriate knowledge description module forthe intended ontology [11].
A number of studies have shown that applications ofIE can be used to annotate documents that are written innatural language. Certainly, the growing number of IE
Extracting Information Science Concepts Based on
JAPE Regular ExpressionAhlam Sawsaa & Joan Lu
8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression
http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 2/7
8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression
http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 3/7
Sawsaa et al .: Extracting Information Science Concepts Based on JAPE Regular Expression.
International Journal Publishers Group (IJPG) ©
193
and concepts from a specific text effectively andefficiently. For this work, we annotate text belonging tomembers of Ontocop. Ontocop is a virtual community of practice within IS domain, designed to support groupinteraction and communication across diverse destinations.
It was intended as a tool for creating an OIS ontology byextracting the main concepts from there outputs,document discussions. Professional bodies can be goodresources for the process of building a conceptual ofinformation Science OIS [8].
3. Methods Employed
Our method is based on creating a corpus of
documents and a Gazetteer of Information Science, withJAPE rules used to extract IS concepts. GATE provides
facilities for loading corpora for annotation from a URL oruploading from a file. The process generally started as
follows: We compiled IS knowledge from different
resources, such as the Ontocop website forumand various publications on the web by membersof Ontocop.
We analyzed the data to ensure it covered all branches of the field.
We transferred the information resources into anXML file to form the corpus.
We uploaded the corpus into the GATE softwareso as to start running ANNIE.
We annotated the concepts based on JAPEgrammar, which is run within ANNIE.
Testing and evaluation. As illustrated in Fig. 1.
4. Implementation
A. Knowledge acquisition
Before creating the ontology, we had to collect the ISknowledge for the domain model. Our approach consistedof annotating IS concepts based on the JAPE grammar,using the GATE software. The annotation process beganas follows:
We collected discussion threads from theOntocop website on MySQL database, using the
URL of the Ontocop website.
List (1) Transfer of discussion data that obtained from Ontocop forumto XML files.
Figure 2 shows the methods we used to annotateconcepts from the embedded knowledge that had emergedfrom experts in the field on the Ontocop forum. The
discussion topics were collected in the MySQL database before being transferred to XML files.
For example, the concept of Information science wasannotated and defined in the OWL ontology language to
start the lifecycle of the ontology process.
Next, we collected IS publications by Ontocop
members to speed up the process, and then
Analysis process
Upload toGATE
Framework
Transfers to
XML
Running
ANNIE
Corpus
Annotation of concepts & evaluation
Documents ofInformation Science
Fig. 1 Annotation workflow
8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression
http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 4/7
8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression
http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 5/7
Sawsaa et al .: Extracting Information Science Concepts Based on JAPE Regular Expression.
International Journal Publishers Group (IJPG) ©
195
Rule: concept2Priority: 20(({Token.string == "information"}){Token.string == "service"}
({Lookup. major Type == "concept"})) : information-->: Information. concept = {Rule=concept2}
More precisely, we apply a regular expression tomatch strings of text:
Phase: ConceptInput: Lookup TokenOptions: control = appeltRule: Glossary(
({Token.string == "catalog?e"})): concept-->:{} .concept= {Rule= "Glossary"}
In these rules we specify a string of text {Token.string== } that must be matched, specifying the attributes of the
annotation by using operators such as “==”, and thenannotating the entities according to the correct labels.
Furthermore, using a control field such as all, applet, brillgives the right results. The next example shows howregular expressions could be annotated as showingconcepts related to (abstract) metacharacter(dot, *, [ ], | ),
{Token.string == "abstract(ing)"}
This could capture the words abstract, abstracting, orabstractor. If we want to annotate the acquisition conceptfollowed by another word we would use the following:
{Token.string == "acquisition. number"}
This could annotate:Acquisition.policeAcquisition.service
The code {Token.string == "archival * "} will annotate
archival library, archival journal, archival processing,archival software, and archival studies, for example.
We could also choose one term from a choice of two
terms. For example, we could choose either Data ormining from the phrase Data mining, Data or processing
from Data processing, Data or storage from Data storage,or Data or representation from Data representation by
applying the pipe simple [|] operator:{Token.string == "Data | mining"}{Token.string == "Data | processing "}{Token.string == "Data | storage "}{Token.string == "Data | representation"}
{Token.string == "Book | art" }
5. Discussion and evaluation
Our extraction of IS concepts using JAPE grammarand regular expression based on the GATE developer forautomated IE provides significant results. The main idea
behind using JAPE and regular expression is to identify ISterminology as tokens, for example Computing, Librariesand Information technology, from a large text. The termidentification relies on looking up a list of IS terms fromthe Gazetteer. For example, we could look up book art, book card, book guidance or book catalogue, or computerapplication, computer science, computer experts,computer file, or computer image. These concepts can becollected to be the main component of IS glossary, and tostructure in semi-formal hierarchy before creating thecomputational model of the OIS ontology.
We extracted the IS concepts from a corpus of 300documents, obtained specifically for this purpose. We ran
the ANNIE application, using document reset, Tokenisor,sentence Splitter, Gazetteer, POS tagger, JAPE transducer
and Orthomatcher The annotation set that appeared in thedisplay pan, and the concepts are highlighted in the
annotation default, each annotation has different colour. ascan be seen in Fig. 4.
Fig. 4 Annotation concepts in GATE
Figure 4 presents the results of annotating the IS conceptsafter running ANNIE and highlighting the matching
concepts. The results show that our approach successfullyannotates concepts. We recalled 541 of the Knowledge concept, 275 Information concept and 35 of theorganization concept (see Fig. 5). Each annotation startsfrom a specific point and ends at a different point, basedon how many tokens it has. The knowledge concept startsat point (557) and ends at (566), while the organization
concept starts at (624) and ends at (636), with its features{major Type=concept}.
8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression
http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 6/7
International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.
International Journal Publishers Group (IJPG) ©
196
Fig. 5 Result of the annotation of the IS domain
The data were evaluated using the Annotation Diff tool
which is based on evaluation metrics for precision, recalland the F-measure. Annotation Diff is used to comparetwo different annotations sets based on the samedocument. We compared the key type feature and theresponse feature.
The tests showed that are accuracy rates are equal tothe manual output of IS experts. The statistics of thecorpus show that the pattern matching of IS concepts based on the lookup IS list was 403, correct concepts andaccuracy were generally higher, and there were no partially correct results (0), missing false positives.
However, we used GATE due to its benefits as an opensource software and because it contains multi-language
NLP models that can be reused to develop other resources.
6. Conclusion
A. Achievement
This paper has described a method using NLP
techniques to extract concepts for the purpose of speedingup the development of an OIS. Furthermore, the
development of the IE system should save domain expertstime and effort in labelling the most common concepts. In
total we extracted 664 concepts that are classes of the OIS,and 650 subclasses, making up the main components ofthe ontology skeleton. The IE technique can be applied tomany different formats, such as XML, HTML documents,
URLs or emails.
B. Future work
Ontology is at the heart of the semantic web. It defines
concepts and relations that make global interoperability possible. In future work, we plan to enhance theseconcepts so as to develop an OIS to create the taxonomyof IS as a domain. The next step will be to code the process using Protégé as the ontology editor. Additionally,
a generic model of the OIS will be evaluated.
References
[1] ALBERTO, H. F., BERTHIER, A. L.-. & RIBEIRO-NETO(2002) A brief survey of web data extraction tools.SIGMOD Record http://annotation.semanticweb.org/tools/.
[2] CHANG, C.-H., KAYED, M., GIRGIS, M. R. & SHAALA,
K. (2000) A Survey of Web Information ExtractionSystems. IEEE TRANSACTIONS ON KNOWLEDGE AND
DATA ENGINEERING, 13.[3] CRESCENZI, V. & MECCA, G. (2004) Automatic
Information Extraction from Large Websites. Journal of
the ACM, 51, pp. 731 – 779.[4] GATE (2010) Developing Language Processing Components
with GATE Version 6 (a User Guide).
http://gate.ac.uk/sale/tao/splitch13.html#x18-32300013.2.[5] HANDSCHUH, S. & STAAB, S. (2002) Authoring and
Annotation of Web Pages in CREAM. Honolulu, Hawaii,USA.
[6] LABORATORY, E. I. (2011) TOVE Ontology Project.
University of Toronto. Toronto,
http://www.eil.utoronto.ca/enterprise-modelling/tove/.[7] MOENS, M.-F. (2006) Information Extraction: algorithms
and prospects in a retrieval context , Springer.[8] SAWSAA, A. & LU, J. (2010a) Ontocop: A virtual
community of practice to create ontology of Information
sceicne. ICOMP'10. Las vagas.[9] SAWSAA, A. & LU, J. (2010b) Ontology of Information
Science Based On OWL for the Semantic Web. In:
International Arab Conference on Information Technology(ACIT'2010). University of Garyounis, Benghazi, Libya
[10] SRIHARI, R. & LI, W. (2002) Information Extraction
Supported Question Answering. In Proceedings of the Eighth Text REtrieval Conference (TREC-8 ).
[11] SUI, Z., ZHAO, J., KANG, W. & ZHAO, Q. (2008) The
Building of a CBD-Based Domain Ontology in Chinese.
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.
[12] TURMO, J., AGENO, A. & CATAL`A, N. (2006) AdaptiveInformation Extraction. ACM Computing Surveys, 38.
Fi . 6 Accurac results
8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression
http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 7/7
Sawsaa et al .: Extracting Information Science Concepts Based on JAPE Regular Expression.
International Journal Publishers Group (IJPG) ©
197
Ahlam Sawsaa she was born inLibya. She received her B.s andM.S degree from Garyounisuniversity in Library and
Information Science. She isserving as lecture in theDepartment of Library andInformation Science at Garyounis
University, and supervisor of many projects of graduatedegrees.She is author of a book and more than seven international publication, reviewer in international conference. She iscurrently a PhD research in Ontologies and semantic webat university of Huddersfield.
Joan Lu Professor Lu is in theDepartment of Informatics. She
was a Team Leader of ITDepartment in an industrial
company before she jointeduniversity. Her research interestsinclude XML technology, ObjectOriented System Development,
Agent technology, data management system, informationaccess/retrieval/visualization/representation, securityissues and Internet Computing. She serves as Editor inChief for the International Journal of InformationRetrieval Research.