extracting information science concepts based on jape regular expression

7
International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013. Manuscript Received: 3,Jun.,2012 Revised: 25,Jun.,2012 Accepted: 30,Jan.,2013 Published:  15,Mar.,2013  Keywords  ontology, Regular expression, Information extraction, Natural Language Programming Abstract    Recently, unstructured data on the World Wide Web has generated significant interest in the extraction of text, emails, web pages, reports and research papers in their raw form. Far more interestingly, extracting information from a specific domain using distributed corpora from the World Wide Web is a vital step towards creating corpus annotation. This paper describes a method of annotation, based on concepts from Information Science, to build a domain ontology, using Natural Language Programming (NLP) technology. We used Java Annotation Patterns Engine (JAPE) grammars to support regular expression matching and thus annotate IS concepts using a GATE developer tool. This speeds up the time-consuming development of the ontology which is important for experts in the domain facing time constraints and high workloads. The rules provide significant results: the pattern matching of IS concepts based on the lookup list produced 403 correct concepts and the accuracy was generally higher, with 0 partially correct, missing and false positive results. Using NLP technique is good approaches to reduce the domain experts work and they can be e valuated the results 1. Introduction Recently, Information Extracting (IE) has received significant interest due to the number of web pages emerging on the internet containing unstructured data. Due to the amount of information available on the internet, it is necessary to have a tool for extracting it. Many specialists in the field of IE have worked to find suitable tools, such as Wrappers, that classify interesting data and map them onto appropriate formats such as XML or relational database. Furthermore, some HTML-aware tools are based on inheriting the constructural features of documents so as to extract the data. On the other hand,  Natural Language Programming (NLP) is a technique used by many tools to extract the data in natural language documents. Tools such as GATE use techniques such as a  part-of-speech tagging, filtering, or lexical semantic tagging to link relevant information, and identify  Ahlam Sawsaa, Joan Lu. Informatics, University of Huddersfield  (swws[email protected]; J ja[email protected] ) relationships among phrases and sentence elements within text. In fact, each of these tools has advantages and disadvantages. A comparative analysis of the existing tools for data extraction is needed to assess their capabilities. This is done in the next section. In this paper, first we provide a brief background of IE tools to justify why we feel the NLP technique should be used to speed up the building of an Ontology of Information Science (OIS). T o extract concepts in the field, we used CREOLE plug-ins from GATE in the IE system. We also show how the JAPE grammar has been implemented by detailing the rules we use to annotate IS concepts. The paper is structured as follows: In section 2, we discuss the background of IE. In section 3, we discuss the methods used to extract Information Science (IS) concepts and how they were constructed. In section 4, we present how the domain knowledge is acquired for creating the corpus, Gazetteer , and how the JAPE rule is implemented. Our discussion and evaluation is in section 5. Finally, we draw conclusions and make suggestions for future work. 2. Background It is a shared belief that ontology receives a lot of recognition from various research fields. Although there are some well-known domain ontologies, such as CYC, the Standardized Nomenclature for Medicine (SNOMED, a clinical terminology), Toronto Virtual Enterprise (TOVE), and the GENE ontology (GO), study of the ontology area is still immature and improvements are needed [6]. IS is a multidisciplinary field, including branches such as Library Science, Archival Science and Computer Science, and therefore lacks a unified model of domain knowledge. The inconsistencies in the structure of the domain make it difficult to use and share data at the syntactic and semantic levels. It is thus necessary to develop an OIS to represent the domain knowledge [9]. The growing amount of unstructured data appearing on the internet makes it extremely difficult to extract knowledge from it. The IS domain includes a huge number of documents made up of web-based knowledge that is inaccessible. Building an OIS thus r equires us to set up an appropriate knowledge description module for the intended ontology [11]. A number of studies have shown that applications of IE can be used to annotate documents that are written in natural language. Certainly, the growing number of IE Extracting Information Science Concepts Based on JAPE Regular Expression Ahlam Sawsaa & Joan Lu

Upload: ijeceditor

Post on 04-Jun-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression

http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 1/7

International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.

ManuscriptReceived:

3,Jun.,2012

Revised:

25,Jun.,2012

Accepted:

30,Jan.,2013

Published:  15,Mar.,2013 

Keywords  

ontology,Regular

expression,

Information

extraction,

Natural

Language

Programming

Abstract     Recently, unstructured data

on the World Wide Web has generated

significant interest in the extraction of text,

emails, web pages, reports and research

papers in their raw form. Far more

interestingly, extracting information from a

specific domain using distributed corpora

from the World Wide Web is a vital step

towards creating corpus annotation. This

paper describes a method of annotation,

based on concepts from InformationScience, to build a domain ontology, using

Natural Language Programming (NLP)

technology. We used Java Annotation

Patterns Engine (JAPE) grammars to

support regular expression matching and

thus annotate IS concepts using a GATE

developer tool. This speeds up the

time-consuming development of the

ontology which is important for experts in

the domain facing time constraints and

high workloads. The rules provide

significant results: the pattern matching of

IS concepts based on the lookup list

produced 403 correct concepts and theaccuracy was generally higher, with 0

partially correct, missing and false positive

results. Using NLP technique is good

approaches to reduce the domain expert’s

work and they can be evaluated the results 

1.  Introduction

Recently, Information Extracting (IE) has receivedsignificant interest due to the number of web pagesemerging on the internet containing unstructured data.Due to the amount of information available on the internet,

it is necessary to have a tool for extracting it. Manyspecialists in the field of IE have worked to find suitabletools, such as Wrappers, that classify interesting data andmap them onto appropriate formats such as XML orrelational database. Furthermore, some HTML-awaretools are based on inheriting the constructural features ofdocuments so as to extract the data. On the other hand, Natural Language Programming (NLP) is a technique

used by many tools to extract the data in natural languagedocuments. Tools such as GATE use techniques such as a

 part-of-speech tagging, filtering, or lexical semantictagging to link relevant information, and identify

 Ahlam Sawsaa, Joan Lu. Informatics, University of Huddersfield  

([email protected]; J [email protected] )

relationships among phrases and sentence elements within

text. In fact, each of these tools has advantages anddisadvantages. A comparative analysis of the existing

tools for data extraction is needed to assess theircapabilities. This is done in the next section.

In this paper, first we provide a brief background of IEtools to justify why we feel the NLP technique should beused to speed up the building of an Ontology ofInformation Science (OIS). To extract concepts in the field,

we used CREOLE plug-ins from GATE in the IE system.We also show how the JAPE grammar has beenimplemented by detailing the rules we use to annotate ISconcepts.

The paper is structured as follows: In section 2, wediscuss the background of IE. In section 3, we discuss themethods used to extract Information Science (IS) conceptsand how they were constructed. In section 4, we presenthow the domain knowledge is acquired for creating thecorpus, Gazetteer, and how the JAPE rule is implemented.Our discussion and evaluation is in section 5. Finally, wedraw conclusions and make suggestions for future work.

2.  Background

It is a shared belief that ontology receives a lot ofrecognition from various research fields. Although thereare some well-known domain ontologies, such as CYC,the Standardized Nomenclature for Medicine (SNOMED,a clinical terminology), Toronto Virtual Enterprise(TOVE), and the GENE ontology (GO), study of theontology area is still immature and improvements areneeded [6].

IS is a multidisciplinary field, including branches suchas Library Science, Archival Science and Computer

Science, and therefore lacks a unified model of domainknowledge. The inconsistencies in the structure of the

domain make it difficult to use and share data at thesyntactic and semantic levels. It is thus necessary to

develop an OIS to represent the domain knowledge [9].The growing amount of unstructured data appearing on

the internet makes it extremely difficult to extractknowledge from it. The IS domain includes a hugenumber of documents made up of web-based knowledgethat is inaccessible. Building an OIS thus requires us toset up an appropriate knowledge description module forthe intended ontology [11].

A number of studies have shown that applications ofIE can be used to annotate documents that are written innatural language. Certainly, the growing number of IE

Extracting Information Science Concepts Based on

JAPE Regular ExpressionAhlam Sawsaa & Joan Lu

8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression

http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 2/7

8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression

http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 3/7

Sawsaa et al .: Extracting Information Science Concepts Based on JAPE Regular Expression.

International Journal Publishers Group (IJPG) © 

193

and concepts from a specific text effectively andefficiently. For this work, we annotate text belonging tomembers of Ontocop. Ontocop is a virtual community of practice within IS domain, designed to support groupinteraction and communication across diverse destinations.

It was intended as a tool for creating an OIS ontology byextracting the main concepts from there outputs,document discussions. Professional bodies can be goodresources for the process of building a conceptual ofinformation Science OIS [8].

3.  Methods Employed

Our method is based on creating a corpus of

documents and a Gazetteer of Information Science, withJAPE rules used to extract IS concepts. GATE provides

facilities for loading corpora for annotation from a URL oruploading from a file. The process generally started as

follows:  We compiled IS knowledge from different

resources, such as the Ontocop website forumand various publications on the web by membersof Ontocop.

  We analyzed the data to ensure it covered all branches of the field.

  We transferred the information resources into anXML file to form the corpus.

  We uploaded the corpus into the GATE softwareso as to start running ANNIE.

  We annotated the concepts based on JAPEgrammar, which is run within ANNIE.

  Testing and evaluation. As illustrated in Fig. 1.

4.  Implementation

 A.  Knowledge acquisition

Before creating the ontology, we had to collect the ISknowledge for the domain model. Our approach consistedof annotating IS concepts based on the JAPE grammar,using the GATE software. The annotation process beganas follows:

  We collected discussion threads from theOntocop website on MySQL database, using the

URL of the Ontocop website.

List (1) Transfer of discussion data that obtained from Ontocop forumto XML files.

Figure 2 shows the methods we used to annotateconcepts from the embedded knowledge that had emergedfrom experts in the field on the Ontocop forum. The

discussion topics were collected in the MySQL database before being transferred to XML files.

For example, the concept of  Information science  wasannotated and defined in the OWL ontology language to

start the lifecycle of the ontology process.

   Next, we collected IS publications by Ontocop

members to speed up the process, and then

Analysis process

Upload toGATE

Framework

Transfers to

XML

Running

ANNIE

Corpus

Annotation of concepts & evaluation

Documents ofInformation Science

Fig. 1 Annotation workflow 

8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression

http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 4/7

8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression

http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 5/7

Sawsaa et al .: Extracting Information Science Concepts Based on JAPE Regular Expression.

International Journal Publishers Group (IJPG) © 

195

Rule: concept2Priority: 20(({Token.string == "information"}){Token.string == "service"}

({Lookup. major Type == "concept"})) : information-->: Information. concept = {Rule=concept2}

More precisely, we apply a regular expression tomatch strings of text:

Phase: ConceptInput: Lookup TokenOptions: control = appeltRule: Glossary(

({Token.string == "catalog?e"})): concept-->:{} .concept= {Rule= "Glossary"}

In these rules we specify a string of text {Token.string== } that must be matched, specifying the attributes of the

annotation by using operators such as “==”, and thenannotating the entities according to the correct labels.

Furthermore, using a control field such as all, applet, brillgives the right results. The next example shows howregular expressions could be annotated as showingconcepts related to (abstract) metacharacter(dot, *, [ ], | ),

{Token.string == "abstract(ing)"}

This could capture the words abstract, abstracting, orabstractor. If we want to annotate the acquisition conceptfollowed by another word we would use the following:

{Token.string == "acquisition. number"}

This could annotate:Acquisition.policeAcquisition.service

The code {Token.string == "archival * "} will annotate

archival library, archival journal, archival processing,archival software, and archival studies, for example.

We could also choose one term from a choice of two

terms. For example, we could choose either Data ormining from the phrase Data mining, Data or processing

from Data processing, Data or storage from Data storage,or Data or representation from Data representation by

applying the pipe simple [|] operator:{Token.string == "Data | mining"}{Token.string == "Data | processing "}{Token.string == "Data | storage "}{Token.string == "Data | representation"}

{Token.string == "Book | art" }

5.  Discussion and evaluation

Our extraction of IS concepts using JAPE grammarand regular expression based on the GATE developer forautomated IE provides significant results. The main idea

 behind using JAPE and regular expression is to identify ISterminology as tokens, for example Computing, Librariesand Information technology, from a large text. The termidentification relies on looking up a list of IS terms fromthe Gazetteer. For example, we could look up book art, book card, book guidance or book catalogue, or computerapplication, computer science, computer experts,computer file, or computer image. These concepts can becollected to be the main component of IS glossary, and tostructure in semi-formal hierarchy before creating thecomputational model of the OIS ontology.

We extracted the IS concepts from a corpus of 300documents, obtained specifically for this purpose. We ran

the ANNIE application, using document reset, Tokenisor,sentence Splitter, Gazetteer, POS tagger, JAPE transducer

and Orthomatcher The annotation set that appeared in thedisplay pan, and the concepts are highlighted in the

annotation default, each annotation has different colour. ascan be seen in Fig. 4.

Fig. 4 Annotation concepts in GATE

Figure 4 presents the results of annotating the IS conceptsafter running ANNIE and highlighting the matching

concepts. The results show that our approach successfullyannotates concepts. We recalled 541 of the  Knowledge concept, 275 Information concept and 35 of theorganization concept   (see Fig. 5). Each annotation startsfrom a specific point and ends at a different point, basedon how many tokens it has. The knowledge concept startsat point (557) and ends at (566), while the organization

concept starts at (624) and ends at (636), with its features{major Type=concept}.

8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression

http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 6/7

  International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.

International Journal Publishers Group (IJPG) © 

196

Fig. 5 Result of the annotation of the IS domain 

The data were evaluated using the Annotation Diff tool

which is based on evaluation metrics for precision, recalland the F-measure. Annotation Diff is used to comparetwo different annotations sets based on the samedocument. We compared the key type feature and theresponse feature.

The tests showed that are accuracy rates are equal tothe manual output of IS experts. The statistics of thecorpus show that the pattern matching of IS concepts based on the lookup IS list was 403, correct concepts andaccuracy were generally higher, and there were no partially correct results (0), missing false positives.

However, we used GATE due to its benefits as an opensource software and because it contains multi-language

 NLP models that can be reused to develop other resources.

6.  Conclusion

 A.  Achievement

This paper has described a method using NLP

techniques to extract concepts for the purpose of speedingup the development of an OIS. Furthermore, the

development of the IE system should save domain expertstime and effort in labelling the most common concepts. In

total we extracted 664 concepts that are classes of the OIS,and 650 subclasses, making up the main components ofthe ontology skeleton. The IE technique can be applied tomany different formats, such as XML, HTML documents,

URLs or emails.

 B.  Future work

Ontology is at the heart of the semantic web. It defines

concepts and relations that make global interoperability possible. In future work, we plan to enhance theseconcepts so as to develop an OIS to create the taxonomyof IS as a domain. The next step will be to code the process using Protégé as the ontology editor. Additionally,

a generic model of the OIS will be evaluated.

References

[1] ALBERTO, H. F., BERTHIER, A. L.-. & RIBEIRO-NETO(2002) A brief survey of web data extraction tools.SIGMOD Record http://annotation.semanticweb.org/tools/.

[2] CHANG, C.-H., KAYED, M., GIRGIS, M. R. & SHAALA,

K. (2000) A Survey of Web Information ExtractionSystems. IEEE TRANSACTIONS ON KNOWLEDGE AND

 DATA ENGINEERING, 13.[3] CRESCENZI, V. & MECCA, G. (2004) Automatic

Information Extraction from Large Websites.  Journal of

the ACM, 51, pp. 731 – 779.[4] GATE (2010) Developing Language Processing Components

with GATE Version 6 (a User Guide).

http://gate.ac.uk/sale/tao/splitch13.html#x18-32300013.2.[5] HANDSCHUH, S. & STAAB, S. (2002) Authoring and

Annotation of Web Pages in CREAM. Honolulu, Hawaii,USA.

[6] LABORATORY, E. I. (2011) TOVE Ontology Project.

University of Toronto. Toronto,

http://www.eil.utoronto.ca/enterprise-modelling/tove/.[7] MOENS, M.-F. (2006)  Information Extraction: algorithms

and prospects in a retrieval context , Springer.[8] SAWSAA, A. & LU, J. (2010a) Ontocop: A virtual

community of practice to create ontology of Information

sceicne. ICOMP'10. Las vagas.[9] SAWSAA, A. & LU, J. (2010b) Ontology of Information

Science Based On OWL for the Semantic Web.  In:

 International Arab Conference on Information Technology(ACIT'2010). University of Garyounis, Benghazi, Libya

[10] SRIHARI, R. & LI, W. (2002) Information Extraction

Supported Question Answering.  In Proceedings of the Eighth Text REtrieval Conference (TREC-8 ).

[11] SUI, Z., ZHAO, J., KANG, W. & ZHAO, Q. (2008) The

Building of a CBD-Based Domain Ontology in Chinese. 

 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[12] TURMO, J., AGENO, A. & CATAL`A, N. (2006) AdaptiveInformation Extraction. ACM Computing Surveys, 38.

Fi . 6 Accurac results

8/13/2019 Extracting Information Science Concepts based on JAPE Regular Expression

http://slidepdf.com/reader/full/extracting-information-science-concepts-based-on-jape-regular-expression 7/7

Sawsaa et al .: Extracting Information Science Concepts Based on JAPE Regular Expression.

International Journal Publishers Group (IJPG) © 

197

Ahlam Sawsaa  she was born inLibya. She received her B.s andM.S degree from Garyounisuniversity in Library and

Information Science. She isserving as lecture in theDepartment of Library andInformation Science at Garyounis

University, and supervisor of many projects of graduatedegrees.She is author of a book and more than seven international publication, reviewer in international conference. She iscurrently a PhD research in Ontologies and semantic webat university of Huddersfield. 

Joan Lu  Professor Lu is in theDepartment of Informatics. She

was a Team Leader of ITDepartment in an industrial

company before she jointeduniversity. Her research interestsinclude XML technology, ObjectOriented System Development,

Agent technology, data management system, informationaccess/retrieval/visualization/representation, securityissues and Internet Computing. She serves as Editor inChief for the International Journal of InformationRetrieval Research.