[ieee 2007 innovations in information technologies (iit) - dubai, united arab emirates...

5
Enhanced Arabic Information Retrieval System based on Arabic Text Classification Sameh Ghwanmeh Yarmouk University Jordan Ghassan Kanaan Arab Academy for Banking and Financial Sciences Jordan Riyad Al-Shalabi Arab Academy for Banking Jordan and Financial Sciences Jordan Ahmad Ababneh Jordan Abstract The paper presents enhanced, effective and simple approach to text classification. The approach uses an algorithm to automatically classifying documents. The main idea of the algorithm is to select feature words from each document, those words cover all the ideas in the document. The results of this algorithm are list of the main subjects founded in the document. Also, in this paper the effects of the Arabic text classification on Information Retrieval have been investigated. The system evaluation was conducted in two cases based on precisionlrecall criteria. evaluate the system without using Arabic text classification and evaluate the system with Arabic text classification. A series of experiments were carried out to test the algorithm using 242 Arabic abstracts. Additionally, automatic phrase indexing was implemented. Experiments revealed that the system with text classification gives better performance than the system without text classification. 1. Introduction Information-retrieval systems process files of records and requests for information and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes of records and information requests. An Information Retrieval System is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects [ 1, 2]. Classification is the mean whereby we order knowledge. Classification is fundamentally a problem of finding sameness. When we classify, we seek for group things that have a common structure or exhibit a common behavior. Text classification mainly uses information retrieval techniques. Traditional information retrieval mainly retrieves relevant documents by using keyword-based or statistic-based techniques [1]. A standard approach to text categorization makes use of the classical text representation technique that maps a document to a high dimensional feature vector, where each entry of the vector represents the presence or absence of a feature [3]. This paper presents enhanced, effective and simple approach to text classification. The approach uses an algorithm to automatically classifying documents. The main idea of the algorithm is to select feature words from each document; those words cover all the ideas in the document. The results of this algorithm are list of the main subjects founded in the document. Also, in this paper the effects of the Arabic text classification on Information Retrieval have been investigated. The goal was to improve the convenience and effectiveness of information access. A series of experiments were carried out to test the algorithm using 242 Arabic abstracts From the Saudi Arabian National Computer Conference. Additionally, automatic phrase indexing was implemented. The system was evaluated for the two cases based on precision/recall evaluation. 2. General Approaches to Classification 2.1 Classical Categorization In this approach all entities that have a given property or a collection of properties in common form a category. Classical categorization comes to us from 978-1-4244-1841-1/08/$25.00 ©2008 IEEE. 461

Upload: riyad

Post on 27-Jan-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2007 Innovations in Information Technologies (IIT) - Dubai, United Arab Emirates (2007.11.18-2007.11.20)] 2007 Innovations in Information Technologies (IIT) - Enhanced Arabic

Enhanced Arabic Information Retrieval System based on Arabic Text Classification

Sameh GhwanmehYarmouk UniversityJordan

Ghassan KanaanArab Academyfor Bankingand Financial SciencesJordan

Riyad Al-ShalabiArab Academyfor Banking Jordanand Financial SciencesJordan

Ahmad AbabnehJordan

Abstract

The paper presents enhanced, effective and simpleapproach to text classification. The approach uses an

algorithm to automatically classifying documents. Themain idea of the algorithm is to select feature wordsfrom each document, those words cover all the ideas inthe document. The results of this algorithm are list ofthe main subjects founded in the document. Also, inthis paper the effects of the Arabic text classificationon Information Retrieval have been investigated. Thesystem evaluation was conducted in two cases basedon precisionlrecall criteria. evaluate the systemwithout using Arabic text classification and evaluatethe system with Arabic text classification. A series ofexperiments were carried out to test the algorithmusing 242 Arabic abstracts. Additionally, automaticphrase indexing was implemented. Experimentsrevealed that the system with text classification givesbetter performance than the system without textclassification.

1. Introduction

Information-retrieval systems process files of recordsand requests for information and identify and retrievefrom the files certain records in response to theinformation requests. The retrieval ofparticular recordsdepends on the similarity between the records and thequeries, which in turn is measured by comparing thevalues of certain attributes of records and informationrequests. An Information Retrieval System is capableof storage, retrieval, and maintenance of information.Information in this context can be composed of text(including numeric and date data), images, audio,video, and other multi-media objects [ 1, 2].

Classification is the mean whereby we orderknowledge. Classification is fundamentally a problemof finding sameness. When we classify, we seek forgroup things that have a common structure or exhibit acommon behavior. Text classification mainly usesinformation retrieval techniques. Traditionalinformation retrieval mainly retrieves relevantdocuments by using keyword-based or statistic-basedtechniques [1]. A standard approach to textcategorization makes use of the classical textrepresentation technique that maps a document to ahigh dimensional feature vector, where each entry ofthe vector represents the presence or absence of afeature [3].

This paper presents enhanced, effective and simpleapproach to text classification. The approach uses analgorithm to automatically classifying documents. Themain idea of the algorithm is to select feature wordsfrom each document; those words cover all the ideas inthe document. The results of this algorithm are list ofthe main subjects founded in the document. Also, inthis paper the effects of the Arabic text classificationon Information Retrieval have been investigated. Thegoal was to improve the convenience and effectivenessof information access. A series of experiments werecarried out to test the algorithm using 242 Arabicabstracts From the Saudi Arabian National ComputerConference. Additionally, automatic phrase indexingwas implemented. The system was evaluated for thetwo cases based on precision/recall evaluation.

2. General Approaches to Classification

2.1 Classical Categorization

In this approach all entities that have a givenproperty or a collection of properties in common forma category. Classical categorization comes to us from

978-1-4244-1841-1/08/$25.00 ©2008 IEEE. 461

Page 2: [IEEE 2007 Innovations in Information Technologies (IIT) - Dubai, United Arab Emirates (2007.11.18-2007.11.20)] 2007 Innovations in Information Technologies (IIT) - Enhanced Arabic

Plato, then from Aristotle through his classification ofplants and animals [4].

2.2 Prototype Theory

A category is represented by a prototypical object.An object is considered to be a member of thiscategory if and only if it resembles this prototype insignificant ways.2.3 Conceptual Clustering

More modern variation ofthe classical approach, inwhich, we generate categories as follows:

* Formulate conceptual description of thesecategories.

* Classify the entities according to thedescription.

In conceptual clustering, we group things accordingto distinct concepts. In prototype theory, we groupthings according to degree of their relationship toconcentrate prototype [4, 5]. In Text Classification, themain task is to choose the correct category for eachtext document in the corpus. It might be needed tocategorize sports by types; or to divide sentences intoquestions and statements. Two main types of textclassification are considered: single-category textclassification problems, in which; there are apredefined set of categories, and each text belongs toexactly one category. In multi-category textclassification, each text can have zero or morecategories [9].

Table 1. Examples of the feature dictionary.

Phras Word Word Doc. Freq Title End_ Fielde 1 12 F F

~ 1 7 1 1

z AJL..z,,T_ 1 1 °0 0 !

K# , L~t 1 5 0 2 cGL

~~ ~~ ~ ~ 1 4 1 0 ---

3. Applications of Text Classification

3.1 E-mail filtering

In which we can use text classification to filteremails into folders set up by user, which helps us tosearch our e-mails.

3.2 Knowledge-Based Creation

Company web sites provide large amounts ofinformation about products, marketing contact persons,etc. Categorization can be used to find companies' webpages and organize them by industrial sector.

3.3 E-Commerce

Users locate products in two basic ways: search andbrowsing. Browsing is best when user doesn't knowexactly what he/she wants. Text classification can beused to organize products into a hierarchy according todescription.

4. Classification Algorithm

The classification algorithm can be described as perthe main steps shown below.

a)b)c)d)e)f)g)

Test DatabaseSelect one document (Dx)Select feature word fti from DxCompute the weight of ftiCompute the weight for each subjectCompute the weight for each subjectSelect one subject as main subject and showall other subject sorted by their weight

4.1 Build Feature Dictionary

The feature dictionary is mainly used to store someterms that can illustrate the topic feature concept. Thedata structure of the feature dictionary is consisting ofword or phrase, document number, Frequency, TitleFrequency, End Frequency and field attribute,frequency (number of times of the feature termoccurring in the document), title frequency (number oftimes of the feature term occurring in the document),end frequency (number of times of the feature termoccurring in the first sentence of a document). Table 1shows examples ofthe feature dictionary.

4.2 Computation of Topic Feature Distribution

According to the field attributes, frequencies andpositions of Feature terms, we could compute topicfeature distribution. According to the frequency andposition of a feature term fti, we could compute theweight of the term fti. The computing formula isdescribed as follows [1]:

462

Page 3: [IEEE 2007 Innovations in Information Technologies (IIT) - Dubai, United Arab Emirates (2007.11.18-2007.11.20)] 2007 Innovations in Information Technologies (IIT) - Enhanced Arabic

AAf) - _fiq(A)±N~t +0 5x N~ T

Phrase Wordl Word2 Document#

Figure 1. Phrase Index

Where, p (fti) is the weight of the feature term fti.Freq (fti) is frequency of the feature term fti. Ntiti,e isnumber of times of the feature term fti occurring in thetitle, Nend is number of times of the feature term ftioccurring at the end sentence paragraph. freq (fti) istotal frequency of the all feature terms in the text.

The weight p(fi) computing formula of a topicfeature fi is described as follows:

p(f*)= x,p Cfr-) (2).

Where: p (fi) is the weight of the topic feature fi.P(ftj) is the weight of the feature term ftj. Ftj_fi showsfeature term set illustrating the same topic feature fi[1].

4.3 Traditional Information Retrieval System

To conduct the necessary comparison, thetraditional Information Retrieval System (IR) has beenused, in which we use simple information retrievaltechniques based on simple word and phrase matching.We have used a table in which we store with eachphrase its components words and the number ofdocument in which this phrase appears. The stepsinvolved in this algorithm are as follows:

a) Exact MatchIn this step, the system retrieves the documents whichhave exact match between user query and phrases inthe table and rank them as the top of the answer.

b) Partial Match.Partial match based on number of words matched. Forexample if the query was ' - JS2l S31"(Artificial Intelligence Approaches), the document thatappear on it " 'D.- 2lI" (Artificial Intelligence)will be ranked before the document which include justthe word "*S2l" (Artificial ). The results of this stepare ranked in the middle of the answer set.

For example, in one word match, if the query was1k .-Y EL.S1I IM I'll" (Artificial Intelligence

Approaches), we search for each individual word in thequery. The results of this step are ranked in the bottomof the answer set. The two words phrase index built isshown in Figure 1.

Table 2. Phrase Index for the query "111.U -'l" (Artificial Intelligence)

Phrase Wordl Word2 Doc. #9S ;I 1̂VI1)S ; 2^,, ;A)S 10

Artificial intelligenceArtificial

Intelligence

4.4 Enhanced IR System Supported by TextClassification

The traditional IR system has been enhanced andupgraded. After we retrieve the document that haveexact match between the query and the phrase, we

must look at the subjects of these document. The stepsinvolved in this algorithm are as follows:

a) Exact MatchIn this step, the system retrieves the documentswhich have exact match between user query andphrases in the table and rank them as the top oftheanswer. Use the subjects of the returneddocuments to search and return the documentwhich have the same subjects.

b) RankingRank the documents that have exact match at thetop of the answer set. Rank the documents thathave the same subject as the subjects of thedocument on the top rank (that have exact match).

Table 3 shows an example of the queries used andsystem output for each query and the subject for bothquery and system output.

Table 3.Example of system output

463

Query Subject System Output Subject6 sT)S1L ,)> >,_,3lT_Z

eAL1E

Page 4: [IEEE 2007 Innovations in Information Technologies (IIT) - Dubai, United Arab Emirates (2007.11.18-2007.11.20)] 2007 Innovations in Information Technologies (IIT) - Enhanced Arabic

The system begins the searching process by exactmatch between the user query " - 2l"(Artificial Intelligence) and the phrases in the featuredictionary then the system search again by using thesubjects of the returned document"'c -Ul "Sil(Artificial intelligence). The system retrieved phraseswhose subject is " 'UP-Ul ;StSl" (ArtificialIntelligence) for example ( " cktl n \I" (NaturalLanguages), "yIl+"(Expert Systems), ''y.2 .9''(Knowledge Bases).

5. System ImplementationThe first component of the system is building the

Feature Dictionary. In which, we use automatic phraseeach phrase consists of two words and compute thenecessary frequencies. The subject field of the featuredictionary has been filled manually. In the main formwindow the user can enter or select a query. After theuser type his query as needed; the system retrieves therelevant documents to this query automatically.

5.1 Performance Evaluation

The performance of the system has been evaluatedby run the system and observes the results (relevantretrieved documents) to each query. From these runs,we calculate precision and recall evaluation measuresfor each query in two cases with text classification andwithout text classification.

5.2 Precision/Recall Measurements

The following Formulas have been used incalculating Precision-Recall measurements [12].

Precision = No. of Relevant retrieved docs No. ofTotal Retrieved DocsRecall = No. of Relevant retrieved docs No. ofRelevant Exist Docs, and the average recall precisionhas been calculated using the following formula.

P(r) = Pi (r)

Nq

where, P (rj) = Max rj <= r <= rj,j P(r)

P(r) is the average precision at the recall level r; Nq isthe number of queries used, and Pi (r) is the precisionat recall level r for each of the queries.

To measure the differences and observe theenhancements, a comparison is made between bothcases for the system runs (i.e. without textclassification and with text classification) from the

precision-recall evaluation measurements calculations.The system has been tested using a set of 242 Arabicabstracts from the proceedings of the Saudi ArabianNational Computer Conferences. A total of 59 Arabicqueries have been used in the tests and made relevancejudgments automatically and manually for each query.

5.3 Experiments and Results

The testing process involved 40 queries. Thesequeries applied in two cases: with and without textclassification. Table 4 shows the recall/precisionresulted from a relevant document with query" l.a-l" (Arabic Language).

Table 4. Resulted Recall/PrecisionR

Relevant orRetrieved Not

Recall Precision Counter R Doc.#0.01 1.00 1 1 2

0.02 1.00 2 1 4

0.03 1.00 3 1 8

0.04 1.00 4 1 9

0.05 1.00 5 1 242

0.06 1.00 6 1 198

0.07 1.00 7 1 86

0.07 0.88 7 0 26

0.07 0.78 7 0 18

0.07 0.70 7 0 42

0.07 0.64 7 0 41

0.07 0.58 7 0 22

0.08 0.62 8 1 39

0.09 0.64 9 1 87

0.10 0.67 10 1 88

Based on the resulted values of precision and recall, itcan be shown that the average precision/recall for thesystem without Arabic text classification isPrecision=3100 and Recall=780o and with Arabic textclassification is Precision=3700 and Recall=930.Figure 2. shows the performance ofthe system for bothcases and the affects ofthe Arabic text classification onthe retrieval process. It can be noted that theperformance of the system is better with Arabic textclassification.

464

Page 5: [IEEE 2007 Innovations in Information Technologies (IIT) - Dubai, United Arab Emirates (2007.11.18-2007.11.20)] 2007 Innovations in Information Technologies (IIT) - Enhanced Arabic

Table 5. Recall /Precision table with and withoutArabic Text Classification

with TC without TC

Recall Precision Recall Precision Intervals

0.0562 0.7529 0.0560 0.765 0 - 0.10.1525 0.5325 0.1522 0.4937 0.1 - 0.20.2515 0.4384 0.2522 0.4296 0.2 - 0.30.3518 0.3890 0.3470 0.3879 0.3 - 0.40.4556 0.3596 0.4405 0.3451 0.4 - 0.50.5525 0.31781 0.5263 0.2910 0.5 - 0.60.6521 0.2734 0.6067 0.2224 0.6 - 0.7

-W ithout Text Classication

- with Text Classication

0.9

0.8

0.7

-0.6

O3

cnO0a>-0.4

0.3

0.2

0.1O

0 0.1 0.2 0.3 0.4 006 0.7 08 0.9 1

Figure 2. Recall/precision with text classificationand without text classification.

6. Conclusions

An Arabic information retrieval system has beenexplained, in which specific documents within a

collection of documents are retrieved in response to a

query and then ranked these documents. It can beconcluded that the system performance and the Arabicinformation retrieval efficiency with text classification,is better than the same system without using textclassification by calculating precision/recall measures

for both situations. The affects of text classification on

the system performance has been observed. The resultsof the experiments show that good precise and recallmeasures are achieved. The system key benefits can besummarized as follows:

7. References

[1] Zhu Jingho, Yao Tianshun,A knowledge-basedApproach to text classification, 2002.

[2] Salton. Automatic Text Processing: TheTransformation: Analysis and Retrieval ofInformation by Computer, Addison-Wesley,Reading, Mass, 1989.

[3] Huma Lodhi, Text Classification using StringKernels, Department of Computer Science,Royal Holloway, University of London, Egham,Surrey TW20 OEX, UK, 2002

[4] T. Joachims. Making large-scale SVM learningpractical. In B. Sch4olkopf, C. J. C. Burges, andA. J. Smola, editors, Advances in KernelMethods Support Vector Learning, pages169-184, Cambridge, MA, 1999. MIT Press.

[5] Rayid Ghani, Using Error-Correcting CodesforText Classification, Center for AutomatedLearning & Discovery, School of ComputerScience, Carnegie Mellon University, Pittsburgh,PA, 1999.

[6] www.dcs.gla.ac.uk/keith/chapter. 1/ch. 1 .html.-[7] Moustafa M. Ghanem, Yike Guo, Huma Lodhi,

Yong Zhang, Automatic Scientific TextClassification Using Local Patterns, 2002.

[8] Nicholas Kushmerick, Edward Johnston,Stephen McGuinness, Information extraction bytext classification, Smart- Media- Institute,Computer- Science- Department,- University-College-Dublin.2000.

[9] NLTK Tutorial: Text Classification,nltk.sourceforge.net/tutorial/classifying/,2003.

[10] William A. Woods, Aggressive Morphology forRobust Lexical Coverage, December 1999.

[11] Ricardo Baeza-Yates and Berthier Ribeiro-NetoAddison Wesley Modern Information RetrievalTextbook, /ACM Press, 1999.

[12] Rada Mihalcea, Information Retrieval and WebSearch Instructor: Teaching Assistant: HasinaAziz, 2004

[13] William A. Woods, Natural LanguageTechnology in Precision Content Retrieval,December 1998.

[14] Grady Booch,Object-Oriented Analysis andDesign (chapter4) ,1994.

465

r 1