using new method of word sense disambiguation which ... new method of word … · web viewword...

WORD SENSE DISAMBIGUATION (WSD) FOR HINDI LANGUAGE WEB INFORMATION RETRIEVAL

1Dr. S.K. Dwivedi and 2Parul Rastogi,1Reader and Head, Computer Science Dept.,

BabaSaheb Bhim Rao Ambedkar University, Lucknow, UP, [email protected]

2Research Scholar, Computer Science Dept.,BabaSaheb Bhim Rao Ambedkar University, Vidya Vihar, Rai Barreliy Road,

Lucknow, UP, [email protected]

ABSTRACT

Word Sense Disambiguation is defined as the problem of computationally determining the correct or exact sense of a word in the particular context. Word sense disambiguation (WSD) is an imperative for the information retrieval systems. Besides the other technical setbacks, like other languages, Hindi language web information retrieval system also faces the problem of sense ambiguity. The Hindi language web information retrieval also faces the problem of sense ambiguity like other languages. The sense ambiguity problem deteriorates the performance of every natural language processing (NLP) application. The performance of Hindi language web information retrieval is also affected by it. We had formalized an approach for the disambiguation of the senses, which improves the performance of Hindi language web information retrieval. Our approach improves the precision of the results over the web. We have taken a test sample of 100 queries and found that 43% of them are detected as unambiguous. Proceeded to ambiguity detection the disambiguation approach is followed which is based on HSC (Highest Sense Count). Query disambiguation approach further follows query expansion. The expanded query generates the new result set which results into high precision and high similarity score. The 57 expanded queries are tested against 1000 test document instances. The overall improvement is 45% in the average precision, 23% in interpolated average precision and a significant improvement in the average similarity score of the new generated result set. The overall accuracy of our approach is 61.4% and it improves the performance of the system for 45%

Keywords: Word Sense Disambiguation, Hindi language, Web Information Retrieval, Query Expansion, Highest Sense Count.

mailto:[email protected]

mailto:[email protected]

1. INTRODUCTION

The unhampered growth of web as a complete reservoir of knowledge has lead in an era of Information Revolution. To date, the Internet is the foremost source of information for the human population. English is the most dominated and preferred language for the web access. In recent times, the rapid growth in the popularity of computers and the Internet in non-English speaking countries like India, have increasingly made the need and importance of reaching out to the non-English speaking zone. With the increase in contents written in native languages on the Internet, a proper mechanism is needed to make this content noticeable and available wherever and whenever necessary.

The major population of India use Hindi as a first language. Hindi language is the national language of India, with roughly 300 million native speakers. Another 100 million or more use Hindi as a second language. It is the language of dozens of major newspapers, magazines, radio and television stations, and of other media. Hindi-IT market seems to have taken-off silently during the past couple of years. Today, not only Hindi language, but also other Indian languages such as Tamil have begun to be noticed by IT bigwigs. Not only Indian languages other languages like Chinese, Japanese, etc. of various countries have also been noticed by them.

The Internet penetration growth rate is among the fastest in the world in India. But everyone knows its going to hit a plateau if the main Internet language remains English. Sure enough, there are portals like Rediff, Yahoo, MSN, Google and others have started offering contents in Hindi and other languages. But, hunting for any real information, it seems like the amount of activity that a person can do on the Internet with Hindi language or any other local language, is limited. The serious concern should be taken by the Indian IT industry to promote the web usage in the rural areas of India. This is possible only when the web would provide information to the native users in their languages only.

Various search engines are available on the Internet as independent search engine sites in English, but very few (Google.com, Raftaar.com, Webkhoj.iiit.ac.in etc.) Hindi language search engines are available. The performance of the existing search engines is not up to the mark. The search engines that support Hindi language search are not able to provide quality results. There are various problems, the search engines face with Hindi language information retrieval. Sense ambiguity is one of the major problems in information retrieval on the web in Hindi Language. Many words are polysemous in nature. Identifying the appropriate sense of the words in the given context is a difficult job for the search engines. Word sense disambiguation gives solution to the many natural language processing systems including information retrieval.

2. RELATED WORKFew researchers have worked for the Hindi language

word sense disambiguation like Pushpak Bhattacharya [1] who proposed the statistical approach which was very near to Lesk [2] approach. Another unsupervised approach was given by Neetu Mishra and Shashi Yadav [3] for Hindi language WSD. However our work for disambiguation is motivated by the Ioannis and Mandhar [4]. They used the Total sense score (TSS) for the disambiguation.

Motivated from their hypothesis our method for disambiguation also follows the similar approach to find the context for a particular document snippet. In addition to their work for TSS we had calculated the phrase frequency for the query for a particular snippet, since we are using the approach for the query disambiguation on the web. Besides that various other researchers have also used web documents for the disambiguation approach. [5, 6].

Further for query expansion our work is related to the VSM. The common specification about the VSM is term frequency and inverse document frequency. In our study, we emphasize that query expansion is related to the terms in a relevant document itself. Using WordNet or the Web as whole for the query expansion is not a feasible solution for the query expansion in WSD. Since sense disambiguation is dependent on the context of the terms therefore it is quiet justified that the context of the query terms can be identified very well from the relevant document set only.

We have taken a base of Pintos approach used for the query expansion for WSD. Pinto’s method’s [7] success rate is low in improving the performance of IR system. They used the VSM model for the WSD. One of the reason as mentioned earlier is the researcher used WordNet as the base for the query expansion which is a lexical database of contextual relations. Since web is a huge pool of information so in the case of web many times it is not feasible to find the context of the key terms with the existing examples of a lexical database. It is important to find out the current (in a particular query) context of a query from the set of relevant document set retrieved.

Jian-Yun and Jin [8] used the approach to query expansion which is different from most previous studies. They argue that an appropriate combination of the expansion terms with the original terms is an important problem to deal with in query expansion. Which has been taken as granted in the other research work that expansion terms should be added as additional dimensions in the resulting vector [9] and [10]. Nie and Jin’s preliminary results seem to support the claim that considering the expansion terms as logical alternatives is a better solution. They also supported the fact that the WordNet is not a suitable resource for query expansion in information retrieval.

The similar kind of work which is also comparable to the Wong [11] that tries to create more complex relationships within vectors. Wong et al. observed that the underlying independence assumption in vector representation is not reasonable. They suggest considering dependencies between dimensions in a Generalized Vector Space Model. However, the method of Wong et al. suffers from the complexity problem. In practice, it is difficult to fully implement it.

3. SENSE AMBIGUITY AND WEB IREvery language faces the ambiguity problem. Many

words are polysemous in nature. In human interaction we are able to find out the sense of an ambiguous word by simply relating it with its context. But for automated systems it is a difficult task.

Sense ambiguity in Hindi language queries can be clearly understood by the given example query “गुलाब की कलम {Gulaab ki kalam} (Rose branch)” (in Hindi language) consists of three terms which is visible in Figure 1.

Sense Ambiguity and Web IR

Terms in Senses from POS (part Hindi WordNet of speech)

गुलाब{Gulaab} गुलाब{Gulaab} Noun(Rose) (Rose)

की {Ki} (of) Prepositionकलम{Kalam} पेन {Pen}(Pen) Noun

अखँि�या {Ankhiyaan}, कलम {Kalam}, तूलिलका {Tulika}(Brush),

कलम {Kalam} (Branch), ले�नी {Lekhni}

http://www.webdunia.com/homepage/default.htm

http://news.google.com/nwshp?hl=hi&ned=hi_in

http://in.hindi.yahoo.com/

http://www.rediff.com/hindi/

http://www.readwriteweb.com/archives/world_internet_penetration_sept06.php

Figure 1: Sense ambiguity and Web IR in Hindi language

Sense ambiguity is a usual problem with other languages also. English language also faces the similar problem. There are numerous polysemous words in English language also. For ex. query “Red Bat” consists of two terms in which one is ambiguous. The ambiguity in the English language can be understood by the Figure 2.

Figure 2: Sense ambiguity in English language

4. AMBIGUITY DETECTIONThe focus of the ambiguity detection method is to

measure the ambiguity of a query term qi from a query Q. The low probability tagging is likely to be ambiguous. In the early phase of our research work we formulated an approach for the ambiguity detection based on two parameters “entropy” and “threshold” [12]. If the value of entropy is greater than threshold or we can say entropy passes a threshold, the query will be an ambiguous query. Detecting

the ambiguity using the concept of entropy and threshold is found quite successful for Hindi language information retrieval. Ambiguity detection improves the performance of the WSD based applications. It reduces the overload on the system by avoiding the useless efforts to disambiguate the unambiguous queries (that is a query having polysemous words). The ambiguity resolution provides a robust mechanism for presenting results to a user for better conception of the contents of the result set.

5. WORD SENSE DISAMBIGUATIONSubsequent to ambiguity detection next step is word sense disambiguation. The query consists of various terms

. To disambiguate the ambiguous query we had taken first ten snippets retrieved for a query. Our approach aims to automatically find the correct sense for a query term. It is based on Agirre [13] hypothesis which is a follows –

1. The meaning of word can be discovered from words around it.

2. Semantically related words that impose constraint on each other are expected to be topically related.

The disambiguation of the query term consists of following steps-

Step 1a) Take a query term b) Retrieve all synsets from Hindi WordNet for term

Step 2a) From the snippets file retrieve the first snippet.b) Tokenize the snippets (stop words like “कारक

(preposition) such as ने, को (to), स े (from), के लिलए (for), म े (in). and योजक (conjunction) such as याा (or), किकन्त ु (but), परन्त ु (but), क्योकिक (because), तथा (and), अन्यथा (otherwise) and the special characters like *, &, #, || etc. are already removed from the snippets file.)

c) Take term from the retrieved snippet and count its occurrence in hypernyms, gloss and test corpus against the first synset. The value is denoted by

(Sense Count for the ith snippet)d) Find the phrase (complete query) frequency by

counting its occurrence in hypernyms, gloss, test corpus and snippet. The value is denoted by .

Step 3a) Evaluate the for the snippet.b) Repeat this process against each synset for the ith

snippet.

Step 4a) Finally calculate the HSC (Highest Sense Count)

which is the highest value of for all the senses.

- Equation 1

Sense Ambiguity and Web IR Terms in Senses from POS (Part of English WordNet Speech)Red Red Color Noun

Red River Bolshy Loss

Bat Chiropteran Noun At-bat Squash Racket Cricket Bat

b) The sense having the highest value of is the correct sense for the snippet.

c) Assign the sense number to the snippet.

Step 5a) Find the frequency of the each sense number

assigned to the snippets.b) The sense having the highest frequency is the

correct sense for the query term.

In the above method following specifications are used-

Sense Count - = Phrase frequency (is a count of query phrase

in snippet and also in hypernym, gloss and test corpus).

= Frequency of in Hypernym, Gloss and Test corpus.

= Total Terms in snippets.

We have used the concept of Highest Sense Count (HSC) which is motivated from [14] approach. Their approach used the sum of frequencies of the meronym, synonym and homonym terms to evaluate the Total Sense Score (TSS). In our approach we used the (Sense Count) SC which simply uses the concept of counting the occurrence of the terms in a snippet in context of particular sense in a hypernyms, gloss sentences and test corpus. We had designed our own test corpus of example sentences in Hindi language. We calculated the Phrase Frequency which counts the occurrence of query phrase in snippets and also in hypernym, gloss and also in test corpus.

The HSC helps in disambiguating the sense of a query and also facilitates to select the relevant document snippets from the top ten document snippets retrieved by the system.

6. QUERY EXPANSIONOnce the relevant set of document snippets is selected by the help of HSC, next step is to expand the query. Our query expansion technique is based on Vector Space Model (VSM). The Vector Space Model (VSM) has been a standard model of representing documents in information retrieval for almost three decades [15] [16]. In VSM, all documents and queries are represented as vectors. It is a fundamental model for web search engines. Query expansion is one of the techniques used to improve the performance of web information retrieval system. When query expansion technique is implemented on the web search engines, it involves evaluating the user's input and expanding the search query to match additional documents.

Another technique can be used for the query expansion is to find the contextual key terms to expand the query especially in the case of WSD. The problem of sense ambiguity deteriorates the precision of the retrieved result set. Therefore, using the contextual terms for the query expansion is found one of the feasible solutions. We result into the increased number of relevant documents against the original query. It is a well-known fact that the terms which are related to the query exist in the same relevant document and also in the close proximity. The complete web document might is large enough. Therefore, we had taken

snippets of the returned document set to identify the key terms that can be used to expand the original query.

We have utilized the snippets returned by the search engine as a source of document summary. The snippet helped clearly for the ambiguity detection approach and same snippet set of document is used for the query expansion also. Our approach is quiet close to the [17] vector space relevance feedback method for the query expansion. Similar to Rocchio’s approach, we add terms for expansion from the relevant documents to the query.

So here the document vector D denotes the collection of snippets returned for the document (top ten documents) [18]. The next step after the disambiguation is query expansion. For the selected sense, the new document vector

is created which is a subset of document vector D. is a set of relevant document snippet selected with the help of HSC. This document vector will be used further for the query expansion.

In the next step, from the document vector we calculate the weight of each term in the document snippet. The initial weight is calculated by the simple formula of Ctf*idf, where Ctf is cumulative term frequency and idf is inverse document frequency.

Hence the formula for the initial weight is as follows-

Equation -2

Here Ctfi denotes total term frequency or number of times a term occurs in a document vector (To reduce the noise only single occurrence of term is counted in an individual document snippet. Following the concept of “One Sense per Discourse” [19] the term holds the same sense throughout the document. Hence it is not justified to count the numerous occurrence of a term in a single document snippet, which will lead to the noise.)

denotes subset of document vector D. denotes document frequency or number of documents

snippet consists term ti It is observed in many cases that various terms have the

similar weights. Therefore it is not feasible to select the terms for query expansion simply by calculating their weight and selecting the highest weight tem. Another aspect of the query expansion is the context of the terms in relation to query. The context of the terms is calculated by calculating the distance of the query terms with document key terms. To calculate the average distance of the highest weighted terms with the query terms, we had used the following formula, where –

denotes the average distance of the highest weighted term i in the document vector

denotes distance of the highest weighted term i in the nth document vector

denotes total number of documents of document vector which consists of the highest weighted term

Equation - 3

To calculate the average distance of the term we had declared x size window around the query terms if an individual term falls in this window then the closest distant term will assigned distance weight 2, next will assigned 1.5 and if the term falls at the critical size of the window then 1. The average distance is calculated by the Equation – 3.

The final weight is calculated by the Equation - 4

denotes final weight of the weighted term i

denotes initial weight of the highest weighted term i

denotes the average distance of the highest weighted term i in the document vector

Equation - 4

The term with the highest weight is selected for the query expansion. The new expanded query is given to the search engines and the relevancy is sorted out by calculating the precision and the similarity score of the results. We get the result set with the improved precision and similarity score.

7. EXPERIMENTS AND RESULTSOur experiments are conducted on Google’s database.

Google provide its AJAX API open for the researchers to access their database. We had collected the set of numerous queries from the user logs and from FIRE1 test collection and finalized sample set of 100 queries after discussion with the Hindi language linguists. Since the TREC queries are not available for the Hindi language. Web is huge repository of information, so it is not possible to find out the recall so we had used P@10 to evaluate the performance level of the search engines.

To check the accuracy of the results we had used three parameters. The conventional precision, non-interpolated average precision and the other three level similarity score. Precision is simply the ratio of the relevant results out the total results retrieved. We had calculated the precision based on top ten results only. Another parameter is a three level similarity score. To calculate the similarity score we had used the three level similarity score formula [20]. Given a query phrase with terms, and a Web page , a raw score is calculated as follows-

Here is a constant, corresponding to the weight for longer sub-phrases, and, is the number of occurrence of the sub-phrases of length , i.e., containing terms. The order of the terms in the sub-phrases is exactly the same as in the original query phrase .

Convert A(q,x) to a three level similarity score S(q,x) through threshold, namely 2 for most relevant, 1 for partially relevant and 0 for irrelevant. Conversion of the raw score to the three level similarity score is done to judge the relevancy level of the documents against the query more

appropriately. However this score is used for manual evaluation of relevance judgment.

Here and are constants, and their values are 0.1 and 1.0 respectively. The average similarity score is calculated for the ambiguous queries. All the three parameters average precision, non-interpolated average precision and average similarity score are calculated for the original query and the expanded query. The results in Table I and II shows the remarkable improvement in precision and the similarity score of the queries.

A. Ambiguity DetectionThe ambiguity detection approach is followed with the set

of 100 queries in which 57 queries are detected as ambiguous and 43 queries are detected unambiguous. Further disambiguation approach is followed for the 57 queries which are detected as ambiguous.

B. Results with Original QueryOnce the query is detected ambiguous the next step is to

disambiguate the query. After disambiguation, query expansion is followed. Out of the 57 queries which are detected ambiguous the average precision and average similarity score is calculated. The results are mentioned in Table I and II.

The query mentioned in section III as “ गुलाब की कलम” {Gulaab ki Kalam} (Rose branch) is detected as ambiguous results and show the P@10 as .40.

Before WSD and Query ExpansionAverage P@10 Non Interpolated

Average [email protected] .72

After WSD and Query ExpansionAverage P@10 Non Interpolated

Average [email protected] .95

Table 1: Average precision in the top 10 results on Google

Similarity Scores before WSD and Query Expansion

Relevant Partially Relevant

Irrelevant

0% 25% 75%Similarity Scores after WSD and Query

Expansion80% 20% 0%

Table 2: Similarity scores in top 10 results on Google

C. Results with Expanded QueryAfter disambiguation, we have expanded the query by

finding out the highest weighted and closest term to the query terms. With the expanded query we calculated the

1 www.isical.ac.in/~fire

http://www.isical.ac.in/~fire

average precision value and average similarity score again. The results are mentioned in Table I and II.

In the example query after disambiguation and query expansion the P@10 value reached to .80. The new query after expansion is “पौधे गुलाब की कलम” {Paudhe Gulaab ki Kalam} (Plant Rose Branch). Here word “पौधे” {Paudhe} (Plant) is the highest weight term found in the relevant snippets of this query.

The results show an obvious improvement in the precision of the results. The dramatic increase in the three level similarity score is justified with the increase in precision value also. In three level relevancy score method, we not only give higher scores to the occurrence of substrings with more distinct terms, but also consider the order of query terms because different order may have different meaning. This is the reason for the 0% relevant document occurrence in the results before query expansion. However we get 25% partially relevant documents which are further used for the query expansion. It is justified that the terms used for the expansion of the query are selected from the results of the original query itself.

We had compared our results to the original results of the Google. There is an apparent improvement in the relevancy of the results which is quite obvious from the Table II. The improvement in the results after query expansion justifies its significance in the improvement of performance of web information retrieval.

Our approach overcomes the bottleneck of Francis Pinto approach [15]. They used WordNet for the query expansion and get satisfactory results for the short queries. However their dependency is also on the correct disambiguation. In our approach due to ambiguity detection this dependency is somewhat reduced. Our approach also gives good results in case of long queries. For example the query “सबसे लंबी पद यात्रा” {Sabse lambi pad yatraa} (Longest tramp) which consists four key terms is also benefited by this approach.

Our approach can easily transfer across other Indian languages and also English language. Since the approach is simply using the snippets returned against the query for the disambiguation and query expansion, therefore it can be implemented for the other languages as well. However the success rate for other languages may vary due to the unavailability of contents in a particular language.

8. CONCLUSIONIt is quite obvious from the results that there is an overall

increase of 45% in the precision, 23% in average interpolated precision and also significant improvement in the similarity score of the new result set. It is remarkable especially in case of Hindi language information retrieval performance on the web. WSD is an imperative issue in the case of information retrieval. Numerous algorithms are given by researchers to show the significant improvement of performance in the Web IR (for English language). In general, all the algorithms use some statistical formulae to find the contextual relation of the polysemous word with the other terms in the query. Some researchers tried to find the context using some pre specified phrases for the words. In our approach, we have utilized the retrieved result set itself for the query expansion. Experiments have justified that the WSD along with query expansion gives good results in improving the relevancy of the results for Hindi language web information retrieval. However, our method faces a constraint, when a size of retrieved relevant document set is

very small then it is difficult to find out the contextual keywords for the expansion of the query.

REFERENCES

[1] M Sinha., M. Kumar, P. Pande, L. Kashyap and P. Bhattacharyya 2004 Hindi Word Sense Disambiguation, in International Symposium on Machine Translation, Natural Language Processing and Translation Support Systems, Delhi, India http://www.cse.iitb.ac.in/~pb/papers/HindiWSD.pdf

[2] M. Lesk, Automated Sense Disambiguation Using Machine-readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone, in: Proceedings of the 1986 SIGDOC Conference, Toronto, Canada, 1986, pp. 24-26.

[3] N. Mishra, S Yadav and T. J Siddiqui, An Unsupervised approach to Hindi Word Sense Disambiguation, in: Proceedings of the First International Conference on Intelligent Human Computer Interaction (IHCI 2009). Organized by the Indian Institute of Information Technology, Allahabad, India, 2009, pp 327-335

[4] I. P. Klapaftis, and S.Manandhar, Google and WordNet based Word Sense Disambiguation, in Proceedings of 22nd ICML Workshop on Learning & Extending Ontologies. Bonn, Germany, 2005

[5] M. Á Gaona, R., Gelbukh, A. and S.Bandyopadhyay, Web-based variant of the Lesk approach to Word Sense Disambiguation, in: Proceedings of Eighth Mexican International Conference on Artificial Intelligence, Guanajuato, Mexico, 2009, pp. 103-107.

[6] P. Katsiouli and T. Kalamboukis, An Evaluation of Greek-English Cross Language Retrieval within the CLEF Ad-Hoc Bilingual Task, in: CLEF 2009.

[7] F.Pinto & C. Sanjulián, Automatic query expansion and word sense disambiguation with long and short queries using WordNet under vector model, in: SISTEDES. 2008, pp. 17-23.

[8] J. Nie & F. Jin, Integrating Logical Operators In Query Expansion In Vector Space Model, in: Proceedings of ACM SIGIR-2002 Workshop on Mathematical and Formal Methods in Information Retrieval. Tampere, Findland, 2002, pp 77-88.

[9] R.Mandala & T. Tokunaga, Combining multiple evidence from different types of thesaurus for query expansion, in: Proceedings of ACM-SIGIR, 1999, pp 191-197.

[10] E. M. Voorhees, Using WordNet to disambiguate word senses for text retrieval, in: Proceedings of ACM-SIGIR, Pittsburgh, 1993, pp 171-180.

[11] S.K.M. Wong, W.Ziarko & P.C.N. Wong, Generalized vector space model in information retrieval. In: Proceedings of ACM-SIGIR, 1985, pp. 18-25.

[12] S. Dwivedi & P.Rastogi, (2008) An Entropy Based Method for Removing Web Query Ambiguity in Hindi Language, J. of Comp. Scie. 4 (2008) 762-767.

[13] E. Agirre, O. Ansa, E. Hovy, D., Martinez, Enriching Very Large Ontologies using the www, in: ECAI 2000, Workshop on Ontology Learning, Berlin, Germany, 2000

[14] I. Klapaftis and S. Manandhar, Google & WordNet based Word Sense Disambiguation, in: Proceedings of the Workshop on Learning and Extending Ontologies by using Machine Learning methods, Bonn, Germany, 2005

[15] Baeza-Yates R. and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999

[16] G. Salton & M.J. McGill, Introduction to Modern Information Retrieval, McGraw, 1983

[17] J. Rocchio, Relevance feedback in information retrieval, in: G. Salton (Ed.), The smart retrieval system experiments in automatic document processing, Englewood Cliffs, NJ: Prentice-Hall. 1971, pp. 313–323

[18] T. Nykiel and I H. Rybinsk, Word Sense Discovery for Web Information Retrieval, in: Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, 2008, pp. 267-274.

[19] W. A. Gale, W.K. Church & D.Yarowsky, One Sense Per Discourse, in: Proceedings of Human Language Technology Conference, 1992, pp 233-237.

[20] L.Li & Y.Shang, New statistical method for performance evaluation of search engines, in: Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence, Vancouver. BC, Canada, 2000, pp 208-215.

http://www.cse.iitb.ac.in/~pb/papers/HindiWSD.pdf

using new method of word sense disambiguation which ... new method of word … · web viewword...

Documents