a classifier-based text mining approach for evaluating semantic

7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic

1/6

A Classifier-based Text Mining Approach for Evaluating Semantic

Relatedness Using Support Vector Machines

Chung-Hong Lee

Department of Electrical EngineeringNational Kaohsiung University of Applied Sciences

Kaohsiung, [email protected]

Hsin-Chang Yang

Department of Information ManagementChang Jung University

Tainan, TAIWAN

[email protected]

Abstract

The quantification of evaluating semantic

relatedness among texts has been a challenging issue

that pervades much of machine learning and naturallanguage processing. This paper presents a hybrid

approach of a text-mining technique for measuringsemantic relatedness among texts. In this work we

develop several text classifiers using Support VectorMachines (SVM) method to supporting acquisition of

relatedness among texts. First, we utilized ourdeveloped text mining algorithms, including text

mining techniques based on classification of texts inseveral text collections. After that, we employ various

SVM classifiers to deal with evaluation of relatedness

of the target documents. The results indicate that this

approach can also be fitted to other research work,such as information filtering, and re-categorizing

resulting documents of search engine queries.

1. Introduction

The analysis and organization of large document

repositories is one of todays great challenges inmachine learning, a key issue being the quantitative

assessment of document relatedness. A sensible

relatedness measure would offer answers to questions

like: How related are two documents and which

documents match a given query best? As anyone who

has done information retrieval or web searches using

search engines will attest, it is rather discouraging to

get a return of a search stating that the search has

found thousands of documents when in fact most of

the documents on the first screen (the highest ranked

documents) are not relevant to the user. Eliminating

the gap between the query results and the documents

that satisfy users true information needs would enable

more research effort to be involved for further

enhancement. Examples of situations in which

acquisition of textual semantic relatedness can be

employed are:

Email filtering. The user wishes to establish a

personalized automatic junk email filter. In thelearning phase the classifier has access to the users

past email files. It interactively brings up past email

and asks the user whether the displayed email is junk

or not. Based on the users justification it brings up

another email and queries the user. The process is

repeated several times and the result is an email filter

tailored to that specific person.

Relevance feedback. The user wishes to sort

through a internet search engine or database for items

(articles, images, etc.) that are of personal interest an

Ill know it when I see it type of search. The search

engine displays a list of resulting documents and theuser justify whether the items are interesting or not,

respectively. Based on the users justification, the

search engine brings up another item list from the

internet websites. After several iterations, the system

has learned to locating documents more precisely with

the support of the classifier, and then returns a new list

of items that it believes will be of interest to the user.

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE


2/6

It is understood that both filtering and relevancy

feedback problems are both classification problems in

that documents are assigned to one of two classes

(relevant or not), and what are to be considered

relevant documents is user dependent, this would

mean that every user must construct a different

training set. Generally speaking, in these cases

acquisition of textual relatedness can be achieved with

the supports of intelligent classifiers which are

designed to meet specific information requirements (or

topics) given by end users. Therefore, the focus of this

work is on development of a novel classifier-based

technique that computes relatedness of documents

based on a specific training corpus of text documents

without requiring domain-specific knowledge.

1.1 Techniques applied to acquisition of

semantic relatedness

It is often easy to get confusion about several

terminologies related to this topic; in our survey atleast three different terms are used by different authors:

semantic relatedness, semantic similarity, and

semantic distance. The distinction between semantic

relatedness and semantic similarity can bedescribed by way of examples, Cars and gasoline, he

writes, would seem to be more closely related than,

say, cars and bicycles, but the latter pair are certainly

more similar. Similarity is thus a special case of

semantic relatedness, and we adopt this viewpoint in

this paper. Among other relationships that the notion of

relatedness encompasses are the various kinds of

meronymy, antonymy, functional association, and

other non-classical relations. The term semanticdistance may cause even more confusion, as it can be

used when talking about either just similarity or

relatedness in general. In this work, we focus on

dealing with the issues associated with measuring

semantic relatedness among texts.The majority of the approaches applied to

measuring semantic relatedness is through a semantic

network, such as WordNet [4], [14]. WordNet is a

broad coverage semantic network established as an

attempt to model the lexical knowledge of a native

speaker of English [18]. In WordNet, English nouns,

verbs, adjectives, and adverbs are organized into

synonym sets (synsets), each representing one

underlying lexical concept, that are interlinked with a

variety of relations. A natural way to evaluate

semantic relatedness in a WordNet taxonomy, given its

graphical representation, is to evaluate the distance

between the nodes corresponding to the items being

compared the shorter the path from one node to

another, the more similar they are. Given multiple

paths, one takes the length of the shortest one[16].

Instead of WordNet, Radas central knowledge

source is MeSH(Medical Subject Headings) [15]. The

networks 15,000 terms from a nine-level hierarchy

that includes high-level nodes such as anatomy,

organism, and disease and is based on the

BROADER-THAN relationship. The principal

assumption put forward by Rada is that the number of

edges between terms in the MeSH hierarchy is a

measure of conceptual distance between terms.

Despite its apparent simplicity, a widely

acknowledged problem with the edge counting

approach mentioned above is that it typically relies on

the notion that links in the taxonomy represent uniform

distances. It is not always true due to that there is a

wide variability in the distance covered by a single

taxonomic link, particularly when certain sub-

taxonomies are much denser than others. In addition,

the edge counting approaches should rely on a well-

established lexical knowledge base as a source of

semantic network (e.g., WordNet) for computing

semantic relatedness. It is not well suited forapplications in specific domains in which the standard

lexical knowledge bases are not available.

Recent work in computational linguistics suggests

that large amounts of semantic information can be

extracted automatically from large text corpora on the

basis of lexical co-occurrence information. Such

semantics has been becoming important and useful for

being as an essential representation of content in each

web page, particularly with the increasing availability

of digital documents from all around the world. In this

work we attempt to develop a novel algorithmic

approach for extracting semantic information from the

web text corpora. Using a variation of automatic textcategorization, which applies the Support Vector

Machines (SVM), we have conducted several

experiments using several text classifiers associated

with some related topics based on SVM-based learning

processes. Furthermore, when exposed to the classified

texts, we employ a novel algorithm to measure the

implicit semantic relatedness among them.

2. Automatic text categorization based on

support vector machines

Support Vector Machine (SVM) is a relatively new

learning technique for data classification. The goal ofSVM is to find a decision surface to separate the

training data samples into two classes and make

decisions based on the support vectors that are selected

as the only effective elements from the training set. For

text classification, SVM makes decision based on the

globally optimized separating hyperplane. It simply

finds out on which side of the hyperplane the test



3/6

pattern is located (see Figure. 1). This characteristics

makes SVM highly competitive, compared with other

pattern classification methods, in terms of predictive

accuracy and efficiency. Various quadratic

programming methods have been proposed and

extensively studied to solve the SVM problem. In

particular, Joachims has done much research on the

application of SVM to text categorization [7].

Figure 1. SVM Classifier Structure [19]

2.1 How support vector machines work

When SVMs are constructed, two sets of

hyperplanes are formed, one hyperplane going through

one or more examples of the non-relevant vectors and

one hyperplane going through one or more examples of

the relevant vectors. Vectors lying on the hyperplanes

are termed support vectors and in fact define the two

hyperplanes. If we define the margin as the orthogonal

distance between the two hyperplanes, then a SVM

maximizes this margin. Equivalently, the optimal

hyperplane is such that the distance to the nearest

vector is maximum.

3. System implementation

In this work we develop an approach applying a

classifier-based technique with Support Vector

Machines (SVM) method to supporting acquisition of

relatedness among texts. We utilized our previously

developed text mining algorithms and platforms [10],

[11], [12], [20], including text mining techniques

based on Support Vector Machines (SVM) and Self-

Organizing Maps (SOM)for performing clustering and

classification of texts in several text collections. After

that, we employ SVM methods to deal with acquisition

of relatedness of the target documents of text mining

process, in order to find the semantic connections and

relatedness among the mined texts, shown in Figure 2.

Figure 2. System Framework

The implementation of semantic relatedness

measures includes two subtasks concerning preparation

of the information sources. Our approach begins with a

standard practice in information retrieval to encode

documents with vectors, in which each component

corresponds to a different word, and the value of the

component reflects the frequency of word occurrence

in the document. Subsequently, we employed the SVM

technique for developing text classifiers.

3.1 The reason for choosing SVM and notother classification techniques

In practice, the resulting dimensionality of the space

is often tremendously huge, since the number of

dimensions is determined by the number of distinct

indexed terms in the corpus. As a result, techniques for

controlling the dimensionality of the vector space are

often required. Since SVM techniques have a superior

potential to manage high dimensional input spaces

effectively than other classification techniques, the

need for time consuming linguistic preprocessing (i.e.

reduction of dimensions of the feature space) can be

largely eliminated. Therefore in this work, we choose

SVM methods as the major approach for classification.

The comparison of performance between SVM

classifiers and other classifiers have also been

examined in our experimental work.

4. Acquisition of semantic relatedness

using support vector machines

In this section we will introduce the implemented

algorithms for acquisition of semantic relatedness from

text corpora by means of a machine learning technique,

namely the Support Vector Machines. As stated above,the Support Vector Machine is one of the major

statistical learning models. It basically provides a way

for text categorization by producing a decision surface

to separate the training data samples into two classes

(Figure 1). As such, the resulting categories are

capable of performing the grouping of semantically

related texts, and further computing the degree of

Text

Corpus

Acquisition of Textual

Semantic Relatedness

(Support Vector Machine

algorithm)

(Feature selection

of input texts)

SVM classifier-based categorization



4/6

semantic relatedness among the texts by means of our

developed algorithm.

Figure 3. Analyzing semantic relatedness

among texts using SVM classifiers with One-Against-All (OAA) technique

Again, the algorithm employed for acquisition of

semantic relatedness among texts is established basedon a multiple categorizing processes using the SVM

classifiers with One-Against-All (OAA) method (see

Figure 3). Instead of a numerical relatedness or

similarity value, each of the measures that we tested

returns two classes indicating related/unrelated

judgment required by the algorithm. We therefore set

the threshold of relatedness of each measure at which it

separated the higher level of semantically related texts

from the lower level. According to the results of the

measures, the degree of semantic relatedness of the

tested texts can be obtained, and which can be recorded

in order to produce the final report of the acquisition of

semantic relatedness among texts in the evaluatedcorpora. Figure 4 shows the resulting map of textual

relatedness evaluation using SVM categorizing process.

The most related texts were categorized into S3 group,

and in turn the less related texts were mapped into the

S2, S1 and S0 text collections.

Figure 4. Resulting map of textual relatednessevaluation using SVM categorizing process

5. Related work

In this paper we introduce the implemented

algorithms for acquisition of semantic relatedness from

text corpora by means of a machine learning technique,

namely the Support Vector Machines. As stated above,the Support Vector Machine is one of the major

statistical learning models. It basically provides a way

for text categorization by producing a decision surface

to separate the training data samples into two classes.

As such, the resulting categories are capable of

performing the grouping of semantically related texts,

and further computing the degree of semantic

relatedness among the texts by means of our developed

algorithm.

Text mining is a new interdisciplinary field. It

combines the disciplines of data mining, information

extraction, information retrieval, text categorization,

machine learning, and computational linguistics to

discover structure, patterns, and knowledge in large

textual corpora. With the huge amount of information

available online, the World Wide Web has been

becoming a fertile area for text mining research. The

Web content data include unstructured data such as

free texts, semi-structured data such as HTMLdocuments, and a more structured data such as tabular

data in the databases. However, much of the Web

content data is unstructured text data. The research

around applying knowledge discovery techniques to

unstructured text is termed knowledge discovery in

texts (KDT) [1], or text data mining [5], or text mining.

Advances in computational resources and new

statistical algorithms for text analysis have helped text

mining develop as a field. Recently there have been

some innovative techniques developed for text mining.

For example, Feldman uses text category labels

(associated with Reuters newswire) to find unexpected

patterns among text articles [1], [2], [3]. Text miningby using self-organizing map (SOM) techniques has

already gained some attentions in the knowledge

discovery research and information retrieval field. The

paper of [13] perhaps marks the first attempt to utilize

SOM (unsupervised neural networks) for an

information retrieval work. In this paper, however, the

document representation is made from 25 manually

selected indexed terms and is thus not really realistic.

In addition, among the most influential work we

certainly have to mention WEBSOM [6], [8], [9].

Their work aims at constructing methods for exploring

full-text document collections, the WEBSOM started

from Honkela's suggestion of using theself-organizingsemantic maps [17] as a preprocessing stage for

encoding documents. Such maps are, in turn, used to

automatically organize (i.e., cluster) documents

according to the words that they contain. When the

documents are organized, following the steps in the

preprocessing stage, on a map in such a way that

nearby locations contain similar documents,

exploration of the collection is facilitated by the



5/6

intuitive neighborhood relations. Thus, users can easily

navigate a word category map and zoom in on groups

of documents related to a specific group of words.

6. Experimental results

In this work we develop several text classifiersincluding Support Vector Machines (SVM) methods to

supporting relatedness measurement among texts. First,

we utilized our developed text mining algorithms,

including text mining techniques based on


that, we employ various SVM classifiers to deal with

categorization of the target documents for evaluating

relatedness. The experiments (Figure 5 and Figure 6)

used a random set of relevant and non-relevant

documents in a corpus. In Figure 5, we show the recall

ratios for five topics (classes), including Finance,

Politics, Movies, Sports and Tech. In the testing

process, we will first assume that topic number one is

the relevant topic and all others are non-relevant. Then

we will assume topic number two is the most relevant

topic and all others non-relevant, etc. Recall is defined

as the number of relevant documents actually retrieved

in a function of iteration divided the number of

relevant documents in the collection. In order to

compare the performance of various classifiers on the

same topic, it is reasonable to cover the performance of

SVM-based classifiers with various kernel functions

(i.e. Gaussian, Exponential and Polynomial functions)

and other classifiers including artificial neural network

(ANN) and kNN algorithms. For each topic (class), we

use a trained classifier for performing text

classification, to classify relevant and non-relevantdocuments. From the results shown in Figure 5, the

classifier of SVM with a Gaussian kernel was superior

to other classifiers, including SVM classifiers with

other kernel functions.

Figure 5. Resulting recall ratios

Also, the results of F1 (Eq.1) values could be obtained

as shown in Figure 6.

1F2

+

Precision Recall=

Precision Recall

(1)

0 500 1000 1500 200030

40

50

60

70

80

90

100

Dimension

F1

Gaussian RBFPolynomialExponential RBF

Figure 6. Resulting F1 ratios

7. Conclusions

This paper presents a hybrid approach of a text-mining technique for evaluating relatedness among

texts. In this work we develop several text classifiers

using Support Vector Machines (SVM) methods to

supporting acquisition of relatedness among texts.

First, we utilized our developed text mining

algorithms, including text mining techniques based on


that, we employ various SVM classifiers to deal with

evaluation of relatedness of the target documents. The

results indicate that this approach can also be fitted to

other research work, such as information filtering, and

re-categorizing resulting documents of search engine

queries. Experimental results show that the technique

performs well in practice, successfully adapting the

classification function of SVM-based classifiers to the

acquisition of relatedness among texts.

8. References

[1]R. Feldman and I. Dagan, KDT - Knowledge Discoveryin Texts, InProceedings of the First Annual Conference on

Knowledge Discovery and Data Mining (KDD), Montreal,1995.

[2]R. Feldman, W. Klosgen, and A. Zilberstein,

Visualization Techniques to Explore Data Mining Results

for Document Collections, In Proceedings of the ThirdAnnual Conference on Knowledge Discovery and DataMining (KDD), Newport Beach, 1997.

[3]R. Feldman, I. Dagan and H. Hirsh, Mining Text UsingKeyword Distributions, In J. of Intelligent InformationSystems, Vol. 10, pp. 281-300, 1998.

[4]C. Fellbaum, WordNet: An Electronic Lexical Database,The MIT Press, Cambridge, MA. 1998.

ANN kNN SVM

(Gaussian)

SVM

(Exponential)

SVM

(Polynomial)

Finance 91.13 78.57 99.68 54.43 86.70

Politics 92.37 84.75 97.46 39.24 82.28

Movies 91.74 79.63 98.37 39.87 73.51

Sports 90.02 76.54 94.18 40.56 74.69

Tech. 89.97 86.31 98.86 38.60 75.94

Class

Classifier



6/6

[5]M.A. Hearst, Untangling Text Data Mining, InProceedings of ACL99: the 37th Annual Meeting of

Association for Computational Linguistics, 1999.

[6]T. Honkela, S. Kaski, K. Lagus and T. Kohonen,Newsgroup Exploration with WEBSOM Method and

Browsing Interface, Technical Report A32. HelsinkiUniversity of Technology, Laboratory of Computer and

Information Science, Espoo, Finland, 1996.

[7]T. Joachims, Text Categorization with Support VectorMachines: Learning with Many Relevant Features, In

Proceedings 10th European Conference on MachineLearning (ECML), Springer Verlag, 1998. Science, Number

1398, pp. 137142, 1998.

[8]S. Kaski., T. Honkela, K. Lagus, and T. Kohonen,WEBSOM--Self-Organizing Maps of DocumentCollections,Neurocomputing, Vol. 21, pp. 101-117, 1998.

[9]T. Kohonen, Self-Organization of Very Large Document

Collections: State of the Art, In Niklasson, L., Boden, M.,and Ziemke, T., editors, InProceedings of ICANN98, the 8th

International Conference on Artificial Neural Networks, Vol.1, London, pp. 65-74. Springer, 1998.

[10]C.H. Lee and H.C. Yang, A Web Text MiningApproach Based on Self-Organizing Map, InProceedings ofthe ACM CIKM'99 2nd Workshop on Web Information and

Data Management (WIDM'99), Kansas City, Missouri, USA,pp. 59-62, 1999.

[11]C.H. Lee and H.C. Yang, A Text Data MiningApproach Using a Chinese Corpus Based on Self-OrganizingMap, In Proceedings of the 4th International Workshop on

Information Retrieval with Asian Language, Taipei, Taiwan,

pp. 19-22, 1999.

[12]C.H. Lee and H.C. Yang, A Multilingual Text MiningApproach Based on Self-Organizing Maps, Applied

Intelligence, Vol. 18(3): pp. 295-310, 2003.

[13]X. Lin, D. Soergel. and G. Marchionini, A Self-Organizing Semantic Map for Information Retrieval, In

Proceedings of the ACM SIGIR Intl Conference on Researchand Development in Information Retrieval (SIGIR91),Chicago, IL,1991.

[14]G.A. Miller, et al., Five Papers on WordNet. CSL Report43, Princeton University, 1990. revised August 1993.

[15]R. Rada., et al., Development and Application of a

Metric on Semantic Nets, IEEE Transactions on Systems,Man, and Cybernetics, 19(1): 17-30, February 1989.

[16]P. Resnik, Using Information Content to EvaluateSemantic Similarity, In Proceedings of the 14th

International Joint Conference on Artificial Intelligence,

Montreal, pp. 448-453, 1995.

[17]H. Ritter and T. Kohonen, Self-Organizing SemanticMaps.Biological Cybernetics, Vol. 61, pp. 241-254, 1989.

[18]R. Richardson and A. F. Smeaton. 1995, UsingWordNet in a Knowledge-based Approach to InformationRetrieval, Working paper CA-0395, School of ComputerApplications, Dublin University, 1995.

[19]V. Vapnik, The Nature of Statistical Learning Theory.

Springer, N.Y., 1995. ISBN 0-387-94559-8, 1995.

[20]H.C. Yang and C.H. Lee, Automatic HypertextConstruction through a Text Mining Approach by Self-Organizing Maps, In Proceedings of the Knowledge

Discovery and Data Mining - PAKDD 2001, 5th Pacific-Asia

Conference, Hong Kong, China, April 16-18, 2001, (PAKDD2001): pp. 108-113, 2001.


a classifier-based text mining approach for evaluating semantic

Documents