a classifier-based text mining approach for evaluating semantic
TRANSCRIPT
-
7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic
1/6
A Classifier-based Text Mining Approach for Evaluating Semantic
Relatedness Using Support Vector Machines
Chung-Hong Lee
Department of Electrical EngineeringNational Kaohsiung University of Applied Sciences
Kaohsiung, [email protected]
Hsin-Chang Yang
Department of Information ManagementChang Jung University
Tainan, TAIWAN
Abstract
The quantification of evaluating semantic
relatedness among texts has been a challenging issue
that pervades much of machine learning and naturallanguage processing. This paper presents a hybrid
approach of a text-mining technique for measuringsemantic relatedness among texts. In this work we
develop several text classifiers using Support VectorMachines (SVM) method to supporting acquisition of
relatedness among texts. First, we utilized ourdeveloped text mining algorithms, including text
mining techniques based on classification of texts inseveral text collections. After that, we employ various
SVM classifiers to deal with evaluation of relatedness
of the target documents. The results indicate that this
approach can also be fitted to other research work,such as information filtering, and re-categorizing
resulting documents of search engine queries.
1. Introduction
The analysis and organization of large document
repositories is one of todays great challenges inmachine learning, a key issue being the quantitative
assessment of document relatedness. A sensible
relatedness measure would offer answers to questions
like: How related are two documents and which
documents match a given query best? As anyone who
has done information retrieval or web searches using
search engines will attest, it is rather discouraging to
get a return of a search stating that the search has
found thousands of documents when in fact most of
the documents on the first screen (the highest ranked
documents) are not relevant to the user. Eliminating
the gap between the query results and the documents
that satisfy users true information needs would enable
more research effort to be involved for further
enhancement. Examples of situations in which
acquisition of textual semantic relatedness can be
employed are:
Email filtering. The user wishes to establish a
personalized automatic junk email filter. In thelearning phase the classifier has access to the users
past email files. It interactively brings up past email
and asks the user whether the displayed email is junk
or not. Based on the users justification it brings up
another email and queries the user. The process is
repeated several times and the result is an email filter
tailored to that specific person.
Relevance feedback. The user wishes to sort
through a internet search engine or database for items
(articles, images, etc.) that are of personal interest an
Ill know it when I see it type of search. The search
engine displays a list of resulting documents and theuser justify whether the items are interesting or not,
respectively. Based on the users justification, the
search engine brings up another item list from the
internet websites. After several iterations, the system
has learned to locating documents more precisely with
the support of the classifier, and then returns a new list
of items that it believes will be of interest to the user.
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE
-
7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic
2/6
It is understood that both filtering and relevancy
feedback problems are both classification problems in
that documents are assigned to one of two classes
(relevant or not), and what are to be considered
relevant documents is user dependent, this would
mean that every user must construct a different
training set. Generally speaking, in these cases
acquisition of textual relatedness can be achieved with
the supports of intelligent classifiers which are
designed to meet specific information requirements (or
topics) given by end users. Therefore, the focus of this
work is on development of a novel classifier-based
technique that computes relatedness of documents
based on a specific training corpus of text documents
without requiring domain-specific knowledge.
1.1 Techniques applied to acquisition of
semantic relatedness
It is often easy to get confusion about several
terminologies related to this topic; in our survey atleast three different terms are used by different authors:
semantic relatedness, semantic similarity, and
semantic distance. The distinction between semantic
relatedness and semantic similarity can bedescribed by way of examples, Cars and gasoline, he
writes, would seem to be more closely related than,
say, cars and bicycles, but the latter pair are certainly
more similar. Similarity is thus a special case of
semantic relatedness, and we adopt this viewpoint in
this paper. Among other relationships that the notion of
relatedness encompasses are the various kinds of
meronymy, antonymy, functional association, and
other non-classical relations. The term semanticdistance may cause even more confusion, as it can be
used when talking about either just similarity or
relatedness in general. In this work, we focus on
dealing with the issues associated with measuring
semantic relatedness among texts.The majority of the approaches applied to
measuring semantic relatedness is through a semantic
network, such as WordNet [4], [14]. WordNet is a
broad coverage semantic network established as an
attempt to model the lexical knowledge of a native
speaker of English [18]. In WordNet, English nouns,
verbs, adjectives, and adverbs are organized into
synonym sets (synsets), each representing one
underlying lexical concept, that are interlinked with a
variety of relations. A natural way to evaluate
semantic relatedness in a WordNet taxonomy, given its
graphical representation, is to evaluate the distance
between the nodes corresponding to the items being
compared the shorter the path from one node to
another, the more similar they are. Given multiple
paths, one takes the length of the shortest one[16].
Instead of WordNet, Radas central knowledge
source is MeSH(Medical Subject Headings) [15]. The
networks 15,000 terms from a nine-level hierarchy
that includes high-level nodes such as anatomy,
organism, and disease and is based on the
BROADER-THAN relationship. The principal
assumption put forward by Rada is that the number of
edges between terms in the MeSH hierarchy is a
measure of conceptual distance between terms.
Despite its apparent simplicity, a widely
acknowledged problem with the edge counting
approach mentioned above is that it typically relies on
the notion that links in the taxonomy represent uniform
distances. It is not always true due to that there is a
wide variability in the distance covered by a single
taxonomic link, particularly when certain sub-
taxonomies are much denser than others. In addition,
the edge counting approaches should rely on a well-
established lexical knowledge base as a source of
semantic network (e.g., WordNet) for computing
semantic relatedness. It is not well suited forapplications in specific domains in which the standard
lexical knowledge bases are not available.
Recent work in computational linguistics suggests
that large amounts of semantic information can be
extracted automatically from large text corpora on the
basis of lexical co-occurrence information. Such
semantics has been becoming important and useful for
being as an essential representation of content in each
web page, particularly with the increasing availability
of digital documents from all around the world. In this
work we attempt to develop a novel algorithmic
approach for extracting semantic information from the
web text corpora. Using a variation of automatic textcategorization, which applies the Support Vector
Machines (SVM), we have conducted several
experiments using several text classifiers associated
with some related topics based on SVM-based learning
processes. Furthermore, when exposed to the classified
texts, we employ a novel algorithm to measure the
implicit semantic relatedness among them.
2. Automatic text categorization based on
support vector machines
Support Vector Machine (SVM) is a relatively new
learning technique for data classification. The goal ofSVM is to find a decision surface to separate the
training data samples into two classes and make
decisions based on the support vectors that are selected
as the only effective elements from the training set. For
text classification, SVM makes decision based on the
globally optimized separating hyperplane. It simply
finds out on which side of the hyperplane the test
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE
-
7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic
3/6
pattern is located (see Figure. 1). This characteristics
makes SVM highly competitive, compared with other
pattern classification methods, in terms of predictive
accuracy and efficiency. Various quadratic
programming methods have been proposed and
extensively studied to solve the SVM problem. In
particular, Joachims has done much research on the
application of SVM to text categorization [7].
Figure 1. SVM Classifier Structure [19]
2.1 How support vector machines work
When SVMs are constructed, two sets of
hyperplanes are formed, one hyperplane going through
one or more examples of the non-relevant vectors and
one hyperplane going through one or more examples of
the relevant vectors. Vectors lying on the hyperplanes
are termed support vectors and in fact define the two
hyperplanes. If we define the margin as the orthogonal
distance between the two hyperplanes, then a SVM
maximizes this margin. Equivalently, the optimal
hyperplane is such that the distance to the nearest
vector is maximum.
3. System implementation
In this work we develop an approach applying a
classifier-based technique with Support Vector
Machines (SVM) method to supporting acquisition of
relatedness among texts. We utilized our previously
developed text mining algorithms and platforms [10],
[11], [12], [20], including text mining techniques
based on Support Vector Machines (SVM) and Self-
Organizing Maps (SOM)for performing clustering and
classification of texts in several text collections. After
that, we employ SVM methods to deal with acquisition
of relatedness of the target documents of text mining
process, in order to find the semantic connections and
relatedness among the mined texts, shown in Figure 2.
Figure 2. System Framework
The implementation of semantic relatedness
measures includes two subtasks concerning preparation
of the information sources. Our approach begins with a
standard practice in information retrieval to encode
documents with vectors, in which each component
corresponds to a different word, and the value of the
component reflects the frequency of word occurrence
in the document. Subsequently, we employed the SVM
technique for developing text classifiers.
3.1 The reason for choosing SVM and notother classification techniques
In practice, the resulting dimensionality of the space
is often tremendously huge, since the number of
dimensions is determined by the number of distinct
indexed terms in the corpus. As a result, techniques for
controlling the dimensionality of the vector space are
often required. Since SVM techniques have a superior
potential to manage high dimensional input spaces
effectively than other classification techniques, the
need for time consuming linguistic preprocessing (i.e.
reduction of dimensions of the feature space) can be
largely eliminated. Therefore in this work, we choose
SVM methods as the major approach for classification.
The comparison of performance between SVM
classifiers and other classifiers have also been
examined in our experimental work.
4. Acquisition of semantic relatedness
using support vector machines
In this section we will introduce the implemented
algorithms for acquisition of semantic relatedness from
text corpora by means of a machine learning technique,
namely the Support Vector Machines. As stated above,the Support Vector Machine is one of the major
statistical learning models. It basically provides a way
for text categorization by producing a decision surface
to separate the training data samples into two classes
(Figure 1). As such, the resulting categories are
capable of performing the grouping of semantically
related texts, and further computing the degree of
Text
Corpus
Acquisition of Textual
Semantic Relatedness
(Support Vector Machine
algorithm)
(Feature selection
of input texts)
SVM classifier-based categorization
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE
-
7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic
4/6
semantic relatedness among the texts by means of our
developed algorithm.
Figure 3. Analyzing semantic relatedness
among texts using SVM classifiers with One-Against-All (OAA) technique
Again, the algorithm employed for acquisition of
semantic relatedness among texts is established basedon a multiple categorizing processes using the SVM
classifiers with One-Against-All (OAA) method (see
Figure 3). Instead of a numerical relatedness or
similarity value, each of the measures that we tested
returns two classes indicating related/unrelated
judgment required by the algorithm. We therefore set
the threshold of relatedness of each measure at which it
separated the higher level of semantically related texts
from the lower level. According to the results of the
measures, the degree of semantic relatedness of the
tested texts can be obtained, and which can be recorded
in order to produce the final report of the acquisition of
semantic relatedness among texts in the evaluatedcorpora. Figure 4 shows the resulting map of textual
relatedness evaluation using SVM categorizing process.
The most related texts were categorized into S3 group,
and in turn the less related texts were mapped into the
S2, S1 and S0 text collections.
Figure 4. Resulting map of textual relatednessevaluation using SVM categorizing process
5. Related work
In this paper we introduce the implemented
algorithms for acquisition of semantic relatedness from
text corpora by means of a machine learning technique,
namely the Support Vector Machines. As stated above,the Support Vector Machine is one of the major
statistical learning models. It basically provides a way
for text categorization by producing a decision surface
to separate the training data samples into two classes.
As such, the resulting categories are capable of
performing the grouping of semantically related texts,
and further computing the degree of semantic
relatedness among the texts by means of our developed
algorithm.
Text mining is a new interdisciplinary field. It
combines the disciplines of data mining, information
extraction, information retrieval, text categorization,
machine learning, and computational linguistics to
discover structure, patterns, and knowledge in large
textual corpora. With the huge amount of information
available online, the World Wide Web has been
becoming a fertile area for text mining research. The
Web content data include unstructured data such as
free texts, semi-structured data such as HTMLdocuments, and a more structured data such as tabular
data in the databases. However, much of the Web
content data is unstructured text data. The research
around applying knowledge discovery techniques to
unstructured text is termed knowledge discovery in
texts (KDT) [1], or text data mining [5], or text mining.
Advances in computational resources and new
statistical algorithms for text analysis have helped text
mining develop as a field. Recently there have been
some innovative techniques developed for text mining.
For example, Feldman uses text category labels
(associated with Reuters newswire) to find unexpected
patterns among text articles [1], [2], [3]. Text miningby using self-organizing map (SOM) techniques has
already gained some attentions in the knowledge
discovery research and information retrieval field. The
paper of [13] perhaps marks the first attempt to utilize
SOM (unsupervised neural networks) for an
information retrieval work. In this paper, however, the
document representation is made from 25 manually
selected indexed terms and is thus not really realistic.
In addition, among the most influential work we
certainly have to mention WEBSOM [6], [8], [9].
Their work aims at constructing methods for exploring
full-text document collections, the WEBSOM started
from Honkela's suggestion of using theself-organizingsemantic maps [17] as a preprocessing stage for
encoding documents. Such maps are, in turn, used to
automatically organize (i.e., cluster) documents
according to the words that they contain. When the
documents are organized, following the steps in the
preprocessing stage, on a map in such a way that
nearby locations contain similar documents,
exploration of the collection is facilitated by the
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE
-
7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic
5/6
intuitive neighborhood relations. Thus, users can easily
navigate a word category map and zoom in on groups
of documents related to a specific group of words.
6. Experimental results
In this work we develop several text classifiersincluding Support Vector Machines (SVM) methods to
supporting relatedness measurement among texts. First,
we utilized our developed text mining algorithms,
including text mining techniques based on
classification of texts in several text collections. After
that, we employ various SVM classifiers to deal with
categorization of the target documents for evaluating
relatedness. The experiments (Figure 5 and Figure 6)
used a random set of relevant and non-relevant
documents in a corpus. In Figure 5, we show the recall
ratios for five topics (classes), including Finance,
Politics, Movies, Sports and Tech. In the testing
process, we will first assume that topic number one is
the relevant topic and all others are non-relevant. Then
we will assume topic number two is the most relevant
topic and all others non-relevant, etc. Recall is defined
as the number of relevant documents actually retrieved
in a function of iteration divided the number of
relevant documents in the collection. In order to
compare the performance of various classifiers on the
same topic, it is reasonable to cover the performance of
SVM-based classifiers with various kernel functions
(i.e. Gaussian, Exponential and Polynomial functions)
and other classifiers including artificial neural network
(ANN) and kNN algorithms. For each topic (class), we
use a trained classifier for performing text
classification, to classify relevant and non-relevantdocuments. From the results shown in Figure 5, the
classifier of SVM with a Gaussian kernel was superior
to other classifiers, including SVM classifiers with
other kernel functions.
Figure 5. Resulting recall ratios
Also, the results of F1 (Eq.1) values could be obtained
as shown in Figure 6.
1F2
+
Precision Recall=
Precision Recall
(1)
0 500 1000 1500 200030
40
50
60
70
80
90
100
Dimension
F1
Gaussian RBFPolynomialExponential RBF
Figure 6. Resulting F1 ratios
7. Conclusions
This paper presents a hybrid approach of a text-mining technique for evaluating relatedness among
texts. In this work we develop several text classifiers
using Support Vector Machines (SVM) methods to
supporting acquisition of relatedness among texts.
First, we utilized our developed text mining
algorithms, including text mining techniques based on
classification of texts in several text collections. After
that, we employ various SVM classifiers to deal with
evaluation of relatedness of the target documents. The
results indicate that this approach can also be fitted to
other research work, such as information filtering, and
re-categorizing resulting documents of search engine
queries. Experimental results show that the technique
performs well in practice, successfully adapting the
classification function of SVM-based classifiers to the
acquisition of relatedness among texts.
8. References
[1]R. Feldman and I. Dagan, KDT - Knowledge Discoveryin Texts, InProceedings of the First Annual Conference on
Knowledge Discovery and Data Mining (KDD), Montreal,1995.
[2]R. Feldman, W. Klosgen, and A. Zilberstein,
Visualization Techniques to Explore Data Mining Results
for Document Collections, In Proceedings of the ThirdAnnual Conference on Knowledge Discovery and DataMining (KDD), Newport Beach, 1997.
[3]R. Feldman, I. Dagan and H. Hirsh, Mining Text UsingKeyword Distributions, In J. of Intelligent InformationSystems, Vol. 10, pp. 281-300, 1998.
[4]C. Fellbaum, WordNet: An Electronic Lexical Database,The MIT Press, Cambridge, MA. 1998.
ANN kNN SVM
(Gaussian)
SVM
(Exponential)
SVM
(Polynomial)
Finance 91.13 78.57 99.68 54.43 86.70
Politics 92.37 84.75 97.46 39.24 82.28
Movies 91.74 79.63 98.37 39.87 73.51
Sports 90.02 76.54 94.18 40.56 74.69
Tech. 89.97 86.31 98.86 38.60 75.94
Class
Classifier
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE
-
7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic
6/6
[5]M.A. Hearst, Untangling Text Data Mining, InProceedings of ACL99: the 37th Annual Meeting of
Association for Computational Linguistics, 1999.
[6]T. Honkela, S. Kaski, K. Lagus and T. Kohonen,Newsgroup Exploration with WEBSOM Method and
Browsing Interface, Technical Report A32. HelsinkiUniversity of Technology, Laboratory of Computer and
Information Science, Espoo, Finland, 1996.
[7]T. Joachims, Text Categorization with Support VectorMachines: Learning with Many Relevant Features, In
Proceedings 10th European Conference on MachineLearning (ECML), Springer Verlag, 1998. Science, Number
1398, pp. 137142, 1998.
[8]S. Kaski., T. Honkela, K. Lagus, and T. Kohonen,WEBSOM--Self-Organizing Maps of DocumentCollections,Neurocomputing, Vol. 21, pp. 101-117, 1998.
[9]T. Kohonen, Self-Organization of Very Large Document
Collections: State of the Art, In Niklasson, L., Boden, M.,and Ziemke, T., editors, InProceedings of ICANN98, the 8th
International Conference on Artificial Neural Networks, Vol.1, London, pp. 65-74. Springer, 1998.
[10]C.H. Lee and H.C. Yang, A Web Text MiningApproach Based on Self-Organizing Map, InProceedings ofthe ACM CIKM'99 2nd Workshop on Web Information and
Data Management (WIDM'99), Kansas City, Missouri, USA,pp. 59-62, 1999.
[11]C.H. Lee and H.C. Yang, A Text Data MiningApproach Using a Chinese Corpus Based on Self-OrganizingMap, In Proceedings of the 4th International Workshop on
Information Retrieval with Asian Language, Taipei, Taiwan,
pp. 19-22, 1999.
[12]C.H. Lee and H.C. Yang, A Multilingual Text MiningApproach Based on Self-Organizing Maps, Applied
Intelligence, Vol. 18(3): pp. 295-310, 2003.
[13]X. Lin, D. Soergel. and G. Marchionini, A Self-Organizing Semantic Map for Information Retrieval, In
Proceedings of the ACM SIGIR Intl Conference on Researchand Development in Information Retrieval (SIGIR91),Chicago, IL,1991.
[14]G.A. Miller, et al., Five Papers on WordNet. CSL Report43, Princeton University, 1990. revised August 1993.
[15]R. Rada., et al., Development and Application of a
Metric on Semantic Nets, IEEE Transactions on Systems,Man, and Cybernetics, 19(1): 17-30, February 1989.
[16]P. Resnik, Using Information Content to EvaluateSemantic Similarity, In Proceedings of the 14th
International Joint Conference on Artificial Intelligence,
Montreal, pp. 448-453, 1995.
[17]H. Ritter and T. Kohonen, Self-Organizing SemanticMaps.Biological Cybernetics, Vol. 61, pp. 241-254, 1989.
[18]R. Richardson and A. F. Smeaton. 1995, UsingWordNet in a Knowledge-based Approach to InformationRetrieval, Working paper CA-0395, School of ComputerApplications, Dublin University, 1995.
[19]V. Vapnik, The Nature of Statistical Learning Theory.
Springer, N.Y., 1995. ISBN 0-387-94559-8, 1995.
[20]H.C. Yang and C.H. Lee, Automatic HypertextConstruction through a Text Mining Approach by Self-Organizing Maps, In Proceedings of the Knowledge
Discovery and Data Mining - PAKDD 2001, 5th Pacific-Asia
Conference, Hong Kong, China, April 16-18, 2001, (PAKDD2001): pp. 108-113, 2001.
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE