a classifier-based text mining approach for evaluating semantic

Upload: jayabalann

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic

    1/6

    A Classifier-based Text Mining Approach for Evaluating Semantic

    Relatedness Using Support Vector Machines

    Chung-Hong Lee

    Department of Electrical EngineeringNational Kaohsiung University of Applied Sciences

    Kaohsiung, [email protected]

    Hsin-Chang Yang

    Department of Information ManagementChang Jung University

    Tainan, TAIWAN

    [email protected]

    Abstract

    The quantification of evaluating semantic

    relatedness among texts has been a challenging issue

    that pervades much of machine learning and naturallanguage processing. This paper presents a hybrid

    approach of a text-mining technique for measuringsemantic relatedness among texts. In this work we

    develop several text classifiers using Support VectorMachines (SVM) method to supporting acquisition of

    relatedness among texts. First, we utilized ourdeveloped text mining algorithms, including text

    mining techniques based on classification of texts inseveral text collections. After that, we employ various

    SVM classifiers to deal with evaluation of relatedness

    of the target documents. The results indicate that this

    approach can also be fitted to other research work,such as information filtering, and re-categorizing

    resulting documents of search engine queries.

    1. Introduction

    The analysis and organization of large document

    repositories is one of todays great challenges inmachine learning, a key issue being the quantitative

    assessment of document relatedness. A sensible

    relatedness measure would offer answers to questions

    like: How related are two documents and which

    documents match a given query best? As anyone who

    has done information retrieval or web searches using

    search engines will attest, it is rather discouraging to

    get a return of a search stating that the search has

    found thousands of documents when in fact most of

    the documents on the first screen (the highest ranked

    documents) are not relevant to the user. Eliminating

    the gap between the query results and the documents

    that satisfy users true information needs would enable

    more research effort to be involved for further

    enhancement. Examples of situations in which

    acquisition of textual semantic relatedness can be

    employed are:

    Email filtering. The user wishes to establish a

    personalized automatic junk email filter. In thelearning phase the classifier has access to the users

    past email files. It interactively brings up past email

    and asks the user whether the displayed email is junk

    or not. Based on the users justification it brings up

    another email and queries the user. The process is

    repeated several times and the result is an email filter

    tailored to that specific person.

    Relevance feedback. The user wishes to sort

    through a internet search engine or database for items

    (articles, images, etc.) that are of personal interest an

    Ill know it when I see it type of search. The search

    engine displays a list of resulting documents and theuser justify whether the items are interesting or not,

    respectively. Based on the users justification, the

    search engine brings up another item list from the

    internet websites. After several iterations, the system

    has learned to locating documents more precisely with

    the support of the classifier, and then returns a new list

    of items that it believes will be of interest to the user.

    Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE

  • 7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic

    2/6

    It is understood that both filtering and relevancy

    feedback problems are both classification problems in

    that documents are assigned to one of two classes

    (relevant or not), and what are to be considered

    relevant documents is user dependent, this would

    mean that every user must construct a different

    training set. Generally speaking, in these cases

    acquisition of textual relatedness can be achieved with

    the supports of intelligent classifiers which are

    designed to meet specific information requirements (or

    topics) given by end users. Therefore, the focus of this

    work is on development of a novel classifier-based

    technique that computes relatedness of documents

    based on a specific training corpus of text documents

    without requiring domain-specific knowledge.

    1.1 Techniques applied to acquisition of

    semantic relatedness

    It is often easy to get confusion about several

    terminologies related to this topic; in our survey atleast three different terms are used by different authors:

    semantic relatedness, semantic similarity, and

    semantic distance. The distinction between semantic

    relatedness and semantic similarity can bedescribed by way of examples, Cars and gasoline, he

    writes, would seem to be more closely related than,

    say, cars and bicycles, but the latter pair are certainly

    more similar. Similarity is thus a special case of

    semantic relatedness, and we adopt this viewpoint in

    this paper. Among other relationships that the notion of

    relatedness encompasses are the various kinds of

    meronymy, antonymy, functional association, and

    other non-classical relations. The term semanticdistance may cause even more confusion, as it can be

    used when talking about either just similarity or

    relatedness in general. In this work, we focus on

    dealing with the issues associated with measuring

    semantic relatedness among texts.The majority of the approaches applied to

    measuring semantic relatedness is through a semantic

    network, such as WordNet [4], [14]. WordNet is a

    broad coverage semantic network established as an

    attempt to model the lexical knowledge of a native

    speaker of English [18]. In WordNet, English nouns,

    verbs, adjectives, and adverbs are organized into

    synonym sets (synsets), each representing one

    underlying lexical concept, that are interlinked with a

    variety of relations. A natural way to evaluate

    semantic relatedness in a WordNet taxonomy, given its

    graphical representation, is to evaluate the distance

    between the nodes corresponding to the items being

    compared the shorter the path from one node to

    another, the more similar they are. Given multiple

    paths, one takes the length of the shortest one[16].

    Instead of WordNet, Radas central knowledge

    source is MeSH(Medical Subject Headings) [15]. The

    networks 15,000 terms from a nine-level hierarchy

    that includes high-level nodes such as anatomy,

    organism, and disease and is based on the

    BROADER-THAN relationship. The principal

    assumption put forward by Rada is that the number of

    edges between terms in the MeSH hierarchy is a

    measure of conceptual distance between terms.

    Despite its apparent simplicity, a widely

    acknowledged problem with the edge counting

    approach mentioned above is that it typically relies on

    the notion that links in the taxonomy represent uniform

    distances. It is not always true due to that there is a

    wide variability in the distance covered by a single

    taxonomic link, particularly when certain sub-

    taxonomies are much denser than others. In addition,

    the edge counting approaches should rely on a well-

    established lexical knowledge base as a source of

    semantic network (e.g., WordNet) for computing

    semantic relatedness. It is not well suited forapplications in specific domains in which the standard

    lexical knowledge bases are not available.

    Recent work in computational linguistics suggests

    that large amounts of semantic information can be

    extracted automatically from large text corpora on the

    basis of lexical co-occurrence information. Such

    semantics has been becoming important and useful for

    being as an essential representation of content in each

    web page, particularly with the increasing availability

    of digital documents from all around the world. In this

    work we attempt to develop a novel algorithmic

    approach for extracting semantic information from the

    web text corpora. Using a variation of automatic textcategorization, which applies the Support Vector

    Machines (SVM), we have conducted several

    experiments using several text classifiers associated

    with some related topics based on SVM-based learning

    processes. Furthermore, when exposed to the classified

    texts, we employ a novel algorithm to measure the

    implicit semantic relatedness among them.

    2. Automatic text categorization based on

    support vector machines

    Support Vector Machine (SVM) is a relatively new

    learning technique for data classification. The goal ofSVM is to find a decision surface to separate the

    training data samples into two classes and make

    decisions based on the support vectors that are selected

    as the only effective elements from the training set. For

    text classification, SVM makes decision based on the

    globally optimized separating hyperplane. It simply

    finds out on which side of the hyperplane the test

    Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE

  • 7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic

    3/6

    pattern is located (see Figure. 1). This characteristics

    makes SVM highly competitive, compared with other

    pattern classification methods, in terms of predictive

    accuracy and efficiency. Various quadratic

    programming methods have been proposed and

    extensively studied to solve the SVM problem. In

    particular, Joachims has done much research on the

    application of SVM to text categorization [7].

    Figure 1. SVM Classifier Structure [19]

    2.1 How support vector machines work

    When SVMs are constructed, two sets of

    hyperplanes are formed, one hyperplane going through

    one or more examples of the non-relevant vectors and

    one hyperplane going through one or more examples of

    the relevant vectors. Vectors lying on the hyperplanes

    are termed support vectors and in fact define the two

    hyperplanes. If we define the margin as the orthogonal

    distance between the two hyperplanes, then a SVM

    maximizes this margin. Equivalently, the optimal

    hyperplane is such that the distance to the nearest

    vector is maximum.

    3. System implementation

    In this work we develop an approach applying a

    classifier-based technique with Support Vector

    Machines (SVM) method to supporting acquisition of

    relatedness among texts. We utilized our previously

    developed text mining algorithms and platforms [10],

    [11], [12], [20], including text mining techniques

    based on Support Vector Machines (SVM) and Self-

    Organizing Maps (SOM)for performing clustering and

    classification of texts in several text collections. After

    that, we employ SVM methods to deal with acquisition

    of relatedness of the target documents of text mining

    process, in order to find the semantic connections and

    relatedness among the mined texts, shown in Figure 2.

    Figure 2. System Framework

    The implementation of semantic relatedness

    measures includes two subtasks concerning preparation

    of the information sources. Our approach begins with a

    standard practice in information retrieval to encode

    documents with vectors, in which each component

    corresponds to a different word, and the value of the

    component reflects the frequency of word occurrence

    in the document. Subsequently, we employed the SVM

    technique for developing text classifiers.

    3.1 The reason for choosing SVM and notother classification techniques

    In practice, the resulting dimensionality of the space

    is often tremendously huge, since the number of

    dimensions is determined by the number of distinct

    indexed terms in the corpus. As a result, techniques for

    controlling the dimensionality of the vector space are

    often required. Since SVM techniques have a superior

    potential to manage high dimensional input spaces

    effectively than other classification techniques, the

    need for time consuming linguistic preprocessing (i.e.

    reduction of dimensions of the feature space) can be

    largely eliminated. Therefore in this work, we choose

    SVM methods as the major approach for classification.

    The comparison of performance between SVM

    classifiers and other classifiers have also been

    examined in our experimental work.

    4. Acquisition of semantic relatedness

    using support vector machines

    In this section we will introduce the implemented

    algorithms for acquisition of semantic relatedness from

    text corpora by means of a machine learning technique,

    namely the Support Vector Machines. As stated above,the Support Vector Machine is one of the major

    statistical learning models. It basically provides a way

    for text categorization by producing a decision surface

    to separate the training data samples into two classes

    (Figure 1). As such, the resulting categories are

    capable of performing the grouping of semantically

    related texts, and further computing the degree of

    Text

    Corpus

    Acquisition of Textual

    Semantic Relatedness

    (Support Vector Machine

    algorithm)

    (Feature selection

    of input texts)

    SVM classifier-based categorization

    Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE

  • 7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic

    4/6

    semantic relatedness among the texts by means of our

    developed algorithm.

    Figure 3. Analyzing semantic relatedness

    among texts using SVM classifiers with One-Against-All (OAA) technique

    Again, the algorithm employed for acquisition of

    semantic relatedness among texts is established basedon a multiple categorizing processes using the SVM

    classifiers with One-Against-All (OAA) method (see

    Figure 3). Instead of a numerical relatedness or

    similarity value, each of the measures that we tested

    returns two classes indicating related/unrelated

    judgment required by the algorithm. We therefore set

    the threshold of relatedness of each measure at which it

    separated the higher level of semantically related texts

    from the lower level. According to the results of the

    measures, the degree of semantic relatedness of the

    tested texts can be obtained, and which can be recorded

    in order to produce the final report of the acquisition of

    semantic relatedness among texts in the evaluatedcorpora. Figure 4 shows the resulting map of textual

    relatedness evaluation using SVM categorizing process.

    The most related texts were categorized into S3 group,

    and in turn the less related texts were mapped into the

    S2, S1 and S0 text collections.

    Figure 4. Resulting map of textual relatednessevaluation using SVM categorizing process

    5. Related work

    In this paper we introduce the implemented

    algorithms for acquisition of semantic relatedness from

    text corpora by means of a machine learning technique,

    namely the Support Vector Machines. As stated above,the Support Vector Machine is one of the major

    statistical learning models. It basically provides a way

    for text categorization by producing a decision surface

    to separate the training data samples into two classes.

    As such, the resulting categories are capable of

    performing the grouping of semantically related texts,

    and further computing the degree of semantic

    relatedness among the texts by means of our developed

    algorithm.

    Text mining is a new interdisciplinary field. It

    combines the disciplines of data mining, information

    extraction, information retrieval, text categorization,

    machine learning, and computational linguistics to

    discover structure, patterns, and knowledge in large

    textual corpora. With the huge amount of information

    available online, the World Wide Web has been

    becoming a fertile area for text mining research. The

    Web content data include unstructured data such as

    free texts, semi-structured data such as HTMLdocuments, and a more structured data such as tabular

    data in the databases. However, much of the Web

    content data is unstructured text data. The research

    around applying knowledge discovery techniques to

    unstructured text is termed knowledge discovery in

    texts (KDT) [1], or text data mining [5], or text mining.

    Advances in computational resources and new

    statistical algorithms for text analysis have helped text

    mining develop as a field. Recently there have been

    some innovative techniques developed for text mining.

    For example, Feldman uses text category labels

    (associated with Reuters newswire) to find unexpected

    patterns among text articles [1], [2], [3]. Text miningby using self-organizing map (SOM) techniques has

    already gained some attentions in the knowledge

    discovery research and information retrieval field. The

    paper of [13] perhaps marks the first attempt to utilize

    SOM (unsupervised neural networks) for an

    information retrieval work. In this paper, however, the

    document representation is made from 25 manually

    selected indexed terms and is thus not really realistic.

    In addition, among the most influential work we

    certainly have to mention WEBSOM [6], [8], [9].

    Their work aims at constructing methods for exploring

    full-text document collections, the WEBSOM started

    from Honkela's suggestion of using theself-organizingsemantic maps [17] as a preprocessing stage for

    encoding documents. Such maps are, in turn, used to

    automatically organize (i.e., cluster) documents

    according to the words that they contain. When the

    documents are organized, following the steps in the

    preprocessing stage, on a map in such a way that

    nearby locations contain similar documents,

    exploration of the collection is facilitated by the

    Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE

  • 7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic

    5/6

    intuitive neighborhood relations. Thus, users can easily

    navigate a word category map and zoom in on groups

    of documents related to a specific group of words.

    6. Experimental results

    In this work we develop several text classifiersincluding Support Vector Machines (SVM) methods to

    supporting relatedness measurement among texts. First,

    we utilized our developed text mining algorithms,

    including text mining techniques based on

    classification of texts in several text collections. After

    that, we employ various SVM classifiers to deal with

    categorization of the target documents for evaluating

    relatedness. The experiments (Figure 5 and Figure 6)

    used a random set of relevant and non-relevant

    documents in a corpus. In Figure 5, we show the recall

    ratios for five topics (classes), including Finance,

    Politics, Movies, Sports and Tech. In the testing

    process, we will first assume that topic number one is

    the relevant topic and all others are non-relevant. Then

    we will assume topic number two is the most relevant

    topic and all others non-relevant, etc. Recall is defined

    as the number of relevant documents actually retrieved

    in a function of iteration divided the number of

    relevant documents in the collection. In order to

    compare the performance of various classifiers on the

    same topic, it is reasonable to cover the performance of

    SVM-based classifiers with various kernel functions

    (i.e. Gaussian, Exponential and Polynomial functions)

    and other classifiers including artificial neural network

    (ANN) and kNN algorithms. For each topic (class), we

    use a trained classifier for performing text

    classification, to classify relevant and non-relevantdocuments. From the results shown in Figure 5, the

    classifier of SVM with a Gaussian kernel was superior

    to other classifiers, including SVM classifiers with

    other kernel functions.

    Figure 5. Resulting recall ratios

    Also, the results of F1 (Eq.1) values could be obtained

    as shown in Figure 6.

    1F2

    +

    Precision Recall=

    Precision Recall

    (1)

    0 500 1000 1500 200030

    40

    50

    60

    70

    80

    90

    100

    Dimension

    F1

    Gaussian RBFPolynomialExponential RBF

    Figure 6. Resulting F1 ratios

    7. Conclusions

    This paper presents a hybrid approach of a text-mining technique for evaluating relatedness among

    texts. In this work we develop several text classifiers

    using Support Vector Machines (SVM) methods to

    supporting acquisition of relatedness among texts.

    First, we utilized our developed text mining

    algorithms, including text mining techniques based on

    classification of texts in several text collections. After

    that, we employ various SVM classifiers to deal with

    evaluation of relatedness of the target documents. The

    results indicate that this approach can also be fitted to

    other research work, such as information filtering, and

    re-categorizing resulting documents of search engine

    queries. Experimental results show that the technique

    performs well in practice, successfully adapting the

    classification function of SVM-based classifiers to the

    acquisition of relatedness among texts.

    8. References

    [1]R. Feldman and I. Dagan, KDT - Knowledge Discoveryin Texts, InProceedings of the First Annual Conference on

    Knowledge Discovery and Data Mining (KDD), Montreal,1995.

    [2]R. Feldman, W. Klosgen, and A. Zilberstein,

    Visualization Techniques to Explore Data Mining Results

    for Document Collections, In Proceedings of the ThirdAnnual Conference on Knowledge Discovery and DataMining (KDD), Newport Beach, 1997.

    [3]R. Feldman, I. Dagan and H. Hirsh, Mining Text UsingKeyword Distributions, In J. of Intelligent InformationSystems, Vol. 10, pp. 281-300, 1998.

    [4]C. Fellbaum, WordNet: An Electronic Lexical Database,The MIT Press, Cambridge, MA. 1998.

    ANN kNN SVM

    (Gaussian)

    SVM

    (Exponential)

    SVM

    (Polynomial)

    Finance 91.13 78.57 99.68 54.43 86.70

    Politics 92.37 84.75 97.46 39.24 82.28

    Movies 91.74 79.63 98.37 39.87 73.51

    Sports 90.02 76.54 94.18 40.56 74.69

    Tech. 89.97 86.31 98.86 38.60 75.94

    Class

    Classifier

    Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE

  • 7/29/2019 A Classifier-Based Text Mining Approach for Evaluating Semantic

    6/6

    [5]M.A. Hearst, Untangling Text Data Mining, InProceedings of ACL99: the 37th Annual Meeting of

    Association for Computational Linguistics, 1999.

    [6]T. Honkela, S. Kaski, K. Lagus and T. Kohonen,Newsgroup Exploration with WEBSOM Method and

    Browsing Interface, Technical Report A32. HelsinkiUniversity of Technology, Laboratory of Computer and

    Information Science, Espoo, Finland, 1996.

    [7]T. Joachims, Text Categorization with Support VectorMachines: Learning with Many Relevant Features, In

    Proceedings 10th European Conference on MachineLearning (ECML), Springer Verlag, 1998. Science, Number

    1398, pp. 137142, 1998.

    [8]S. Kaski., T. Honkela, K. Lagus, and T. Kohonen,WEBSOM--Self-Organizing Maps of DocumentCollections,Neurocomputing, Vol. 21, pp. 101-117, 1998.

    [9]T. Kohonen, Self-Organization of Very Large Document

    Collections: State of the Art, In Niklasson, L., Boden, M.,and Ziemke, T., editors, InProceedings of ICANN98, the 8th

    International Conference on Artificial Neural Networks, Vol.1, London, pp. 65-74. Springer, 1998.

    [10]C.H. Lee and H.C. Yang, A Web Text MiningApproach Based on Self-Organizing Map, InProceedings ofthe ACM CIKM'99 2nd Workshop on Web Information and

    Data Management (WIDM'99), Kansas City, Missouri, USA,pp. 59-62, 1999.

    [11]C.H. Lee and H.C. Yang, A Text Data MiningApproach Using a Chinese Corpus Based on Self-OrganizingMap, In Proceedings of the 4th International Workshop on

    Information Retrieval with Asian Language, Taipei, Taiwan,

    pp. 19-22, 1999.

    [12]C.H. Lee and H.C. Yang, A Multilingual Text MiningApproach Based on Self-Organizing Maps, Applied

    Intelligence, Vol. 18(3): pp. 295-310, 2003.

    [13]X. Lin, D. Soergel. and G. Marchionini, A Self-Organizing Semantic Map for Information Retrieval, In

    Proceedings of the ACM SIGIR Intl Conference on Researchand Development in Information Retrieval (SIGIR91),Chicago, IL,1991.

    [14]G.A. Miller, et al., Five Papers on WordNet. CSL Report43, Princeton University, 1990. revised August 1993.

    [15]R. Rada., et al., Development and Application of a

    Metric on Semantic Nets, IEEE Transactions on Systems,Man, and Cybernetics, 19(1): 17-30, February 1989.

    [16]P. Resnik, Using Information Content to EvaluateSemantic Similarity, In Proceedings of the 14th

    International Joint Conference on Artificial Intelligence,

    Montreal, pp. 448-453, 1995.

    [17]H. Ritter and T. Kohonen, Self-Organizing SemanticMaps.Biological Cybernetics, Vol. 61, pp. 241-254, 1989.

    [18]R. Richardson and A. F. Smeaton. 1995, UsingWordNet in a Knowledge-based Approach to InformationRetrieval, Working paper CA-0395, School of ComputerApplications, Dublin University, 1995.

    [19]V. Vapnik, The Nature of Statistical Learning Theory.

    Springer, N.Y., 1995. ISBN 0-387-94559-8, 1995.

    [20]H.C. Yang and C.H. Lee, Automatic HypertextConstruction through a Text Mining Approach by Self-Organizing Maps, In Proceedings of the Knowledge

    Discovery and Data Mining - PAKDD 2001, 5th Pacific-Asia

    Conference, Hong Kong, China, April 16-18, 2001, (PAKDD2001): pp. 108-113, 2001.

    Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05)0-7695-2315-3/05 $ 20.00 IEEE