indexing repaired) repaired)

Upload: deepti-verma

Post on 08-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Indexing Repaired) Repaired)

    1/6

    Sample document collection

    This is the Initial stage of the proposed architecture. In this stage the sample documents are

    collected. These documents belong to 4 different subjects of computer science domain namingly

    database management system, computer networks, operating system and software engineering.

    The extracted documents are arranged in a directory to facilitate the training process.

    Indexing of Documents

    Indexing is the process of preparing the raw document collection into an easily accessible

    representation of documents. This transformation from a document text into a representation of

    text is known as indexing of the documents. For transforming a document into an indexed form

    Text to Matrix Generator toolbox is used in MATLAB which involves following steps:

    INPUT: File name/ Directory, Options

    OUTPUT: Term to Document Matrix

    Input: filename, OPTIONS

    Output: tdm, dictionary and several optional outputs;

    parse files or input directory;

    read the stoplist;

    Read the

    Stoplist

    Parse

    the file

    For

    each

    Normalizedictionary

    TDM

    Dictionary

    Filtering of

    terms

  • 8/6/2019 Indexing Repaired) Repaired)

    2/6

    for each input file,

    parse the file (construct dictionary);

    end

    normalize the dictionary (remove stopwords and too long or too

    short terms, stemming);

    construct tdm;

    remove terms as per frequency parameters;

    compute global weights;

    apply local weighting function;

    form final tdm;

    Document Linearization

    Document Linearization is the process by which a document is reduced to a stream of terms.This is usually done in two steps and as follows.

    1. Markup and Format Removal During this phase, the markup tags and formatting tags

    are removed from the document.2. Tokenization During this phase, the outcome of the previous phase that is the remaining

    text is parsed, lowercased and all the punctuations are removed.

    Filtration

    Filtration is the process of deciding which terms or attributes should be used to represent the

    documents so that these terms can

    1. Describe the content of the document.

    2. Discriminate the document from the other documents in the collection.

    For this purpose frequently used terms cannot be used for two reasons. First, the number of

    documents that are relevant to a topic or subtopic is likely to be a small proportion of thecollection. A term that will be effective in separating the relevant documents from the non-

    relevant documents, then, is likely to be a term that appears in a small number of documents.

    This means that high frequency terms are poor discriminators. The second reason is that terms

    appearing in many contexts do not define a topic or sub-topic of a document.

    "The more documents in which a term appears (the more contexts in which it is used) then the

    less likely it is to be a content-bearing term. Consequently it is less likely that the term is one of

  • 8/6/2019 Indexing Repaired) Repaired)

    3/6

    those terms that contribute to the user's relevance assessment. Hence, terms that appear in many

    documents are less likely to be the ones used by a searcher to discriminate between relevant and

    non-relevant documents."

    For these reasons, frequently used terms or stopwords are removed from text streams. Stop words

    are the words having frequency greater than some user specified frequency. Special care is taken so that

    important words that occur more frequently are not removed. The stop-word removal is done with the aid

    of a publicly available list of stop-words []. Using public list of stop-words is category independent and

    ensures important words within a category that occur more frequently are not removed. The disadvantage

    is that there are many different public lists of stop-words all of which may not be the same. Nevertheless,

    a number of the list could be compared and the appropriate one chosen.

    However, removing stopwords from one document at a time is time consuming. A cost-effective

    approach consists in removing all terms which appear commonly in the document collection, and

    which will not improve retrieval of relevant material.

    This can be accomplished with a stopword library --a stop-list of terms to be removed. These

    lists contain words, which are assumed to have no impact on the meaning of a document. Such a

    World Wide Web

    Html files

    Download

    agent

    Indexing

    Document

    Linearization

    Filtration

    Stemming

    Transformatio

    n

    Weighting

    Document vectors

  • 8/6/2019 Indexing Repaired) Repaired)

    4/6

    list usually contains words like the, is, a, etc. During the preprocessing all words matching

    the top word list are removed from the document.

    These lists can be either generic (applied to all collections) or specific (created for a givencollection). For instance, in some IR systems terms that appear in more than 5% of a collection

    are removed. In others, terms that are not in the stop-list but appear in more than 50% of acollection are deemed as "negative terms" and are also removed to avoid weighting

    complications.

    Stemming

    A stemming algorithm is a process of linguistic normalization, in which the variant forms of a

    word are reduced to a common form, for example,

    connection

    connections

    connective ---> connectconnected

    connecting

    It refers to the process of reducing terms to their stems or root variant. Thus, "computer",

    "computing", "compute" is reduced to "compute" and "walks", "walking" and "walker" isreduced to "walk".

    Stemming can be strong stemming (e.g. houses, mice to word stems e.g. hous, mic) or weak

    stemming (e.g. houses, mice to word stems e.g. house, mouse). For pre-processing of

    English documents the PORTER stemming algorithm is often used, other algorithms for exampleare: the n-Gram Stemmer and others like Snowball Stemming Algorithms, Lovins' English

    stemmer etc.

    Weighting

    Weighting is the final stage in indexing application. Terms are weighted according to a given

    weighting model which may include local weighting, global weighting or both. If local weightsare used, then term weights are normally expressed as term frequencies, tf. If global weights are

    used, the weight of a term is given by IDF values. The most common (and basic) weighting

    scheme is one in which local and global weights are used (weight of a term = tf*IDF). This iscommonly referred to as tf*IDFweighting.

    Clustering of documents

    Clustering is a process of grouping similar documents from a given set of documents. It will put

    documents with similar content or with related topics into the same cluster (group). Each cluster

    is assigned a label based on the content of the documents belong to this cluster.

  • 8/6/2019 Indexing Repaired) Repaired)

    5/6

    k-mean clustering

    K-means clustering is a partitioning method. It partitions data into k mutually exclusive clusters,

    and returns the index of the cluster to which it has assigned each observation.

    K-means treats each document in the collection as an object having a location in space. It finds apartition in which objects within each cluster are as close to each other as possible, and as far

    from objects in other clusters as possible.

    Each cluster in the partition is defined by its member objects and by its centroid, or center. Thecentroid for each cluster is the point to which the sum of distances from all objects in that cluster

    is minimized. kmeans computes cluster centroids differently for each distance measure, to

    minimize the sum with respect to the measure that you specify.

    K-means uses an iterative algorithm that minimizes the sum of distances from each object to itscluster centroid, over all clusters. This algorithm moves objects between clusters until the sum

    cannot be decreased further. The result is a set of clusters that are as compact and well-separated

    as possible.

  • 8/6/2019 Indexing Repaired) Repaired)

    6/6

    Text mining

    NMF can be used for text mining applications. In this process, a document-term matrix is

    constructed with the weights of various terms (typically weighted word frequency information)from a set of documents. This matrix is factored into a term-feature and a feature-document

    matrix. The features are derived from the contents of the documents, and the feature-document

    matrix describes data clusters of related documents.