indexing repaired) repaired)

8/6/2019 Indexing Repaired) Repaired)

1/6

Sample document collection

This is the Initial stage of the proposed architecture. In this stage the sample documents are

collected. These documents belong to 4 different subjects of computer science domain namingly

database management system, computer networks, operating system and software engineering.

The extracted documents are arranged in a directory to facilitate the training process.

Indexing of Documents

Indexing is the process of preparing the raw document collection into an easily accessible

representation of documents. This transformation from a document text into a representation of

text is known as indexing of the documents. For transforming a document into an indexed form

Text to Matrix Generator toolbox is used in MATLAB which involves following steps:

INPUT: File name/ Directory, Options

OUTPUT: Term to Document Matrix

Input: filename, OPTIONS

Output: tdm, dictionary and several optional outputs;

parse files or input directory;

read the stoplist;

Read the

Stoplist

Parse

the file

For

each

Normalizedictionary

TDM

Dictionary

Filtering of

terms


2/6

for each input file,

parse the file (construct dictionary);

end

normalize the dictionary (remove stopwords and too long or too

short terms, stemming);

construct tdm;

remove terms as per frequency parameters;

compute global weights;

apply local weighting function;

form final tdm;

Document Linearization

Document Linearization is the process by which a document is reduced to a stream of terms.This is usually done in two steps and as follows.

1. Markup and Format Removal During this phase, the markup tags and formatting tags

are removed from the document.2. Tokenization During this phase, the outcome of the previous phase that is the remaining

text is parsed, lowercased and all the punctuations are removed.

Filtration

Filtration is the process of deciding which terms or attributes should be used to represent the

documents so that these terms can

1. Describe the content of the document.

2. Discriminate the document from the other documents in the collection.

For this purpose frequently used terms cannot be used for two reasons. First, the number of

documents that are relevant to a topic or subtopic is likely to be a small proportion of thecollection. A term that will be effective in separating the relevant documents from the non-

relevant documents, then, is likely to be a term that appears in a small number of documents.

This means that high frequency terms are poor discriminators. The second reason is that terms

appearing in many contexts do not define a topic or sub-topic of a document.

"The more documents in which a term appears (the more contexts in which it is used) then the

less likely it is to be a content-bearing term. Consequently it is less likely that the term is one of


3/6

those terms that contribute to the user's relevance assessment. Hence, terms that appear in many

documents are less likely to be the ones used by a searcher to discriminate between relevant and

non-relevant documents."

For these reasons, frequently used terms or stopwords are removed from text streams. Stop words

are the words having frequency greater than some user specified frequency. Special care is taken so that

important words that occur more frequently are not removed. The stop-word removal is done with the aid

of a publicly available list of stop-words []. Using public list of stop-words is category independent and

ensures important words within a category that occur more frequently are not removed. The disadvantage

is that there are many different public lists of stop-words all of which may not be the same. Nevertheless,

a number of the list could be compared and the appropriate one chosen.

However, removing stopwords from one document at a time is time consuming. A cost-effective

approach consists in removing all terms which appear commonly in the document collection, and

which will not improve retrieval of relevant material.

This can be accomplished with a stopword library --a stop-list of terms to be removed. These

lists contain words, which are assumed to have no impact on the meaning of a document. Such a

World Wide Web

Html files

Download

agent

Indexing

Document

Linearization

Filtration

Stemming

Transformatio

n

Weighting

Document vectors


4/6

list usually contains words like the, is, a, etc. During the preprocessing all words matching

the top word list are removed from the document.

These lists can be either generic (applied to all collections) or specific (created for a givencollection). For instance, in some IR systems terms that appear in more than 5% of a collection

are removed. In others, terms that are not in the stop-list but appear in more than 50% of acollection are deemed as "negative terms" and are also removed to avoid weighting

complications.

Stemming

A stemming algorithm is a process of linguistic normalization, in which the variant forms of a

word are reduced to a common form, for example,

connection

connections

connective ---> connectconnected

connecting

It refers to the process of reducing terms to their stems or root variant. Thus, "computer",

"computing", "compute" is reduced to "compute" and "walks", "walking" and "walker" isreduced to "walk".

Stemming can be strong stemming (e.g. houses, mice to word stems e.g. hous, mic) or weak

stemming (e.g. houses, mice to word stems e.g. house, mouse). For pre-processing of

English documents the PORTER stemming algorithm is often used, other algorithms for exampleare: the n-Gram Stemmer and others like Snowball Stemming Algorithms, Lovins' English

stemmer etc.

Weighting

Weighting is the final stage in indexing application. Terms are weighted according to a given

weighting model which may include local weighting, global weighting or both. If local weightsare used, then term weights are normally expressed as term frequencies, tf. If global weights are

used, the weight of a term is given by IDF values. The most common (and basic) weighting

scheme is one in which local and global weights are used (weight of a term = tf*IDF). This iscommonly referred to as tf*IDFweighting.

Clustering of documents

Clustering is a process of grouping similar documents from a given set of documents. It will put

documents with similar content or with related topics into the same cluster (group). Each cluster

is assigned a label based on the content of the documents belong to this cluster.


5/6

k-mean clustering

K-means clustering is a partitioning method. It partitions data into k mutually exclusive clusters,

and returns the index of the cluster to which it has assigned each observation.

K-means treats each document in the collection as an object having a location in space. It finds apartition in which objects within each cluster are as close to each other as possible, and as far

from objects in other clusters as possible.

Each cluster in the partition is defined by its member objects and by its centroid, or center. Thecentroid for each cluster is the point to which the sum of distances from all objects in that cluster

is minimized. kmeans computes cluster centroids differently for each distance measure, to

minimize the sum with respect to the measure that you specify.

K-means uses an iterative algorithm that minimizes the sum of distances from each object to itscluster centroid, over all clusters. This algorithm moves objects between clusters until the sum

cannot be decreased further. The result is a set of clusters that are as compact and well-separated

as possible.


6/6

Text mining

NMF can be used for text mining applications. In this process, a document-term matrix is

constructed with the weights of various terms (typically weighted word frequency information)from a set of documents. This matrix is factored into a term-feature and a feature-document

matrix. The features are derived from the contents of the documents, and the feature-document

matrix describes data clusters of related documents.