srs 26022013 for clustering

Software Requirements Specification

for

Document Clustering based on Similarity Measure Using Multi-Reference points

Version 1.0

Prepared by

Maram Nagarjuna Reddy

(11491A5811)

[email protected]

QIS College of Engineering And Technology

Under the esteemed guidance of

Prof. G. Lakshmi TulasiM.Tech.(Ph.D)

26 February 2013

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page ii

Table of Contents

Table of Contents....................................................................................................................... iiRevision History......................................................................................................................... ii1. Introduction.......................................................................................................................... 1

1.1 Purpose......................................................................................................................................11.2 Document Conventions..............................................................................................................21.3 Intended Audience and Reading Suggestions..............................................................................21.4 Product Scope............................................................................................................................31.5 References..................................................................................................................................3

2. Overall Description.............................................................................................................. 42.1 Product Perspective....................................................................................................................42.2 Product Functions.......................................................................................................................52.3 User Classes and Characteristics.................................................................................................52.4 Operating Environment...............................................................................................................52.5 Design and Implementation Constraints......................................................................................62.6 User Documentation...................................................................................................................62.7 Assumptions and Dependencies..................................................................................................6

3. External Interface Requirements........................................................................................73.1 User Interfaces............................................................................................................................73.2 Hardware Interfaces....................................................................................................................83.3 Software Interfaces.....................................................................................................................83.4 Communications Interfaces.........................................................................................................9

4. System Features................................................................................................................... 94.1 System Feature 1........................................................................................................................9

5. Other Nonfunctional Requirements..................................................................................115.1 Performance Requirements.......................................................................................................115.2 Safety Requirements.................................................................................................................115.3 Security Requirements..............................................................................................................115.4 Software Quality Attributes......................................................................................................125.5 Business Rules.........................................................................................................................13

6. Other Requirements........................................................................................................... 13Appendix A: Glossary.............................................................................................................. 13Appendix B: Analysis Models.................................................................................................. 15Appendix C: To Be Determined List....................................................................................... 15

Revision History

Name Date Reason For Changes Version

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 1

1. Introduction

The main goal of the requirement phase is to produce the software requirement

specification (SRS), which accurately capture the client’s requirements. SRS is a document that

describes what the software should do. The basic purpose of SRS is to bridge the communication

gap between the clients, the end users and the Software developers. Another purpose is helping

user to understand their own needs.

Clustering is the classification of objects into different groups, or more precisely, the

partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share

some common trait - often proximity according to some defined distance measure. Data clustering

is a common technique for statistical data analysis, which is used in many fields, including

machine learning, data mining, pattern recognition, image analysis and bioinformatics. The

computational task of classifying the data set into k clusters is often referred to as k-clustering.

In recent years, due to the increased availability of large document collections and the

need to efficiently operate on them (e.g., navigate, analyze, query, and summarize), there has

been an increased emphasis on developing efficient and effective clustering algorithms for large

document collections. To a large extent, this research has focused (or assumed) that each

document is part of a single topic. This assumption is in general true for short documents

(e.g.,web-pages) but it does not hold for many of the large document for which clustering

algorithms have been increasingly applied.

1.1 Purpose

The purpose of Software Requirements Specification (SRS) document is to describe the

external behavior of the Document Clustering based on Similarity Measure Using Multi-Reference

points.

The SRS typically contains the brief description of the project. The purpose of the

requirement document is to specify all the information required to design, develop and test the

software.

http://en.wikipedia.org/wiki/Bioinformatics

http://en.wikipedia.org/wiki/Image_analysis

http://en.wikipedia.org/wiki/Pattern_recognition

http://en.wikipedia.org/wiki/Data_mining

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Data_analysis

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Metric_(mathematics)

http://en.wikipedia.org/wiki/Subset

http://en.wikipedia.org/wiki/Data_set

http://en.wikipedia.org/wiki/Partition_of_a_set

http://en.wikipedia.org/wiki/Statistical_classification


The purpose of this project is to group, in an unsupervised way, a given document set into

clusters such that documents within each cluster are more similar between each other than those

in different clusters.

The main purpose of this project is to organize a collection of patterns into clusters based

on similarity Measure Using Multi-Reference points

1.2 Document Conventions

In general this document follows the IEEE formatting requirements. Use Arial font size 11, or

12 throughout the document for text. Use italics for comments. Document text should be single

spaced and maintain the 1” margins found in this template. For Section and Subsection titles please

follow the template. The template standards are published in “IEEE Standards Collection,” and can

be downloaded from

www.csc.villanova.edu/~tway/courses/csc4181/.../ srs _ template -1.doc

1.3 Intended Audience and Reading Suggestions

This SRS document is intended for users, developers, testers, documentation writers.

The rest of the SRS is organized as follows.

Section 2 briefly discusses Overall Description and also describes the design constraints

that are to be considered when the system is to be designed, and other factors necessary to

provide a complete and comprehensive description of the requirements for the software . Section

3 describes the nonfunctional requirements such as various interfaces, Section 4 presents

system features and its descriptions. Section 5 describes the nonfunctional requirements such as

various interfaces, Performance Requirements, Safety Requirements etc

Requirements Specification which defines and describes the operations, interfaces,

performance, and quality assurance requirements of the Document Clustering Using Multi-

Reference points.

http://www.csc.villanova.edu/~tway/courses/csc4181/.../srs_template-1.doc


1.4 Product Scope

The aim of clustering is to find intrinsic structures in documents, and organize them into

meaningful subgroups for further study and analysis. There have been many clustering

algorithms published every year.

The main work is to develop two similarity measures for document clustering which

provides maximum efficiency and performance.

Our first objective is to derive a novel method for measuring similarity between data

objects in sparse and high-dimensional domain, particularly text documents.

From the proposed similarity measure, we then formulate new clustering criterion functions

and introduce their respective clustering algorithms, which are fast and scalable like k-

means, but are also capable of providing high-quality and consistent performance.

The main goal is to perform document clustering by optimizing the two similarity

measures .

It is an enabling technique for a wide range of information retrieval tasks such as efficient

organization, browsing and summarization of large volumes of text documents. Cluster analysis

aims to organize a collection of patterns into clusters based on similarity. Clustering has its root in

many fields, such as mathematics, computer science, statistics, biology, and economics. In

different application domains, a variety of clustering techniques have been developed, depending

on the methods used to represent data, the measures of similarity between data objects, and the

techniques for grouping data objects into clusters.

1.5 References

[1] Duc Thang Nguyen, Lihui Chen and Chee Keong Chan, “Clustering with Multiviewpoint-Based

Similarity Measure”, IEEE Transactions on Knowledge and Data Engineering, 2012.

[2] Y. Zhao and G. Karypis, “Criterion Functions for Document Clustering: Experiments and

Analysis,” technical report, Dept. of Computer Science, Univ. of Minnesota, 2002.


[3] S. Zhong and J. Ghosh, “A Comparative Study of Generative Models for Document

Clustering,” Proc. SIAM Int’l Conf. Data Mining Workshop Clustering High Dimensional Data and

Its Applications, 2003.

2. Overall Description

2.1 Product Perspective

We are facing an ever increasing volume of text documents. The abundant texts flowing

over the Internet, huge collections of documents in digital libraries and repositories, and digitized

personal information such as blog articles and emails are piling up quickly everyday. These have

brought challenges for the effective and efficient organization of text documents.

Clustering in general is an important and useful technique that automatically organizes a

collection with a substantial number of data objects into a much smaller number of coherent

groups In the particular scenario of text documents, clustering has proven to be an effective

approach for quite some time—and an interesting research problem as well. It is becoming even

more interesting and demanding with the development of the World Wide Web and the evolution

of Web 2.0. For example, results returned by search engines are clustered to help users quickly

identify and focus on the relevant set of results. Customer comments are clustered in many online

stores, such as Amazon.com, to provide collaborative recommendations. In collaborative

bookmarking or tagging, clusters of users that share certain traits are identified by their

annotations.

Document clustering has become an increasingly important technique for unsupervised

document organization, automatic topic extraction, and fast information retrieval or filtering. For

example, a web search engine often returns thousands of pages in response to a broad query,

making it difficult for users to browse or to identify relevant information. Clustering methods can

be used to automatically group the retrieved documents into a list of meaningful categories, as is

achieved by search engines such as Google News. Similarly, a large database of documents can

be pre-clustered to facilitate query processing by searching only the cluster that is closest to the

query. In this project certain concepts are need to be explained very briefly.


2.2 Product Functions

The main purpose of this project is to maximize user utility.

Documents: The abundant texts flowing over the Internet, huge collections of documents in

digital libraries and repositories, and digitized personal information such as blog articles and

emails are piling up quickly everyday. These have brought challenges for the effective and

efficient organization of text documents.

The type of Similarity Measure: here I am using one of the Similarity Measure Using Multi-

Reference points for Document Clustering.

The clustering algorithm uses above Similarity Measure for forming into clusters.

2.3 User Classes and Characteristics

The users are assumed to have basic knowledge of the computers and have more knowledge of

the data mining. The user needs to know the exact nature of the submitted job, such as the

execution time as well as resources required, and must possess the technical know-how to use

the interface for submitting jobs.

They can rectify the small problems that may arise due to disk crashes, power failures and

other catastrophes to maintain the system. The proper user interface, user’s manual, online help

and the guide to install and maintain the system must be sufficient to educate the users on how to

use the system without any problems.

2.4 Operating Environment

The target operating system is Windows XP Professional. It also requires Java run time and

compile time environments along with a tool named as NETBEANS IDE. The hardware requirements

for this system is minimal requirements for running the application. For storing collection of documents

, need a database.


2.5 Design and Implementation Constraints

Hardware limitations: The developers don’t have enough storage for keeping document

dataset.

They may also have timing constraints.

The current constraints on the project are related to the provision of hardware resources to

implement and test a high-performance cluster. At present, a network of four Pentium- IV

workstations, with a 128 Mb RAM, serves as the cluster. For better performance analysis, a larger

number of dedicated requirements would be beneficial.

2.6 User Documentation

User manual and guide will be made available for troubleshooting and help. The user

manual will contain detailed information about the usage of the product from manual perspective to

an expert network/system user. The manual and summary of application shall also be made

available online.

2.7 Assumptions and Dependencies

Assume that the client will have

The users have sufficient knowledge of computers.

The users have sufficient knowledge about information retrieval.

The computer should have all tools for running application..

The users know the English language, as the user interface will be provided in English

The product can access document database


3. External Interface Requirements

3.1 User Interfaces

The user should be simple and easy to understand and use. Also be an interactive

interface .The system should prompt for the user for proper input criteria

The software provides good graphical interface for the user can operate on the system,

performing the required task such as upload, viewing the details of the result.

The minimal requirements are that the cluster user would be able to interact with the

system through the prompt, or through the interface provided by the system. There will be a

different command for each of the following actions:

submit text documents

display the clusters as result

Input Design considered the following things:

What data should be given as input?

How the data should be arranged or coded?

The dialog to guide the operating personnel in providing input.


3.2 Hardware Interfaces

This requires various hardware components for the system and also include following hardware

interfaces

CPU usage

Memory usage

Text file creation

3.3 Software Interfaces

This product requires following specific software components

● Java language

● Net beans IDE 7.0.1

● Windows XP/Windows 2000

● Large data bases

3.4 Communications Interfaces

Web browser does the following tasks .

Parsing is the first step when the document enters the process state.

Parsing is defined as the separation or identification of meta tags in a HTML document.

Here, the raw HTML file is read and it is parsed through all the nodes in the tree structure.

4. System Features

Document pre-processing steps


Tokenization: A document is treated as a string (or bag of words), and then partitioned into

a list of tokens.

Removing stop words: Stop words are frequently occurring, insignificant words. This step

eliminates the stop words.

Stemming word: This step is the process of conflating tokens to their root form (connection

-> connect).

Document representation

Generating N-distinct words from the corpora and call them as index terms (or the

vocabulary). The document collection is then represented as a N-dimensional vector in term

space.

Computing Term weights

Term Frequency.

Inverse Document Frequency.

Compute the TF-IDF weighting.

Measuring similarity between two documents

Capturing the similarity of two documents using cosine similarity measurement. The cosine

similarity is calculated by measuring the cosine of the angle between two document

vectors.


Use Case Diagram for fuctioning

Clustering

Clustering is a division of data into groups of similar objects.

Representing the data by fewer clusters necessarily loses certain fine details, but achieves

simplification.

The similar documents are grouped together in a cluster, if their similarity measure is less

than a specified threshold


5. Other Nonfunctional Requirements

5.1 Performance Requirements

The capability of the computer depends on the performance of the software. The software can

take any number of inputs provided the storage size is larger enough. This would depend on the

available memory space.

Response Time

The Page or Information page should be taken within few seconds .

The system shall respond to the member in not less than two seconds from the time of the

request submittal. The system shall be allowed to take more time when doing large processing

tasks.

Throughput

The number of clusters is directly dependent on the number of users;

Resource Utilization

The resources are modified according the user requirements and also according to the latest

similarity measures.

5.2 Safety Requirements

There are no specific safety requirements associated with the proposed system. The

Document Clustering based on Similarity Measure Using Multi-Reference points is

composed of well known and commonly used hardware and software which do not cause any

safety hazards.

The level of security is provided to this product is that don’t allow to modify any

parameters within the Similarity Measure of algorithm


5.3 Security Requirements

Only expertise personnel are allowed to use the product and go through selection

procedures Similarly, don’t allow to change features and data with in the documents of the corpus

at runtime.

5.4 Software Quality Attributes

• Maintainability: There will be no maintained requirement for the software. The

database is provided by the end user and therefore is maintained by this user.

• Portability: The system is portable.

• Availability: This system will available only until the system on which it is install,

is running.

• Scalability: Applicable.

Usability

• The system shall allow the users to access the system from the Internet using HTML or its

derivative technologies. The system uses a web browser as an interface.

• Since all users are familiar with the general usage of browsers, no specific training is required.

• The system is user friendly and self-explanatory.

Reliability

The system has to be very reliable due to the importance of similarity measure used.

Availability

The system is available 100% for the user and is used 24 hrs a day and 365 days a year.

The system shall be operational 24 hours a day and 7 days a week.

Mean Time between Failures (MTBF)

The system will be developed in such a way that it may fail once in 2 years.


Accuracy

The accuracy of the system is more when compared with remaining similarity measures

such as cosine and spherical k- means similarity measures.

Access Reliability

The system shall provide 100% access reliability.

5.5 Business Rules

Document Clustering based on Similarity Measure Using Multi-Reference points is most

suitable for marketing managers and knowledge analysts of large enterprises in order to analyze the

more frequent retrieved documents from huge collection data repositories . The product should be

used carefully without loss of data. Major advantage is that it gives more accurate results than any

other similarity measures.

6. Other Requirements

There are no other requirements.

Appendix A: Glossary

Clustering is a common descriptive task where one seeks to identify a finite set of

categories or clusters to describe the data.

Cluster Analysis In multivariate analysis, cluster analysis refers to methods

used to divide up objects into similar groups, or, more precisely, groups

whose members are all close to one another on various dimensions being

measured. In cluster analysis, one does not start with any apriori notion of

group characteristics. The definition of clusters emerges entirely from the

cluster analysis - i.e. from the process of identifying "clumps" of objects.

Similarity A method which determines the strength of the relationship between

variables, and/or a means to test whether the relationship is stronger than

expected due to the null hypothesis. Usually, we are interested in the


relationship between two variables, x and y. The correlation coefficient r is

one measure of the strength of the relationship.

Data Data is the raw material of a system supplied by data producers and is

used by information consumers to create information.

Data mining A technique using software tools geared for the user who typically does not

know exactly what he's searching for, but is looking for particular patterns

or trends. Data mining is the process of shifting through large amounts of

data to produce data content relationships. It can predict future trends and

behaviors, allowing businesses to make proactive, knowledge-driven

decisions. This is also known as data surfing.

MTBF Mean Time between Failures

TF-IDF Term Frequency- Inverse Document Frequency


Appendix B: Analysis Models

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used

to represent a system in terms of the input data to the system, various processing carried out on

these data, and the output data is generated by the system

Data flow Diagram for Document clustering

Appendix C: To Be Determined List

<Collect a numbered list of the TBD (to be determined) references that remain in the SRS so they can be tracked to closure.>