srs 26022013 for clustering
DESCRIPTION
this is about clustering multi viewTRANSCRIPT
Software Requirements Specification
for
Document Clustering based on Similarity Measure Using Multi-Reference points
Version 1.0
Prepared by
Maram Nagarjuna Reddy
(11491A5811)
QIS College of Engineering And Technology
Under the esteemed guidance of
Prof. G. Lakshmi TulasiM.Tech.(Ph.D)
26 February 2013
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points
Page ii
Table of Contents
Table of Contents....................................................................................................................... iiRevision History......................................................................................................................... ii1. Introduction.......................................................................................................................... 1
1.1 Purpose......................................................................................................................................11.2 Document Conventions..............................................................................................................21.3 Intended Audience and Reading Suggestions..............................................................................21.4 Product Scope............................................................................................................................31.5 References..................................................................................................................................3
2. Overall Description.............................................................................................................. 42.1 Product Perspective....................................................................................................................42.2 Product Functions.......................................................................................................................52.3 User Classes and Characteristics.................................................................................................52.4 Operating Environment...............................................................................................................52.5 Design and Implementation Constraints......................................................................................62.6 User Documentation...................................................................................................................62.7 Assumptions and Dependencies..................................................................................................6
3. External Interface Requirements........................................................................................73.1 User Interfaces............................................................................................................................73.2 Hardware Interfaces....................................................................................................................83.3 Software Interfaces.....................................................................................................................83.4 Communications Interfaces.........................................................................................................9
4. System Features................................................................................................................... 94.1 System Feature 1........................................................................................................................9
5. Other Nonfunctional Requirements..................................................................................115.1 Performance Requirements.......................................................................................................115.2 Safety Requirements.................................................................................................................115.3 Security Requirements..............................................................................................................115.4 Software Quality Attributes......................................................................................................125.5 Business Rules.........................................................................................................................13
6. Other Requirements........................................................................................................... 13Appendix A: Glossary.............................................................................................................. 13Appendix B: Analysis Models.................................................................................................. 15Appendix C: To Be Determined List....................................................................................... 15
Revision History
Name Date Reason For Changes Version
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 1
1. Introduction
The main goal of the requirement phase is to produce the software requirement
specification (SRS), which accurately capture the client’s requirements. SRS is a document that
describes what the software should do. The basic purpose of SRS is to bridge the communication
gap between the clients, the end users and the Software developers. Another purpose is helping
user to understand their own needs.
Clustering is the classification of objects into different groups, or more precisely, the
partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share
some common trait - often proximity according to some defined distance measure. Data clustering
is a common technique for statistical data analysis, which is used in many fields, including
machine learning, data mining, pattern recognition, image analysis and bioinformatics. The
computational task of classifying the data set into k clusters is often referred to as k-clustering.
In recent years, due to the increased availability of large document collections and the
need to efficiently operate on them (e.g., navigate, analyze, query, and summarize), there has
been an increased emphasis on developing efficient and effective clustering algorithms for large
document collections. To a large extent, this research has focused (or assumed) that each
document is part of a single topic. This assumption is in general true for short documents
(e.g.,web-pages) but it does not hold for many of the large document for which clustering
algorithms have been increasingly applied.
1.1 Purpose
The purpose of Software Requirements Specification (SRS) document is to describe the
external behavior of the Document Clustering based on Similarity Measure Using Multi-Reference
points.
The SRS typically contains the brief description of the project. The purpose of the
requirement document is to specify all the information required to design, develop and test the
software.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 2
The purpose of this project is to group, in an unsupervised way, a given document set into
clusters such that documents within each cluster are more similar between each other than those
in different clusters.
The main purpose of this project is to organize a collection of patterns into clusters based
on similarity Measure Using Multi-Reference points
1.2 Document Conventions
In general this document follows the IEEE formatting requirements. Use Arial font size 11, or
12 throughout the document for text. Use italics for comments. Document text should be single
spaced and maintain the 1” margins found in this template. For Section and Subsection titles please
follow the template. The template standards are published in “IEEE Standards Collection,” and can
be downloaded from
www.csc.villanova.edu/~tway/courses/csc4181/.../ srs _ template -1.doc
1.3 Intended Audience and Reading Suggestions
This SRS document is intended for users, developers, testers, documentation writers.
The rest of the SRS is organized as follows.
Section 2 briefly discusses Overall Description and also describes the design constraints
that are to be considered when the system is to be designed, and other factors necessary to
provide a complete and comprehensive description of the requirements for the software . Section
3 describes the nonfunctional requirements such as various interfaces, Section 4 presents
system features and its descriptions. Section 5 describes the nonfunctional requirements such as
various interfaces, Performance Requirements, Safety Requirements etc
Requirements Specification which defines and describes the operations, interfaces,
performance, and quality assurance requirements of the Document Clustering Using Multi-
Reference points.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 3
1.4 Product Scope
The aim of clustering is to find intrinsic structures in documents, and organize them into
meaningful subgroups for further study and analysis. There have been many clustering
algorithms published every year.
The main work is to develop two similarity measures for document clustering which
provides maximum efficiency and performance.
Our first objective is to derive a novel method for measuring similarity between data
objects in sparse and high-dimensional domain, particularly text documents.
From the proposed similarity measure, we then formulate new clustering criterion functions
and introduce their respective clustering algorithms, which are fast and scalable like k-
means, but are also capable of providing high-quality and consistent performance.
The main goal is to perform document clustering by optimizing the two similarity
measures .
It is an enabling technique for a wide range of information retrieval tasks such as efficient
organization, browsing and summarization of large volumes of text documents. Cluster analysis
aims to organize a collection of patterns into clusters based on similarity. Clustering has its root in
many fields, such as mathematics, computer science, statistics, biology, and economics. In
different application domains, a variety of clustering techniques have been developed, depending
on the methods used to represent data, the measures of similarity between data objects, and the
techniques for grouping data objects into clusters.
1.5 References
[1] Duc Thang Nguyen, Lihui Chen and Chee Keong Chan, “Clustering with Multiviewpoint-Based
Similarity Measure”, IEEE Transactions on Knowledge and Data Engineering, 2012.
[2] Y. Zhao and G. Karypis, “Criterion Functions for Document Clustering: Experiments and
Analysis,” technical report, Dept. of Computer Science, Univ. of Minnesota, 2002.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 4
[3] S. Zhong and J. Ghosh, “A Comparative Study of Generative Models for Document
Clustering,” Proc. SIAM Int’l Conf. Data Mining Workshop Clustering High Dimensional Data and
Its Applications, 2003.
2. Overall Description
2.1 Product Perspective
We are facing an ever increasing volume of text documents. The abundant texts flowing
over the Internet, huge collections of documents in digital libraries and repositories, and digitized
personal information such as blog articles and emails are piling up quickly everyday. These have
brought challenges for the effective and efficient organization of text documents.
Clustering in general is an important and useful technique that automatically organizes a
collection with a substantial number of data objects into a much smaller number of coherent
groups In the particular scenario of text documents, clustering has proven to be an effective
approach for quite some time—and an interesting research problem as well. It is becoming even
more interesting and demanding with the development of the World Wide Web and the evolution
of Web 2.0. For example, results returned by search engines are clustered to help users quickly
identify and focus on the relevant set of results. Customer comments are clustered in many online
stores, such as Amazon.com, to provide collaborative recommendations. In collaborative
bookmarking or tagging, clusters of users that share certain traits are identified by their
annotations.
Document clustering has become an increasingly important technique for unsupervised
document organization, automatic topic extraction, and fast information retrieval or filtering. For
example, a web search engine often returns thousands of pages in response to a broad query,
making it difficult for users to browse or to identify relevant information. Clustering methods can
be used to automatically group the retrieved documents into a list of meaningful categories, as is
achieved by search engines such as Google News. Similarly, a large database of documents can
be pre-clustered to facilitate query processing by searching only the cluster that is closest to the
query. In this project certain concepts are need to be explained very briefly.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 5
2.2 Product Functions
The main purpose of this project is to maximize user utility.
Documents: The abundant texts flowing over the Internet, huge collections of documents in
digital libraries and repositories, and digitized personal information such as blog articles and
emails are piling up quickly everyday. These have brought challenges for the effective and
efficient organization of text documents.
The type of Similarity Measure: here I am using one of the Similarity Measure Using Multi-
Reference points for Document Clustering.
The clustering algorithm uses above Similarity Measure for forming into clusters.
2.3 User Classes and Characteristics
The users are assumed to have basic knowledge of the computers and have more knowledge of
the data mining. The user needs to know the exact nature of the submitted job, such as the
execution time as well as resources required, and must possess the technical know-how to use
the interface for submitting jobs.
They can rectify the small problems that may arise due to disk crashes, power failures and
other catastrophes to maintain the system. The proper user interface, user’s manual, online help
and the guide to install and maintain the system must be sufficient to educate the users on how to
use the system without any problems.
2.4 Operating Environment
The target operating system is Windows XP Professional. It also requires Java run time and
compile time environments along with a tool named as NETBEANS IDE. The hardware requirements
for this system is minimal requirements for running the application. For storing collection of documents
, need a database.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 6
2.5 Design and Implementation Constraints
Hardware limitations: The developers don’t have enough storage for keeping document
dataset.
They may also have timing constraints.
The current constraints on the project are related to the provision of hardware resources to
implement and test a high-performance cluster. At present, a network of four Pentium- IV
workstations, with a 128 Mb RAM, serves as the cluster. For better performance analysis, a larger
number of dedicated requirements would be beneficial.
2.6 User Documentation
User manual and guide will be made available for troubleshooting and help. The user
manual will contain detailed information about the usage of the product from manual perspective to
an expert network/system user. The manual and summary of application shall also be made
available online.
2.7 Assumptions and Dependencies
Assume that the client will have
The users have sufficient knowledge of computers.
The users have sufficient knowledge about information retrieval.
The computer should have all tools for running application..
The users know the English language, as the user interface will be provided in English
The product can access document database
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 7
3. External Interface Requirements
3.1 User Interfaces
The user should be simple and easy to understand and use. Also be an interactive
interface .The system should prompt for the user for proper input criteria
The software provides good graphical interface for the user can operate on the system,
performing the required task such as upload, viewing the details of the result.
The minimal requirements are that the cluster user would be able to interact with the
system through the prompt, or through the interface provided by the system. There will be a
different command for each of the following actions:
submit text documents
display the clusters as result
Input Design considered the following things:
What data should be given as input?
How the data should be arranged or coded?
The dialog to guide the operating personnel in providing input.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 8
3.2 Hardware Interfaces
This requires various hardware components for the system and also include following hardware
interfaces
CPU usage
Memory usage
Text file creation
3.3 Software Interfaces
This product requires following specific software components
● Java language
● Net beans IDE 7.0.1
● Windows XP/Windows 2000
● Large data bases
3.4 Communications Interfaces
Web browser does the following tasks .
Parsing is the first step when the document enters the process state.
Parsing is defined as the separation or identification of meta tags in a HTML document.
Here, the raw HTML file is read and it is parsed through all the nodes in the tree structure.
4. System Features
Document pre-processing steps
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 9
Tokenization: A document is treated as a string (or bag of words), and then partitioned into
a list of tokens.
Removing stop words: Stop words are frequently occurring, insignificant words. This step
eliminates the stop words.
Stemming word: This step is the process of conflating tokens to their root form (connection
-> connect).
Document representation
Generating N-distinct words from the corpora and call them as index terms (or the
vocabulary). The document collection is then represented as a N-dimensional vector in term
space.
Computing Term weights
Term Frequency.
Inverse Document Frequency.
Compute the TF-IDF weighting.
Measuring similarity between two documents
Capturing the similarity of two documents using cosine similarity measurement. The cosine
similarity is calculated by measuring the cosine of the angle between two document
vectors.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 10
Use Case Diagram for fuctioning
Clustering
Clustering is a division of data into groups of similar objects.
Representing the data by fewer clusters necessarily loses certain fine details, but achieves
simplification.
The similar documents are grouped together in a cluster, if their similarity measure is less
than a specified threshold
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 11
5. Other Nonfunctional Requirements
5.1 Performance Requirements
The capability of the computer depends on the performance of the software. The software can
take any number of inputs provided the storage size is larger enough. This would depend on the
available memory space.
Response Time
The Page or Information page should be taken within few seconds .
The system shall respond to the member in not less than two seconds from the time of the
request submittal. The system shall be allowed to take more time when doing large processing
tasks.
Throughput
The number of clusters is directly dependent on the number of users;
Resource Utilization
The resources are modified according the user requirements and also according to the latest
similarity measures.
5.2 Safety Requirements
There are no specific safety requirements associated with the proposed system. The
Document Clustering based on Similarity Measure Using Multi-Reference points is
composed of well known and commonly used hardware and software which do not cause any
safety hazards.
The level of security is provided to this product is that don’t allow to modify any
parameters within the Similarity Measure of algorithm
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 12
5.3 Security Requirements
Only expertise personnel are allowed to use the product and go through selection
procedures Similarly, don’t allow to change features and data with in the documents of the corpus
at runtime.
5.4 Software Quality Attributes
• Maintainability: There will be no maintained requirement for the software. The
database is provided by the end user and therefore is maintained by this user.
• Portability: The system is portable.
• Availability: This system will available only until the system on which it is install,
is running.
• Scalability: Applicable.
Usability
• The system shall allow the users to access the system from the Internet using HTML or its
derivative technologies. The system uses a web browser as an interface.
• Since all users are familiar with the general usage of browsers, no specific training is required.
• The system is user friendly and self-explanatory.
Reliability
The system has to be very reliable due to the importance of similarity measure used.
Availability
The system is available 100% for the user and is used 24 hrs a day and 365 days a year.
The system shall be operational 24 hours a day and 7 days a week.
Mean Time between Failures (MTBF)
The system will be developed in such a way that it may fail once in 2 years.
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 13
Accuracy
The accuracy of the system is more when compared with remaining similarity measures
such as cosine and spherical k- means similarity measures.
Access Reliability
The system shall provide 100% access reliability.
5.5 Business Rules
Document Clustering based on Similarity Measure Using Multi-Reference points is most
suitable for marketing managers and knowledge analysts of large enterprises in order to analyze the
more frequent retrieved documents from huge collection data repositories . The product should be
used carefully without loss of data. Major advantage is that it gives more accurate results than any
other similarity measures.
6. Other Requirements
There are no other requirements.
Appendix A: Glossary
Clustering is a common descriptive task where one seeks to identify a finite set of
categories or clusters to describe the data.
Cluster Analysis In multivariate analysis, cluster analysis refers to methods
used to divide up objects into similar groups, or, more precisely, groups
whose members are all close to one another on various dimensions being
measured. In cluster analysis, one does not start with any apriori notion of
group characteristics. The definition of clusters emerges entirely from the
cluster analysis - i.e. from the process of identifying "clumps" of objects.
Similarity A method which determines the strength of the relationship between
variables, and/or a means to test whether the relationship is stronger than
expected due to the null hypothesis. Usually, we are interested in the
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 14
relationship between two variables, x and y. The correlation coefficient r is
one measure of the strength of the relationship.
Data Data is the raw material of a system supplied by data producers and is
used by information consumers to create information.
Data mining A technique using software tools geared for the user who typically does not
know exactly what he's searching for, but is looking for particular patterns
or trends. Data mining is the process of shifting through large amounts of
data to produce data content relationships. It can predict future trends and
behaviors, allowing businesses to make proactive, knowledge-driven
decisions. This is also known as data surfing.
MTBF Mean Time between Failures
TF-IDF Term Frequency- Inverse Document Frequency
Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference pointsPage 15
Appendix B: Analysis Models
The DFD is also called as bubble chart. It is a simple graphical formalism that can be used
to represent a system in terms of the input data to the system, various processing carried out on
these data, and the output data is generated by the system
Data flow Diagram for Document clustering
Appendix C: To Be Determined List
<Collect a numbered list of the TBD (to be determined) references that remain in the SRS so they can be tracked to closure.>