web_sciences complete book

Proceedings of the

International Conference on

Web Sciences (ICWS-2009)

www.excelpublish.com

Proceedings of the

International Conference on

Web Sciences (ICWS-2009)

(10–11 January, 2009)

Editors

L.S.S. Reddy P. Thrimurthy

H.R. Mohan K. Rajasekhara Rao

Kodanada Rama Sastry J.

Co-editors

K. Thirupathi Rao M. Vishnuvardhan

Organised by

School of Computing

Koneru Lakshmaiah College of Engineering Green Fields Vaddeswaram, Guntur-522502

Andhra Pradesh, India

In Association with

CSI Koneru Chapter & Division II of CSI

EXCEL INDIA PUBLISHERS New Delhi

First Impression: 2009

© K.L. College of Engineering, Andhra Pradesh

Proceedings of the International Conference on Web Sciences (ICWS-2009)

ISBN: 978-81-907839-9-6

No part of this publication may be reproduced or transmitted in any form by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the copyright owners.

DISCLAIMER

The authors are solely responsible for the contents of the papers compiled in this volume. The publishers or editors do not take any responsibility for the same in any manner.

Published by

EXCEL INDIA PUBLISHERS

61/28, Dalpat Singh Building, Pratik Market, Munirka, New Delhi-110067 Tel: +91-11-2671 1755/ 2755/ 5755 Fax: +91-11-2671 6755 E-mail: [email protected] Website: www.excelpublish.com

Typeset by

Excel Publishing Services, New Delhi - 110067 E-mail: [email protected]

Printed by

Excel Seminar Solutions, New Delhi - 110067 E-mail: [email protected]

Preface

The terms WEB Sciences were coined by internationally reputed Professor P Thrimurthy who is presently the Chief Advisor of ANU College of Engineering & Technologies in Acharya Nagarjuna University, Guntur, India and Shri H.R. Mohan, Chairman, Division II, CSI. The WEB is known as the interconnection of various networks spread across and primarily known for hosting and accessing the information of various forms. The young generations now cannot think of anything excluding the usage of internet and the WEB. WEB is synonymously known as Information management over Internet. WEB made the world as a Global Village. Today no field exists (Engineering, Physics, Zoology, Biology etc) that does not use the Internet as the backbone. Internet is being used for various purposes which include research, consultancy, academics, and business as the most important medium for communication.

Internet is being used predominantly by the information mangers and it is high time that all professionals must be made to know how important it is to use Internet and the WEB in their day to day endeavors. It is in this pursuit that Dr Thrimurty and Shri H.R.Mohan, have advocated and advised to conduct an International conference on WEB sciences.

Koneru Lakshmaiah College Engineering (KLCE) known for its quality of education, Research, and consultancy offering diversified courses in the Field of Engineering, Humanities and management have picked up the leaf and have decided to host the conference under the aegis of Computer Society of India which is one the oldest premier societies of India. KLCE is formed by Koneru Lakshmaiah Educational Foundation (KLEF). KLCE is an Autonomous Engineering College and poised to become Deemed University (K.L.UNIVERSITY) soon.

KLCE is one of the premier institutes of India established in the Year 1980. It has state of the art infrastructure and have implemented very high standards in imparting Technical and Management Courses. KLCE is known all over the world through its alumni and through various tie-ups with Industry and other peer organizations situated in India and Abroad. The driving force behind the success of KLCE is Mr. Koneru Satyanarayana, Chairman of KLEF.

KLCE is situated adjacent to the Buckingham Canal, Green Fields, Vaddeswram, Guntur (Dt), Andhra Pradesh State, India, PIN 522502(Near Vijayawada).

Computer Society of India (CSI) is a premier professional body of India which is committed to the advancement of the theory and practices of computer science, Computer Engineering and Computer Technology. CSI has been tirelessly helping India in promoting the computer literacy, National policy and Business.

The main aim of the international conference on WEB sciences is to bring together Industry, Academia, Scientists, Sociologists, entrepreneurs, and decision makers situated around the world. This Initiative to combine Engineering, Management and Social Sciences would lead to creating and using new knowledge which is needed for the benefit of the Society at large.

In association with Computer Society of India, The conference is being conducted by the School of computing of KLCE comprising of four departments which include Computer Science and Engineering, Information science and technology, Electronics and Computers Engineering and Master of Computer Applications. The School is strategically positioned to conduct this International Conference in terms of having state of art infrastructure, eminent faculty derived from academics, Industry and Research Organizations. The School is publishing a Half Yearly Journal (ISSN 0974-2107) “International Journal of Systems and

vi ♦ Preface

Copyright © ICWS-2009

Technologies” (IJST). The journal is publishing papers submitted by the scholars all over the world by following international adjudication system.

The conference is planned to deliver knowledge through Keynote Address, Invited Talks, Paper Presentations and the proceedings of the conference shall be delivered through a separate publication. Some of the well accepted papers delivered at the conferences shall be published in IJST. The papers published shall also be hosted at http://www.klce.ac.in

We anticipate that excellent knowledge shall be emanated through discussions, forums and conclusions on the future course of developments in making use of the WEB by all disciplines of Engineering, Technology and the Sciences.

We would like to place on record our sincere thanks to the Chairman of KLCE Sri Koneru Satynarayana and Prof L.S.S Reddy, the principal of KLCE, for their continuous help and appreciation and also making available all the infrastructural facilities needed for organizing the international Conference of grate magnitude. We thank all the National and International Members who served in technical and Organizing Committees. We also thank the management, faculty, Staff and students of KLCE for excellent support and cooperation. All the Faculty and Staff of School of Computing need to be appreciated for making the conference, a Grand success and for all their efforts in bringing the proceedings of the conference on time, with high quality standards.

January 2009

Dr. K. Rajasekhara Rao

Organising Committee

Chief Patron

K. Satyanaryana Chairman, KLCE

Correspondent and Secretary

K. Siva Kanchana Latha KLCE

Director

K. Lakshman Havish

KLCE

Patron

L.S.S. Reddy

Principal, KLCE

Convener

Dr. K. Rajasekhara Rao Vice-Principal, KLCE

CONFERENCE ADVISORY COMMITTEE

Hara Gopal Reddy ANU-Guntur K. K. Aggarwal President CSI S. Mahalingam Vice President CSI Raj Kumar Gupta Secretary CSI Saurabh H. Sonawala Treasurer CSI Lalit Sawhney Immd. Past President M. P. Goel Region I Vice President CSI Rabindra Nath Lahiri Region II Vice President CSI S. G. Shah Region III Vice President CSI Sanjay Mohapatra Region IV Vice President CSI Sudha Raju Region V Vice President CSI V. L. Mehta Region VI Vice President CSI S. Arumugam Region VII Vice President CSI S. V. Raghavan Region VIII (Intrnl) Vice President CSI Dr. Swarnalatha R.Rao Division-I Chair Person, CSI H R Mohan Division-II Chair Person, CSI Deepak Shikarpur Division-III Chair Person, CSI C. R. Chakravarthy Division-IV Chair Person, CSI H. R. Vishwakarma Division-V Chair Person, CSI P. R. Rangaswami Chairman, Nomination Committee, CSI Satish Doshi Member, Nomination Committee, CSI Shivraj Kumar (Retd.) Member, Nomination Committee, CSI

viii ♦ Committees


COLLEGE ADVISORY COMMITTEE

P. Srinivasa Kumar OSD, KLCE Y. Purandar Dean IRP G. Rama Krishna KLCE C. Naga Raju KLCE K. Balaji HOD ECM V. Srikanth HOD IST N. Venkatram HOD ECM V. Chandra Prakash IST M. Seeta Ram Prasad CSE K. Thirupathi Rao CSE

Conference Chair

P. Thrimurthy ANU, Guntur

Conference Co-Chair

J.K.R. Sastry KLCE

Technical Programme Chair H.R. Mohan Chairman, Division – II

Technical Programme Committee Allam Appa Rao VC, JNTU, Kakinada M.Chandwani Indore B. Yagna Narayana IIIT Hyderabad S.N. Patel USA R.V. Raja Kumar IIT Kharagpur Wimpie Van den Berg South Africa K. Suzuki Japan Trevol Moulden USA Ignatius John Canada Ranga Vemuri USA N.N. Jani India Yasa Karuna Ratne SriLanka Prasanna Sri Lanka Sukumar Nandi IIT Guwahati Viswanath Nandyal

Contents

Preface v

Committees vii

Session-I: Web Technology Semantic Extension of Syntatic Table Data

V. Kiran Kumar and K. Rajasekhara Rao 3

e-Learning Portals: A Semantic Web Services Approach Balasubramanian V., David K. and Kumaravelan G. 8

An Efficient Architectural Framework of a Tool for Undertaking Comprehensive Testing of Embedded Systems

V. Chandra Prakash, J.K.R. Sastry, K. Rajashekara Rao and J. Sasi Bhanu 13

Managed Access Point Solution

Radhika P. 21

Autonomic Web Process for Customer Loan Acquiring Process

V.M.K. Hari, G. Srinivas, T. Siddartha Varma and Rukmini Ravali Kota 29

Performance Evaluation of Traditional Focused Crawler and Accelerated

Focused Crawler N.V.G. Sirisha Gadiraju and G.V. Padma Raju 39

A Semantic Web Approach for Improving Ranking Model of Web Documents

Kumar Saurabh Bisht and Sanjay Chaudhary 46

Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization

Monika Mangla 52

Enhanced Web Service Crawler Engine (A Web Crawler that Discovers

Web Services Published on Internet)

Vandan Tewari, Inderjeet Singh, Nipur Garg and Preeti Soni 57

Session-II: Data Warehouse Mining Web Intelligence: Applying Web Usage Mining

Techniques to Discover Potential Browsing Problems of Users

D. Vasumathi, A. Govardhan and K. Suresh 67

Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining

Dharmendra T. Patel and Amit D. Kothari 71

Data Obscuration in Privacy Preserving Data Mining

Anuradha T., Suman M. and Arunakumari D. 76

Mining Full Text Documents by Combining Classification

and Clustering Approaches

Y. Ramu 83

x ♦ Contents


Discovery of Semantic Web Using Web Mining

K. Suresh, P. Srinivas Rao and D. Vasumathi 90

Performance Evolution of Memory Mapped Files on Dual Core Processors

Using Large Data Mining Data Sets

S.N. Tirumala Rao, E.V. Prasad, N.B. Venkateswarlu and G. Sambasiva Rao 101

Steganography Based Embedded System Used for Bank Locker System:

A Security Approach

J.R. Surywanshi and K.N. Hande 109

Audio Data Mining Using Multi-Perceptron Artificial Neural Network

A.R. Ebhendra Pagoti, Mohammed Abdul Khaliq and Praveen Dasari 117

A Practical Approach for Mining Data Regions from Web Pages

K. Sudheer Reddy, G.P.S. Varma and P. Ashok Reddy 125

Session-III: Computer Networks On the Optimality of WLAN Location Determination Systems

T.V. Sai Krishna and T. Sudha Rani 139

Multi-Objective QoS Based Routing Algorithm for Mobile Ad-hoc Networks

Shanti Priyadarshini Jonna and Ganesh Soma 148

A Neural Network Based Router

D.N. Mallikarjuna Rao and V. Kamakshi Prasad 156

Spam Filter Design using HC, SA, TA Feature Selection Methods

M. Srinivas, Supreethi K.P. and E.V. Prasad 161

Analysis & Design of a New Symmetric Key Cryptography Algorithm

and Comparison with RSA

Sadeque Imam Shaikh 168

An Adaptive Multipath Source Routing Protocol for Congestion Control

and Load Balancing in MANET

Rambabu Yerajana and A.K. Sarje 174

Spam Filtering using Statistical Bayesian Intelligence Technique

Lalji Prasad, RashmiYadav and Vidhya Samand 180

Ensure Security on Untrusted Platform for Web Applications

Surendrababu K. And Surendra Gupta 186

A Novel Approach for Routing Misbehavior Detection in MANETs Shyam Sunder Reddy K. and C. Shoba Bindu 195

Multi Layer Security Approach for Defense Against MITM

(Man-in-the-Middle) Attack

K.V.S.N. Rama Rao, Shubham Roy Choudhury, Manas Ranjan Patra

and Moiaz Jiwani 203

Contents ♦ xi


Video Streaming Over Bluetooth

M. Siddique Khan, Rehan Ahmad, Tauseef Ahmad and Mohammed A. Qadeer 209

Role of SNA in Exploring and Classifying Communities within B-Schools

through Case Study

Dhanya Pramod, Krishnan R. and Manisha Somavanshi 216

Smart Medium Access Control (SMAC) Protocol for Mobile Ad Hoc Networks

Using Directional Antennas

P. Sai Kiran 226

Implementation of TCP Peach Protocol in Wireless Network

Rajeshwari, S. Patil Satyanarayan and K. Padaganur 234

A Polynomial Perceptron Network for Adaptive Channel Equalisation

Gunamani Jena, R. Baliarsingh and G.M.V. Prasad 239

Implementation of Packet Sniffer for Traffic Analysis and Monitoring

Arshad Iqbal, Mohammad Zahid and Mohammed A. Qadeer 251

Implementation of BGP Using XORP

Quamar Niyaz, S. Kashif Ahmad and Mohammad A. Qadeer 260

Voice Calls Using IP enabled Wireless Phones on WiFi / GPRS Networks

Robin Kasana, Sarvat Sayeed and Mohammad A. Qadeer 266

Internet Key Exchange Standard for: IPSEC

Sachin P. Gawate, N.G. Bawane and Nilesh Joglekar 273

Autonomic Elements to Simplify and Optimize System Administration

K. Thirupathi Rao, K.V.D. Kiran, S. Srinivasa Rao, D. Ramesh Babu

and M. Vishnuvardhan 283

Session-IV: Image Processing A Multi-Clustering Recommender System Using Collaborative Filtering

Partha Sarathi Chakraborty 295

Digital Video Broadcasting in an Urban Environment an Experimental Study

S. Vijaya Bhaskara Rao, K.S. Ravi, N.V.K. Ramesh, J.T. Ong, G. Shanmugam

and Yan Hong 301

Gray-level Morphological Filters for Image Segmentation and Sharpening Edges

G. Anjan Babu and Santhaiah 308

Watermarking for Enhancing Security of Image Authentication Systems

S. Balaji, B. Mouleswara Rao and N. Praveena 313

Unsupervised Color Image Segmentation Based on Gaussian Mixture Model

and Uncertainity K-Means

Srinivas Yarramalle and Satya Sridevi P. 322

xii ♦ Contents


Recovery of Corrupted Photo Images Based on Noise Parameters

for Secured Authentication Pradeep Reddy C.H., Srinivasulu D. and Ramesh R. 327

An Efficient Palmprint Authentication System

K. Hemantha Kumar 333

Speaker Adaptation Techniques

D. Shakina Deiv, Pradip K. Das and M. Bhattacharya 338

Text Clustering Based on WordNet and LSI

Nadeem Akhtar and Nesar Ahmad 344

Cheating Prevention in Visual Cryptography

Gowriswara Rao G. and C. Shoba Bindu 351

Image Steganalysis Using LSB Based Algorithm for Similarity Measures

Mamta Juneja 359

Content Based Image Retrieval Using Dynamical Neural Network (DNN)

D. Rajya Lakshmi, A. Damodaram, K. Ravi Kiran and K. Saritha 366

Development of New Artificial Neural Network Algorithm for Prediction

of Thunderstorm Activity

K. Krishna Reddy, K.S. Ravi, V. Venu Gopalal Reddy and Y. Md. Riyazuddiny 376

Visual Similarity Based Image Retrieval for Gene Expression Studies

Ch. Ratna Jyothi and Y. Ramadevi 383

Review of Analysis of Watermarking Algorithms for Images in the Presence

of Lossy Compression

N. Venkatram and L.S.S. Reddy 393

Session-V: Software Engineering Evaluation Metrics for Autonomic Systems

K. Thirupathi Rao, B. Thirumala Rao, L.S.S. Reddy,

V. Krishna Reddy and P. Saikiran 399

Feature Selection for High Dimensional Data: Empirical Study on the Usability

of Correlation & Coefficient of Dispersion Measures

Babu Reddy M., Thrimurthy P. and Chandrasekharam R. 407

Extreme Programming: A Rapidly Used Method in Agile Software Process Model

V. Phani Krishna and K. Rajasekhara Rao 415

Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique

R. Renuga, Sudha Sadasivam, S. Anitha, N.U. Harinee, R. Sowmya

and B. Sriranjani 423

Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach

Shaikh Phiroj Chhaware and Latesh G. Mallik 429

Contents ♦ xiii


Remote Administrative Suite for Unix-Based Servers

G. Rama Koteswara Rao, G. Siva Nageswara Rao and K. Ram Chand 435

Development of Gui Based Software Tool for Propagation Impairment

Predictions in Ku and Ka Band-Traps

Sarat Kumar K., Vijaya Bhaskara Rao S. and D. Narayana Rao H. 443

Semantic Explanation of Biomedical Text Using Google

B.V. Subba Rao and K.V. Sambasiva Rao 452

Session-VI: Embedded Systems

Smart Image Viewer Using Nios II Soft-Core Embedded Processor

Based on FPGA Platform

Swapnili A. Dumbre, Pravin Y. Karmore and R.W. Jasutkar 461

SMS Based Remote Monitoring and Controlling of Electronic Devices

Mahendra A. Sheti and N.G. Bawane 464

An Embedded System Design for Wireless Data Acquisition and Control

K.S. Ravi, S. Balaji and Y. Rama Krishna 473

Bluetooth Security

M. Suman, P. Sai Anusha, M. Pujitha and R. Lakshmi Bhargavi 480

Managing Next Generation Challenges and Services through Web

Mining Techniques

Rajesh K. Shuklam, P.K. Chande and G.P. Basal 486

Internet Based Production and Marketing Decision Support System

of Vegetable Crops in Central India Gigi A. Abraham, B. Dass, A.K. Rai and A. Khare 495

Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks

V. Srikanth, T. Sai Kiran, A. Chenchu Jeevan and S. Suresh Babu 500

Author Index 505

Web Technology

Proceedings of the International Conference on Web Sciences ICWS-2009 January 10th and 11th, 2009

Koneru Lakshmaiah College of Engineering, Vaddeswaram, AP, INDIA

Semantic Extension of Syntatic Table Data

V. Kiran Kumar K. Rajasekhara Rao

Dravidian University KL College of Engineering Kuppam Vijayawada [email protected] [email protected]

Abstract

I intend to explain how to convert syntactic HTML tables into RDF documents. Most of the data on the web pages are designed in HTML tables. As the present data does not have a semantic approach the machines are unable to use it. This paper discusses the issues related to the conversion from syntactic HTML tables to semantic content (RDF file). HTML tables can be created with any HTML editor available today. The advantage of this process is, that, with little knowledge of RDF, one can easily create RDF document by designing his data in a HTML table.

Keywords: Semantic Web, RDF, RDFS, WWW

1. Introduction

The Semantic Web is an emerging technology intended to transform ‘documents’ on the World Wide Web (WWW) into ‘knowledge’ that can be processed by machines. RDF is a language representing resources on the web, which can be processed by machines rather than just displaying them. Most of the data is available in web pages are designed in HTML tables. As, the HTML table lacks semantic approach it can not processed by machines. This paper discusses the issues related to the conversion from syntactic HTML tables to semantic content (RDF file).

This paper contains seven sub-headings: Section 2 discusses about HTML tables, Section 3 introduces RDF and RDFS. Section 4 discusses the conceptual view of mapping from HTML tables to RDF Files. Section 5 is a scenario on population survey and employee tables. Section 6 deals with the key issues related to mapping process. Section 7 concludes explaining future work of the tool.

2. HTML Tables

Table represents relationship between data. Before the creation of HTML table model, the only method available for relative alignment of text is through the use of PRE element. Though it was useful in some situations, the effects of PRE were very limited. Tables were introduced in HTML 3.0, since then a great deal of refinement has occurred. The table may have an associated caption that provides short description of table’s purpose. Table rows may be grouped into a head, body and foot sections (THEAD, TFOOT, TBODY elements). The THEAD and TFOOT elements contain header and footer rows, respectively, while TBODY elements supply the table's main row groups. A row group contains TR elements for

4 ♦ Semantic Extension of Syntatic Table Data


individual rows, and each TR contains TH or TD elements for header cells or data cells, respectively. This paper uses the classification of HTML tables as regular and irregular.

A regular table is briefly described as a table in which metadata for the data items are represented in Table Header and Stub, which can be organized into one or more nested levels. The hierarchical structure of the column headers should be in top-down. The table may contain additional metadata information in the Stub header cell. The table may optionally contain footnotes. For any data cell, the cell’s metadata are positioned directly either column header or row header. An irregular table is a table that breaks one or more of the rules of regular tables. This paper attempts to use regular tables for converting into RDFS ontology.

3. RDF and RDF Schema

In February 2004, The World Wide Web Consortium released the Resource Description Framework (RDF) as W3C Recommendation. RDF is used to represent information and to exchange knowledge in the Web. If one wants to utilize machine processing information RDF would be much useful. RDF uses a general method to decompose knowledge into pieces called triples. The triples can be represented as subject, predicate and object. In RDF, the English statement

“Tim Berners-Lee invented World Wide Web”

Could be represented by RDF statement having

• A subject Tim Berners-Lee

• A predicate invented

• And an object World Wide Web

RDF statements may be encoded using various serialization syntaxes. The RDF statement above would be represented by the graph model as shown below

invented

RDF Graph

Subject and objects are represented by nodes and predicate is represented by an arc. RDF's vocabulary description language, RDF Schema, is a semantic extension of RDF. It provides mechanisms for describing groups of related resources and the relationships between these resources. RDF Schema vocabulary descriptions are written in RDF using the terms described in this document. These resources are used to determine characteristics of other resources, such as the domains and ranges of properties.

4. The Conceptual View of Mapping

Based on the regular table concept, the table may have a caption, table header, stub header

and optional footnotes. In General the table header and stub header is organized as <TH>

cells. A Class in RDF Schema is somewhat like the notion of a class in object-oriented

programming languages. This allows resources to be defined as instances of classes and

Tim Berner’s Lee

World Wide Web

Semantic Extension of Syntatic Table Data ♦ 5


subclasses of classes. The caption of a table is mapped to an RDFS class. Similarly stub, table

header and data cells of a table are mapped to subject, predicate and object of RDF triples.

The row of a table is treated as an instance of a class. For a property in RDF, the domain will

be treated as caption of a table and range will be specified by the data type of the data cell. If

the table header and stub header is organized in nested levels, then those headers are clubbed

to single header by appending the headers one after another. If the table doesn’t have a stub,

then we should create on stub before converting into RDF. The following diagram clearly

shows the mapping process.

5. The Scenario – A Population Survey and Employee Table

For example fig.1 is a population survey table. The diagram clearly explains the parts of the table. As the table header and stub are organized into multiple levels, a conversion is needed from multiple levels to single level by appending the headers one after another. The resultant table was converted into fig.2.

Fig. 1: Table with multiple levels of stub and table header

Fig. 2: Converted Table of Fig.1: with single level of stub and table header

Let us discuss another type of regular table which does not have a stub. For these types, a stub should be added with some information regarding the table data. The stub can be created by the user. For example fig.3 represents employee table which has no stub. As the table does not have a stub, it was created with caption of a table and row number and appended as one of the columns of a table. The resultant table is shown in fig.4. The user is free to create the stub of his choice. Once an RDF file is created, the stub plays a crucial role in answering the queries on the particular RDF file.

6 ♦ Semantic Extension of Syntatic Table Data


Fig. 3: A regular table without stub Fig. 4: Converted table of fig. 3 with user-defined stub

Fig.5 shows the mapping of Population table in RDF graph. This representation uses table stub as subject, column header as predicate and corresponding data cell as object of RDF graph. Each row represents an instance. As there are four rows in ‘population’ table, the RDF graph converts into four instances of RDFS class. The conversion uses ‘http://www.dravidianuniversity.ac.in#’ as a namespace, where all the user defined variables are placed.

Fig. 5: Represents RDF graph representation of population table

6. Key Steps Involved in Mapping Process

Based on the scenario outlined above, the following key points are identified

• Use the necessary name space and URI in RDF file.

• Mapping of caption of a table to RDFS class

• The row of a HTML table is treated as an instance of an RDFS class

• Mapping of stub, table header and data cells into RDF triples

• The domain and range of the properties were defined by the caption of a table and type of the data cells in HTML table.

Semantic Extension of Syntatic Table Data ♦ 7


7. Conclusion and Future Plan

This conversion process helps the machine to process the data available in syntactic format by converting into semantic form. A tool is being developed for the conversion of syntactic table data into semantic RDF file. There are two main advantages of the tool. One, the syntactic data existing on web in table format can easily be extended to semantic content (RDF file). Hence, we can provide semantic approach to table data, so that machine is able to process it. Two, a layman with little knowledge on RDF can create RDF documents by simply creating HTML tables. This process is possible only when the table contains textual data, further research becomes necessarily to see how best non-textual data can also be taken for such conversion. Another limitation of this process is that it cannot handle nested tables.

References

[1] [T. Berners-Lee, J. Hendler and O. Lassila]. The Semantic Web, Scientific American, 2001. [2] [D. Brickley and R. V. Guha]. Resource description framework(rdf) schema specification 1.0: Rdf schema.

W3C WorkingDraft, 2003. [3] [Karin K.Breitman, Marco Antonio Casanova, Walter Truszkowski]. Semantic Web, concepts,

Technologies and Applications:57–77, September 2006. [4] [P. Hayes]. Resource description framework (rdf) semantics.W3C Working Draft, 2003. [5] [T. Berners-Lee]. Weaving the Web–The Past, Present and Future of the World Wide Web by its Inventor.

Texere, 2000. [6] [Stephen Ferg, Bureau of Labor Statistics.] Techniques for Accessible HTML Tables, August 23,2002 [7] [F. Manola and E. Miller], RDF primer, W3C Recommendation, 10 Feb, 2004.

http:///www.w3c.org/TR/rdf-primer. [8] [Beckett, D., (editor)], RDF/XML Syntax Specification (Revised), W3C Recommendation, 10th February

2004.



e-Learning Portals: A Semantic

Web Services Approach

Balasubramanian V. David K. Kumaravelan G. DCS, Bharathidasan DCS, Bharathidasan DCS, Bharathidasan University, Tiruchirappalli University, Tiruchirappalli University, Tiruchirappalli [email protected]

Abstract

In recent years there has been a movement from conventional type learning to e-Learning. The semantic web can be used to bring e-Learning to a higher level of collaborative intelligence by means of creating a new educational model. The Semantic Web has opened new horizons for internet applications in general and for e-Learning in particular. However, there are several challenges involved in improving this e-Learning model. This paper discusses about the motivation for Semantic Web Service (SWS) based e-Learning model by addressing major issues and proposes a SWS based conceptual architecture for e-Dhorna, an ongoing e-Learning project of the Department of Computer Science, Bharathidasan University, India.

1 Introduction

e-Learning is commonly referred to the intentional use of networked information and communications technology in teaching and learning. e-Learning is not just concerned with providing easy access to learning resources, anytime, anywhere, via a repository of learning resources. It is also concerned with supporting features such as the personal definition of learning goals, the synchronous and asynchronous communication, collaboration between learners and instructors. Many institutions and Universities in India have initiated e-Learning projects. Few United States based companies such as QUALCOM, Microsoft, etc., have already committed for funding to those e-Learning projects. Few Indian Universities have associations with many American Universities such as California, Cornell, Carnegie Mellon, Harvard, Princeton, Yale and Purdue for creating e-Content.

1.1 The Technological Infrastructure of e-Learning Portals

Today many e-Learning applications achieve high standards in providing instructors to manage online courses via web technologies and database system. WebCT and Blackboard are the two most advanced and popular e-Learning systems which provide sets of very comprehensive tools and have the capabilities to support sophisticated Internet-based learning environment [Muna, 2005]. The International Data Corporation (IDC) predicted in its January 2005 report that e-Learning will have $21 billion market in 2008 [Valden, 2004]. A truly effective e-Learning solution must be provided to meet the growing demands of students, employees, researchers and lifelong learners. Efficient management of the available information on the web can lead to an e-Learning environment that provides learners with interaction and the most relevant materials.

e-Learning Portals: A Semantic Web Services Approach ♦ 9


2 Issues in Developing an e-Learning Portal

The traditional learning process could be characterized as centralized authority (content is selected by educator), strong push delivery (teacher push knowledge to students), lack of personalization (content must fulfill the needs of many) and the static/linear learning process (unchanged content). In e-Learning the course contents are distributed and student oriented. A learner can customize and personalize the contents based on his/her requirements. The learning process can be carried out any time, any place with asynchronous interaction [Naidu, 2006].

The learning materials are scattered across the application and the user finds it very hard to construct a user centered course. Due to the lack of commonly agreed service language, there is no co-ordination between the software agents that locate the resources for a specific content. Selecting the exact learning materials is also a big issue, since the resources are not properly defined with metadata. The contents are not delivered in a personalized way to the learner, as there is no infrastructure to find out the real requirements of the user. If different university portals are implemented with different tools, then sharing of contents is not feasible between them due to interoperability issues.

3 Need for Semantic Portals

The heterogeneity and the distributed nature of web has led to the need for web portals and web sites providing access to collections of interesting URLs and information that can be retrieved using search. With the advent of Semantic Web which has powerful features enables the content publishers to express a crude meaning of the page instead of merely dumping html text. Autonomous agent software can then use this information to organize and filter the data to meet the user’s needs. Current research in the Semantic Web area should eventually enable Web users to have an intelligent access to Web services and resources [Berners Lee, 2001].

The key property (common-shared-meaning and machine-processable metadata) of the Semantic Web architecture, enabled by a set of suitable agents, establishes a powerful approach to satisfy the e-Learning requirements: efficient, just-in-time and task relevant learning. Learning material is semantically annotated and for a new learning demand it may be easily combined in a new learning course. According to the preferences, a user can find and combine useful learning materials easily. The process is based on semantic querying and navigation through learning materials, enabled by the ontological background thus making the semantic portals more effective when compared with traditional web portals [Valden, 2004].

3.1 The Role of Semantic Web Services (SWS) in e-Learning Portals

The advent of Semantic Web and its relevant technologies, tools and applications provide a new context for exploitation. The ‘expression of meaning’ relates directly to numerous open issues in e-Learning [Muna, 2005]. Semantic web services are aimed at enabling an automatic discovery, composition and invocation of available web services. Based on semantic descriptions of functional capabilities of available web services, a SWS broker automatically selects and invokes web services appropriate to achieve a given goal. The

10 ♦ e-Learning Portals: A Semantic Web Services Approach


benefits of this approach are semantics-based browsing and semantic search. Semantic

browsing locates metadata and assembles point-and-click interfaces from a combination of relevant information. Semantic searching is based on the metadata tagging process which enables content providers to describe, index, and search their resources. The metadata tagging process helps in effective retrieval of relevant content for a specific search term. By adding semantic information with the help of metadata tagging, the search process goes beyond superficial keyword matching thus allowing easy removal of non-relevant information from the result set.

4 The e-Dhrona

The e-Learning Portal e-Dhrona is a virtual environment for various on-line and regular courses of the University as well as for teachers and students for academic exchange. Through e-Dhrona, effective educational and training courses can be brought to the PCs of students of Bharathidasan University’s Internal Departments, Constituent Colleges and Affiliated Colleges. The objective is to provide academic information and centralized knowledge which is customizable, accessible 24/7, flexible, convenient and user-centric. With the abundance of courses and shortage of faculty support, the e-Dhrona project helps in providing a standard for academic content.

4.1 The e-Dhrona Architecture with Semantic Web Services

The e-Dhrona project can cater the needs of its users if it is enabled with Semantic web Services (SWS) technology. The major benefit of SWS technology is the ability to compose web services that are located at different sources. Semantic Web Services are the result of the evolution of the syntactic definition of web services and the semantic web. One solution to create semantic web services is by mapping concepts in a web service description to ontological concepts [Fensel, 2007]. Using this approach user can explicitly define the semantics of a web service for a given domain. The role of ontology is to formally describe shared meaning of used vocabulary (set of symbols). Ontology contains the set of possible mapping between symbols and their meanings. But shared-understanding problem in e-Learning occurs on several ontological levels, which describe several aspects of document usage. When a student searches for learning material, the most important thing to be considered is the content of the course, the context of the course and the materials associated to it. The figure (Fig. 1) shows a conceptual Semantic e-Learning architecture which provides high-level services to people looking for appropriate online courses.

The process of building this multi step architecture comprises of the following:

Knowledge Warehouse: This is the basic and core element of the architecture. It is a repository where ontologies, metadata, inference rules, educational resources, course descriptions, and user profiles are stored. The metadata is placed within the external metadata repository (e.g. The RDF repository). The Knowledge Base includes e-Dhrona ontology creation and RDF repository building.

e-Dhrona Ontology: The knowledge engineer creates and maintains an e-Dhrona ontology. An ontology editor (like Onto Edit, Protégé or OI-modeler) can be used for creating the initial ontology. The Knowledge Engineer updates the ontology at a later stage using the appropriate ontology editor. In this way, the development of the ontology is an iterative process, centered on the architecture and driven by use cases, where each stage refines the previous one.

e-Learning Portals: A Semantic Web Services Approach ♦ 11


Fig. 1: The Semantic e-Learning Architecture of e-Dhrona

RDF Repository: The RDF repository includes the metadata in the form of RDF triples of every web page that can be provided by any of the subsequent services. The context and content parser will be used to generate the RDF repository. The Web services engine captures new RDF and updates the RDF repository. Off-line updates on any of the databases involved can be imported and processed by the context and contents parser on a regular basis.

The Web Services Interface: It represents the Semantic Web portal. The dynamic Web generator displays portal page for each user. e-Dhrona Web users access their pages via the Common User Interface.

Search Engine: It provides an API with methods for querying the knowledge base. RDQL (RDF Data Query Language) can be used as an ontology query language.

Inference Engine: The inference engine answers queries and it performs derivations of new knowledge by an intelligent combination of facts in the knowledge warehouse with the ontology.

Software Agents: The software application which accesses the e-Dhrona Knowledge Base repository and Web resources.

Common Access Interface: It provides an integrated interface through which readers as well as authors/administrators of academic institutions can access the contents, upload or modify the data with particular authority.

12 ♦ e-Learning Portals: A Semantic Web Services Approach


From a pedagogical perspective, semantic portals are an “enabling technology” allowing students to determine the learning agenda and be in control of their own learning. In particular, they allow students to perform semantic querying for learning materials (linked to shared ontologies) and construct their own courses, based on their own preferences, needs and prior knowledge [Biswanath, 2006].

5 Related Work

The Ontology-based Intelligent Authoring Tool can be used for building the ontologies [Apted, 2002]. The tool uses four ontologies (domain, teaching strategies, learner model and interfaces ontology) for the construction of the learning model and the teaching strategy model, but it fails in exploiting modern web technologies. Our proposed framework is developed only in parts. The future work is concerned with the implementation of complete ontological representations of the introduced semantic layers as well as of current e-Learning metadata standards and their mappings. Nevertheless, the availability of appropriate Web services aimed at supporting specific process objectives has to be perceived as an important prerequisite for developing SWS-based applications.

6 Conclusion

In developing interactive learning environments, semantic web is playing a big role and with ontological engineering, XML and RDF it is possible to build practical systems for the learners and the teachers. In this paper we have presented a case study of e-Dhrona in e-Learning scenario that exploits ontologies for describing the semantics (content), for defining the context and for structuring the learning materials. This three-dimensional, semantically structured space enables easier and more comfortable search and navigation through the learning material. The proposed model can provide related useful information for searching and sequencing the learning resources in the web-based e-Learning systems.

References

[1] [Adelsberger et al. 2001] Adelsberger H., Bick M., Körner F., Pawlowski J.M. Virtual Education in Business Information Systems (VAWI) - Facilitating collaborative development processes using the Essen Learning Model, In Proceedings of the 20th ICDE World Conference on Open Learning and Distance

Education, Düsseldorf, Germany, 2001. [2] [Apted et al.,2002] Apted, T., & Kay, J. Automatic Construction of Learning Ontologies, Proceedings of

ICCE Workshop on Concepts and Ontologies in Web-based Educational Systems (pp. 57-64). Auckland, New Zealand, 2002.

[3] [Berners-Lee et al., 2001] Berners-Lee T, Hendler, J., & Lassila, O., The Semantic Web, Scientific

American, 284, pp 34–43, 2001. [4] [Biswanath, 2006] Biswanath Dutta, Semantic Web Based E-learning, Proceeding of the International

Conference on ICT for Digital Learning Environment, Banglore, India, 2006. [5] [Fensel et al., 2007] Fensel D, Lausen H et al. Enabling Semantic Web Services, Springer-Verlag Berlin

Heidelberg, 2007. [6] [Muna et al., 2005] Muna, S, Hatem, Haider, A Ramadan, Daniel C Neagu, e-Learning Based on Context

Oriented Semantic Web, Journal of Computer Science, 1 (4): 500-504, 2005. [7] [Naidu, 2006] Som Naidu, E-Learning: A Guide Book of Principles, Procedures and Practices,

Commonwealth Educational Media Center for Asia Publication, New Delhi, 2006. [8] [Vladen, 2004] Vladan Devedzic, Education and the Semantic Web, International Journal of Artificial

Intelligence in Education, Vol. 14 (pp. 39-65). 2004.



An Efficient Architectural Framework of a Tool

for Undertaking Comprehensive Testing

of Embedded Systems

V. Chandra Prakash J.K.R. Sastry Koneru Lakshmaiah College of Engineering Koneru Lakshmaiah College of Engineering Green Fields Vaddeswaram Green Fields Vaddeswaram Guntur District- 522501 Guntur District- 522501

K. Rajashekara Rao J. Sasi Bhanu Koneru Lakshmaiah College of Engineering Koneru Lakshmaiah College of Engineering Green Fields Vaddeswaram Green Fields Vaddeswaram Guntur District- 522501 Guntur District- 522501

Abstract

Testing and debugging embedded systems is difficult and time consuming for

simple reason that the embedded systems have neither storage nor user

interface. The users are extremely intolerable of buggy embedded systems.

Embedded systems deals with external environment by way of sensing the

physical parameters and also must provide outputs that control the external

environment

In case of embedded systems, the issue of testing must consider both hardware

and Software. The malfunctioning of hardware is detected through software

failures. Cost-Effective testing of embedded software is of critical concern in

maintaining competitive edge. Testing an embedded system manually is quite

time taking and also will be a costly preposition. Tool based testing of an

embedded system has to be considered and put into use to reduce the cost of

testing and the ability to complete the testing of the system rather quickly as

fast as possible.

Tools[5,6,7] are available in the market for testing embedded systems but they

carry fragments of testing and even the fragments’ of Testing does not address

the unified manner of testing. The tools fail to address the integration testing

of the Software Components, Hardware Components and the interface

between them. In this paper an efficient Architectural Framework of a Testing

tool that help undertaking comprehensive testing of embedded system is

presented.

1 Introduction

In the case of Embedded Systems, the issue of testing must consider both hardware and Software. The malfunctioning of hardware is detected through software failures. The target Embedded System does not support the required hardware and software platform needed for

14 ♦ An Efficient Architectural Framework of a Tool for Undertaking Comprehensive Testing


development and testing the Software. The software development cannot be done on the Target Machine. The Software is developed on host machine and then installed in the Target machine which is then executed. The testing of Embedded System must broadly meet the following testing goals [1].

• Finding the bugs early in the development process is not possible as the target machine often is not available early in the development stage or the hardware being developed parallel to software development is either unstable or buggy.

• Exercising all of the code including dealing with exceptional conditions in a target machine is difficult as most of the code in an embedded system is related to uncommon or unlikely situations or events occurring in certain sequences and timing.

• Overcome the difficulty of developing reusable and repeatable tests needed to be executed due to repeatable event occurrence sequence in the target machine

• Maintenance of audit trail related to test results, event sequences, code traces, and core dumps etc. which are required for debugging.

To realize the testing goals, it is necessary that testing be carried in the host machine first and then be carried along with the Target. Embedded Software must be of the highest Quality and must adapt to excellent strategies for carrying testing. In order to decide on the testing strategy or the type of testing carried or the phases in which the testing is carried, it is necessary to carry on with the analysis of different types of test cases that must be used for carrying the testing [2].

Every embedded Application comprises of two different types of tasks. While one type of tasks deals with emergent processing requirements, the other type of tasks undertakes the processing of input/output and also various processing related to house keeping. The tasks that deal with emergent requirements are initiated for execution on interrupt.

Test cases must be identified sufficiently enough that all types of tasks that comprise the application must be tested. Several types of testing such as integration testing, Regression testing, functional testing, module testing etc. must also to be conducted.

Several types of testing are to be carried which include unit testing, Black Box Testing to test the behavior of the system during and after the occurrence of external events, Environment testing to test the user interface through LCD output, push button input etc,, Integration testing to test the integration of Hardware Components, Software Components and the Interface between the Hardware and the Software, Regression testing to test the behavior of the system due to the reasons of incorporating changes to code. The testing system must help in carrying testing to test the hardware, software and software along with the hardware.

2 Testing Approaches

Several authors have proposed different approaches to conducting testing of embedded systems. Jason [8] and others have suggested testing of modules of embedded systems by isolating the modules at run time and improving the integration of testing into development environment. This method has, however, failed to support the regression of events. Nancy [9] and others suggested an approach of carrying unit testing of the embedded systems using agile methods and using multiple strategies. Testing of embedded software is bound up with

An Efficient Architectural Framework of a Tool for Undertaking Comprehensive Testing ♦ 15


testing of hardware. Even with evolving hardware in the picture, agile methods work well provided, multiple testing strategies are used. This has powerful implications for improving the quality of high reliability systems, which commonly have embedded software at their heart. Tsai [14] and others have suggested END-TO-END Integration testing of embedded system by specifying test scenarios as thin threads, each thread representing a single function. They have even developed a WEB based tool for carrying END-TO-END Integration Testing.

Nam Hee Lee [11] suggested a different approach for conducting integration testing by considering interaction scenarios, since the integration testing must consider sequence of external input events and internal interactions. Regression testing [10] has been a popular quality testing technique. Most regression testing are based on code or software design, Tsai and others have suggested regression testing based on Test scenarios and the testing approach suggested is functional regression testing. They have even suggested a WEB based tool to undertake the Regression testing. Jakob [16] and others have suggested testing of embedded systems by simulating the Hardware on the host and combining the software with the simulators. This approach however will not be able to deal with all kinds of test scenarios related to Hardware. The complete behavior of Hardware, specially unforeseen behavior, cannot be simulated on a host machine. Tsai [17] and others have suggested a testing approach based on verification patterns, the key concept of this being recognizing the scenarios into patterns and applying the testing approach whenever similar patterns are recognized in any Embedded application. But the key to this approach is the ability to identify all test scenarios that occur across all types of embedded applications.

While all these approaches, no doubt, address a particular type of testing, no coverage has been made to carry comprehensive testing considering the testing to be carried to test Hardware, Software and both.

3 Testing Requirement Analysis

Looking at the software development, hardware development, integration & migration of code into the target system and then carrying testing of the Target system, the following testing scenarios exists. [19] The entire Embedded System application code is divided primarily into components namely Hardware independent code, and Hardware Dependent Code. Hardware independent code are tasks that carry mundane house keeping and data processing and tasks that controls processing on a particular device, whereas the hardware dependent code are either interrupt service routines or the drivers that control the operation of the device. It is necessary to identify different types of test cases that test all the three different types of code segments.

Unit testing, integration testing and the regression testing of the hardware independent code

can be carried by scaffolding the code by simulating the hardware.. Some of the testing

related to response time, throughput, testing for portability, testing the built-in peripherals

like ROM, RAM, DMA, and UART etc. requires usage of Instruction set simulator within the

testing tool.

Some of the testing such as the existence of pre conditions can also be made by using the assert macros. The testing system should have the ability to insert inline assert macros to test existing of a particular condition before a piece of code is executed and the result of



execution of a such a macro must also be recorded as a test case result. Testing such as Null Pointer evaluation, Validation for range of values, verification of whether a function is called by ISR (Interrupt Service Routine) or a Task, checking and resetting event bits can be carried by using the assert macros.

As 80% of code testing can be done on the host, the following types of testing however cannot be carried on the Host alone and therefore there is a need for a testing process that uses both host machine and Target machine or just the target machine itself.

Logic analyzers help in testing the Hardware. The tests related to timing of the signals, triggering of the signals, occurrence of the events, computation of the response time, patterns of occurrence of the signals etc, can only be carried with the help of Logic Analyzers. This kind of testing is done in unison without having any integration with the software. Therefore a Logic analyzer driven by testing software shall provide a good platform for testing.

The real testing of Hardware along with Software can be achieved through in-circuit Emulators which replace the Microprocessors in the target machine. Other chips in the embedded systems are connected in the similar way as connected to Microprocessor. Emulator has software built into a separate memory called Overlay memory. Overlay Memory is additional memory and different from either ROM or RAM. The emulator software provides the support for debugging, recording of memory content, recording the trace of program execution during dumps, tracing of program execution for a test case. In the event of any failures, Emulators still helps in interacting with Host machine. The entire dump of overlay memory can be viewed and the reason for failure can be investigated. The testing software thus should be interfaced with in-circuit emulator particularly for supporting the debugging mechanism during failures and breakdowns.

Monitors are software that reside on the host and provide debugging interface and testing interface with the monitor on the target. Monitors also provide debugging Interface, and provide communication interface with the target to place the code in RAM or flash. If necessary, some of the locator functions are executed in the process. Users can interact with the monitor on the host to set break points, run the program and the data are communicated to monitor on the target for facilitating the execution. Monitors can be used to test memory leakages, function usage analysis, testing for week code, testing for changes in data at specified memory locations, and testing for high use functions. Monitors also help in testing inter task communication through Mail boxes, Queues and Pipes including the overflow and underflow conditions. Thus the testing tool should have built-in monitors.

4 Architectural Analysis of Existing Tools

In literature, most important tools that are in use are developed by the companies like Tornado [6], Windriver [5] and Tetware [7]. Tsai [10, 14, and 17] and others have introduced tools which provide limited testing on the host for the purposes of carrying either unit testing, integration testing, regression testing, END-TO-END testing but not all of them together.

4.1 Tornado Architecture

Tornado architecture has provided a solution that provides a testing process both at Target machine and host machine and provide for proper interface required. Fig 4.1 shows the architecture of the Tornado tool. The model provides for support of a simulator on the Target



side which adds too much of a code and may hamper the achievement of the response time and throughput. The architecture has provision to test the hardware under the control and initiation of software. This model has no provision for scaffolding, instruction set simulation etc. The testing of Hardware using Logical Analyzer is undertaken at the Target that again adds a heavy overhead on the target. This architecture does not support third party tools that help in identification and testing of memory leakages, functional coverage etc.

Fig. 4.1: Tornado Tool Architecture

4.2 Windriver Architecture

This architecture is an improvement over the Tornado architecture. The Architecture is shown at Fig 4.2. This architecture uses the interface of third party tools, provides for user interface with which testing is carried and also uses simulator software on the host side. This architecture also uses an emulator on the target side, thus helping testing and debugging under the failure conditions. This architecture has no support for scaffolding; Testing for Hardware initiated either at the HOST or at the Target.

4.3 Tetware Architecture

The Tetware architecture is based on test cases that are fed as input at the host. The Tetware provides huge API Library that interfaces with the library that is resident on the target. This architecture is shown at Fig. 4.3. This architecture relies on heavy code to be resident on the



target. This architecture hampers the response time and throughput very heavily. This architecture also has no support for scaffolding, instruction simulation and environment checking through Assert Macros.

Fig. 4.2: Windriver Tool Architecture

Fig. 4.3: Tetware Tool Architecture



5 Proposed Architectural Framework for a Comprehensive Testing Tool

Considering the different testing requirements form the point of view of scaffolding, instruction set simulation, assert macros that are needed for testing at the host, the host based architecture is proposed. The architecture is shown at Fig. 5.1. The proposed architecture provides an user interface and also a provision for a Database to store the test data, Test results and the historical data transmitted by the target. The Host side architecture also provides a communication interface. The most important advantage of this model is the provision of a interface to test the Hardware through probes that connect the Target through either USB or serial interface. The HOST side architecture also provides for scaffolding facility to test for Hardware independent code on the HOST itself.

On the TARGET side, testing and debugging is done through in-circuit Emulator and a provision for flash programmer for burning the program into either flash or ROM. The communication interface resident on the target shall provide the communication interface with HOST.

The proposed architecture surely reduces the size of the code on the target and therefore ensures original intended response time & throughput and does not demand any extra hardware on the target side, thus providing the cost effective solution.

Fig. 5.1: Proposed Architecture

6 Summary and Conclusions

Testing an embedded System is complex as the target machine has limited resources and as such has no user interface. The testing goals as such cannot be achieved when testing is to be



done with target machine only. If testing is done using Host machine, the Hardware dependent code can never be tested. So it is evident that testing an embedded system requires an architecture that considers both Host Machine and Target Machine. Comprehensive testing can only be carried by using a suite of tools which includes scaffolding software, simulators, Assert macros, Logic Analyzers, and In-Circuit Emulators. All the tools used must function in an integrated manner so that compressive testing can be carried. The architectural framework proposed shall meet the entire functional requirements for testing an embedded system comprehensively.

References

Books [1] David E. Simon, “An embedded Software Primer”, Pearson Education [2] Raj Kamal, “Embedded Systems Architecture, Programming and Design”, Tata McGraw-Hills Publishing

Company [3] Prasad KVKK, ”Software Testing Tools”, DreamTech Press, India [4] Frank Vahid and Tony Givargis, “Embedded Systems Design A unified Hardware/Software Introduction”,

John Wiley and Sons

WEB Sites

[5] Windriver,”http://www.windriver.com [6] Open Group, “http://www.opengroup.com [7] Real Time Inc, “http://www.rti.com

Journal Articles

[8] “Jayson McDonald etc. al.‘, “Module Testing Embedded Software – An industrial Project”, Proceedings of seventh international conference on Engineering of Complex Computer Systems – 2001

[9] Nancy Van etc. al, “Taming the Embedded Tiger”, Agile test Technique for Embedded Software, Proceedings of Agile development conference ADC-04

[10] Wei-Tek Tsai etc, al, “Scenario based Functional Regression Testing”, Proceedings of the 25th Annual International Computer Software and Application Conference 2002

[11] Nam Hee Lee etc. al., “Automated Embedded Software testing using Task Interaction Scenarios” [12] D. Deng etc. al., “Model based Testing and maintenance”, Proceedings of the IEEE Sixth International

Symposium on Multimedia Software Engineering 2004 [13] Raymond Paul, “END-TO-END Integration testing”, Proceedings of the second Asia-Pacific Conference on

Quality Software 2001 [14] W.T. Tsai etc. al, “ END-TO-END Integration Testing design”, Proceedings of the 25th Annual

International Computer Software and Application Conference 2001 [15] Jerry Gao etc. cal., “Testing Coverage Analysis for Software Component Validation”, Proceedings of the

29th Annual International Computer Software and Applications Conference 2005 [16] Jakob etc. al., “Testing Embedded Software using simulated Hardware”, ERTS 2006-25-27 January 2006 [17] W.T. Tsai L. Yu etc. al, “Rapid verification of Embedded Systems using Patterns”, Proceedings of the 27th

Annual International Computer Software and Applications Conference 2003 [18] Dr. Sastry JKR, Dr. K. Rajashekara Rao, Sasi Bhanu J, “Comprehensive Requirement Specification of a

Cost-Effective Embedded Testing Tool” A Paper presented at National conference on software engineering(NCSOFT), CSI May 2007



Managed Access Point Solution

Radhika P. Vignan’s Nirula Institute of Technology & Science for Women Pedpalakalur; Guntur-522 005

e-mail: [email protected]

Abstract

This paper highlights the general framework of Wireless LAN access point software called Managed Access Point Solution (MAPS). It is a software package that combines the latest 802.11 wireless standards with networking and security components. MAPS enables Original Equipment Manufacturers (OEMs) / Original Design Manufacturers (ODMs), to deliver leading-edge Wi-Fi® devices, such as business-class wireless gateways, broadband access points / routers and hot-spot infrastructure nodes, for the small-to-medium business (SMB) market. The software is designed with security for Wi-Fi use and secure client software supporting personal and enterprise security modes.

1 Introduction

Today's embedded systems are increasingly built by integrating pre-existing software

modules, allowing OEMs to focus their efforts on their core competitive advantage - the

embedded device's application. The first wave was the move towards standard commercial

and open-source operating systems replacing home-grown ones. The next one is the move to

well-designed, configurable building blocks of software IP which just plug into the operating

environment for the application. The focus is on building modular software products that are

pre-integrated with the hardware and operating systems they support ensures that you can

spend more time using the functionality in the way that best suits your need, rather than

"porting" it to your environment. The components can be fixed into an embedded software

application with a minimum amount of effort, using a streamlined and simplified licensing

model that includes royalty-free distribution and full source code.

With ever-increasing cost and time-to-market pressures, building leading-edge embedded

devices is a high-risk proposition. Until now, OEMs and ODMs had to go with monolithic

software packages or use in-house resources to engineer device software to their

requirements. As a result, customizing the devices to meet customers’ specific requirements,

have increased time to market and added to the cost. Therefore true turnkey solutions are

needed that combine:

• A rich set of field-proven, standard components.

• An array of customizable options.

• A team of professional services experts to provide all hardware/software integration, porting, testing, and validation.

• Flexible licensing options.

22 ♦ Managed Access Point Solution


1.1 Wireless Technology

Wireless LAN technology is propagating in various embedded application domains, and WLAN security standards have become the key concern for many organizations. New standards address many of the initial security concerns while still maintaining and enhancing the mobility and untethered aspects of a wireless LAN. New applications in industrial networks, M2M and in the consumer space are in a demand for more secure, more standardized middleware, to hook up to traditional wired LANs, rather than just a monolithic wireless access box for PCs.

2 MAPS Product Overview

MAPS provides a production-ready solution for building secure, managed access point devices to OEMs/ODMs, while reducing development cost, risk, and time to market. With MAPS, OEMs can easily differentiate their Wi-Fi® Access Point products by choosing from a wide range of advanced networking and security modules.

Managed Access Point Solution ♦ 23


2.1 Field-proven Software Modules

Specific instances of the Managed Access Point Solution are created by leveraging pre existing software blocks that have proven their merit in thousands of deployments and also minimizes risk for OEMs by keeping licensing terms flexible. Only the Managed Access Point Solution can offer such a comprehensive set of features with completely modular packaging that allows for full customization, to meet an OEM's specific requirements.

2.2 SMBware Software Modules

Small-to-Medium Business ware family of embedded software solutions gives OEMs/ODMs the ability to bring differentiated, leading-edge devices for the small-to-medium business (SMB) market segment. To create a fully customized device, OEMs first select from a comprehensive set of SMBware software modules. In addition to SMBware modules, OEMs can select from third-party modules or modules developed in-house by OEMs. These modules are then integrated to create validated software packages that meet an OEM’s specific needs.

The final custom touches are added to develop specific features such as BSPs, bootloaders, drivers, and hardware accelerators for OS platforms running a Managed Access Point Solution, software modules are integrated and end-user device management interfaces are customized. The result is a standard, field-tested software solution in a production-ready custom package, with all hardware integration, porting, testing, and validation completed.



2.3 Features of MAPS

• The Dual-band, multimode networks (5 GHz and 2.4 GHz, 54 Mbps) capable of delivering high-performance throughput.

• Power over Ethernet (PoE), which eliminates extra cabling and the necessity to locate a device near a power source.

• Wireless distribution system (WDS), which extends a network’s wireless range without additional cabling.

• Advanced 802.11 security standards, including WEP, WPA, WPA2, (802.11i), for any generation of wireless security, in Personal and Enterprise modes.

• 802.11n MIMO technology, which uses multiple radios to create a robust signal that travels farther with fewer dead spots at high data rates.

• Wi-Fi Multimedia (WMM), which provides improved quality of service over wireless connections for better video and voice performance.

2.4 IEEE 802.11

The IEEE 802.11 is a set of standards for wireless local area network (WLAN) computer communication, developed by the IEEE LAN/MAN Standards Committee (IEEE 802) in the 5 GHz and 2.4 GHz public spectrum bands. The 802.11 family includes over-the-air modulation techniques that use the same basic protocol. The most popular are those defined by the 802.11b and 802.11g protocols, and are amendments to the original standard. 802.11-



1997 was the first wireless networking standard, but 802.11b was the first widely accepted one, followed by 802.11g and 802.11n. Security was originally purposefully weak due to export requirements of some governments,[1] and was later enhanced via the 802.11i amendment after governmental and legislative changes.

802.11n is a new multi-streaming modulation technique that is still under draft development, but products based on its proprietary pre-draft versions are being sold. Other standards in the family (c–f, h, j) are service amendments and extensions or corrections to previous specifications. 802.11b and 802.11g use the 2.4 GHz ISM band, operating in the United States under Part 15 of the US Federal Communications Commission Rules and Regulations. Because of this choice of frequency band, 802.11b and g equipment may occasionally suffer interference from microwave ovens and cordless telephones. Bluetooth devices, while operating in the same band, in theory do not interfere with 802.11b/g because they use a frequency hopping spread spectrum signaling method (FHSS) while 802.11b/g uses a direct sequence spread spectrum signaling method (DSSS). 802.11a uses the 5 GHz U-NII band, which offers 8 non-overlapping channels rather than the 3 offered in the 2.4GHz ISM frequency band.

The segment of the radio frequency spectrum used varies between countries. In the US, 802.11a and 802.11g devices may be operated without a license, as allowed in Part 15 of the FCC Rules and Regulations. Frequencies used by channels one through six (802.11b) fall within the 2.4 GHz amateur radio band. Licensed amateur radio operators may operate 802.11b/g devices under Part 97 of the FCC Rules and Regulations, allowing increased power output but not commercial content or encryption.[2]

2.5 KEY 802.11 standards

MAPS supports key IEEE standards for WLANs including:

2.5.1 802.11e

Full Wi-Fi Multimedia standard plus MAC enhancements for QoS. Improves audio, video (e.g., MPEG-2), and voice applications over wireless networks and allows network administrators to give priority to time-sensitive traffic such as voice.

2.5.2 802.11i

Strengthens wireless security by incorporating stronger encryption techniques, such as the Advanced Encryption Standard (AES), into the MAC layer. Adds pre-authentication support for fast roaming between APs.

2.5.3 802.11n

Uses multiple-input, multiple-output (MIMO) techniques to boost wireless bandwidth and range. Multiple radios create a robust signal that travels farther, with fewer dead spots.

3 Benefits of MAPS

• Complete turnkey solution for building Wi-Fi devices with secure, managed access points lessens OEMs’ development costs, risk, and time to market.

• Selected MAPS and SMBware networking and security modules enable OEMs to easily differentiate products.



• Adherence to standards enables:

• 802.11 a/b/g/n support for maximum flexibility and high performance.

• PoE for simplified power requirements.

• WEP, WPA, WPA2 (802.11i) for advanced security.

• WDS, using a wireless medium, for a flexible and efficient distribution mechanism.

• MIMO technology for stronger signals and fewer dead spots.

• Comprehensive management capabilities including secure remote management.

• Support for a broad range of Wi-Fi chipsets.

• Branding options offer a cost-effective, customized look and feel.

4 Technical Specifications of MAPS

The deployment scenario of MAPS can be shown as follows:

4.1 Interfaces

• Ethernet connection to wired LAN (single or multiple)

• DSL/Cable/Dialup/WWAN connection to ISP

• Optional Ethernet LAN switch (managed/unmanaged)

• Wi-Fi Supplicant upstream connection



4.2 Protocol Support

• IP routing

• Bridging

• TCP/IP, UDP, ICMP

• PPPoE, PPTP client

• DHCP, NTP

• RIP v1, v2

• Optional IPSec (ESP, AH), IKE, IKEv2

• IEEE 802.11 standards

4.3 Networking Capabilities

• Static routing, dynamic routing

• Unlimited users (subject to capacity)

• Static IP address assignment

• DHCP client for device IP configuration

4.4 DHCP Address Reservation

• NAT or classical routing

• Port triggering

• uPNP

• Configurable MTU

• Multiple LAN sub-nets

• 802.11 MIB support

4.5 Device Management

• Intuitive, easily brandable browser-based GUI

• SNMP v2.c and v3 support

• Advanced per-client, AP and radio statistics

• Telnet and serial console CLI support

• Remote management restricted to IP address or range

• Custom remote management port

• GUI-based firmware upgrade

• SMTP authentication for email



5 Conclusion

In this paper, the general framework of Wireless LAN access point software called Managed Access Point Solution (MAPS) is discussed. It explains how this software package combines the latest 802.11 wireless standards with networking and security components. MAPS enables OEMs/ODMs, to deliver leading-edge Wi-Fi® devices. The technical specifications, features and benefits of MAPS have been highlighted. The software is designed with security for Wi-Fi use and secure client software supporting personal and enterprise security modes.

References

[1] [http://books.google.co.in/books?hl=en&id=uEc4njiIXhYC&dq=ieee+802.11+handbook&printsec=frontcover&source=web&ots=qH5LuA0v2y&sig=j_baDrrbtCrbEJZuoXT4mMpKk1s]

[2] [http://en.wikipedia.org/wiki/IEEE_802.11] [3] [IEE802.11 handbook A Designer's Companion (By Bob O'Hara and Al Petrick)] [4] [http://www.teamf1.com]



Autonomic Web Process for Customer

Loan Acquiring Process

V.M.K. Hari G. Srinivas Department of MCA Department of Information Technology Adikavinannaya University GITAM University

T. Siddartha Varma Rukmini Ravali Kota Department of Information Technology Department of Information Technology GITAM University GITAM University

Abstract

Web services are developed to address common problems associated with RPC(Remote procedural calls). As Web services are agnostic about implementation details, operating system, programming language, platform they will play very important role in distributed computing, inter process communication, B2B interactions. Due to rapid development in web service technologies and semantic web services, it is possible to achieve automated B2B interactions. By using these services we are going to apply autonomic web process for customer loan acquiring process. In this work we are going to adapt frame work for dynamic configuration of web services and process adaptation incase of events proposed by Kunal verma, to provide improved services to loan acquiring process. During this work we give details of how to achieve autonomic computing features to loan acquiring process. Because of this autonomic loan acquiring process, a customer service can configure automatically with optimal bank loan services to request loans, and if loan approvals are delayed, it can take optimal actions by changing loan service providers for risk avoidance.

1 Introduction

As web services provide agnostic environment for B2B communication, by adding semantics to these web services standards, the web services publication, discovery, composition can be automated [1][2].We provide details about autonomic web process services to loan acquiring process for customers. In order to achieve loan acquiring process automation we try to define loan service requirements, policies, and constraints semantically. To add semantics to loan services we suggest a Bank Ontology for standard interfaces (operation, Input, output, exceptions).Using this Bank Ontology each Bank’s loan service interfaces are semantically annotated and WSDL-S file is generated using tools like MWSAF. These WSDL-S file are published in UDDI structures for global access for their interfaces through queries. To search the UDDI semantically and syntactically, we provide domain ontology for loan service interfaces of Banks.

Loan service policies are described using WS-Policy. Quantitative constraints are given input to ILP solver in LINDO API format matrix format. Logical constraints are represented as SWRL rules and stored in form of ontology rules [4][5].

30 ♦ Autonomic Web Process for Customer Loan Acquiring Process


In dynamic configuration environment for loan acquisition process all services are not known during configuration time. So to adapt process in run time, abstract process with well defined process flow and controls are needed. To create abstract process WS-BPEL constructs are used [6].This abstract process can be deployed in IBM BPWS4J SOAP engine server. During executing process abstract process service templates are replaced by actual services. By analyzing process constraints, policies, loan acquiring process services are selected for execution. During process execution to handle events like loan approval delayed, loan request cancellation process state maintenance is required. To maintain process state with various services can be done with service manager for each service. Service manger is modeled as a MDP for handling decision frame work. Runtime changes are efficiently handled by METEOR-S architecture by using service managers [7].

2 Autonomic Loans Acquiring Process

In general Online LOANS ACQUIRE PROCESS involves two modules: Company Loans Requirement module functions and Loan Acquiring module functions

Company Loans Requirement module functions:

They will set constraints (quantitative and logical) for LOANS ACQUIRE PROCESS.

Quantitative constraints include:

a. Maximum rate of interest of whole loan acquiring process. b. Maximum rate of interest of each loan. c. Maximum approval time of loan acquisition. d. Maximum installment rate (EMI) for each loan. e. Maximum number of equal monthly installments (EMIs).

Logical constraints include: a. Faithfulness of Bank service. b. Restrictions on acquiring more than one type of loan, etc.

Loan Acquiring module functions are: a. It gets loan details from different Bank services. b. It should select the best Bank services. c. It must satisfy constraints of Company Loans Requirement module d. It should place loan requests for optimal Bank services.

In this LOANS ACQUIRE PROCESS the Loan Acquiring module functions are done by humans.

In this whole process following events may occur:

• Loans approvals are delayed or cancelled because the Bank services may not have required capital.

• Physical failures like trusted bank services are not available.

• Logical failures like some loan requests are cancelled due to delay of other Bank services.

To react above events optimally autonomic web process is evolved. Autonomic web process means a web process with autonomic computing. Because of adding autonomic computing features like self configuration, self healing, self optimizing to web process, it provides improved services to customers. [5]

Autonomic Web Process for Customer Loan Acquiring Process ♦ 31


3 Autonomic Computing Features for LOANS ACQUIRE PROCESS

3.1 Self Configuration

Whenever a new optimal Bank services are registered then LOANS ACQUIRE PROCESS should configure with new optimal service without violating constraints of Company Loans Requirement module.

3.2 Self Healing

LOANS ACQUIRE PROCESS should continuously detect and diagnose various failures (e.g. preferred services are not available or delay in loan approval time).

3.3 Self Optimization

LOANS ACQUIRE PROCESS should monitor optimal criteria attributes (e.g. rate of interest, loan approval time).It should reconfigure the process with new optimal suppliers. While LOANS ACQUIRE PROCESS is reconfiguring the process it should obey the constraints, policies, requirements of Company Loans Requirement module.

In order to achieve autonomic computing features, the LOANS ACQUIRE PROCESS requirements, policies, constraints are represented semantically. To get autonomic computing environment, we can use existing web services standards by adding semantics. Web services provide inter process communication for distributed computing paradigm. By adding semantics to web service standards we can provide automated capabilities related to web services composition, publication, discovery [7].LOANS ACQUIRE PROCESS require dynamic configuration environment with Event handling capability. To provide this environment we follow the Kunal Verma’s research work[7].

First we try to represent policies, constraints, requirements of LOANS ACQUIRE PROCESS semantically.

As LOANS ACQUIRE PROCESS requires various Bank services details, we have to provide Bank services ontology so that a loan requirement process can use standard messages, protocols for implementing LOANS ACQUIRE PROCESS B2B interactions. We define loan acquirement process in term of 4 steps. (We are trying to define loan acquire process in favor of customer, so we do not consider bank’s loan approval process)

1. Customers should register with their details and amount of loan required.(assumption: bank services may analyze details of customers and then give details about bank’s loan)

2. It will get loan details from Bank services. 3. Analyze loan details from different bank services. 4. It will place loan requests with faithfulness Bank service.

A LOANS ACQUIRE PROCESS requesting loan services should have following capabilities.

1. Get Loan Details. 2. Loan Request. 3. Cancel Loan.



Bank Domain ontology standard provides standard messages for above capabilities. It also provides standard input, output, faults, and assertions for messages. The entire Bank loan services should semantically annotated with this ontology.

Fig. 1: LOANS ACQUIRE PROCESS steps.

Fig. 2: Service templates annotated using Bank Loan service ontology standard. [9][10]

4 Semantics about Bank Loan Web Services

Web services have primarily been designed for providing inter-operability between business applications. WSDL is an XML standard for defining web services. As number of Bank

Bank Service Interfaces

Has input messages

Has output messages

isa isa

Input Messages

Output Message

getLoanDeatils Loan Request

loanDetailsRequest

LoanRequestDetails

loanDetailsResponse loanReqeustResponse

isa

isa

isa isa Loan Cancel

loanCancelResponse loanCancelRequs

Get Loan details

Place Loan requests

stop

Start

Customer registration

Analyze Constraints



supplier services increases, interoperability among these services is difficult because names of service inputs, outputs, functions differ among services. We can solve it by relating service elements with ontological concepts of Bank ontology standard.

Semantic annotations on WSDL elements used [2] for annotating Bank loan web service interfaces are:

• Annotating Loan service message types (XSD complex types and elements) can use Extension attribute: model Reference (semantic association) – extension attribute schema Mapping

• Annotating Loan service operations can use

• Extension elements: precondition and effect (child elements of the operation element)

• Extension attribute: category (on the interface element)

• Extension attribute: Model Reference (action) (on operation element)

LOANS ACQUIRE PROCESS service template should consist of service level meta data, semantic operations, service level policy assertions. To represent service request template we can use WSDL-S.WSDL-S provide semantic representations for input, output, operation, preconditions, effects, faults using its attributes, elements.[11][12]

Service requesting template WSDL-S=< service level meta data, semantic operations, service level policy assertions> (semantic template)

To provide autonomic communication and interoperability among bank web services semantics are added to bank services. We add semantics to the web services by using various attribute of WSDL-S to relate Bank loan service elements with ontology concepts of Bank ontology standard elements.[11]

Web service descriptions have two principal entities.

1. Functions provided by service.

2. Data (input, output) exchange by service.

Adding semantics to Data

Loan web service input, output can be semantically associated with Bank Ontology standard input, output using model Reference attribute of WSDL-S.[12][19][2]

Bank web service actual invocation detail mapping are needed to indicate exact correspondence of data types of two Services. To solve this problem SAWSDL provides two attributes for schema mapping

1. Lifting Schema Mapping(XML instance to ontology instance)

2. Lowering Schema Mapping(ontology instance to XML instance)

Adding semantics to Functions: A Bank service function should be semantically defined so that it can be invoked from other service.

A semantic operation is represented by following tuple: [12][19][2]

<Operation:FunctionalConcept,



input:Semantictype,

output:Semantictype,

fault:semanticfault,

pre:semanticprecondition,

effect:semanticefect>

WSDL-S well defined interfaces (URIs) are provided by the WSDL-S Services.

When a semantic operation invoked service manager has to check following list.

1. Is input of semantic type?

2. Are all preconditions satisfied?

3. Execute the operation.

4. Is output of semantic type?

5. After completion of operation are all effects satisfied?

6. If any fault thrown during operation execution, then throw Fault.

Non Functional requirements of Loan web services can be specified using WS-Policy:[13]

The Web Services Policy Framework (WS-Policy) provides a general purpose model and corresponding syntax to describe the policies of a web service. WS-Policy defines a base set of constructs that can be used and extended by other Web services specifications to describe a broad range of service requirements and capabilities.[13]

The goal of WS-Policy is to provide the mechanisms needed to enable Web services applications to specify policy information. Specifically, this specification defines the following:

• An XML Info set called a policy expression that contains domain-specific, Web Service policy information for e.g. loan services.

• A core set of constructs to indicate how choices and/or combinations of domain specific policy assertions apply in a (e.g. loan) Web services environment.

Policy is a collection of assertions. Each assertion can be defined using following tuple

Policy (P)=Union of Assertions(A)

A=<Domain attribute, operator, value, unit, assertion type, assertion category>

A Bank Service Loan policy expressed using WS-Policy. The policy contains information about the expected delay probability and penalties from various states. The Bank Service Loan policy has following information.

• The Loan service gives a probability of 85% for loan approval.

• The Loan can cancel at any time based on the terms given below.

• If the Loan has not been delayed, but it has not been approved later it can be cancelled with a penalty of 5% to the customer.

• If the Loan has been approved without a delay, it can be cancelled with a penalty of 20% to the customer.



To create dynamic configuration environment for autonomic web process LOANS ACQUIRE PROCESS[14][7] we adopt the 3 steps proposed by Kunal Verma:[7]

• Abstract process creation.

• Semantic web service discovery.

• Constraints Analysis.

4.1 Abstract Process Creation for Loan Acquiring Process: [14][7][6]

To create abstract process all the constructs of WS-BPEL can be used. [6] [6] WS-BPEL provides a language for the specification of Executable and Abstract business processes. By doing so, it extends the Web Services interaction model and enables it to support business transactions. WS-BPEL defines an interoperable integration model that should facilitate the expansion of automated process integration in both the intra corporate and the business-to-business spaces.

Business processes can be described in two ways. Executable business processes model actual behavior of a participant in a business interaction. Abstract business processes are partially specified processes that are not intended to be executed.[6]

Abstract Processes serve a descriptive role, with more than one use case. One such use case might be to describe the observable behavior of some or all of the services offered by an executable Process. Another use case would be to define a process template that embodies domain-specific best practices. Such a process template would capture essential process logic in a manner compatible with a design-time representation, while excluding execution details to be completed when mapping to an Executable Process.

Advantage of WS-BPEL is to configure the process by replacing semantic templates with actual service at a later time. Since WSDL-S adds semantic to WSDL by using extensibility attribute it allows us to capture all the information in semantic templates and also makes the abstract process executable.

4.2 Semantic Web Service Discovery: [14][15][7]

How does LOANS ACQUIRE PROCESS find out what loan web services are available that meet its particular needs?

To answer this question we can use UDDI registry. UDDI is central replica table registry of information about web service. UDDI based on catalog of services. It supports look up both humans and machines. UDDI catalogs three types of registrations.[16]

Yellow Pages-Let you find services by various industry categories.

White Pages.-Let you find business by its name or other characteristics.

Green Pages-Provides information model for how an organization does business electronically.

• Identifies Business process as well as how to use them.

In UDDI

• Bank Loan Organizations populate registry with information about their web services.

• UDDI Registry assigns a unique identifier to each service and business registration.



While storing these organization their services in UDDI registry we can impose technical note to store services.[4]

That is semantic template information of the form:

WSDL-S=< service level metadata, union of semantic operations, service level policy assertions>

Service level meta data stored in Business Service template. Semantic operations stored in Binding templates under category bags. Semantic operation parts are stored in key references of category bags.

While UDDI implementations only search for string matches, we can incorporate SNOBASE based ontology inference search mechanism to also consider Bank Loan Service Process domain ontological relationships for matching. This discovery module can be implemented using the UDDI4J API.[15]

To search UDDI using WSDL and SOAP message can be used.

4.3 Constraints Analysis [14][1][7]

In order to perform constraints analysis for Loan Acquire Process, its constraints should be represented in consistent form for ILP solver. Loan acquiring process Quantitative Constraints can be described as follows.[4]

Equations for Set up

1. Set the bounds on i and j, where i iterates over the number of different loans (M) for which operations are to be selected and j iterates over the number of candidate loan services for each loan - N(i). For example, M = 2, as the operations have to selected for two activities - “personal Loan Request” and “Business Loan Request”. Also, since there are two candidate services for both the operations, N(1)=2 and N(2)=2.

2. Create a binary variable for each selected operation of candidate service. Each candidate service is assigned a binary variable. The candidate services for “personal Loan Request” (i=1) are assigned and and the candidate services for “Business Loan Request” (i=2) are assigned and X11,X12,X21,X22,.

3. Set up constraints that state that only one operation must be chosen for each activity.

N(1) ∑j=1X1j=1 or X 11+X12=1 (a)

N (2) ∑ j=1 X 2j=1 or X 21+X22=2 (b)

Equations for Quantitative Constraints

4. It is also possible to have constraints on particular loan service activities. There is a constraint on activity 1 (Business Loan Request) that number of installments (NEMIS) can be greater than 30.

This can be expressed as the following constraint.

N(1) ∑

j=1 NEMIS j*X1j>=30

5. There is a entire process constraint that loan Approval Time of the process should be less than 8 days.



M N(1)

∑i=1∑j=1 loan Approval Time ij*Xij<=8

6. There is a entire process constraint that loan installment amount (EMI) of the process should be less than 1200 rupees.. The constraint for this can be represented as the following.

E.g: M N(1)∑i=1∑j=1EMIij*Xij<=1200

7. Create the objective function. In this case, interest should be minimized. This is expressed as the following.

Minimize M ∑ i=1 N(1)∑j=1 INTEREST ij*Xij

These equations can be given as input LINDO API for solving constraints. It will give optimal services.

To create logical constraints first we need to provide LOANS ACQUIRE PROCESS domain knowledge using ontology rules. Ontology rules are represented using SWRL (Semantic Web Rule Language). These rules are stored in form of ontology. There are two aspects of logical constraint analysis – Step 1) creating the rules based on the constraints at design time and Step 2) applying the SWRL reasoner to see if the constraints are satisfied at configuration time[5]. Let us first examine creating the rules. These rules are created with the help of the ontology shown in Figure3. Here are sample rules that capture the requirements outlined in the motivating scenario.

1. SBBI Bank Loan Service 1 should be a trusted service. This is expressed in SWRL abstract syntax using the following expression.

2. BankService (?S1) and faithfullness(?S1, “trusted”) => trustedService(?S1)

Fig. 3: Bank Domain Ontology.

Home Loan

Personal Loan

Business Loan

Bank Service

Loans provides

isa isa isa

sbbi sbhi

abbi icii

Pl: NEMIs: Instalment: ROI:


Bl: NEMIs: Instalment: ROI:

Hl: NEMIs: Instalment: ROI:


Bl: NEMIs: Instalment: ROI:





5 Conclusion and Future Work

In this work we provide a frame work to create autonomic loan acquiring process.We give a Loan acquiring process domain ontology standard for Bank loan service messages, and its interface. We present this work with examples. This work explores ideas for autonomic loan acquire process for improving customer services.

In future we are trying to evaluate this autonomic loan process under METEOR-S environment.We will provide the results of autonomic loan acquiring process.

References

[1] R. Aggarwal, K. Verma, J. Miller and W. Milnor, Constraint Driven Web Service Composition in

METEOR-S, Proceedings of the 2004 IEEE International Conference on Service Computing (SCC 2004), Shanghai, China, pp. 23-30, 2004.

[2] SAWSDL, Semantic Annotations for Web Services Description Language Working Group, 2006, http://www.w3.org/2002/ws/sawsdl/

[3] A. Patil, S. Oundhakar, A. Sheth, K. Verma, METEOR-S Web service Annotation Framework, The Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), New York, pp. 553-562, 2004.

[4] LINDO API for Optimization, http://www.lindo.com/ [5] J. Colgrave, K. Januszewski, L. Clément, T. Rogers, Using WSDL in a UDDI Registry, Version 2.0.2,

http://www.oasis-open.org/committees/uddi-spec/doc/tn/uddi-spec-tc-tn-wsdl-v202-20040631.htm http://lsdis.cs.uga.edu/projects/METEOR-S

[6] SWRL, http://www.daml.org/2003/11/swrl/ [7] L. Lin, and I. B. Arpinar Discovery of Semantic Relations between Web Services, IEEE International

Conference on Web Services (ICWS 2006), Chicago, Illinois, 2006 (to appear). [8] wsbpel-specification-draft-01 http://docs.oasis-open.org/wsbpel/2.0/Web Services Business Process

Execution Language Version 2.0 Public Review Draft, 23rd August, 2006. [9] RosettaNet eBusiness Standards for the Global Supply Chain, http://www.rosettanet.org/ configuration and

adaptation of semantic web processes by kunal verma doctor of philosophy athens, georgia 2006. [10] Rosetta Net Ontology http://lsdis.cs.uga.edu/ projects/meteor-s/wsdl-/ontologies/rosetta.owl [11] K. Sivashanmugam, K. Verma, A. P. Sheth, J. A. Miller, Adding Semantics to Web Services Standards,

Proceedings of the International Conference on Web Services (ICWS 2003), Las Vegas, Nevada, pp. 395-401, 2003.

[12] Web Service Description Language (WSDL), www.w3.org/TR/ws [13] Web Service Policy Framework (WS-Policy), available at http://www106.ibm.com/developerworks/library/

ws-polfram/, 2003. [14] K. Verma, K. Gomadam, J. Lathem, A. P. Sheth, J. A. Miller, Semantics enabled Dynamic Process

Configuration. LSDIS Technical Report, March 2006. [15] M. Paolucci, T. Kawamura, T. Payne and K. Sycara, Semantic Matching of Web Services Capabilities, The

Proceedings of the First International Semantic Web Conference, Sardinia, Italy, pp. 333-347, 2002. [16] Universal Description, Discovery and Integration (UDDI), http://www.uddi.org [17] K. Verma, R. Akkiraju, R. Goodwin, Semantic Matching of Web Service Policies, Proceedings of Second

International Workshop on Semantic and Dynamic Web Processes (SDWP 2005), Orlando, Florida, pp. 79-90, 2005.

[18] K. Verma, A. Sheth, Autonomic Web Processes. In Proceedings of the Third International Conference on Service Oriented Computing (ICSOC 2005), Vision Paper, Amsterdam, The Netherlands, pp. 1-11, 2005.

[19] K. Verma, P. Doshi, K. Gomadam, J. A. Miller, A. P. Sheth, Optimal Adaptation in Web Processes with

Coordination Constraints, Proceedings of the Fourth IEEE International Conference on Web Services (ICWS 2006), Chicago, IL, 2006 (to appear).

[20] R. Bellman, Dynamic Programming and Stochastic Control Processes Information and Control 1(3), pp. 228-239, 1958.

[21] WSDL-S, W3C Member Submission on Web Service Semanticshttp://www.w3.org/Submission/WSDL-S/ [22] Web Service Modeling Language (WSML), http://www.wsmo.org/wsml/



Performance Evaluation of Traditional Focused

Crawler and Accelerated Focused Crawler

N.V.G. Sirisha Gadiraju G.V. Padma Raju S.R.K.R Engineering College S.R.K.R Engineering College Bhimavaram Bhimavaram [email protected] [email protected]

Abstract

Search Engines collect data from the Web by crawling it. In spite of consuming enormous amounts of hardware and network resources these general purpose crawlers end up fetching only large fraction of the visible web. When information need is only about a specific topic special type of crawlers called as Topical Crawlers complement search engines. In this paper we compare and evaluate the performance of two topical crawlers called Traditional Focused Crawler and Accelerated Focused Crawler. Bayesian Classifier guides these crawlers in fetching topic relevant documents. The crawlers are evaluated using two methods. One based on the number of topic relevant target pages found and retrieved and the second based on the lexical similarity between crawled pages and topic descriptions for the topic provided by the editors of Dmoz.org. Due to the limited amount of resources consumed by these crawlers they have applications in niche search engines and business intelligence.

1. Introduction

The size of the publicly indexable World-Wide-Web has exceeded 23.68 billion pages in the year 2008. This is very large compared to the one billion pages in the year 2000. Dynamic content on the web is also growing day by day. Search engines are therefore increasingly challenged when trying to maintain current indices using exhaustive crawling. Exhaustive crawls also consume vast storage and bandwidth resources.

Focused crawlers [Chakrabarti et al., 1999] aim to search and retrieve only subset of the World-Wide Web that pertains to a specific topic of relevance. Due to the limited resources used by a good focused crawler, users can make use of them in their PCs.

The major problem in focused crawling is that of properly assigning credit to all pages along a crawl route that yields highly relevant documents. A classifier can be used to assign credit (priority) to unvisited URLs. The classifier is trained with specific topic’s positive and negative example pages. It is then used to predict the relevance of unvisited URL to the specific topic. Naïve Bayesian Classifier is a popular classifier used to automatically tag documents and it is based on the fact that if we know the probabilities of words (features) appearing in a certain category of document, given the set of words (features) in a new document, we can correctly predict the relevance of the new document to the given category or topic. Relevance can have any value between 0 and 1.

A major characteristic or difficulty of the text classification problem is the high dimensionality of the feature space. The native feature space consists of the unique terms

40 ♦ Performance Evaluation of Traditional Focused Crawler and Accelerated Focused Crawler


(words or phrases) that occur in documents, which can be tens or hundreds or thousands of terms for even a moderate-sized text corpus. This is prohibitively high for the classification algorithm. It is highly desirable to reduce the number of features automatically without sacrificing classification accuracy. Stop word removal [Sirotkin et al., 1992], stemming [Porter, 1980] along with the term goodness criterion like document frequency helps in achieving a desired degree of term elimination from the full vocabulary of a document corpus.

The remainder of this paper is structured as follows: Section 2 describes the two crawling methods. Section 3 describes the document frequency feature selection method. Section 4 states the procedure used in obtaining the test data. Evaluation schemes, results comparison are also given in the section 4. Section 5 summarizes our conclusions.

2. Traditional Focused Crawling and Accelerated Focused Crawling

Traditional Focused Crawler and Accelerated Focused Crawler start from a set of topic relevant URL’s called seeds URL’s. The documents represented by these seed URL’s are fetched and the links embedded in these seed documents are collected. The relevancy of these links to target topic is found out by means of a classifier. The links are then added to a priority queue of unvisited links with a priority equal to the relevancy calculated above. Document representing the link present at the front of the queue is fetched and process repeats until a predefined goal is attained.

In determining the relevancy of a link to the target topic, Traditional Focused Crawler uses features of the parent page where as Accelerated Focused Crawler uses features around the link itself. Both the crawlers are trained on set of topic relevant documents known as seed pages. Accelerated Focused Crawler is also trained on documents retrieved by Traditional Focused Crawler.

3. Feature Selection Method

Features are gathered from the DOM tree [Chakrabarti et al., 2002] representation of the document using a DOM parser. Of all the features gathered from the document corpus, good features are selected using document frequency criterion.

Document frequency is the number of documents in which the feature appeared. Only the terms that occur in a large number of documents are retained. DF thresholding is the simplest technique for vocabulary reduction. It scales easily to very large corpora with an approximately linear computational complexity [Yang et al., 1997] in the number of training documents. DF of a feature is given by

∑=

n

i 1 i

i

classin documents of no. Total

classin word thecontaining documents No.of

4. Experiments

4.1. Test Beds Creation

For evaluating the crawlers the topic relevant seed URL’s, training URL’s, target URL’s, topic descriptions are collected from content.rdf file of Open Directory Project (ODP). Topics

Performance Evaluation of Traditional Focused Crawler and Accelerated Focused Crawler ♦ 41


that are at a distance of 3 from the ODP root are picked and all topics with less than 100 relevant URLs are removed so that we have topics with a critical mass of URLs for training and evaluation. Among these the topics Food_Service, Model_Aviation, Nonprofit_Resources, Database_Theory, Mutual_Funds, Roller_Skating are actually crawled. The ODP relevant set for a given topic is divided into two random subsets. The first set is the seeds. This set of URLs was used to initialize the crawl as well as provide the set of positive examples to train the classifiers. The second set is the targets. It is a holdout set that was not used either to train or to seed the crawlers. These targets were used only for evaluating the crawlers. The topics were divided into two test beds. The crawlers have crawled all the topics present in the two test beds. The first test bed is Food_Service, Model_Aviation, Nonprofit_Resources. The second test bed is Mutual_Funds, Roller_Skating, Database_Theory. For training a classifier we need both positive and negative examples. The positive examples for a topic are the pages corresponding to the seed URLs for that topic in the test bed. The negative examples are the set of seed URLs of other two topics in the same test bed.

4.2 Evaluation Scheme

Table 1 Evaluation Schemes: t

cS is the set of pages crawled by the crawler C at time t. Td is the target set

and Dd ,p are the vectors representing the topic description and crawled page respectively. Finally σ is

the cosine similarity function

Relevance Assessments Recall Precision

Target Pages dd

t

c TTS || ∩ t

cd

t

c STS /|| ∩

Target Descriptions ),(∑∈

tcSp dDpσ

||/),( t

cSp d SDptc

∑∈

σ

The crawlers’ effectiveness is found out using two measures called recall and precision. Table 1 shows the assessments schemes used in this paper. It consists of two sets of crawler effectiveness measures differentiated mainly by the source of evidence to assess relevance.

The first set (precision, recall) focuses only on the target pages that have been identified for the topic. The second set (precision, recall) employs relevance assessments based on the lexical similarity between crawled pages (whether or not they are in target set) and the topic descriptions. All the four measures are dynamic in that they provide a temporal characterization of the crawl strategy. It is suggested that these four measures are sufficient to provide a reasonably complete picture of the crawl effectiveness [Pant et al., 2001].

To find the cosine similarity between the crawled page and the topic description for the specific topic, both of them have to be represented in mutually compatible format. For this features and term frequency values of features are found from the crawled page as well as topic description of the topic and vectors D, p are constructed. Cosine similarity function

( ()σ ) is given by))((

),(22

∑ ∑

∑

∈ ∈

∩∈

=

pi Di ii

Dpi

ii

Dp

Dp

Dpσ

where ip and iD are the term frequency weights of “term i” in page p and topic description D

respectively. Recall for the full crawl is estimated by summing up the cosine similarity scores



over all the crawled pages. For precision, the proportion of retrieved pages that is relevant is estimated as the average similarity of crawled pages.

4.3 Results Comparison

Offline analysis is done after all the pages have been downloaded and the experiments are

over. Precision and Recall are measured as shown in Table 1. The horizontal axis in all the

plots is time approximated by the number of pages crawled. The vertical axis shows the

performance metric i.e. recall or precision.Nature of topic is said to have impact on crawler

performance [Chakrabarti et al., 1999; Pant et al., 2001]. From the Figures 1 and 2 we can see

that the crawlers have fetched more number of topic relevant target pages when crawled for a

co-operative topic like Nonprofit_Resources. The effect of the size of training data and test

data on crawler performance is also studied. When crawled for the topics Food_Services,

Model_Aviation and Nonprofit_Resources training data is large (nearly 200 to 300

documents relevant to topic). Crawlers were able to fetch more number of target pages as

shown by recall values in Figures 1 and 2.

Fig. 1: Target Recall of Traditional Focused Crawler Fig. 2: Target Recall of Accelerated Focused Crawler

Fig. 3: Target recall of Traditional Focused Crawler Fig. 4: Target recall of Accelerated Focused Crawler

The number of URLs specified as targets is only 30 for each of the topics in this test bed. When crawled for the topics Database_Theory, Mutual_Funds, Roller_Skating the crawlers were trained with only 75 documents relevant to the specified topic and the number of URLs specified as targets are 25 for Database_Theory, 306 for Mutual_Funds and 455 for Roller_Skating. Though the target set is large in the second test bed the crawlers were able to



fetch only few of them as shown in Figures 3 and 4. It shows that when classifier is used to guide the crawlers, good training (large set of training data) yields good crawler performance. Accelerated Focused Crawler is trained over the documents fetched by the Traditional Focused Crawler i.e. it has large training data set. This once again suggests the use of Accelerated Focused Crawler instead of Traditional Focused Crawler in fetching topic relevant documents

Fig. 5: DMoz recall for Traditional Focused Crawler. Fig. 6: DMoz recall for Accelerated Focused Crawler

Fig. 7: DMoz recall of Traditional Focused Crawler Fig. 8: DMoz recall of Accelerated Focused Crawler

When crawlers were evaluated using lexical similarity between crawled pages (whether are

not they are in target set) and topic descriptions high recall values are attained by Accelerated

Focused Crawler for all topics as shown in Figures 5,6,7,8. In Figure 8 the DMoz recall

values are in decending order of magnitude for the topics Mutual_Funds, Roller_Skating and

Database_Theory. This is because Mutual_Funds was described briefly when compared to

Roller_skating and Roller_Skating was described briefly when compared to

Database_Theory by the Dmoz editors. This result suggests that when crawlers are driven by

keyword queries or topic descriptions it is better to describe the topic or theme using only and

all prominent terms of the topic. This results in collecting more number of pages relevant to

the topic.

Figure 10 shows that when training set of documensts available is small (in the case of

Database_Theory, Mutual_Funds, Roller_Skating) Accelerated Focused Crawler has



outperformed Traditional Focused crawler in average target recall. Figure 9 show us that the

performance of Traditional Focused Crawler and Accelerated Focused Crawler are nearly the

same when training data set is large. Traditional Focused Crawler has found 0 targets for the

topic Database_theory (Figure 3). This is because the target set size is only 27 for this topic.

In the case of other two topics in the same test bed the number of targets is greater than 300.

Fig. 9: Average target recall of the crawlers. Topics-FMN Fig. 10: Average target recall of crawlers. Topics-DMR

These target sets are not picked in that way intentionally but it was because the number of links specified as relevant by DMoz editors for the topic Database_theory was less compared to the other two topics. This result may indicate that Database_Theory is less popular topic in the Web. Accelerated Focused Crawler (Figure 4) has shown better performance in this case also. This shows that the popularity of topic also effects crawler performance.

5 Conclusion

Accelerated Focused Crawler is a simple enhancement to Traditional Focused crawler. It assigns better priorities to the unvisited URLs in the crawl frontier. No manual training is given to the crawlers. They are trained with documents relevant to the topic gathered from Dmoz.org. Features are extracted from the DOM representation of parent (source) page which is simple compared to all other techniques. When a small training data set is available, it is suggestive to use Traditional Focused Crawler only to train Accelerated Focused Crawler and not for fetching topic relevant documents. When training data is large we can make use of Traditional Focused Crawler for fetching topic relevant documents. In any case Accelerated Focused Crawler has performed well compared to Traditional Focused Crawler. Popularity of the topic as well as the nature of the topic i.e. whether it is competitive or collaborative also has effect on crawler performance. When crawlers are driven by keyword queries or topic descriptions, describing the topic or theme using only and all prominent terms of the topic enhances the performance of the crawl.

References

[1] [Chakrabarti et al., 1999] S. Chakrabarti, M. vander Berg and B. Dom, Focused crawling: a new approach to topic-specific web resource discovery. In the Proceedings of the eighth international conference on

World Wide Web, Toronto, Canada, Pages 1623–1640, 1999. [2] [Chakrabarti et al., 2002] S. Chakrabarti, K. Punera, M. Subramanyam, Accelerated Focused Crawling

through Online Relevance Feedback, In the Proceedings. of the 11th

international conference on World

Wide Web, Honolulu, Hawaii, USA, Pages.148-159, 2002.



[3] [Pant et al., 2001] F. Menczer, G. Pant, M. E. Ruiz, P. Srinivasan, Evaluating topic-driven Web crawlers, In the Proceedings. of the 24

th annual international ACM SIGIR conference on Research and development in

information retrieval, New Orleans, U.S, Pages 241–249 2001. [4] [Porter, 1980] M.F. Porter, An algorithm for suffix stripping, Program, Volume 14, Issue 3, Pages. 130-

137, 1980 [5] [Sirotkin et al., 1992] J.W. Wilbur and K. Sirotkin, The automatic identification of stop words, Journal of

Information Science, Volume 18, Issue 1, Pages 45 – 55, 1992. [6] [Yang et.al., 1997] Y. Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text

Categorization, In the Proceedings of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Pages 412–420, 1997.



A Semantic Web Approach for Improving Ranking

Model of Web Documents

Kumar Saurabh Bisht Sanjay Chaudhary DA-IICT DA-IICT Gandhinagar, Gujarat – 382007 Gandhinagar, Gujarat – 382007 [email protected] [email protected]

Abstract

Ranking models are used by Web search engines to answer user queries based on key words. Traditionally ranking models are based on a static snapshot of the Web graph, which is basically the link structure of the Web documents. The visitor’s browsing activities is directly related to importance of the document. However in this traditional static model the document importance on account of interactive browsing is neglected. Thus this model lacks the ability of taking advantage of user interaction for document ranking.

In this paper we propose a model based on semantic web to improve the local ranking of the Web documents. This model works on Ant Colony algorithm to enable the Web servers to interact with Web surfers and thus improve the local ranking of Web documents. The local ranking then can be used to generate the global Web ranking.

1 Introduction

In today’s ICT era, information seeking has become a part of social behavior. With the plethora of information available on Web, an efficient mechanism for information retrieval is of primal importance. The search engines are an important tool for finding information on Web. All of the search engines try to retrieve data based on the ranking of Web documents. However the traditional ranking models on static snapshot of the Web graph, including the document content and the information conveyed. An important missing part of this static model is that the information based on user interactive browsing is not accounted. This affects the relevancy and importance metrics of a document, as the judgments of users collected by user at run time can be very important.

In this paper we propose a model based in semantic web that enables a web server to record interactive experience of user.

The given approach works on three levels:

1. To make the existing model flexible i.e. the metrics related to relevance and importance can be modified according to run/browsing time user judgments.

2. The new enhancement should be automated in processing of the metrics recorded during browsing time.

3. The model should enable web server to play active role in the user’s choice for highly ranked pages.

A Semantic Web Approach for Improving Ranking Model of Web Documents ♦ 47


Our model has two components:

1. An ontology that keeps the interactive experience of user in machine understandable form.

2. A processing module based on Ant algorithm.

Preliminary experiments have shown encouraging results in the improvement of local document ranking.

The following sections give details about Semantic Web and various related terminologies afterwards there is brief discussion about Ant algorithm followed by proposed approach and experiment results with conclusion.

2 Semantic Web

Semantic Web is an evolving extension of the Web in which the semantics of information and service on the web is defined [Lee 2007], which enables information not only to be processable by people but as well as machine. At its core the semantic web framework compromises a set of design standards and technologies.

Fig. 1: A typical RDF triple in ontology from proposed approach

The formal specifications for information representation in Semantic Web are Resource Description Framework, a metadata model for modeling information through a variety of syntax formats and Web Ontology Language: OWL. The RDF metadata model is based upon the idea of making statements about Web resources in the form of subject-predicate-object expressions called RDF triple. Figure 1 provide a formal description of concepts, terms and relationships within a given knowledge domain.

2.1 Terminologies

2.1.1 Ontology

An ontology is an explicit and formal specification of a conceptualization [Antoniou and Harmelen, 2008]. An ontology describes formally a domain of discourse. It is a formal representation of a set of concepts within a domain and the relationships between those concepts. Typically, an ontology consists of a finite list of terms and the relationships between these terms. The terms denote important concepts (classes of objects) of the domain. The relationships typically include hierarchies of classes. See Figure 1 where has Importance is a relationship between two concepts “document” and “importance”. The “document” and “x” are the instances of these concepts. The major advantage of ontologies is that they support semantic interoperability and hence provide a shared understanding of concepts. Ontologies can be developed using data models like RDF, OWL.

…/system_document/world/document

…/system_document/world/document 1/importance

“document1”

“ X”

hasImportance Resource ID

48 ♦ A Semantic Web Approach for Improving Ranking Model of Web Documents


2.1.2 Owl

The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies, and is endorsed by the World Wide Web Consortium [Dean et al., 2004]. The data described by an OWL ontology is interpreted as a set of "individuals" and a set of "property assertions" which relate these individuals to each other. An OWL ontology consists of a set of axioms which place constraints on sets of individuals (called "classes") and the types of relationships permitted between them. These axioms provide semantics by allowing systems to infer additional information based on the data explicitly provided [Baader et al., 2003].

3 Ant Colonies

Ant colonies are a highly distributed and structured social organization [Dorigo et al., 1991]. On account of this structure these colonies can perform complex tasks, which has formed the basis of various models for the design of algorithms for optimization and distributed control problems. Several aspects of ant colonies have inspired different ant algorithms suited for different purposes. These kinds of algorithms are good for dealing with distributed problems. One of these is Ant Colony, which works on the principle of ants’ coordination by depositing a chemical on the ground. These chemicals are called pheromones and they are used for marking paths in the ground, which increases the probability that other ants will follow the same path.

The functioning of an ACO algorithm can be summarized as follows. A set of computational concurrent and asynchronous agents (a colony of ants) moves through paths looking for food. Whenever an ant encounters an obstacle it moves either left or right based on a decision policy. The decision policy is based on two parameters, called trails (τ) and attractiveness (Ω). The trail refers to the pheromones deposited by preceding ants and the ant following that path increases the attractiveness of that path by depositing more pheromones in that path. Each path’s attractiveness decreases with time as the trail evaporates (update) [Colorni et al., 1991]. With more and more ant following the shortest path to food the pheromone trail of that path keeps increasing and hence shortest optimal path is fund.

4 Proposed Model

In our model we emulate the web surfing with the Ant Colony model. The pheromone counter of the links represents the attractiveness of the path to the desired Web document in our model. The more number of hits, the more important the link is but at the same time the hits are necessary to maintain the level of importance the pheromone counter dwindle with time. The Web surfers are the ants that navigate through the links of the Web documents to go to particular information. The Web server in this model is not a passive listener to cater the request of users but it is also the maintaining agent who records the pheromone of web links and ensures that updation of the pheromone is taken care of by the processing module.

4.1 Model Working

People looking for information visits web page through various links/page. Every visit is converted by the server into pheromone count and recorded. So if the person doesn’t find the page useful he/she will not visit that page again and the pheromone count of that page will



dwindle with time reducing it’s attractiveness. Whereas repeated visits will increase the attractiveness of that page by increasing pheromone count. The web server records the pheromone count and also other interactive counts (more detail in the following section of server side enhancement).

4.1.1 Server Side Enhancement

The server maintains the pheromone count and other interaction corresponding to a page in

an ontology and also periodically updates them. A sample ontology in Figure 2 describes the

sample ontology. Currently we have two interaction metric recorded in the ontology:

1. Number of hits.

2. Visitor evaluation (1 = informative or 0 = not informative) relevance of the page.

3. Time stamp (last visit)

<owl:Class rdf:ID="hits"> <rdfs: rdf:resource="h_200 "/>

</owl:Class>

<owl:Class rdf:ID="evaluation">

<rdfs: rdf:resource="e_1 "/>

<owl:Class rdf:ID="Date">

<rdfs:subClassOf>

<owl:Class rdf:ID="time_stamp"/>

</rdfs:subClassOf>

</owl:Class>

<owl:Class rdf:ID="time_hr">

<rdfs:subClassOf rdf:resource="#time_stamp"/>

</owl:Class>

<Date rdf:ID="d_11_7_2008"/>

<time_hr rdf:ID="t_1320"/>

Fig. 2: Server ontology

The above ontology shows that the document was visited on 11 July 2008 at 1:20 pm and it was the 200th hit.

4.1.2 Pheromone Representation

Now we can sum up the entire picture by representing how the pheromone count works. The role of the pheromone is to record the trail and thus indicates the importance of the link/document. The count is always changing based on the time stamp last visited (i.e. the time elapsed after last count change by the visit).

So here is the how it works:

The pheromone associated with the link/model is defined as:

Pcount : D → V, T (1)

50 ♦ A Semantic Web Approach for Improving Ranking Model of Web Documents


Where V is pheromone density at a particular time and T is time stamp of last visit. Now the value of v can be updated in two ways:

Positive update: When the user visits the page, user input of evaluation of page (positive)

Negative update: With time the negative update decreases the pheromone count (evaporation). Also user input of evaluation of page (negative)

It may be noted that equal weightage is given to user visit and input to avoid malicious degradation of pheromone so that the visit will cancel the malicious input. Say for example a user repeatedly visits a page and give negative input 0 but the updation account for the visit also so net negative updation is 0 but the evaporation will continue to take place with the last time stamp count of pheromone.

The pheromone accumulation of a page at n+1 visit is done as follows:

Pnew = Pcurrent + 1 (2)

The negative pheromone mechanism is realized by using the radioactive degradation formula:

Pcount (t) = Pcount (T) * (1/2) exp (t - Pcount (T)/η) (3)

η is the degradation parameter set heuristically. T is the last time stamp of updation so Pcount

at time “t” is dependent on Pcount at last updation.

5 Experiment Results and Conclusion

We used the following model to ascertain the local page rank on a server setup using Apache Tomcat 5.5 for a collection of 70 web documents. Table 1 and 2 show the observed result.

Table 1: Result for η = 2

Percentage of page ranked within error margin of 10% 48*



Table 2: Result for η = 4



Percentage of page ranked within error margin of 40% 87* * Result value is approximated

The results clearly show the potential of this model. More than 50% of the pages were ranked within the error margin of 25%, which is encouraging in view of the sandbox environment of the experiment. In this paper we have presented our idea based on Ant Colony algorithm in the context of learning and web data mining. The proposed model proof of concept implementation shows the improvements in the existing system.

The future work also holds promises in the area of improvement of current algorithm of Ant Colony by making it more relevant based on the information gain of the user experience. Also other optimization can be in the fine-tuning of parameter and increment strategy of pheromone accumulation. We expect better results with more fine-tuning of the approach in future.



References

[1] [Antoniou and Harmelen, 2008] Grigoris Antoniou and Frank van Harmelen, A Semantic Web Primer, pp. 11, MIT Press, Cambridge, Massachusetts, 2008.

[2] [Dorigo, Maniezzo and Colorni, 1991] M. Dorigo, V. Maniezzo, and A. Colorni, The ant system: an autocatalytic optimizing process, Technical Report TR91-016, Politecnico di Milano, 1991.

[3] [Lee 2007], Tim Berners-Lee, MIT Technology Review, 2007 [4] [Dean et al., 2004] M Dean, G Schreiber, W3C reference on OWL, W3C document, 2004 [5] [Colorni et al., 1991] A. Colorni, M. Dorigo, and V. Maniezzo, Distributed optimization by ant colonies,

Proceedings of ECAL'91, European Conference on Artificial Life, Elsevier Publishing, Amsterdam, 1991. [6] [Baader et al., 2003] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, P. F. Patel-Schneider (Eds.), The

Description Logic Handbook: Theory, Implementation, and Applications, Cambridge University Press, 2003



Crawl Only Dissimilar Pages: A Novel and Effective

Approach for Crawler Resource Utilization

Monika Mangla Terna Engineering College, Navi Mumbai

[email protected]

Abstract

Usage of search engines has become a significant part of today’s life[Page and Brin, 1998] While using search engines, we come across many web documents that are replicated on the internet [Brin and Page, 1998] [Bharat and Broder, 1999]. Identification of such replicated sites is an important task for search engines. Replication limits crawler performance (processing time, data storage cost) [Burner, 1997]. Some time even the entire collections (such as JAVA FAQS, Linux manuals) are replicated that limit usage of system resources [Dean and Henzinger, 1999]. In this paper, usage of graphs has been discussed to evade crawling a web page if mirror version of the said page has been crawled earlier. Crawling of only dissimilar web pages enhance the effectiveness of web crawler. The paper here discusses how to represent web sites in form of graph, and how this graph representation is to exploited for crawling of non-mirrored web pages only, so that similar web pages are not crawled multiple times. The method proposed is capable of handling the challenge for finding replicas among the input set of several millions of web pages having hundreds of gigabytes of textual data.

Keywords: Site replication; Mirror; Search engines; Collection

1 Introduction

The World Wide Web (WWW) is a vast and day by day growing source of information organized in the form of a large distributed hypertext system [Slattery and Ghani, 2001]. The web has more than 350 million pages and it is growing in the tune of one million pages per day. Such enormous growth and flux necessitates the creation of highly efficient crawling systems [Smith, 1997] [Pinkerton, 1998]. World Wide Web depends upon crawler (also known as robot or spider) for acquiring relevant web pages [Miller and Bharat, 1998]. Such enormous growth and flux necessitates the creation of highly efficient crawling systems. A crawler crawls through hyperlinks present in the documents to move from one web page to another or sometimes one web site to another web site also. Many of the documents across the web are replicated; sometimes entire documents are replicated over multiple sites. For example the documents containing JAVA Frequently Asked Questions(FAQs) are found replicated over many sites, which results in accessing the same documents number of times, thus limiting the resource utilization and effectiveness of crawler [Cho, Shivkumar and Molina]. Other examples of replicated collections are C tutorials, C++ tutorials, Windows manuals. Even the same job opening is advertised on multiple sites linking to the same web page. Replicated collections consist of thousands of pages which are mirrored in several sites

Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization ♦ 53


in order of tens or sometimes even in hundreds. A considerable amount of crawling resources are used for crawling of these mirrored or similar pages multiple times.

If some method is devised to crawl a page if and only if no similar page has been crawled earlier; therefore visiting similar web pages number of times can be avoided and resources could be utilized in an effective and efficient manner. This paper suggests a methodology utilizes graph representation of web sites in form of a graph for performing the said task. This paper focuses on the general structure of hypertext documents and their representation in form of graph in section 2 and working of crawler has been discussed in section 3. In section 4, a methodology has been proposed that performs crawling of a page if and only if no similar page has been crawled by this time.

2 Representing Web in form of Graph

World Wide Web is a communication system for retrieval and display of multimedia documents with help of hyperlinks. All hyperlinks present among web pages are employed by web crawler for moving from one web page to another; and thus crawled web pages are stored in the database of search engine [Chakrabarti, Berg and Dom]. Hyperlinks present in web pages may provide link to some where within the same web page, some other web page or it may be on some other web site also[Sharma, Gupta and Aggarwal]. Thus hyperlinks may be divided into three types:

Book Mark: A link to a place within the same document. It is also known as a Page link.

Internal: A link to a different document within the same website.

External: A link to a web-site outside the site of the document.

While representing web in form of graph; hyperlinks among web pages are used. Here in the graph representation of web; all web pages are represented by vertices of the graph. A

hyperlink from page Pi to page Pj is represented by an edge from vertex Vi to Vj, where Vi

and Vj represents page Pi and Pj respectively. Graph representation of web does not

differentiates among internal and external links, while book marks are not represented in form of graph representation.

Documents on WWW can be moved or deleted. The referenced information may change resulting in breaking of hypertext link. Thus, the flexible and dynamic nature of WWW necessitates the regular maintenance of graph representation to cope with the dynamic nature of web in order to prevent structural collapse. A module may be run at some fixed interval that reflects changed hyperlinks in its graph representation if any.

3 The Crawler

Web crawler is a program that indexes, automatically navigates the web and downloads web pages. Web crawlers utilize the graph structure of the program to move from page to other page. it picks up a seed URL and downloads corresponding Robot.txt file, which contains downloading permissions and the information about the files that should be excluded by the crawler. A crawler stores a web page and then extracts any URLs appearing in that web page. The same process is repeated for all web pages whose URLs have been extracted from the earlier page. Architecture of web crawler is shown in Figure 1. Key purpose of designing web

54 ♦ Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization


crawlers is to retrieve web pages and to add them to local repository. Crawlers are utilized by search engines for preparation of repository of web pages in the search engine. Different actions performed by web crawler are as follows:

Fig. 1: Web crawler Architecture

1. The downloaded document is marked as having been visited.

2. The external links are extracted from the document and put into a queue.

3. The contents of the document are indexed.

4 The Proposed Solution

In the suggested methodology clustering techniques is used to group all similar pages. There are different algorithms available for finding if two web pages are similar. Web page can be divided into number of chunks, which are converted to some hash value later. If two web pages share number of hash values more than some threshold value then pages are considered to be similar. Many variations of finding similar web pages are available. Some algorithms consider web pages similar based on the structure of web pages only. Here the proposed approach does not emphasize on finding if the pages are similar or not. Approach confirms that a page is to be crawled only if no similar page has been crawled by the time.

In the proposed modus operandi, a color tag is associated with every web page; initially value of color tag for all web pages is set to white. In the proposed methodology, color tag can have a value from set white, gray, black. Color tag gray signifies that a similar page is being crawled, thus if color of any web page is set to gray; which means that crawling process for the web page under consideration is in continuation. Therefore, if any similar web page is fetched for the process of crawling, the web page will not be crawled just to facilitate crawling of dissimilar pages only.

In the proposed method, a cluster consists of set of all similar pages. Cluster is associated with additional information containing number of web pages in the cluster having color tag set to gray (represented by Gray(Ci) for cluster Ci), it also stores number of web pages in the cluster for which color tag is set to black ( represented by Black(Ci) for cluster Ci. Before the crawling process starts, color tag of all web pages is set to white; therefore value of Gray(Ci) and Black(Ci) for every cluster is initialized to zero. To cope with the dynamic nature of web, a refresh tag is also associated with every web page[Cho and Molina, 2000]. In the beginning value of refresh tag is set to zero, value of refresh tag for the modified web page is changed to value of maximum refresh tag in the cluster incremented by one; thus the web page with

Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization ♦ 55


maximum value of refresh tag is the one which was modified in the last. Addition of refresh tag maintains crawling of most updated web page at any time thus the proposed method is armed with the capability of crawling the most refresh data.

In the suggested approach, whenever a web page is crawled it is indispensable that no page in the same cluster should be crawled in the future. The cluster, which a web page belongs to is represented by the function Cluster(Pi) for any web page Pi.

Proposed Algorithm

Step 1: Initialize Gray(Ci)= 0, Black(Ci)=0 for every Cluster Ci

Step 2:For each ClusterCi Step 2.1 For Page Pi with Color(Pi) = White and refresh(Pi) = max refresh value

Step 2.2 Color(Pi) = Gray Step 2.3 Gray(Ci)= Gray(Ci)+1 Step 2.4 Scan the page and find list of Adjacent Pages of Page Pi

Step 2.5 For all Adjacent pages of Page Pi If Gray(Cluster(Adaj(Pi)))=0 AND Black(Cluster(Adaj(Pi)))=0

Crawl Adaj(Pi)

Color(Adaj(Pi)= Gray Gray(Cluster(Adaj(Pi))) = Gray(Cluster(Adaj(Pi)))+1

Step 3: Color(Pi) = Black

Step 4:Black(Ci) = Black(Ci)+1

In the suggested Algorithm, Adjacent pages of a web page Pi refers to the web pages that are linked from web page Pi by means of hyperlink. In the algorithm, while following hyperlinks present in any web page it is tested out if any similar web page has been crawled earlier or not using Step 2.5; the step checks the cluster which adjacent of page Pi belongs to and find the number of pages with color tag set to Gray and that set to Black. If no page exists in the cluster with color tag set to Gray and Black, all pages will be having color tag set to White; which means no similar page has been crawled, therefore web page is downloaded, hyperlinks are extracted and then color tag of web page is set to Gray. While picking a web page from a cluster, page that was refreshed last recently is selected so that results cope with changing nature of World Wide Web.

In the proposed technique, finding set of similar web pages needs to be a continuous procedure so that changes if any, could be implemented in the clusters that are used in the algorithm. Similar to change in content on the World Wide Web, structure of Web pages is also prone to change; Some Web pages are deleted; Some new Web pages are inserted on daily basis; there may be some changes in the structure of hyperlinks also. Implementing all these structural changes necessitates creation of Web Graph to be a continuous procedure; repeated after regular intervals.

5 Conclusion

It has been observed that number of web pages are replicated over number of web sites. Crawler, while crawling the Web crawls through these replicated web pages multiple times and thus resources are under utilized. These resources could be utilized in an efficient and effective manner if visiting replicated web pages is restricted to once. In this paper, An approach for crawling only dissimilar web pages has been suggested. In order to ensure quality and freshness of downloaded pages from the set of similar pages, inclusion of refresh

56 ♦ Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization


tag has been proposed. Following the suggested approach the efforts of crawler can be reduced by a significant amount, while producing results which are better organized, most updated and relevant when presented to the user.

References

[1] [Page and Brin, 1998] L. Page and S. Brin, “The anatomy of a search engine”, Proc. of the 7th International WWW Conference (WWW 98), Brisbane,

[2] [Brin and Page, 1998] Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual Web search engine”. Proceedings of the Seventh International World Wide Web Conference, pages 107—117, April 1998.

[3] [Bharat and Broder, 1999] Krishna Bharat and Andrei Z. Broder. Mirror, Mirror, on the Web: A study of host pairs with replicated content. In Proceedings of 8th International Conference on World Wide Web (WWW'99), May 1999.

[4] [Burner, 1997] Mike Burner, “Crawling towards Eternity: Building an archive of the World Wide Web”, Web Techniques Magazine, 2(5), May 1997.

[5] [Dean and Henzinger, 1999] J. Dean and M. Henzinger, “Finding related pages in the world wide web”, Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1467-1479, 1999.

[6] [Slattery and Ghani, 2001] Y. Yang, S. Slattery, and R. Ghani, “A study of approaches to hypertext categorization”, Journal of Intelligent Information Systems. Kluwer Academic Press, 2001.

[7] [Smith, 1997] Z. Smith, “The Truth About the Web: Crawling towards Eternity”, Web Techniques Magazine, 2(5), May 1997.

[8] [Pinkerton, 1998] Brian Pinkerton, “Finding what people want: Experiences with the web crawler”, Proc. of WWW Conf., Australia, April 14-18, 1998.

[9] [Miller and Bharat, 1998] Robert C. Miller and Krishna Bharat, “SPHINX: A framework for creating personal, site-specific Web Crawlers”, Proceedings of the Seventh International World Wide Web Conference, pages 119—130, April 1998.

[10] [Cho, Shivkumar and Molina] Junghoo Cho, Narayanan Shivakumar and Hector Garcia-Molina,” Finding replicated web Collections”,

[11] [Chakrabarti, Berg and Dom] S. Chakrabarti, M. van den Berg, and B. Dom, “Distributed hypertext resource discovery through examples”, Proceedings of the 25th International Conference on Very Large Databases (VLDB), pages 375-386,.

[12] [Sharma, Gupta and Aggarwal] A. K. Sharma, J. P. Gupta, D. P. Agarwal,” Augmented Hypertext Documents Suitable for Parallel Retrieval Of Information”

[13] [Cho and Molina, 2000] Junghoo Cho and Hector Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings of the 2000 ACM SIGMOD, 2000.



Enhanced Web Service Crawler Engine (A Web Crawler that Discovers Web Services Published on Internet)

Vandan Tewari Inderjeet Singh SGSITS, Indore SGSITS, Indore [email protected] [email protected]

Nipur Garg Preeti Soni SGSITS, Indore SGSITS, Indore [email protected] [email protected]

Abstract

As Web Services proliferate, size and magnitude of UDDI Business Registries (UBRs) are likely to increase. The ability to discover web services of interest across multiple UBRs then becomes a major challenges. specially, when using primitive search methods provided by existing UDDI APIs. Also UDDI Registration is voluntary and therefore web services can easily be passive. For a client finding services of interest should be time effective and highly productive (i.e a searched services of interest should also be active else the whole searching time and efforts will be wasted) If an explored service from UBRs results to be passive, it leads to wastage of lot of processing power and time of both service provider as well as client. The previous research work shows an intriguing results of only 63 % of available web services to be active. This paper proposes “Enhanced Web Service Crawler Engine” which provides more relevant results and gives output within the acceptable time limits. Proposed EWSCE is intelligent and it performs verification & validation test on discovered Web Services to ensure that these are active before presenting these to the user. Further this crawler is able to work with federated UBRs. During discovery, if some web services fail the validation test, EWSCE stores it in a special database for further reuse and will automatically delete it from corresponding UBR.

1 Background

1.1 Web Service

A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP-messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards.

1.2 WSDL

Web Service Description language is an XML based language that provides a mode for describing web service. WSDL is often used in combination with SOAP and XML Schema to

58 ♦ Enhanced Web Service Crawler Engine


provide web services over the Internet. A client program connecting to a web service can read the WSDL to determine what functions are available on the server. Any special data types used are embedded in the WSDL file in the form of XML Schema. The client can then use SOAP to actually call one of the functions listed in the WSDL.

1.3 UDDI

Universal Description, Discovery and Integration (UDDI) is a platform-independent, XML-based registry for businesses worldwide to list themselves on the Internet. UDDI was originally proposed as a core Web service standard. It is designed to be interrogated by SOAP messages and to provide access to Web Services Description Language documents describing the protocol bindings and message formats required to interact with the web services listed in its directory.

1.4 Web Crawler

A web crawler (also known as a web spider, web robot, or web scutter) is a program or

automated script which browses the World Wide Web in a methodical, automated manner.

This process is called web crawling or spidering. Many sites, in particular search engines, use

spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a

copy of all the visited pages for later processing by a search engine that will index the

downloaded pages to provide fast searches. Crawlers can also be used for automating

maintenance tasks on a website, such as checking links or validating HTML code. Also,

crawlers can be used to gather specific types of information from Web pages, such as

harvesting e-mail addresses.

2 Related Work

Web services discovery is the first step towards usage of SOA for business applications over

internet and is an interesting area of research in ubiquitous computing. Many researchers

have proposed discovering Web services through a centralized UDDI registry [M Paolucci et

al.,2002;U Thaden et.al.,2003]. Although centralized registries can provide effective methods

for the discovery of Web services, they suffer from problems associated with having

centralized systems such as a single point of failure, and other bottlenecks. Other approaches

like [C.Zhou et.al.,2003,K.Sivashanmugam et.al.,2004] focused on having multiple

public/private registries grouped into registry federations Web services over a federated

registry sources but, similar to the centralized registry environment, it does not provide any

means for advanced search techniques which are essential for locating appropriate business

applications. In addition, having a federated registry environment can potentially provide

inconsistent policies to be employed which will significantly have an impact on the

practicability of conducting inquiries across the federated environment and can at the same

time significantly affect the productiveness of discovering Web services in a real-time

manner across multiple registries. Some other approaches focused on the peer-to-peer

framework architecture for service discovery and ranking [E.Al-Masri et.al., 2007], providing

a conceptual model based on Web service reputation, and providing keyword-based search

Enhanced Web Service Crawler Engine ♦ 59


engine for querying Web services. Finding relevant services on web is still an active area of

research since the study provides some details and statistics from Web services. In Previous

work where web services were discovered through search engines [E.Al-Masri et.al.,2008],

Web Search engines treats web services & general web documents in the similar way for

search criteria which results irrelevancy in fetched information about any type of web

services. Further limited Search methods used by these crawlers also limits the relevancy of

fetched data and expand the time duration of search too due to large search space. We have

tried to mitigate these problems in our proposed EWSCE by using a local cache and a refined

search mechanism. Also providing an example binding instance increases the effectiveness of

discovery. In this paper we propose a search engine capable of discovering web services

effectively on web.

3 Our Proposal

As Web Service proliferates size and magnitude of UDDI Business Registry(UBRs) are

likely to increase. UDDI registration is Voluntarily for service providers and therefore Web

service can easily become passive i.e. provider has revoked their Web Services but still there

is an entry in UDDI. If a client is asking for searching a particular web service, search Engine

can return a service which does not exists. To overcome this deficit it is proposed to design

an Enhanced Web Service Crawler Engine for discovering web services across multiple

UBRs. This EWSCE automatically refreshes the UDDI or UBRs for web service update. The

refresh rate of EWSCE is very high which ensures none of the service existing in UBRs can

be passive.

This Crawler has few basic modules as follows:

• UBRs Crawl Module: This module maintains a table of IP Addresses of all federated UBRs as initial seeds in its local memory area and also maintain a cache for fast processing.

• Validation Module: This modules check validity of searched web services. If WSDL of a web services exists it ensures that it is an active service and then the Find module fetches Access point URLs and WSDL document corresponding to that web service. We then parse this WSDL document and find its method name which ensures the validity of the discovered web service.

Find Module: This module takes initial seeds from the local IPtable of local disk and finds out the respective web services corresponding to the keyword entered by the user.

Modify Module: Those Services which fail the validation test, their access point URL is sent to this Modify module.Modify module deletes the entry of those passive web services from the UBRs.

3.1 Proposed Algorithm for Search Engine

For implementation of proposed search engine, following algorithm is proposed:



3.2 Proposed Architecture: Following is the Proposed Architecture of EWSCE

Step1: START

Step2: Accept the Keyword from enduser and initialize its iptable for initial seeds

Do

Step3 :Visit each seeds and find out the Access Point URL against the requested keyword and store it locally.

Step4: Parse the WSDL document against Access Point URL for each discovered service.

Step5: If web service is Active then

Store it locally

Else

Fetch Business Key against that Access Point URL from UBRs

and Pass it to the modify module that deletes the web services from

UBRs and Store it locally for future Reference.

Untill all seeds are visited from WSlist to crawl

Step6: Display the list of Access Point URL to the EndUser in the form of

hyperlink to show binding instance of web services..

Step7: END



4 Results

4.1 Scenario I for Testing: User wants to find out the web services only related to

“calculation”.

Fig 4.1 contains the form in which user entered the keyword “Calculation” to search for Web Services related to calculations.

Fig. 4.1: User Entered the keyword to perform the search

Now the User has list of Web Services related to keyword “Calculation” as shown in fig 4.2.

Fig. 4.2: Search Results- List of Access Point URL

4.2 Results for Scenario1

As shown in fig 4.2, Crawler will give the list of web services as a hyperlink to its actual definition i.e link to the deployed web services at provider side. User can check the validation of Search results by binding to the Access point URL as given. That means this will give only relevant list of web services (i.e Active Web Services) to the user, there will be no any irrelevant link. Suppose User clicks powerservice from the above list, then the input window



with two text box will be open for the user to calculating the a to the power b as shown in fig 4.3 and output of that will be displayed at client side as shown in fig 4.4.

Fig. 4.3: Input Screen against hyperlink

Fig. 4.4: Final Result : After executing web service

5 Conclusions & Discussions

In this paper an EWSCE has been presented here for purpose of effective and fruitful discovery of web services. This proposed solution provides an efficient Web service discovery model in which client neither has to search multiple UBRs nor has to suffer from the problem of handling with passive web services. As number of web services increases the success of the business depends on both speed and accuracy of getting the information of relevant required web service. In ensuring accuracy EWSCE has an edge over the other WSCE. The crawler update rate of the proposed engine is high, besides this the engine periodically refreshes the repository whenever idle and automatically deletes passive web services from there.



In future this proposed search engine can be made intelligent to extend our current framework by using AI Techniques such as Service Rating for returning relevant services., Further the response time for discovery of required service can be improved by using local cache etc. Also to make Virtual UBRs more smarter we can include the procedure for dealing with new popped up services

References

[1] [C. Zhou et.al., 2003] C Zhou, L.Chia, B.Silverajan and B. Lee. UX-an architecture providing QoS-aware and federated support for UDDI.In Proceedings of ICWS, pp.171-176,2003.

[2] [E. Al-Masri et.al.,2007] E. Al-Masri, & Q.H., Mahmoud, A Framework for Efficient Discovery of Web Service across Heterogeneous Registries, In Proceedings of IEEE Consumer Communication and

Networking Conference(CCNC), 2007. [3] [E. Al-Masri et.al., 2008] Eyhab Al-Masri and Qusay H. Mahmoud, Investigating Web Services on the

world wide web.In proceeding of www2008, pg.no.795-804. 2008. [4] [K. Sivashanmugam et.al., 2004] K. Sivashanmugam, K. Verma and A Seth. Discovery of web services in a

federated environment. In proceedings of ISWC, pp270-278, 2004. [5] [M Paolucci et al.,2002] M Paolucci, T. Kawamura, T. Payne and K Sycara. Semantic matching of web

service capabilities. In proceedings of ISWC, pp1104-1111, 2002. [6] [U Thaden et.al.,2003] U Thaden, WSiberski and W. Nejdl. A semantic web based peer to peer Service

registry network. In Technical report, Learning Lab Lowry Saxony, 2003

Data Warehouse Mining



Web Intelligence: Applying Web Usage

Mining Techniques to Discover Potential

Browsing Problems of Users

D. Vasumathi A. Govardhan Dept. of CSE Dept. of CSE JNTU Hyderabad JNTU Hyderabad [email protected] [email protected]

K. Suresh Dept. of I.T

VCE, Hyderabad [email protected]

Abstract

In this paper, a web usage mining based approach is proposed to discover potential browsing problems. Two web usage mining techniques in the approach are introduced, including Automatic Pattern Discovery (APD) and Co-occurrence Pattern Mining with Distance Measurement (CPMDM). A combination method is also discussed to show how potential browsing problems can be identified

1 Introduction

Website design is an important criterion for the success of a website. In order to improve website design, it is essential to understanding how the website is used through analyzing users’ browsing behaviour. Currently there are many ways to do this, and analysis of the click stream data is claimed to be the most convenient and cheapest method [3]. Web usage mining is a tool that applied Data Mining techniques to analyze web usage data [1], and it is a suitable technique that can be used to discover potential browsing problems. However, traditional web usage mining techniques are not sufficient enough for discovering potential browsing problems, such as Clustering, Classification and Association Rule. In this paper, we proposed an approach, which is based on the concept of web usage mining and follows the KDD (Knowledge Discovery in Database) process.

Two main techniques are included, which are Automatic Pattern Discovery, and a co-occurrence pattern mining, which is improved from traditional traversal pattern mining. These techniques are claimed can be used to discover potential browsing problems.

2 An Approach for Applying Web Usage Mining Techniques

In this paper, we proposed an approach for applying web usage mining techniques to discover potential browsing problems. Figure 1 presents the proposed approach, which is based on the KDD process [2]. In this approach, the KDD process will be run as a normal process, from data collection and preprocessing, to pattern discovery and analysis, recommendation and action. The second step (pattern discovery and analysis) will be the main focus of this paper.

68 ♦ Web Intelligence: Applying Web Usage Mining Techniques to Discover Potential Browsing Problems


Fig. 1: A KDD based Approach for Discovering Potential Browsing Problems

3 Automatic Patterns Discovery (APD)

In our previous work [4], some interesting patterns have already been identified, including Upstairs and Downstairs pattern, Mountain pattern and Fingers pattern.

The Upstairs pattern is found when the user moves forward in the website and never back to the web page visited before. The Downstairs pattern is that the user moves backward, that is the user returns to the visited pages. The Mountain pattern occurs when a Downstairs pattern immediately follows an Upstairs pattern. The Fingers pattern occurs when a user moves from one web page to browse another web page and then immediately returns to the first web page. These patterns are claimed to be very useful for discovering potential browsing problems (see [4] for further detail). The APD method is based on the concept of sequential mining to parse the browsing routes of users. The APD method is performed by a three-level browsing route transformation algorithm. The level-1 elements include Same, Up and Down. The level-2 elements are Peak and Trough, and the final level is to discover the Stairs, Fingers and Mountain pattern (See [5] for more detail about the APD method). Table 1 shows an example of number-based browsing sequences, which are transformed from the browsing routes of users (the number denotes the occurrence sequence of the visited web page in a user’s session). Table 2 shows the discovered final patterns by performing the APD method.

Table 1. Number-based Browsing Sequences

Number Number-based sequence

1 0,1,2

2 0,0,1,0,2,0,3,0,4,0,5,6,7,6,7,8,6,4,5,0

Table 2. Final Patterns

Number Patterns

1 Upstairs

2 Finger,Finger,Finger,Finger,Mountain, Mountain, Mountain

Action: web page

redesign

Recommendation

Check stream data

and data preprocessing

APD and distance based association

Rule mining

Evaluation

Evaluation KDD

Web Intelligence: Applying Web Usage Mining Techniques to Discover Potential Browsing Problems ♦ 69


4 Co-occurrence Pattern Mining with Distance Measurement (CPMDM)

CPMDM is another technique that can be used to analyse the browsing behavior of users, which is an improvement of co-occurrence pattern mining by introducing a Distance

measurement. Co-occurrence pattern is a pattern that used to describe the co occurrence frequency (or probability) of two web pages in users browsing routes. The additional measurement, Distance, is a measurement that used to measure how many browsing steps from one page to another in a cooccurrence pattern. There are three different directions of the distance measurement, including Forward, Backward and Two-Way. The Forward distance measures the distance from web page A to B of the co-occurrence pattern AB. The Backward distance on the other hand measures the distance from B to A of the co occurrence pattern AB. The Two-Way distance combines forward and backward distance. It ignores the

direction of the association rule, and takes all co occurrence patterns about A and B.

5 Combining APD and CPMDM for Discovering Browsing Problems

The analysis results of the APD and CPMDM are two totally different analyses of users’

browsing behaviour. However, there will be some biases if only one of these two methods is

used to assess the website’s design. Therefore, if the analysis results of the APD and

CPMDM can be combined, more concrete indications of potential problems in the website’s

design can be discovered.

Table 3 shows an example about combining the APD and CPMDM methods for discovering

potential, browsing problems. In this case, the starting page of the co-occurrence patterns is

the home page of the University of York website. In the table, the Support means the

probability of the co-occurrence pattern and the Distance is the average forward distance of

the pattern. The proportion of Stairs and Fingers pattern is measured by using the APD

method. In this case, we consider that the fingers pattern is a problematical pattern, and the

longer the distance means the more difficult for a user to traverse from one page to another.

Therefore, the browsing route from home page to /uao/ugrad/course page can easily to be

discovered as a route that potential browsing problem may occur.

Table 3. Combining the APD and CPMDM of “The people who view home page then view”

URL support Distance (average)

Stairs Pattern

Finger Pattern

/uao/ugrad 0.25 9.1271 44% 39%

/gso/gsp/ 0.173 5.3195 52% 26%

/uao/ugrad/courses/ 0.127 16.9021 34% 47%

6 Conclusion

This paper proposed a users’ browsing behaviour analysis approach which is based on applying web usage mining techniques. The concepts of the APD and CPMDM have been briefly introduced, and the combination method has been discussed in this paper as well. From the example of the combination method, it showed that potential browsing problems of users can be discovered easily. The approach that proposed in this paper is therefore beneficial for the area of website design improvement.

70 ♦ Web Intelligence: Applying Web Usage Mining Techniques to Discover Potential Browsing Problems


References

[1] Cooley, R. Mobasher, B. and Srivastave, J. (1997) “Web Mining: Information and Pattern Discovery on the World Wide Web” In Proceedings of the 9th IEEE ICTAI Conference, pp. 558-567, Newport Beach, CA, USA.

[2] Lee, J., Podlaseck, M., Schonberg, E., Hoch, R. (2001) “Visualization and Analysis of Click stream Data of Online Stores for Understanding Web Merchandising”, Journal of Data Mining and Knowledge Discovery, Vol. 5, pp. 59-84.

[3] Kohavi, R., Mason, L. and Zheng, Z. (2004) “Lessons and Challenges from Mining Retail E-commerce Data” Machine Learning, Vol. 57, pp. 83-113

[4] Ting, I. H., Kimble, C., Kudenko, D. (2004) “Visualizing and Classifying the Pattern of User’s Browsing Behavior for Website Design Recommendation” In Proceedings of 1

st KDDS workshop, 20-24 September,

Pisa, Italy. [5] Ting, I., Clark, L., Kimble, C., Kudenko, D. and Wright, P. (2007) "APD-A Tool for Identifying

Behavioural Patterns Automatically from Clickstream Data" Accepted to appear in KES2007 Conference, 12- 14 September.



Fuzzy Classification to Discover On-line User

Preferences Using Web Usage Mining

Dharmendra T. Patel Amit D. Kothari Charotar Institute of Computer Applications Charotar Institute of Computer Applications Changa-Gujarat University, Gujarat Changa-Gujarat University, Gujarat [email protected]

Abstract

Web usage mining is an important sub-category of Web Mining. Every day users can access lots of web sites for different purposes. The information of on-line users (like Time, Date, host name, amount of data transferred, platform, url etc) is recorded in many server log files. This information, recorded in server logs file, are very important for some decision-making tasks for business communities. Web Usage mining is an important technique to discover useful user patterns from information recorded in server log files. In this paper method to discovery online user preferences is suggested. It is based on vector space model and fuzzy classification of online user sessions. Web Usage mining concepts like clustering and classification, based on user interactions to web, are used in this paper to discover useful usage patterns. It shows that why fuzzy classification is better than normal classification for several applications like recommendation.

Keywords: Web Mining, Web Usage Mining, Vector Space Model, Fuzzy Classification Method

1 Introduction and Related Work

Web mining is a technique of data mining, which can be applicable on web data. Today WWW grows at an amazing rate for information gate way and as a medium for conducting business. When any user interacts with the web, it leaves lots of information like Date, Time, Host Name, Amount of Data transferred, Platform, Version etc to many server log files. Web usage mining, the sub-category of web mining technique, which mines these user related information recorded in many server log files to discover important usage patterns. Figure-1 depicts the Web Usage mining process [3]

Preprocessing is the main task of WUM process. The inputs of the preprocessing phase, for usage processing, may include the web server logs, referral logs, registration files, index server logs, and optionally usage statistics from a previous analysis. The outputs are the user session file, transaction file, site topology, and page classifications [2]. The first step in usage pattern discovery is session extraction from log files. Various sessionization strategies are describes in [4]. Normally, sessions are represented as vectors whose coordinates determine which item has been seen. Once the sessions are obtained they can be clustered. Each cluster groups similar sessions. As a consequence, it is possible to acquire knowledge about typical user visits treated here as predefined usage patterns. A classification of the on-line users to

72 ♦ Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining


one of the predefined classes is typically based on similarity calculation between each predefined pattern and the current session. The current session is assigned to the most similar cluster [7]. Unfortunately, the majority of clustering algorithms [1] divide the whole vector space in separate groups that cannot work ideally for the real life cases. This problem has been noticed in [5]. Fuzzy clustering may be the solution of above mentioned problem but it does not solve the problem of a classification of the on-line session that is situated on the border of two or more clusters. Independently from the clustering type (whether it is fuzzy or not) the fuzzy classification is required.

Fig. 1: Web Usage Mining Process

The purpose of this paper is to present a method to discover on-line user preferences based on the previous user behaviour and the fuzzy classification of the online user’s session to one of the precalculated usage patterns. It is assumed that the users enter the web site to visit abstract items (web pages, e-commerce products etc) whose features (for example textual content) and relations between them are not known. As a result, the preference vector is created. Each vector’s coordinate corresponds to one item and measures the relevance of this item for the user interests. The obtained vector can be used in recommendation, ordering the search results or personalized advertisements. To apply fuzzy classification on online user, this paper recommends following steps.

1. Session Clustering to find out Usage Patterns.

2. Classification of Online User to Usage Patterns.

3. Preference Vector Calculation.

2 Session Clustering to Find out Usage Patterns

When user visits any web site, its information stores in many server log files. Web usage mining can be applied to server log files to find usage patterns. The first step in usage pattern discovery is session extraction from log files. Using appropriate sessionization strategy, based on requirement, extract session. Historical user sessions are clustered in vector space model in order to discover typical usage pattern. Clustering process is not fuzzy like classification. Let h be the historical session vector that corresponds to a particular session then: hj=1 if the item dj has been visited in the session represented by h and hij=0 otherwise.

Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining ♦ 73


Sessions with only one or two visited items or sessions in which almost all items occur may worsen the clustering results. For this reason, it is better to cluster only these vectors in which the number of visited items is lower than nmax and greater than nmin. The nmin, nmax

parameters are very important for the clustering result. Too low value of nmin may cause that many sessions will not be similar to any other and as a consequence many clusters with a small number of elements will appear. Too high value of nmin or too low value of nmax

removes valuable vectors. Too high value of nmax may result in appearance of small number of clusters with many elements.

Once the historical sessions are created and selected, they are clustered using an algorithm of clustering[8]. It is recommended to use the algorithm that does not require the number of clusters to be specified explicitly. As a result of clustering the set C=c1,c2,c3… cnof n

clusters is created. Each cluster can be regarded as a set of the session vectors that belong to it C1=h1,h2,h3…,hcard(c1). Their mean vectors called centroid can also represent the clusters:

(2.1)

These calculated centroids will be also denominated usage patterns. The purpose of a centroid is to measure how often the given item has been visited in the sessions that belonging to this cluster.

3 Classification of On-line User to Usage Pattern

The clusters and centroids obtained from previous section are very valuable for on-line users. The current session vector s is used in order to classify the current user behavior to the closest usage pattern. Similarly to the historical session vector, every coordinate corresponds to a particular item. When the user visits the item di the coordinates of the vector s change according to the following formula:

(3.1)

The constant t ∈<0,1> regulates the influence of the items visited before on the classification process. If the parameter t is set to 0 items seen before will not have any influence. In case of t =1 items visited before will possess the same impact as the current item. Similarity between the current session vector s and the centroids of the jthcluster can be calculated using the Jaccard formula:

(3.2)

The centroid Cmaxc of the closest usage pattern fulfils the following condition:

(3.3)

74 ♦ Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining


The main reason to use Jaccard formula is that zero coordinates doesn’t increase similarity values. The fuzzy classification is used in another approach. In this case, the similarity between a given usage pattern and the current session vector is treated as a membership function that measures the grade of membership of the current session vector in the usage pattern (0 – it does not belong to the pattern at all, 0.5 - it belongs partially, 1 it belongs entirely). The membership function is a fundamental element of the fuzzy set theory [6].

It is important to emphasize that the preferences of the user can vary even during the same site visit. For this reason the online classification should be recalculated every time the user sees a new item.

4 Preference Vector Calculation

Users enter the web site to visit abstract items (web pages, e-commerce products etc) whose features (for example textual content) and relations between them are not known. As a result, the preference vector is created. Each vector’s coordinate corresponds to one item and measures the relevance of this item for the user interests. In this paper preference vector calculation method is described to discover on-line user preferences based on the previous user behavior and the fuzzy classification of the online user’s session to one of the precalculated usage patterns.

The preference vector p can be obtained by calculating the similarity between the current session vector and the usage patterns. The values of preference vector’s coordinates change iteratively every time a new document or a product is visited. Before the user enters the site, p0=0 and the session vector s0=0 hence the preferences are not known yet and there is no item that has been visited in this session When the ith

item is requested the preference vector

is modified:

(4.1)

Where

1. pi-1 Remembers the previous preferences of the user. The a∈(0,1)

parameter regulates the influence of the old preference vector on the current one.

2.

Promotes items that were frequently visited in the cluster whose centroids are similar to current session.

3. 1-si Weakens the influence of the items that have been already seen in

this session.

It is important to underline that all usage patterns influence on the preferences vector. Instead of classifying the current session vector to the closest usage pattern, the fuzzy classification is used. The introduction of the fuzzy classification is especially profitable when the session is situated at the same distance from the closest clusters. If any user wants common information from many clusters equation 4.1 is very useful and that is based on fuzzy classification. If only the closest usage pattern were used (instead of fuzzy classification), the formula would have the following form:

Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining ♦ 75


(4.2) Preference vector calculation is based on user behavior, if user wants common information from more clusters equation 4. 1 is profitable which is based on fuzzy classification otherwise equation 4.2 can be used to determine closet usage pattern. Preference vector contains many characteristics; we have to developed preference vector based on those characteristics.

5 Conclusions and Future Work

In this paper user preferences discovery has been presented using fuzzy classification method. It has been shown that if on-line sessions are situated between two or more usage patterns the fuzzy classification behaves better than a normal classification. Although the preference vector calculation using the fuzzy classification seems to be more time consuming (please compare the formulas: 4.1 and 4.2).It is possible to limit fuzzy classification to 2 or 3 patterns to eliminate above problem.

The future work will be concentrated on integration of presented method in real time applications like recommendation, ordering the search results or personalized advertisements. The other thing is instead of using vector space model, we can use graph theory to retain more information than vectors.

References

[1] Data Ware housing, Data mining and OLAP by Alex Berson and Stephen J. Smith. [2] Proceedings of the Fifth International conference on Computational intelligence and Multimedia

Applications. [3] Srivastava, J., Cooley R., Deshpande, M., Tan, P.N., Web Usage Mining: Discovery and Applications of

Usage Patterns from Web Data. SIGKDD Explorations, volume. 1, pages. 12-23, 2000. [4] Web Usage mining for E-business Applications ECML/PKDD-2002 Tutorial. [5] Mining Web Access Logs Using Relational Competitive Fuzzy Clustering, In: 8 International Fuzzy

Systems Association World Congress - IFSA 99 [6] Fuzzy Thinking, The New Science of Fuzzy Logic, Hyperion, New York,1993. [7] Integrating Web Usage and Content Mining for More Effective Personalization. LNCS 1875 Springer

Verlag, 156-76. [8] Cooley R., Web Usage Mining: Discovery and Application of Interesting patterns from Web Data, Ph. D.

Thesis, Department of Computer Science, University of Minnesota, 2000.



Data Obscuration in Privacy Preserving Data Mining

Anuradha T. Suman M. Arunakumari D. K.L. College of Engineering, K.L. College of Engineering, K.L. College of Engineering, Vijayawada Vijayawada Vijayawada [email protected] [email protected] [email protected]

Abstract

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records, and only disclose the randomized (obscured) values. The model is built over the randomized data after first compensating for the randomization (at the aggregate level). The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issues of measuring privacy.

1 Introduction

One approach to privacy in data mining is to obscure or randomize the data: making private data available, but with enough noise added that exact values cannot be determined. Consider a scenario in which two or more parties owning confidential databases wish to run a data mining algorithm on the union of their databases without revealing any unnecessary information. For example, consider separate medical institutions that wish to conduct a joint research while preserving the privacy of their patients. In this scenario it is required to protect privileged information, but it is also required to enable its use for research or for other purposes. In particular, although the parties realize that combining their data has some mutual benefit, none of them is willing to reveal its database to any other party.

In this case, there is one central server, and many clients (the medical institutions), each having a piece of information. The server collects this information and builds its aggregate model using, for example, a classification algorithm or an algorithm for mining association rules. Often the resulting model no longer contains personally identifiable information, but contains only averages over large groups of clients.

The usual solution to the above problem consists in having all clients send their personal information to the server. However, many people are becoming increasingly concerned about the privacy of their personal data. They would like to avoid giving out much more about themselves than is required to run their business with the company. If all the company needs is the aggregate model, a solution is preferred that reduces the disclosure of private data while still allowing the server to build the model.

One possibility is as follows: before sending its piece of data, each client perturbs it so that some true information is taken away and some false information is introduced. This approach is called randomization or data obscuration. Another possibility is to decrease precision of the

Data Obscuration in Privacy Preserving Data Mining ♦ 77


transmitted data by rounding, suppressing certain values, replacing values with intervals, or replacing categorical values by more general categories up the taxonomical hierarchy. The usage of randomization for preserving privacy has been studied extensively in the framework of statistical databases. In that case, the server has a complete and precise database with the information from its clients, and it has to make a version of this database public, for others to work with. One important example is census data: the government of a country collects private information about its inhabitants, and then has to turn this data into a tool for research and economic planning.

2 Numerical Randomization

Let each client Ci, i = 1, 2, . . . ,N, have a numerical attribute xi. Assume that each xi is an instance of random variable Xi, where all Xi are independent and identically distributed. The cumulative distribution function (the same for every Xi) is denoted by FX. The server wants to learn the function FX, or its close approximation; this is the aggregate model which the server is allowed to know. The server can know anything about the clients that is derivable from the model, but we would like to limit what the server knows about the actual instances xi.

The paper [4] proposes the following solution. Each client randomizes its xi by adding to it a random shift yi. The shift values yi are independent identically distributed random variables with cumulative distribution function FY; their distribution is chosen in advance and is known to the server. Thus, client Ci sends randomized value zi = xi + yi to the server, and the server’s task is to approximate function FX given FY and values z1, z2, . . . , zN. Also, it is necessary to understand how to choose FY so that

• the server can approximate FX reasonably well, and

• the value of zi does not disclose too much about xi.

∑∫

=

∞

∞−

+

−

−=

N

i j

XiY

j

XiYj

X

dzzfzzf

afazf

Naf

1

1

)()(

)()(1:)(

j: = j+1;

The amount of disclosure is measured in [4] in terms of confidence intervals. Given confidence c%, for each randomized value z we can define an interval [z − w1, z + w2] such that for all nonrandomized values x we have until (stopping criterion met).

P [Z − w1 ≤ x ≤ Z + w2 |Z = x + Y,Y~FY ] ≥ c%.

In other words, here we consider an “attack” where the server computes a c%-likely interval for the private value x given the randomized value z that it sees. The shortest width w = w1 + w2 for a confidence interval is used as the amount of privacy at c% confidence level. Once the distribution function FY is determined and the data is randomized, the server faces the reconstruction problem: Given FY and the realizations of N i.i.d. random samples Z1, Z2, . . , ZN, where Zi = Xi + Yi, estimate FX. In [4] this problem is solved by an iterative algorithm based on Bayes’ rule. Denote the density of Xi (the derivative of FX) by fX, and the density of Yi (the derivative of FY) by fY ; then the reconstruction algorithm is as follows:

1. f 0X : = uniform distribution; 2. j:= 0 // Iteration number; 3. repeat

78 ♦ Data Obscuration in Privacy Preserving Data Mining


For efficiency, the density functions fjX are approximated by piecewise constant functions

over a partition of the attribute domain into k intervals I1, I2, . . . , Ik. The formula in the algorithm above is approximated by (m(It) is the midpoint of It):

∑∑=

=

+

−

−

=

N

i tt

j

Xti

k

t Y

p

j

XpiY

p

j

X

IIfImzmf

IfImzmf

NIf

11

1

||)())()((

)())()((1:)(

It can also be written in terms of cumulative distribution functions, where ∆ FX((a, b]) = FX(b) − FX(a) = P[a <X ≤ b] and N(Is) is the number of randomized values zi inside interval Is:

∑∑

==

+

∆−

∆−

=∆k

t t

j

XtsY

p

j

XpsYk

s

sp

j

X

IFImImf

IFImImf

N

INIF

11

1

)())()((

)())()(()(:)(

Experimental results show that the class prediction accuracy for decision trees constructed over randomized data (using By Class or Local) is reasonably close (within 5%–15%) to the trees constructed over original data, even with heavy enough randomization to have 95%-confidence intervals as wide as the whole range of an attribute. The training set had 100,000 records.

3 Itemset Randomization

Papers [6; 7] consider randomization of categorical data, in the context of association rules. Suppose that each client Ci has a transaction ti, which is a subset of a given finite set of items

I, |I| = n. For any subset A ⊂ I, its supporting the dataset of transactions T =ti N

i 1=is defined

as the fraction of transactions containing A as their subset:

suppT (A):= |ti | A ⊆ ti, i = 1 . . .N| N;

an itemset A is frequent if its support is at least a certain threshold smin. An association rule

A⇒ B is a pair of disjoint itemsets A and B; its support is the sup-port of AU B, and its

confidence is the fraction of transactions containing A that also contain B:

confT(A⇒B):= suppT(AU B) /suppT (A) .

An association rule holds for T if its sup-port is at least smin and its confidence is at least cmin,

which is another threshold. Association rules were introduced in [2], and [3] presents efficient algorithm Ap-riori for mining association rules that hold for a given dataset. The idea of Ap-riori is to make use of antimonotonicity property:

∀ A ⊆ B : suppT(A) ≥ suppT (B) .

Conceptually, it first finds frequent 1-item sets, then checks the support of all 2-item sets whose 1-subsets are frequent, then checks all 3-item sets whose 2-subsets are frequent, etc. It stops when no candidate itemsets (with frequent subsets) can be formed. It is easy to see that the problem of finding association rules can be reduced to finding frequent itemsets. A natural way to randomize a set of items is by deleting some items and inserting some new items. A select-a-size randomization operator is defined for a fixed transaction size |t| = m

and has two parameters: a randomization level 0≤ρ ≤ 1 and a probability distribution (p[0],



p[1], . . . , p[m])over set 0,1, .. ,m.Given a transaction t of size m, the operator generates a randomized stransaction t1 as follows:

1. The operator selects an integer j at ran-dom from the set 0, 1, . . . , m so that P[j is selected] = p[j].

2. It selects j items from t, uniformly at random (without replacement). These items, and no other items of t, are placed into t1.

3. It considers each item a ∉ t in turn and tosses a coin with probability ρ of “head-s”

and 1 − ρ of “tails”. All those items for which the coin faces “heads” are added to t1.

If different clients have transactions of different sizes, then select-a-size parameters have to be chosen for each transaction size. So, this (nonrandomized) size has to be transmitted to the server with the randomized transaction.

4 Limiting Privacy Breaches

Consider the following simple randomization R: given a transaction t, we consider each item in turn, and with probability 80%replace it with a new random item; with probability 20% we leave the item unchanged. Since most of the items get replaced, we may suppose that this randomization preserves privacy well. However, it is not so, at least not all the time. Indeed, let A = x, y, z be a 3-item set with partial supports

s3 = suppT (A) = 1%; s2 = 5%; s1 + s0 =94%.

Assume that overall there are 10,000 items and 10 million transactions, all of size 10. Then 100,000 transactions contain A, and 500,000 more transactions contain all but one items of A. How many of these transactions contain A after they are randomized? The following is a rough average estimate:

A ⊂ t and A ⊂ R(t) : 100,000 · 0.23 = 800

|A ∩ t| = 2 and A ⊂ R(t) : 500,000· 0.22.000,10

8.0.8 = 12.8

|A∩ t| ≤ 1 and A ⊂ R(t) : < 107 · 0.2 · 2

000,10

8.0.9

≈ 1.04

So, there will be about 814 randomized transactions containing A, out of which about 800, or 98%, contained A before randomization as well. Now, suppose that the server receives from client Ci a randomized transaction R(t) that contains A. The server now knows that the actual, nonrandomized transaction t at Ci contains A with probability about 98%. On the other hand, the prior probability of A ⊂ t is just 1%. The disclosure of A ⊂ R(t) has caused a probability jump from 1% to 98%. This situation is a privacy breach.

Intuitively, a privacy breach with respect to some property P(t) occurs when, for some possible outcome of randomization (= some possible view of the server), the posterior probability of P(t) is higher than a given threshold called the privacy breach level. Of course, there are always some properties that are likely; so, we have to only look at “interesting” properties, such as the presence of a given item in t. In order to prevent privacy breaches from happening, transactions are randomized by inserting many “false” items, as well as deleting some “true” items. So many “false” items should be inserted into a transaction that one is as likely to see a “false” itemset as a “true” one. In select-a-size randomization



operator, it is the randomization level ρ that determines the probability of a “false” item to be inserted.

The other parameters, namely the distribution (p[0], p[1], . . . , p[m]), are set in [11] so that, for a certain “cutoff” integer K, any number of items from 0 to K is retained from the original transaction with probability 1/(K + 1), while the rest of the items are inserted independently

with probability ρ. The question of optimizing all select-a-size parameters to achieve maximum recoverability for a given breach level is left open.

The parameters of randomization are checked for privacy as follows. It is assumed that the server knows the maximum possible support of an itemset for each itemset size, among transactions of each transaction size, or their upper bounds. Based on this knowledge, the server computes partial supports for (imaginary) privacy-challenging itemsets, and tests randomization parameters by computing posterior probabilities P[a Є t | A _ R(t)] from the definition of privacy breaches. The randomization parameters are selected to keep variance low while preventing privacy breaches for privacy-challenging itemsets.

Graphs and experiments with real-life datasets show that, given several million transactions, it is possible to find randomization parameters so that the majority of 1-item, 2- item, and 3-item sets with support at least 0.2% can be recovered from randomized data, for privacy breach level of 50%. However, long transactions (longer than about 10 items) have to be discarded, because the privacy-preserving randomization parameters for them must be “too randomizing,” saving too little for support recovery. Those itemsets that were recovered incorrectly (“false drops” and “false positives”) were usually close to the support threshold, i.e. there were few outliers. The standard deviation for 3-itemset support estimator was at most 0.07%for one dataset and less than 0.05% for the other; for 1-item and 2-item sets it is smaller still.

5 Measures of Privacy

Privacy is measured in terms of confidence intervals. The nonrandomized numerical attribute xi is treated as an unknown parameter of the distribution of the randomized value Zi = xi + Yi. Given an instance zi of the randomized value Zi, the server can compute an interval I(zi) = [x−(zi), x+(zi)] such that xi ЄI(zi) with at least certain probability c%; this should be true for all xi. The length |I(zi)| of this confidence interval is treated as a privacy measure of the randomization. One problem with this method is that the domain of the nonrandomized value and its distribution are not specified. Consider an attribute X with the following density function:

fX(x) = 0.5 if 0 ≤ x ≤1 or 4 ≤ x ≤ 5 0 otherwise

Assume that the perturbing additive Y is distributed uniformly in [−1, 1]; then, according to the confidence interval measure, the amount of privacy is 2 at confidence level 100%. However, if we take into account the fact that X must be compute a confidence interval of size 1 (not 2). The interval is computed as follows:

[0, 1] if -1 ≤ z ≤ 2 I(z) =

[4, 5] if 3 ≤ z ≤ 6



Moreover, in many cases the confidence interval can be even shorter: for example, for z = −0.5 we can give interval [0, 0.5] of size 0.5.Privacy is measured using Shannon’s information theory .The average amount of information in the nonrandomized attribute X depends on its distribution and is measured by its differential entropy

2X~x

log()( −= EXh fx(x)) = - ∫Ωx

X xf )( log2fX(x)dx.

The average amount of information that remains in X after the randomized attribute Z is disclosed can be measured by the conditional differential entropy

))(log()/( /2),(~),(

xfEZXh zZXZXzx

=−=

=dxdzxfzxf zZX

ZXZX )(log),( /2

,, =

Ω∫−

The average information loss for X that occurs by disclosing Z can be measured in terms of the difference between the two entropies:

I(X;Z) = h(X) − h(X|Z) = )(

)(log

|

2Z)(X,~),( xf

xfE

X

zZx

zx

=

This quantity is also known as mutual information between random variables X and Z. It is proposed in [1] to use the following functions to measure amount of privacy (∏(X)) and amount of privacy loss (p(X|Z)):

∏ (X): = 2h(X); p (X|Z) := 1 − 2−I(X;Z).

In the example above we have

∏ (X) = 2; ∏ (X|Z) = 2h(X|Z) ≈ 0.84; P(X|Z) ≈ 0.58.

A possible interpretation of these numbers is that, without knowing Z, we can localize X within a set of size 2; when Z is revealed, we can (on average) localize X within a set of size 0.84, which is less than 1. However, even this information-theoretic measure of privacy is not without some difficulties. Suppose that clients would not like to disclose the property

“X ≤ 0.01.” The prior probability of this property is 0.5%; however, if the randomized value Z happens to be in [−1,−0.99] the posterior probability P[X ≤ 0.01| Z =z] becomes 100%.Of course,

Z Є[−1,−0.99] is unlikely. Therefore, Z Є [−1, −0.99] occurs for about 1 in 100,000 records. But every time it occurs the property “X ≤ 0.01” is fully disclosed, becomes 100% certain.

The mutual information, being an average measure, does not notice this rare disclosure. Nor does it alert us to the fact that whether X Є [0, 1] or X Є [4, 5] is fully disclosed for every record; this time it is because the prior probability of each of these properties is high (50%).

The notion of privacy breaches, on the other hand, captures these disclosures. Indeed, for any privacy breach level ρ < 100% and for some randomization outcome (namely, for Z ≤ −0.99) the posterior probability of property “X ≤ 0.01” is above the breach level. The definition of privacy breaches is that we have to specify which properties are privacy-sensitive, whose probabilities must be kept below breach level. Specifying too many privacy-sensitive properties may require too destructive a randomization, leading to a very imprecise aggregate model at the server. Thus, the question of the right privacy measure is still open.



6 Conclusion

The research in using randomization for preserving privacy has shown promise and has already led to interesting and practically useful results. This paper looks at privacy under a different angle than the conventional cryptographic approach. It raises an important question of measuring privacy, which should be addressed in the purely cryptographic setting as well since the disclosure through legitimate query answers must also be measured. Randomization does not rely on intractability hypotheses from algebra or number theory, and does not require costly cryptographic operations or sophisticated protocols. It is possible that future studies will combine statistical approach to privacy with cryptography and secure multiparty computation, to the mutual benefit of all of them.

References

[1] D. Agrawal and C. C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms.In Proceedings of the 20th Symposium on Principles of Database Systems, Santa Barbara, California, USA, May 2001.

[2] “Randomization in privacy preservin-g data mining” - Alexandre Evfimievski Volume 4, issue 2 pages 43-47 SIGKDD Exlorations.

[3] R. J. A. Little. Statistical analysis of masked data. Journal of Official Statistics, 9(2):407–426, 1993. [4] R. Agrawal and R. Srikant. Privacy preserving data mining. In Proceedings of the 19th ACM SIGMOD

Conference on Management of Data, Dallas, Texas, USA, May 2000. [5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. CRC

Press,Boca Raton, Florida, USA, 1984. SIGKDD Explorations. Volume 4, Issue 2 - page 47. [6] S. J. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In Proceedings of the 28th

International Conference on Very Large Data Bases, Hong Kong, China, August 2002. [7] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In

Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, pages 217–228, Edmonton, Alberta, Canada, July 23–26 2002.



Mining Full Text Documents by Combining

Classification and Clustering Approaches

Y. Ramu S.V. Engg. College for Women, Bhimavaram–534204

[email protected]

Abstract

The area of Knowledge Discovery in Text (KDT) and Text Mining (TM) is growing rapidly mainly because of the strong need for analyzing the vast amount of textual data that reside on internal file systems and the Web. Most of the present day search engines aid in locating relevant documents based on keyword matches. However, to provide the user with more relevant information, we need a system that also incorporates the conceptual framework of the queries. So training search engine to retrieve documents based on a combination of keyword and conceptual matching is essential. An automatic classifier is used to determine the concepts to which new documents belong. Currently, the classifier is trained by selecting documents randomly from each concept’s training set and it also ignores the hierarchical structure of the concept tree. In this paper, I present a novel approach to select these training documents by using document clustering within the concepts. I also exploit hierarchical structure in which the concepts themselves are arranged. Combining these approaches to text classification, I can achieve an improvement in accuracy over the existing system.

1 Introduction

The vast amount of data found in an organization, some estimates run as high as 80%, are textual such as reports, emails, etc. This type of unstructured data usually lacks metadata and as a consequence there is no standard means to facilitate search, query and analysis. Today, the Web has developed a medium of documents for people rather than for data and information that can be processed automatically.

A human editor can only recognize that a new event has occurred by carefully following all the web pages or other textual sources. This is clearly inadequate for the volume and complexity of the information involved. So there is a need for automated extraction of useful Knowledge from huge amounts of textual data in order to assist human analysis is apparent.

Knowledge discovery and Text Mining are mostly automated techniques that aim to discover high level information in huge amount of textual data and present it to the potential user (analyst, decision-maker, etc).

Knowledge Discovery in Text (KDT) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in unstructured textual data.

Unstructured textual data is a set of documents. In this paper, I use the term document to refer to a logical unit of text. This could be Web page, a status memo, an invoice, an email etc. It

84 ♦ Mining Full Text Documents by Combining Classification and Clustering Approaches


can be complex and long, and is often more than text and can include graphics and multimedia content.

2 Motivation

Search engines often provide too many irrelevant results. This is mostly because of the fact that a single word might have multiple meanings [Krovetz 92]. Thus current day search engines that match documents only based on keywords prove inaccurate.

To overcome this problem, it is better to consider in to account both the keyword for which the user is searching and the meaning, or the concept, in which the user is interested. The conceptual arrangement of information can be found on the Internet in the form of directory services such as Yahoo!

These arrange Web pages conceptually in a hierarchical browsing structure. While it is possible that a lower level concept may belong to more than one higher-level concept, in our study we consider the hierarchy as a classification tree. In this case, each concept is a child of at most one parent concept.

During indexing, we can use an automatic classifier to assign newly arriving documents to

one or more of the preexisting classes or concepts. However, the most successful paradigm

for organizing large amounts of information is by categorizing the different documents

according to their topic, where topics are organized in a hierarchy of increasing specificity

[Koller 97]. By utilizing known hierarchical structure, the classification problem can be

decomposed into a smaller set of problems corresponding to hierarchical splits in the tree.

For any classifier, the performance will improve if the documents that are used to train the classifier are the best representatives of the categories. Clustering can be used to select the documents that best represents a category.

3 Related Work

3.1 Text Classification

Text classification is the process of matching a document with the best possible concept(s) from a predefined set of concepts. Text classification is a two step process: Training and Classification.

i) Training: The system is given a set of pre-classified documents. It uses these to learn the features that represent each of the concepts.

ii) Classification: A classifier uses the knowledge that it has already gained in the training phase to assign a new document to one or more of the categories. Feature selection plays an important role in document classification.

3.2 Hierarchical Text Classification

In ‘flat text classification’, categories are treated in isolation of each other and there is no structure defining the relationships among them. A single huge classifier is trained which categorizes each new document as belonging to one of the possible basic classes.

Mining Full Text Documents by Combining Classification and Clustering Approaches ♦ 85


In ‘hierarchical text classification’ we can address this large classification problem using a divide-and-conquer approach [Sun 01]. [Koller 97] Proposed an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree.

At each level in the category hierarchy, a document can be first classified into one or more

sub-categories using some flat classification methods. We can use features from both the

current level as well as its children to train this classifier. The following are the motivations

for taking hierarchical structure into account [D’Alessio 00]:

1. The flattened classifier loses the intuition that topics that are close to each other in hierarchy have more in common with each other, in general, than topics that are spatially far apart. These classifiers are computationally simple, but they lose accuracy because the categories are treated independently and relationship among the categories is not exploited.

2. Text categorization in hierarchical setting provides an effective solution for dealing with very large problems. By treating problem hierarchically, the problem can be decomposed into several problems each involving a smaller number of categories. Moreover, decomposing a problem can lead to more accurate specialized classifiers. Category structures for hierarchical classification can be classified into [Sun 03]:

• Virtual category tree

• Category tree

• Virtual directed acyclic category graph

• Directed acyclic category graph

3.3 Document Clustering

There are many different clustering algorithms, but they fall into a few basic types [Manning 99]. One way to group the algorithms is by: hierarchical clustering or the flat non-hierarchical clustering

i) A Hierarchical Clustering: Produces a hierarchy of clusters with the usual interpretation that each node stands for a subclass of its mother’s node. There are two basic approaches to generating hierarchical clustering: Agglomerative and Divisive.

ii) Flat Clustering(Non- Hierarchical): It simply creates a certain number of clusters and

the relationships between clusters are often undetermined. Most algorithms that

produce flat clustering are iterative. They start with a set of initial clusters and

improve them by iterating a reallocation operation that reassigns objects. Non-

hierarchical algorithms often start out with a partition based on randomly selected

seeds (one seed per cluster), and then refine this initial partition. [Manning 99].

3.4 Document’s Indexing

The indexing process is comprised of two phases: Classifier training and Document collection indexing.



i) Classifier training: During this phase a fixed number of sample documents for each concept are collected and merged, and the resulting super-documents are preprocessed and indexed using the TF * IDF method. This essentially represents each concept by the centroid of the training set for that concept.

ii) Document Collection indexing: New documents are indexed using a vector space

method to create a traditional word- based index. Then, the document is classified by comparing the document vector to the centroid for each concept. The similarity values thus calculated are stored in the concept- based index.

4 Implementation (Approach)

4.1 Incorporating Clustering

Feature selection for text classification plays a primary role towards improving the classification accuracy and computational efficiency. With any large set of classes, the boundaries between the categories are fuzzy. The documents that are near the boundary line will add noise if used for training and confuse the classifier. Thus, we want to eliminate documents, and the words they contain, from the representative vector for the category. It is important for us to carefully choose the documents from each category on which the feature selection algorithms operate during training. Hence, in order to train the classifier, we need to:

• Identify within-category clusters (Cluster – Mining), and

• Extract the cluster(s)’ representative pages

Here, cluster mining is a different one to the conventional use of clustering techniques to compute a partition for a complete set of data (documents (web) in our case). It’s aim is to identify only some representative clusters of Web pages within a Web structure. So we have to use clustering techniques to get some kind of information about the arrangement of documents within each category space and select the best possible representative documents from those clusters. In essence, we are doing document mining within the framework of cluster mining.

4.2 Incorporating Hierarchical Classification

There are two approaches adopted by existing hierarchical classification methods [Sun 01]

i) Big Bang approach: In this, only a single classifier is used in the classification process.

ii) Top-Down Level Based approach: In this one or more classifiers are constructed at each category level and each classifier works as a flat classifier.

In our approach, we adopt a top-down level based approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems. The classifiers we use are based on the vector space model.

4.3 System Architecture

We study and evaluate the performance of the text classifier when it is trained by using documents selected with the help of clustering from each category and by using a top- down,



level-based, hierarchical approach of text classification. To do this, we need the following components:

i) A system to perform document clustering with in each category. We then choose the

documents based on the result of clustering so that the documents that are best

representatives of the category are selected.

ii) Automatic classifier(s) that will be trained for evaluation purposes. We will have one

a comprehensive classifier for flat classification and one classifier for each non- leaf

node in the tree in case of hierarchical classification.

iii) A mechanism to test the classifier with documents that it has not seen before to

evaluate its classification accuracy. We use the accuracy of classification as an

evaluation measure.

4.4 Classification Phase (Testing)

Similar to the processing of the training documents, a term vector is generated for the

document to be classified. This vector is compared with all the vectors in the training inverted

index and the category vectors most similar to the document vector are the categories to

which the document is assigned. The similarity between the vectors is determined by the

cosine similarity measure, i.e., the inner product of the vectors. This gives a measure of the

degree of similarity of the document with a particular category. The results are then sorted to

identify the top matches. A detailed discussion on tuning the various parameters, such as

number of tokens per document, number of categories per document to be considered, etc., to

improve the performance of the categorizer can be found in [Gauch 04].

5 Experimental Observations and Results

5.1 Experimental Set up

Source of Training Data: Because Open Directory Project hierarchy [ODP 02] is readily

available for download from their web site in a compact format, it was chosen as the source

for classification tree. In our work with hierarchical text classification, the top few levels of

the tree are sufficient. We decided to classify documents into classes from the top three levels

only.

5.2 Experiment: Determining the Baseline

Currently, Key Concept uses a flat classifier. Documents are randomly selected from the

categories. To evaluate our experiments, we must first establish a baseline level of

performance with the existing classifier.

Chart 1 provides us with the baseline with which we can compare our future work. It shows

the percentage of documents within the top n (n =1,2,..10) concepts plotted against the rank n.

46.6% of the documents are correctly classified as belonging to the their true category and the

correct answer appears within the top 10 selections over 80% of the time.



Chart 1: Baseline. Performance of the Flat Classifier when it is trained using 30 documents randomly selected from each concept

5.3 Experiment – Effect of Clustering on Flat Classification

Chart 2: Using within-category clustering to select the documents to train the flat classifier.

Chart 2 shows the comparison of the results obtained from the six experiments It is clear from the graph above that experiment which is selecting documents that are farthest from the centroid, yields the poorest of the results. The percentage of exact matches in this is just 29.6%. This denotes a fall of 36% as compared to our random baseline of 46.6%.

Experiment, which involves selecting documents closest to the centroid from each concept, gives us 49.5% of exact matches. This translates to an improvement of 3% in exact terms over random training. In experiment-3, choose 30 documents that are farthest to each other in each concept to train the classifier. We train the classifier with 30 documents that are farthest from each other. The results of this experiment show that the percentage of exact matches is 48.6%, an improvement of 2% over our baseline.

The accuracy of the classifier is 51.6%, 52.2% and 52.9% for experiments 4, 5 and 6



respectively. The best observed performance among these 6 trials is for experiment-6, selecting from two clusters after discarding outliers, which shows an improvement of 6.3% over baseline.

6 Future Work

In this paper, we presented a novel approach to text classification by combining within- concept clustering with hierarchical approach. We are going to conduct experiments to determine how deeply we need to traverse the concept tree to collect training documents..

References

[1] [Krovetz 92] Robert Krovetz and Bruce W. Croft. Lexical Ambiguity and Information Retrieval. ACM Transactions on Information Systems, 10(2), April 1992, pages. 115-141.

[2] [Koller 97]. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In

Proceedings of the 14th

International Conference on Machine Learning, 1997.

[3] [Sun 01]. A. Sun and E. Lim. Hierarchical Text Classification and Evaluation. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM2001), California, USA, November 2001, pages 521-528.

[4] [D’Alessio 00]: S. D’Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. The effect of using

hierarchical classifiers in text categorization. In Proceedings Of the 6th

International Conference “Recherched’ Information Assistee par Ordinateur”, Paris, FR, 2000, pages 302-313.

[5] [Sun 03]: A. Sun, E. Lim, and W. Ng. Performance Measurement Framework for Hierarchical Text Classification. Journal of the American Society for Information Science and Technology, 54(11), 2003. Pages 1014-1028.

[6] [Manning 99]. C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. The MIT Press. 1999.

[7] [Gauch 04] S. Gauch, J. M. Madrid, S. Induri, D. Ravindran, and S. Chadalavada. KeyConcept: A Conceptual Search Engine. Information and Telecommunication Technology Center, Technical Report: ITTC-FY2004-TR-8646-37, University of Kansas. 2004



Discovery of Semantic Web Using Web Mining

K. Suresh P. Srinivas Rao D. Vasumathi

I.T,VCE, Hyderabad JNTU Hyderabad JNTUCEH [email protected] [email protected] [email protected]

Abstract

The Semantic Web is the second generation WWW, enriched by machine processable information which supports the user in his tasks The main idea of the Semantic Web is to enrich the current Web by machine-processable information in order to allow for semantic-based tools supporting the human user .Semantic web Mining aims at combining the two fast developing research areas Semantic Web and Web Mining .Web Mining aims at discovering insights about the meaning of Web resources and their usage. Given the primarily syntactical nature of data Web Mining operates on, the discovery of meaning is impossible based on these data only. In this paper, we discuss the interplay of the Semantic Web with Web Mining, with a specific focus on usage mining.

1 Introduction

Web Usage Mining is the application of data mining methods to the analysis of recordings of Web usage, most often in the form Of Web server logs .One of its central problems is the large number of patterns that are usually found: among these, how can the interesting patterns be identified? For example, an application of association rule analysis to a Web log will typically return many patterns like the observation that 90% of the users who made a purchase in an online shop also visited the homepage-a pattern that is trivial because the homepage is the site’s main entry point. Statistical measures of pattern quality like support and confidence, and measures of interestingness based on the divergence from prior beliefs are a primarily syntactical approach to this problem. They need to be complemented by an understanding of what a site and its usage patterns are about i.e. a semantic approach. A popular approach for modeling sites and their usage is related to OLAP techniques: a modeling of the pages in terms (possibly multiple) concept hierarchies, and an investigation of patterns at different levels of abstraction, i.e. Knowledge discovery Cycle which iterates over various “roll- ups”and”drill-downs”. Concept Hierarchies conceptualize a domain in terms of taxonomies such as product catalogs, topical thesauri, Etc. The expressive power of this form of knowledge representation is limited to be a relationship. However, for many applications, a more expressive form of knowledge representation is desirable, for example ontologies that allow arbitrary relations between concepts.

A second problem facing many current analyses that take semantics into account is that the conceptualizations often have to be hand-crafted to represent a site that has grown independently of an overall conceptual design, and that the mapping of individual pages to this conceptualization may have to be established. It would thus be desirable to have a rich semantic model of a site, of its content and its (hyperlink) structure, a model that captures the

Discovery of Semantic Web Using Web Mining ♦ 91


complexity of the manifold relationships between the concepts covered in a site, and a model that is “built into” the site in the sense that the pages requested by visitors are directly associated with the concepts and relations treated by it.

The Semantic Web is just this: today’s Web enriched by a formal semantics in form of ontologies that captures the meaning of pages and links in a machine-understandable form. The main idea of the Semantic Web is to enrich the current Web by machine-processable information in order to allow for semantic-based tools supporting the human user. In this paper, we discuss on one hand how the Semantic Web can improve Web usage mining, and on the other hand how usage mining can be used to built up the Semantic Web.

2 Web Usage Mining

Web mining is the application of data mining techniques to the content, structure, and usage of Web resources. This can help to discover global as well as local structure within and between Web pages. Like other data mining applications, Web mining can profit from given structure on data (as in database tables), but it can also be applied to semi structured or unstructured data like free-form text. This means that Web mining is an invaluable help in the transformation from human understandable content to machine understandable semantics. A distinction is generally made between Web mining that operates on the Web resources themselves (often further differentiated into content and structure mining), and mining that operates on visitors’ usage of these resources. These techniques, and their application for understanding Web usage, will be discussed in more detail in section 5.In Web usage mining, the primary Web resource that is being mined is a record of the requests made by visitors to a Web site, most often collected in a Web server log [5].The content and structure of Web pages, and in particular those of one Web site, reflect the intentions of the authors and designers of the pages and the underlying information architecture. The actual behavior of the users of these resources may reveal additional structure.

First, relationships may be induced by usage where no particular structure was designed. For example, in an online catalog of products, there is usually either no inherent structure (different products are simply viewed as a set), or one or several hierarchical structures given by product categories, manufacturers, etc. Mining the visits to that site, however, one may find that many of the users who were interested in product A were also interested in product B. Here, “interest” may be measured by requests for product description pages, or by the placement of that product into the shopping cart (indicated by the request for the respective pages). The identified association rules are at the center of cross-selling and up-selling strategies in E-commerce sites: When a new user shows interest in product A, she will receive a recommendation for product B (cf. [3, 4]).

Second, relationships may be induced by usage where a different relationship was intended. For example, sequence mining may show that many of the users who visited page C later went to page D, along paths that indicate a prolonged search (frequent visits to help and index pages, frequent backtracking, etc.) [1, 2]. This can be interpreted to mean that visitors wish to reach D from C, but that this was not foreseen in the information architecture, hence that there is at present no hyperlink from C to D. This insight can be used for static site improvement for all users (adding a link from C to D), or for dynamic recommendations personalized for the subset of users who go to C (“you may wish to also look at D”).It is useful to combine Web usage mining with contentand structure analysis in order to “make

92 ♦ Discovery of Semantic Web Using Web Mining


sense” of observed frequent paths and the pages on these paths. This can be done using a variety of methods. Many of these methods rely on a mapping of pages into ontology. And underlying ontology and the mapping of pages into it may already be available, the mapping of pages into an existing ontology may need to be learned, and/or the ontology itself may have to be inferred first. In the following sections, we will first investigate the notions of semantics (as used in the Semantic Web) and ontologies in more detail. We will then look at how the use of ontologies, and other ways of identifying the meaning of pages, can help to make Web Mining go semantic. Lastly, we will investigate how ontologies and their instances can be learned.

3 Semantic Web

The Semantic Web is based on a vision of Tim Berners-Lee, the inventor of the WWW. The great success of the current WWW leads to a new challenge: a huge amount of data is interpretable by humans only; machine support is limited. Berners-Lee suggests to enrich the Web by machine-processable information which supports the user in his tasks. For instance, today’s search engines are already quite powerful, but still return too often too large or inadequate lists of hits. Machine-processable information can point the search engine to the relevant pages and can thus improve both precision and recall. For instance, it is today almost impossible to retrieve information with a keyword search when the information is spread over several pages. The process of building the Semantic Web is today still heavily going on. Its structure has to be defined, and this structure has then to be filled with life. In order to make this task feasible, one should start with the simpler tasks first. The following steps show the direction where the Semantic Web is heading:

1. Providing a common syntax for machine understandable statements.

2. Establisshing common vocabularies.

3. Agreeing on a logical language.

4. Using the language for exchanging proofs.

Berners-Lee suggested a layer structure for the Semantic

Web: (i) Unicode/URI, (ii) XML/Name Spaces/ XML Schema, (iii) RDF/RDF Schema, (iv) Ontology vocabulary, (v) Logic, (vi) Proof, (vii) Trust.

1 This structure reflects the steps listed above. It follows the understanding that each step alone will already provide added value, so that the Semantic Web can be realized in an incremental fashion. On the first two layers, a common syntax is provided. Uniform resource

identifiers (URIs) provide a standard way to refer to entities,2 while Unicode is a standard for exchanging symbols. The Extensible Markup Language (XML) fixes a notation for describing labeled trees, and XML Schema allows to define grammars for valid XML documents. XML documents can refer to different namespaces to make explicit the context (and therefore meaning) of different tags. The formalizations on these two layers are nowadays widely accepted, and the number of XML documents is increasing rapidly.

The Resource Description Framework (RDF) can be seen as the first layer which is part of

the Semantic Web. According to the W3C recommendation [40], RDF “is a foundation for

processing metadata; it provides interoperability between applications that exchange machine



understandable information on the Web.” RDF documents consist of three types of entities:

resources, properties, and statements. Today the Semantic Web community considers these

levels rather as one single level as most ontologies allow for logical axioms. Following [2],

ontology is “an explicit formalization of a shared understanding of a conceptualization”. This

high-level definition is realized differently by different research communities. However, most

of them have a certain understanding in common, as most of them include a set of concepts, a

hierarchy on them, and relations between concepts. Most of them also include axioms in

some specific logic. To give a flavor, we present here just the core of our own definition [3],

as it is reflected by the Karlsruhe Ontology framework KAON.3 It is built in a modular way,

so that different needs can be fulfilled by combining parts.

Fig. 1: The Relation between the WWW, Relational Metadata, and Ontologies.

Definition 1 A core ontology with axioms is a tuple O :=(C;_C;R; _;_R;A) consisting of _ two disjoint sets C and R whose elements are called concept identifiers and relation identifiers, resp.,

_ a partial order _C on C, called concept hierarchy or taxonomy,_ a function _:R ! C+ called

signature (where C+ is the set of all finite tuples of elements in C),_ a partial order _R on R,

called relation hierarchy, where r1 _R r2 implies j_(r1)j = j_(r2)j and _i(_(r1)) _C _i(_(r2)), for each 1 _ i _ j_(r1)j, with _i being the projection on the ith component, and _ a set A of

logical axioms in some logical language L.

This definition constitutes a core structure that is quite straightforward, well-agreed upon, and

that may easily be mapped onto most existing ontology representation languages. Step by

step the definition can be extended by taking into account axioms, lexicons, and knowledge

bases [1].As an example, have a look at the top of Figure 1. The set C of concepts is the set

fTop, Project, Person, Researcher, Literalg, and the concept hierarchy _C is indicated by the

arrows with a bold head. The set R of relations is the set fworks-in, researcher, cooperates-

with, nameg. The relation ‘worksin’ has (Person, Project) as signature, the relation ‘name’

has (Person, Literal) as signature.4 In this example, the hierarchy on the relations is flat, i. e.,

_R is just the identity relation. For an example of a non-flat relation, have a look at Figure 2.



root

facilityaccommodationFood provider

hotel

minigolf

Youth_hostel

Belongs_to

Tennis _court

Family_hotel Wellness hotel

is _sports

_facility

Fast foodrestaurant

italian german

Vegetarian - only regularb

Fig. 2: Parts of the ontology of the content

The objects of the metadata level can be seen as instances of the ontology concepts. For example, ‘URI-SWMining’ is an instance of the concept ‘Project’, and thus by inheritance also of the concept ‘Top’. Up to here, RDF Schema would be sufficient for formalizing the ontology. Often ontologies contain also logical axioms. By applying logical deduction, one can then infer new knowledge from the information which is stated implicitly. The axiom in Figure 1 states for instance that the ‘cooperates-with’ relation is symmetric. From it, one can logically infer that the person addressed by ‘URI-AHO’ is cooperating with the person addressed by ‘URI-GST’ (and not only the other way around).

A priori, any knowledge representation mechanism5 can play the role of a Semantic Web language. Frame Logic (or F–Logic; [2]), for instance, provides a semantically founded knowledge representation based on the frame and slot metaphor. Probably the most popular framework at the moment are Description Logics (DL). DLs are subsets of first order logic which aim at being as expressive as possible while still being decidable. The description logic SHIQ provides the basis for the web language DAML+OIL.6 Its latest version is currently established by the W3C Web Ontology Working Group (WebOnt)7 under the name OWL. Several tools are in use for the creation and maintenance of ontologies and metadata, as well as for reasoning within them. Our group has developed OntoEdit [4, 5], an ontology editor which is connected to Ontobroker [1], an inference engine for F–Logic. It provides means for semantic based query handling over distributed resources. In this paper, we will focus our interest on the XML, RDF, ontology and logic layers.

4 Using Semantics for Usage Mining and Mining the Usage of the Semantic Web

Semantics can be utilized for Web Mining for different purposes some of the approaches presented in this section rely on a comparatively ad hoc formalization of semantics, while others exploit the full power of the Semantic Web. The Semantic Web offers a good basis to enrich Web Mining: The types of (hyper)links are now described explicitly, allowing the



knowledge engineer to gain deeper insights in Web structure mining; and the contents of the pages come along with a formal semantics, allowing her to apply mining techniques which require more structured input. Because the distinction between the use of semantics for Web mining and the mining of the Semantic Web itself is all but sharp, we will discuss both in an integrated fashion. Web usage mining benefits from including semantics into the mining process for the simple reason that the application expert as the end user of mining results is interested in events in the application domain, in particular user behavior, while the data available—Web server logs—are technically oriented sequences of HTTP requests.

A central aim is therefore to map HTTP requests to meaningful units of application events. Application events are defined with respect to the application domain and the site, a non-trivial task that amounts to a detailed formalization of the site’s business model. For example, relevant E-business events include product views and product click-through this in which a user shows specific interest in a specific product by requesting more detailed information (e.g., from the Beach Hotel to a listing of its prices in the various seasons). Web server logs generally contain at least some information on an event that was marked by the user’s request for a specific Web page, or the system’s generating a page to acknowledge the successful completion of a transaction. For example, consider a tourism Web site that allows visitors to search hotels according to different criteria, to look at detailed descriptions of these hotels, to make reservations, and so on. In the Web site, a hotel room reservation event may be identified by the recorded delivery of the page reserve.php?-user=12345&hotel=Beach Hotel people=2&-arrive=01May&depart=04 May, which was generated after the user chose “room for 2 persons” in the “Beach Hotel” and typed in the arrival and departure dates of his desired stay. What information the log contains, and whether this is sufficient, will depend on the technical set-up of the site as well as on the purposes of the analysis.So what are the aspects of application events that need to be reconstructed using semantics? In the following sections, we will show that a requested Web page is, first, about some content, second, the request for a specific service concerning that content, and third, usually part of a larger sequence of events. We will refer to the first two as atomic application events, and to the third as complex

application event.

4.1 Atomic Application Events: Content

A requested Web page is about something, usually a product or other object described in the page. For example, search hotel.html?facilities=tennis may be a page about hotels, more specifically a listing of hotels, with special attention given to a detailed indication of their sports facilities. To describe content in this way,URLs are generally mapped to concepts. The concepts are usually organized in taxonomies (also called “concept hierarchies”, see [1] and the definition in Section 3). For example, a tennis court is a facility. Introducing relations, we note that a facility belongs-to an accommodation, etc. (see Fig. 2).

4.2 Atomic application events: Service

A requested Web page reflects a purposeful user activity, often the request for a specific service. For example, search hotel.html?facilities=tennis was generated after the user had initiated a search by hotel facilities (stating tennis as the desired value). This way of analyzing requests gives a better sense of what users wanted and expected from the site, as opposed to what they received in terms of the eventual content of the page.



To a certain extent, the requested service is associated with the request’s URL stem and the

delivered page’s content (e.g., the URL search hotel.html says that the page was a result of a

search request). However, the delivered page’s content may also be meaningless for the

understanding of user intentions, as is the case when the delivered page was a “404 File not

found”. More information is usually contained in the specifics of the user query that led to the

creation of the page. This information may be contained in the URL query string, which is

recorded in the Web server log if the common request method GET is used.

The query string may also be recorded by the application server in a separate log. As an

example, we have used ontology to describe a Web site which operates on relational

databases and also contains a number of static pages, together with an automated

classification scheme that relies on mapping the query strings for dynamic page generation to

concepts [5]. Pages are classified according to multiple concept hierarchies that reflect

content (type of object that the page describes), structure (function of pages in object search),

and service (type of search functionality chosen by the user). A path can then be regarded as a

sequence of (more or less abstract) concepts in a concept hierarchy, allowing the analyst to

identify strategies of search. This classification can make Web usage mining results more

comprehensible and actionable for Web site redesign or personalization: The semantic

analysis has helped to improve the design of search options in the site, and to identify

behavioral patterns that indicate whether a user is likely to successfully complete a search

process, or whether he is likely to abandon the site .

The latter insights could be used to dynamically generate help messages for new users.Oberle

[4] develops a scheme for application server logging of user queries with respect to a full-

blown ontology (a “knowledge portal” in the sense of [2]). This allows the analyst to utilize

the full expressiveness of the ontology language, which enables a wide range of inferences

going beyond the use of taxonomy-based generalizations. He gives examples of possible

inferences on queries to a community portal, which can help support researchers in finding

potential cooperation partners and projects. A largescale evaluation of the proposal is under

development. The ontologies of content and services of a Web site as well as the mapping of

pages into them may be obtained in various ways. At one extreme, ontologies may be

handcrafted ex post; at the other extreme, they may be the generating structure of the Web

site (in which case also the mapping of pages to ontology elements is already available). In

most cases, mining methods themselves must be called upon to establish the ontology

(ontology learning) and/or the mapping (instance learning), for example by using methods of

learning relations (e.g., [3]) and information extraction (e.g., [1, 3]).

4.3 Complex application events A requested Web page, or rather, the activity/ies behind it, is generally part of a more extended behavior. This may be a problem-solving strategy consciously pursued by the user (e.g., to narrow down search by iteratively refining search terms), a canonical activity sequence pertaining to the site type (e.g., catalog search/browse, choose, add-to cart, pay in an E-commerce setting [37]), or a description of behavior identified by application experts in



exploratory data analysis. An example of the latter is the distinction of four kinds of online shopping strategies by [4]: directed buying earch/deliberation, hedonic browsing, and knowledge building. The first group is characterized by focused search patterns and immediate purchase.

The second is more motivated by a future purchase and therefore tends to browse through a particular category of products rather than directly proceed to the purchase of a specific product. The third is entertainment- and stimulus-driven, which occasionally results in spontaneous purchases. The fourth also shows exploratory behavior, but for the primary goal of information acquisition as a basis for future purchasing decisions. Moe characterized these browsing patterns in terms of product and category pages visited on a Web site. Spiliopoulou, Pohle, and Teltzrow [4] transferred this conceptualization to the analysis of a non-commercial information site. They formulated regular expressions that capture the behavior of search/deliberation and knowledge building, and used sequence mining to identify these behaviors in the site’s logs.

Search/deliberation

Knowledge building

Fig. 3: Parts of the ontology of the complex application events of the example site.

4.4 How is Knowledge about Application Events used in Mining?

Once requests have been mapped to concepts, the question arises how knowledge is gained from these transformed data. We will investigate the treatment of atomic and of complex application events in turn. Mining using multiple taxonomies is related to OLAP data cube techniques: objects (in this case, requests or requested URLs) are described along a number of dimensions, and concept hierarchies or lattices are formulated along each dimension to allow more abstract views .The analysis of data abstracted using taxonomies is crucial for many mining applications to generate meaningful results: In a site with dynamically generated pages, each individual page will be requested so infrequently that no regularities may be found in an analysis of navigation behavior. Rather, regularities may exist at a more abstract level, leading to rules like “visitors who stay in Wellness Hotels also tend to eat in restaurants”. Second, patterns mined in past data are not helpful for applications like recommender systems when new items are introduced into product catalog and/or site structure: The new Pier Hotel cannot be recommended simply because it was not in the

Home category Product

Home category product

category product



tourism site before and thus could not co-occur with any other item, be recommended by another user, etc.

A knowledge of regularities at a more abstract level could help to derive a recommendation of the Pier Hotel because it too is wellness Hotel (and there are criteria for recommending Wellness Hotels).After the preprocessing steps in which access data have been mapped into taxonomies, two main approaches are taken in subsequent mining steps. In many cases, mining operates on concepts at a chosen level of abstraction. For example, the sessions are transformed into points in a feature space [3], or on the sessions transformed into sequences of content units at a given level of description (for example, association rules can be sought between abstract concepts such as Wellness Hotels, tennis courts, and restaurants).This approach is usually combined with interactive control of the software, so that the analyst can re-adjust the chosen level of abstraction after viewing the results (e.g., in the miner WUM; see [5] for a case study).Alternatively to this ‘static’ approach, other algorithms identify the most specific level of relationships by choosing concepts dynamically. This may lead to rules like “People who stay at Wellness Hotels tend to eat at vegetarian only Indian restaurants”—linking hotel-choice behavior at a comparatively high level of abstraction with restaurant choice behavior at a comparatively detailed level of description.

Semantic Web Usage Mining for complex application events involves two steps of mapping requests to events. As discussed in Section 4.3 above, complex application events are usually defined by regular expressions in atomic application events (at some given level of abstraction in their respective hierarchies). Therefore, in a first step, URLs are mapped to atomic application events at the required level of abstraction. In a second step, a sequence miner can then be used to discover sequential patterns in the transformed data.

The shapes of sequential patterns sought, and the mining tool used, determine how much prior knowledge can be used to constrain the patterns identified: They range from largely unconstrained first-order or k-th order Markov chains [7], to regular expressions that specify the atomic activities completely (the name of the concept) or partially (a variable matching a set of concepts) [4, 2]. Examples of the use of regular expressions describing application-relevant courses of events include search strategies [5], a segmentation of visitors into customers and non-customers [7], and a segmentation of visitors into different interest groups based on the customer buying cycle model from marketing [4]. To date, few commonly agreed-upon models of Semantic Web behavior exist.

5 Extracting Semantics from Web Usage

The effort behind the Semantic Web is to add semantic annotation to Web documents in order to access knowledge instead of unstructured material. The purpose is to allow knowledge to be managed in an automatic way. Web Mining can help to learn definitions of structures for knowledge organization (e. g., ontologies) and to provide the population of such knowledge structures. All approaches discussed here are semi-automatic. They assist the knowledge engineer in extracting the semantics, but cannot completely replace her. In order to obtain high quality results, one cannot replace the human in the loop, as there is always a lot of tacit knowledge involved in the modeling process [5].

A computer will never be able to fully consider background knowledge, experience, or social conventions. If this were the case, the Semantic Web would be superfluous, since then



machines like search engines or agents could operate directly on conventional Web pages. The overall aim of our research is thus not to replace the human, but rather to provide him with more and more support. In [6], we have discussed how content, structure, and usage mining can be used for creating Semantics. Here we focus on the contribution of usage mining. In the World Wide Web as in other places, much knowledge is socially constructed.

This social behavior is reflected by the usage of the Web. One tenet related to this view is that navigation is not only driven by formalized relationships or the underlying logic of the available Web resources, but that it “is an information browsing strategy that takes advantage of the behavior of like-minded people Recommender systems based on “collaborative filtering” have been the most popular application of this idea. In recent years, the idea has been extended to consider not only ratings, but also Web usage as a basis for the identification of like-mindedness (“People who liked/bought this book also looked at ...”); see [3] for a recent mining-based system; see also [6] for a classic, although not mining-based, application. Web usage mining by its definition always creates patterns that structure pages in some way.

6 Conclusions and Outlook

In this paper, we have studied the combination of the two fast-developing research areas Semantic Web and Web Mining, especially usage mining. We discussed how Semantic Web Usage Mining can improve the results of ‘classical’ usage mining by exploiting the new semantic structures in the Web; and how the construction of the Semantic Web can make use of Web Mining techniques. A truly semantic understanding of Web usage needs to take into account not only the information stored in server logs, but also the meaning that is constituted by the sets and sequences of Web page accesses. The examples provided show the potential benefits of further research in this integration attempt. One important focus is to make search engines and other programs able to better understand the content of Web pages and sites. This is reflected in the wealth of research efforts that model pages in terms of an ontology of the content. Overall, three important directions for further interdisciplinary cooperation between mining and application experts in Semantic Web Usage Mining have been identified:

1. the development of ontologies of complex behavior,

2. the deployment of these ontologies in Semantic Web description and mining tools and

3. continued research into methods and tools that allow the integration of both experts’ and users’ background knowledge into the mining cycle. Web mining methods should increasingly treat content, structure, and usage in an integrated fashion in iterated cycles of extracting and utilizing semantics, to be able to understand and (re)shape the Web.

References

[1.] C.C. Aggarwal. Collaborative crawling: Mining user experiences for topical resource discovery. In KDD - 2002– Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, CA, July 23-26, 2002, pages 423–428, New York, 2002. ACM.

[2.] M. Baumgarten, A.G. B¨uchner, S.S. Anand, M.D. Mulvenna, and J.G. Hughes. User-driven navigation pattern discovery from internet data. In M. Spiliopoulou and B. Masand, editors, Advances in Web Usage Analysis and User Profiling, pages 74–91. Springer, Berlin, 2000.



[3.] B. Berendt. Detail and context in web usage mining: Coarsening and visualizing sequences. In R. Kohavi, B.M. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD2001 – Mining Web Log Data Across All Customer Touch Points, pages 1–24. Springer-Verlag, Berlin Heidelberg,2002b.

[4.] B. Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou. The impact of site structure and user environment on session reconstruction in web usage analysis. In Workshop Notes of Fourth WEBKDD Web Mining for Usage Patterns & User Profiles at KDD - 2002, July 23, 2002, pages 115–129, 2002.

[5.] B. Berendt and M. Spiliopoulou. Analysing navigation behavior in web sites integrating multiple information systems. The VLDB Journal, 9(1):56–75, 2000.

[6.] Bettina Berendt, Andreas Hotho, and Gerd Stumme. Towards semantic web mining. In [22], pages 264–278.

[7.] J. L. Borges and M. Levene. Data mining of user navigation patterns. In M. Spiliopoulou and B. Masand, editors, Advances in Web Usage Analysis.



Performance Evolution of Memory Mapped Files

on Dual Core Processors Using Large

Data Mining Data Sets

S.N. Tirumala Rao E.V. Prasad

RNEC, Ongole J.N.T.U.C.E [email protected] [email protected]

N.B. Venkateswarlu G. Sambasiva Rao AITAM, Tekkali, SACET,Chirala

[email protected] [email protected]

Abstract

In the recent years, major CPU designers have shifted from ramping up clock speeds to add on-chip multicore processors. A study is carried out with data mining (DM) algorithms to explore the potential of multi-core hardware architecture with OpenMP. The concept of memory mapped files is widely supported by most of the modern operating systems. Performance of memory mapped files on multicore processor is also studied. In our experiments popular clustering algorithms such as k-means, max-min are used. Experiments are carried out with serial versions and parallel versions. Experiment results with both simulated and real data demonstrates the scalability of our implementation and effective utilization of parallel hardware, which benefits DM problems, involves large data sets.

Keywords: OpenMP, mmap(), fread(), k-means and max-min

1 Introduction

The goal of Data Mining is to discover knowledge hidden in data repositories. This activity has recently attracted a lot of attention. High energy physics experiments produce hundreds of Tera Bytes of data. Credit card banking sector hold large databases of customer’s transactions and web search engines collect web documents worldwide. Regardless of the application field, Data Mining (DM) allows to ‘dig’ into huge datasets to reveal patterns and correlations useful for high level interpretation. Finding clusters, association rules, classes and time series are the most common DM tasks. Evidently, classification algorithms and clustering algorithms are employed for this purpose. All require the use of algorithms whose complexity, both in time and in space, grows at least linearly with the dataset size. Because of the size of the data, and the complexity of the algorithms, the DM algorithms are reported to be time consuming and hinder quick policy decision making. There are many attempts to reduce CPU time requirement of the DM applications [Venkateswarlu et al., 1995; Gray and More, 2004].

Many DM algorithms require a computation to be iteratively applied to all records of a dataset. In order to guarantee scalability, even on a serial or a small scale parallel platform

102 ♦ Performance Evolution of Memory Mapped Files on Dual Core Processors Using Large Data Mining Data


(workstation cluster), the increase in the I/O activity must be carefully taken into account. In the work of [palmerini, 2001] recognized two main categories of algorithms with respect to the patterns of their I/O activities. Read and Compute (R&C) algorithms, which will be useful for the same dataset at each iteration, and read, compute and write (RC&W) ones, which at each iteration rewrite the dataset to be used at the next step and also suggested the employment of ‘Out-of-Core’ (OOC) techniques which explicitly take care of data movements which reported to be showing low I/O overhead. An important OS feature is time-sharing among processes; widely known as multi-threading, with which one can overlap I/O actions with useful computations. [Stoffel et al., 1999; Bueherg, 2006] demonstrated the advantage of such features in order to design efficient DM algorithms.

Most of the commercial data mining tools and public domain tools such as Clusta, Xcluster, Rosetta, FASTLab, Weka etc., support DM algorithms which accept data sets in flat file form or CSV form only. Thus, they use standard I/O functions such as fgetc(), fscanf(). However, fread() is also in wide use with many DM algorithms [chen et al. ,2002 ; Islam,2003]. Moreover, earlier studies [Islam, 2003] indicated that kernel level I/O fine tuning was very important in getting better throughput from system while running DM algorithms. In the recent years, many network and other applications, which demand huge I/O overhead, are reported to be using a special I/O feature known as mmap() to improve their performance. For example the performance of Apache Server was addressed in [www.isi.edu]. In addition, there would be the CPU time benefit by making use of memory mapping rather than conventional I/O in Mach Operating Systems. [Carig and Leroux] have reported that, effective utilization of multi-core technology will profoundly improve the performance and scalability of networking equipment, video game platforms, and a host of other embedded applications

2 Parallel processing

Parallel processing used to be reserved for supercomputers. By the use of internet, many companies to have web and database capable of handling thousands of requests in a second. These servers used a technology known as Symmetric Multiple Processing (SMP), which is still the most common form of parallel processing. Requests for web pages, however, are atomic and so, if you have a mainframe with 4 CPUs, you can run four copies of the web server (one on each CPU) and dispatch incoming requests to whichever CPU is the least busy. Now, parallel computing is becoming extremely common. Dual CPU systems (using SMP) are much cheaper than they used to be putting them in the reach of many consumers. Moreover, many single CPU machines now have parallel capabilities. Intel's hyper threading CPUs are capable of running multiple threads simultaneously under certain conditions. Now with Intel's and AMD's dual core processors, many people who buy a single CPU system actually have the functionality of a dual CPU system.

2.1 Parallelization by Compiler

The first survey of parallel algorithms for hierarchical clustering using distance based metrics is given in [Olson, 1995]. A parallelizing compiler generally works in two different ways: Fully automatic and programmer directed. The fully automatic parallelization has several important caveats: Wrong results may be produced; Performance may actually degrade, much less flexible than manual parallelization [parallel computing].

Performance Evolution of Memory Mapped Files on Dual Core Processors Using Large Data Mining Data ♦ 103


3 Traditional File I/O

The traditional way of accessing files is to first open them with the open system call and then use read, write and lseek calls to do sequential or random access I/O. The detailed experimental results of traditional file I/O on DM algorithms proved that fread() gives better performance than fgetc() on single core machines, for this refer fig number 1 (Annexure) in [Tirumala Rao et al., 2008].

3.1 Memory Mapping

A memory mapping of a file is a special file access technique that is widely supported in popular operating systems such as Unix and Windows and also reported that the mapping of a large file into the memory (address space) can significantly enhance the I/O system performance. The detailed experimental results of traditional file I/O and memory mapping (mmap()) on DM algorithms proved that mmap() gives better performance than fread() on single core machines refer figure number 2 (Annexure ) in [Tirumala Rao et al., 2008].

4 OpenMP

OpenMP is an API provides a portable, scalable model for developers of shared memory parallel applications. The API supports C/C++ and FORTRAN on multiple architectures, including UNIX and Windows. Writing a shared memory parallel program required the use of vendor-specific constructs which raised a lot of portability issues and this problem was solved by OpenMP [www.OpenMP.org]. The OpenMP API consists of set of compiler directives for expressing parallelism, work sharing, data environment and synchronization. These directives are added to an existing serial program in such away that they can be safely discarded by compilers which don’t understand the API. So that OpenMP extends but there will not be any change in the base program. It supports incremental parallelism, unified code for both serial and parallel applications. It also supports both coarse-grained and fine-grained parallelism.

4.1 OpenMp Vs POSIX

Explicit threading methods, such as Windows threads or POSIX threads use library calls to create, manage, and synchronize threads. Use of explicit threads requires an almost complete restructuring of affected code. On the other hand, OpenMP is a set of pragmas, API functions, and environment variables enable to incorporate threads into the applications at a relatively high level. The OpenMP pragmas are used to denote regions in the code that can be run concurrently. An OpenMP-compliant compiler transforms the code and inserts the proper function calls to execute these regions in parallel. In most cases, the serial logic of the original code can be preserved and is easily recovered by ignoring the OpenMP pragmas at compilation time.

4.2 OpenMp Vs MPI

In the past, OpenMP has been confined to Symmetric Multi-Processing (SMP) machines and teamed with Message Passing Interface (MPI) technology to make use of multiple SMP systems. Most parallel data clustering approaches target distributed memory multiprocessors



and their implementation is based on message passing. The message passing would require significant programming effort.A new system, Cluster OpenMP, is an implementation of OpenMP that can make use of multiple SMP machines without resorting to MPI. This advance has the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP is maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary [OpenMp, 2006].

4.3 OpenMp Vs Traditional Parallel Programming

OpenMP is a set of extensions that make it easy for programmers to take full advantage of a system. It has been possible to write parallel programs for a long time. Historically, this has been done by forking the main program into multiple processes or multiple threads manually. This strategy has two major drawbacks. First, spawning processes is extremely platform dependent. Second, it creates a lot of overhead for both the CPU and the programmer as it can be quite complicated keeping track of what is going on in all of the threads. OpenMP takes most of the work out of it for you. Most importantly, OpenMP makes it much easier to parallelize computationally intensive mathematic calculations

4.4 Our Contribution

Previously [Hadjidoukas, 2008] had reported that, OpenMP provides a means of transparent management of the asymmetry and non-determinism in CURE (clustering using representatives). This paper aims to develop an efficient parallelized clustering algorithms such as k-means and max-min, that targets shared memory multi-core processors by employing mmap() facility under popular operating systems such as Windowsxp and Linux. This work mainly focuses on the shared memory architecture under OpenMP environment that supports multiple levels of parallelism. Thus, we are able to satisfy the need for nested parallelism exploitation in order to achieve load balancing. Our experimental results demonstrate significant performance gain in parallelized version of above said algorithms. These experiments were carried out with both synthetic and real data.

5 Experimental Set-Up

In this study, a randomly generated data set and Pocker hand data set [cattral and Oppacher, 2007] are used with the selected algorithms such as k-means and max-min. Random data set is generated to have 10 million records with the dimensionality of 1026. Pokers hand set data has 1 million records with ten attributes (dimensions). It is in ASCII format with comma separated values, which is converted to binary format before applying our algorithms. Thus, experiments are carried out with the converted binary data.

The k-means and max-min algorithms are tested with file size of 2 GB. Computational time requirements of these algorithms with fread(),parallelized algorithm with OpenMP and mmap(), parallelizing above algorithms with OpenMP functions is observed with both the data sets under various conditions. Intel Pentium dual core 2.80 GHz processor with 1 GB RAM, 1 MB Cache memory is used in our study. Fedora 9 Linux (kernel 2.6.25-14, red hat version 6.0.52) equipped with GNU C++(gcc version 4.3) and Windowsxp with VC++ 2008,



environment is installed on a machine with dual booting option to study the performance of the parallelization of above said DM algorithms with OpenMP and mmap() under the same hard ware setup.

In order to check the performance of parallelized algorithms over sequential algorithms with varying dimensionality, these experiments have been carried out. Figure 1 to 3 of appendix demonstrate our observations. Here onwards algorithms implemented with fread(), OpenMP are termed as FODM and algorithms implemented with mmap(),OpenMP are termed as MODM. It could be observed that parallelized FODM and MODM consistently taking less time than sequential algorithm. It could also observe that algorithms with MODM are more scalable than FODM. Our observation also says that Parallelized selected DM algorithms gives better performance under Linux environment over Windowsxp environment.

0

5 0 0 0 0 0 0

1 0 0 0 0 0 0 0

1 5 0 0 0 0 0 0

2 0 0 0 0 0 0 0

2 5 0 0 0 0 0 0

1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0

D i m e n sio n s

Clo

ck T

icks

S e r ia l -fr e a d () P a ra l le l -fre a d () N = 1 0 0 0 0 0 0 0

S e r ia l -m m a p () P a ra l le l -m m a p () K = 2

Fig. 1: k-means algorithm with random data for 10 Million records and also clusters 2 under Linux.

0

2 0 0 0 0 0 0

4 0 0 0 0 0 0

6 0 0 0 0 0 0

8 0 0 0 0 0 0

1 0 9 8 7 6 5 4 3 2

D im e n s io n s

Clo

ck

Tic

ks

S e r i a l - fr e a d ( ) P a r a l l e l - fr e a d ( ) N = 1 0 0 0 0 0 0 K = 1 0

Fig. 2: k-means algorithm with Pokers data for 1 Million records and also clusters 10 under Linux.

05 0 0 0

1 0 0 0 01 5 0 0 02 0 0 0 02 5 0 0 03 0 0 0 03 5 0 0 04 0 0 0 04 5 0 0 05 0 0 0 05 5 0 0 06 0 0 0 06 5 0 0 07 0 0 0 07 5 0 0 08 0 0 0 08 5 0 0 09 0 0 0 0

2 3 4 5 6 7 8 9 1 0

D im e n sio n s

Clo

ck T

icks

S e r i a l -fre a d () P a ra l le l -fre a d () N = 1 0 0 0 0 0 0 0S e r i a l -m m a p () P a ra l le l -m m a p () K = 1 0

Fig. 3: k-means algorithm with random data for 10 Million records and also clusters 10 under WindowsXP .



To see the benefit of parallelized selected DM algorithms over sequential algorithms with varying number of samples, the experiments are conducted experiments, which were presented from Figure 4 and 5. It is observed the advantage of MODM or FODM over sequential algorithms. It is also realized that parallelized MODM gives better performance over parallelized FODM irrespective of their size.

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

1 2 3 4 5 6 7 8 9 10

R e co rd s M i l l io n s

Clo

ck T

icks

S eria l-fread () P a ra lle l-fread () D = 10

S eria l-m m ap() P a ra lle l-m m ap() K = 2

Fig. 4: k-means algorithm with random data for Dimensionality 10 and clusters 2 under Linux .

0250050007500

100001250015000175002000022500250002750030000

1 2 3 4 5 6 7 8 9 10

Re cords in M illions

Clo

ck T

icks

Serial-fread() parallel-fread() D=10Serial-mmap() Parallel-mmap() K=2

Fig. 5: k-means algorithm with random data for Dimensionality 10 and clusters 2 under Windows-xp.

0

10000000

20000000

30000000

40000000

50000000

2 3 4 5 6 7 8 9 10

Clusters

Clo

ck T

icks

Serial-fread() Parallel-fread() D=10 and N=10000000

Fig. 6: k-means algorithm with random data for 10 Million records and Dimensions 10 under Linux environment

Experiments are carried out to verify the performance of parallelized DM algorithms over sequential DM algorithms with varying number of clusters. The observations from Figure 6 and 7 depict that the parallelized DM algorithms benefit over sequential DM. It is also



observed that of parallelized DM algorithm with mmap() is showing better benefit over parallelized DM algorithm with fread() independent of varying number samples.

Figures 8 and 9 demonstrate that benefit of mmap() is more with dual core than with single core. In all these experiments, N is termed as number of records, D is dimensionality of the data set, K is number of clusters.

0

10000

20000

30000

40000

50000

60000

70000

80000

2 3 4 5 6 7 8 9 10

C lu ste rs

Clo

ck T

icks

S e ria l-fre a d () P a ra l le l -fre a d () N= 10000000

se ria l -m m a p () P a ra l le l -m m a p () D= 10

Fig. 7: k-means algorithm with random data for 10 Million records and Dimensions 10 under Windows-xp environment

0

10

20

30

40

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Records in millions

%o

f b

en

fit

Clo

ck t

icks

% of mmap() benfit on Single Core

% of mmap() benfit on dual core D=10 and K=2

Fig. 8: k-means algorithm pokers data for dimensionality 10 and clusters 2 under Linux environment

0

5

10

15

20

25

30

2 3 4 5 6 7 8 9 10

No o f Cluste rs

% O

f B

en

fit

clo

ck t

icks

B enfit of m m ap() O n S ingle Core B enfit O f m m ap() On Dual CO re

Fig. 9: k-means algorithm pokers data for dimensionality 10 and clusters 2 under Linux environment



6 Conclusion

The parallelization of k-means and max-min algorithms with OpenMP is studied on selected operating systems. Experiments show that Parallelized DM algorithms are more scalable than sequential algorithms on personal computer also. It also shows that parallelized algorithms with mmap() are more scalable than parallelized algorithms with fread() irrespective of number of samples, number of dimensions and number of clusters. Our observations also reveal that the computational benefit of mmap() over fread() based algorithms is independent of number of dimensions, number of samples and number of clusters. The advantage of mmap() on dual core is higher over single core.

References

[1] [Bueherg, 2006] Gregery Bueherg, “Towards Data mining an enlights, Architectures”, SIAM conference on Data mining, April, 2006.

[2] [Carig and Leroux] Robert Carig, Paul N. Leroux “Leveraging Multi-Core Processor for High-Performance

Embedded Systems”, www.qnx.com. [3] [cattral and Oppacher, 2007] Rabert cattral and Franz Oppacher Carleton University, Department of

Computer Science Intelligent systems research unit, Canada, http://archive.ics.uci.edu/ml/datasets/Poker+Hand

[4] [chen et al., 2002] Yen-Yu chen, Dingquing Gasu, Torsten Suel, “I/O Efficient Techniques for computing

page rank”, Department of computer and information science, Polytechnique university, Brooklyn, Technical Report -CIS-2002-03.

[5] [Gray and More ,2004] A.Gray and A.More, “Data Structures for fast statistics”, International conference On Machine learning”, Alberta Canada, July 2004.

[6] [Hadjidoukas ,2008] Panagiotis E. Hadjidoukas, and Laurent Amsaleg, “Parallelization of a Hierarchical

Data Clustering Algorithm using OpenMP”, Department of Computer Science, University of Ioannina, Ioannina, Greece, pages 289-299,2008

[7] [Islam,2003] Tuba Islam, “An unsupervised approach for Automatic Language Identification”, Master Thesis, Bogaziqi University, Istambul, Turkey, 2003.

[8] [Olson, 1995] C.F. Olson,”Parallel Algorithms for Hierarchical Clustering. Parallel Computing”, pages 313-1325, 1995.

[9] [OpenMp ,2006] “Processors White Papers Extending OpenMP to Clusters”, Intel, May 2006. [10] [palmerini , 2001] Paolo palmerini, “Design of efficient input/output intensive data mining application”,

ERCIM NEWS, No: 44, Jan, 2001. intensive data mining application”, ERCIM NEWS, No: 44, Jan, 2001. [11] [parallel computing] “Introduction to parallel computing”, https://comuting

IInl.gov/tutorials/parallel_comp/ [12] [Stoffel et al. ,1999 ] Killen stoffel and abdelkades Belkoniene, “Parallel k-means clustering for large data

sets”, proceedings of Europas, 1999. [13] [Tirumala Rao et al., 2008] S .N. Tirumala Rao, E. V. Prasad, N. B. Venkateswarlu and B. G. Reddy,

“Significant performance evaluation of memory mapped files with clustering algorithms”, IADIS International conference on applied computing, Portugal pages .455-460, 2008.

[14] [Venkateswarlu et al.,1995] N. B. Venkateswarlu, M.B.Al-Daoud and S.A Raberts, “Fast k-means

Clustering Algorithms”, University of Leads School of Computer Studies Research Report Series Report 95.18

[15] [www.isi.edu] “Optimized performance analysis of Apache-1.0.5 server”, www.isi.edu. [16] [www.OpenMP.org] “OpenMP Architecture Review Board”. OpenMP specifications. Available at

http://www.openmp.org.



Steganography Based Embedded System used for Bank

Locker System: A Security Approach

J.R. Surywanshi K.N. Hande G.H. Raisoni College of Engineering, Nagpur G.H. Raisoni College of Engineering, Nagpur [email protected] [email protected]

Abstract

Steganography literally means “covered message” and involves transmitting secret messages through seemingly innocuous files. The goal is that not only does the message remain hidden, but also that a hidden message was even sent goes undetected .In this project we applied this concept on hardware based application. We developed an embedded system that is used to automatically On and Off the bank locker. There is no roll of Key for opening and closing the locker. In respective of key, the security will provided through steganography. The secret code will provided to the bank locker through a simple mobile device. This is totally a wireless based application. We used bluetooth device for sending and receiving the signals. This is new achievement that operating a locker through a single mobile signals. The bank locker owner should use his locker without time consuming process .and to avoid any kind of misuse of locker.

1 Introduction

Every one knows the general process of bank locking system. This is totally manual process. If we develop this process in modern way by taking the help of steganography, then it develops a new system that provides a tight security. In steganographic communication senders and receivers agree on a steganographic system and a shared secret key that determines how a message is encoded in the covered medium. To send a hidden message, for example, Alice creates a new image with a digital camera. Alice supplies the steganographic key with her shared secret key and her message. The steganographic system uses the shared secret key to determine how the hidden message should be encoded in the redundant bits. The result is a stego image that Alice sends to Bob. When Bob receives the image, he uses the shared secret and the agreed upon steganographic system to retrieve the hidden message. Figure [1] shows an overview of the encoding step;

As mentioned above the role of Alice and Bob is to perform by the Locker owner and Bank higher authority person who have rights of providing the security policy to their bank owner people. We have used here a simple mobile device that is used for providing stegno message for locking and unlocking the bank locker irrespective of key.

Locker automatically turns ON and OFF without use of actual key. It is an automatically operating system. Micro controller is used for providing the signals to the DC motor. DC motor changes its position that is required for opening and closing the lock. Blue tooth device does a very good job for this whole process. Bluetooth receives and sends the signals from

110 ♦ Steganography Based Embedded System used for Bank Locker System: A Security Approach


the PC to mobile device and vice versa. Software is developed such that if anyone knows the image and trying to guess the secret code at that time then only three possible chances are allowed. More than these chances will automatically deactivate the locker system.

Fig. 1: Modern steganographic communication. The encoding step of a steganographic system identifies redundant bits and then replaces a subset of them with data from a secret message

Provision is provided if someone steals the mobile and tries to access the locker system. Entry related to accessing the locker system will automatically update bank server data. So the person won’t be able to lie. Working process of this project is very simple. At the designing time the bank authority person provides a simple image to the locker owner. The locker owner tells his secret code to the bank authority person. The secrete code may be the signature or any identity of that person. This secret code wills Stegno into the given image.

This image is stored into the server of the bank as well as it transferred to the bank locker owner’s personal mobile. During the stegno process, the pairing address of the mobile is also inserted. This provides the security at device level.

In short there are three kinds of securities will provided

• Image based

• Personal secret Code

• Device based (Pairing address of mobile)

It means at the decoding time the process will check the same image, same secret code plus same mobile device. This is complete process of designing level. This process has to be performed at the time when the new owner wants to open his new locker. After this creation when the owner of locker wants to operate his locker he can use it without needs of the bank authority people. For example when he/she wants to open the locker he/she will go to the bank. He/she will just give the signals through his/her mobile device to the locker. This is totally wireless base application. Bluetooth performs this activity. The signals first checked

Steganography Based Embedded System used for Bank Locker System: A Security Approach ♦ 111


out by the computer. The PC will check all three-security level. If all the information is correct then it gives the signals to the micro controller. The micro controller gives signals to the DC motor. This DC motor will lock and unlock the system. After the completion of the works the locker user again transfers the signals to the PC. Bank server transfers the unlocking signals to the micro controller. Figure [2] shows the signal transferred by user for operating his locker.

2 System Overview

This project will develop in two phases.

• Software phase

• Hardware phase

2.1 Software Development Phase

First we develop the software phase. For this we consider a single image. This image will use any JPEG image. This image is openly available to all the people who will use this mobile. But nobody have the secret code. So no harm if the mobile is kept anywhere.

Different algorithms are available for steganography

• Discrete Cosine Transform

• Sequential

• Pseudo random

• Subtraction

• Statistic – aware embedding

• We use here Discrete Cosine Transform

Fig. 2. The complete process where the computer stegno the secrete code within the image and transfer to the mobile device.



2.1.1 DCT Based Information Hiding Process

Transform coding is simply the compression of images in the frequency domain. It constitutes an integral component of contemporary image processing applications. Transform coding relies on the premise that pixels in an image exhibit a certain level of correlation with their neighboring pixels. A transformation is, therefore, defined to map the spatial (correlated) data into transformed (uncorrelated) coefficients. Clearly, the transformation should utilize the fact that the information content of an individual pixel is relatively small i.e., to a large extent visual contribution of a pixel can be predicted using its neighbors. The Discrete Cosine Transform (DCT) is an example of transform coding. JPEG is an image compression standard, which was proposed by the Joint Photographic Experts Group. JPEG transforms the information of color domain into frequency domain by applying Discrete Cosine Transform (DCT). The image is divided into blocks with 8X8 pixels, which is transformed into frequency domain. Each block of an image is represented by 64 components, which are called DCT coefficients. The global and important information of an image block is represented in lower DCT coefficients, while the detailed information is represented in upper coefficients. The compression of an image is achieved by omitting the upper coefficients. The following equation is used for quantization. The result is rounded to the nearest integer.

(1) Where

ci is the original transform coefficient (real number)

q is the quantization factor (integer between 1..255)

The reverse process, dequantization of quantized coefficients is completed with the following formula:

(2) Algorithm used for this total approach is given in here. For each color component, the JPEG image format uses a discrete cosine transform (DCT) to transform successive 8X8 pixel blocks of the image into 64 DCT coefficients each. The DCT coefficient F(u,v) of an 8X8 block of image pixels f(x,y) are given by :

(3)

Where C(x) = 1/√2when x equal 0 and C(x) = 1 otherwise. Afterwards, the following operation quantizes the coefficients:

(4)



where Q(u,v) is a 64-element quantization table. We can use the least-significant bits of the quantized DCT coefficients as redundant bits in which to embed the hidden message. The modification of a single DCT coefficient affects all 64 image pixels. In some image formats (such as GIF), an image’s visual structure exists to some degree in all the image’s bit layers. Steganographic systems that modify least –significant bits of this image format are often susceptible to visual attacks. This is not true for JPEGs. The modifications are in the frequency domain instead of the spatial domain, so there are no visual attacks against the JPEG format.

Input: message, cover image Output: stego image while data left to embed do

get next DCT coefficient from cover image

if DCT≠0 and DCT ≠1 then

get next LSB from message replace DCT LSB with message LSB end if insert DCT into stego image end while

Fig. 3. The JSteg algorithm. As it runs, the algorithm sequentially replaces the least-significant bit of discrete cosine transform (DCT) coefficients with message data. It does not require a shared secret.

Figure [4] shows two images with a resolution of 640X480 in 24-bit color. The uncompressed original image is almost 1.2 Mbytes (the two JPEG images shown are about 0.3 Mbytes). Figure [4a] is unmodified; Figure 4b contains the first chapter of Lewis Carroll’s The Hunting of the Snark. After compression, the chapter is about 15 Kbytes. The human eye cannot detect which image holds steganographic content.

Embedding Process

• Compute the DCT coefficients for each 8x8 block

• Quantize the DCT coefficients by standard JPEG quantization table

• Modify the coefficients according to the bit to hide

• If bit=1, all coefficients are modified to odd numbers

• If bit=0, all coefficients are modified to even numbers

• All coefficients quantized to 0 are remain intact

• Inverse quantization

• Inverse DCT

Extracting process:

• Compute the DCT coefficients for each 8x8 block

• Quantize the DCT coefficients by standard JPEG quantization table

• Count the numbers of coefficients quantized to odd and even

• If odd coefficients are more, then bit=1 • If even coefficients are more, then bit=0



Fig. 4: Embedded information in a JPEG. (a) The unmodified original picture; (b) the picture with the first chapter of The Hunting of the Snanke embedded in it.

2.2 Hardware Development

In hardware phase we designed a micro controller kit. This contained a micro controller and a Dc motor interface. This motor rotates clockwise and anticlockwise direction for opening or closing the bank locker. We used here PIC micro controller. The motor will have the functionality to.

2.2.1 Block Descriptions

PC: This block is the only point where the system accepts user input. It will be a Windows-based software application to be run on any Windows PC or laptop. Here the user will be able to manipulate the various functions of the motor; run, stop, accelerate and decelerate, using easy to learn onscreen controls.

USB Bluetooth Adapter: To bridge the connection between the PIC and the PC, there will be Bluetooth modules connected to both sides. The PC side will be implemented using this common adapter which lets a Bluetooth connection be made as a serial link. The USB adapter can be installed easily in Windows, just as any other USB device would. The signals will then be received and manipulated from the motor control software.

Bluetooth Adapter: This is the other end of the Bluetooth wireless connection, it is the module that will receive the wireless signals from the USB transmitter and send them to the control unit

Control Unit: The control unit consists of the PIC microcontroller and the pulse width modulator. The PIC will be used to receive feedback from the motor to determine speed and adjust the signal accordingly. The PWM will be used to control the duty cycle of the motor by regulating the power output.

Step-down DC to DC Converter: The step-down converter is used to supply lower voltages then available from the voltage source. In this case, we are using a 12 V battery as our power supply, so the step-down converter will be available to output voltages from 0 to 12 V, depending on the duty ratio that is designated by the control unit. This will change the output voltage to the motor accordingly, thus varying the speed the motor is running at. This module will be designed using several resistors, an inductor, a capacitor and a diode and MOSFET transistor used as switches.



Fig. 5: Complete process of transferring the signals from PC to Dc motor start, stop, through the commands of the computer. There is Wireless control via Bluetooth is provided

12 V Battery: This is the power supply for the circuit. It will be used to power the PIC, as well as supply power to the step-down converter which will be varied from 0-12 V depending on the voltage needed for the requested speed. A 12 V lead acid battery will be used.

H-Bridge: The purpose of this unit is to allow the motor to come to a complete stop if requested, as well as change the direction the motor is running in. This receives its commands from the control unit.

DC Motor: This is a 12 V permanent magnet DC motor. It will be powered by the 12V battery through the step-down dc to dc converter and controlled via the control unit and H-Bridge.

3 Conclusion

This steganography based embedded system allows users who wish to operate the bank locker without any communication with bank authority person and without any time consuming process. This is a new implementation that this stenography technology will used in any hardware based application. This is successfully worked out in this project .The future could be like it is used in general purposed application like home security application Hence through this we tried to develop steganography based embedded system that provided the tight security for hardware base application system.



References

[1] A. Westfeld and A. Pfitzmann, “Attacks on Steganographic Systems,” Proc. Information Hiding—3rd Int’l Workshop, Springer Verlag, 1999, pages. 61–76.

[2] B. Chen and G.W. Wornell, “Quantization Index Modulation: A Class of Provably Good Methods for Digital Watermarking and Information Embedding,” IEEE Trans. Information Theory, vol. 47, no. 4, 2001, pages. 1423–1443

[3] F.A.P. Petitcolas, R.J. Anderson, and M.G. Kuhn, “Information Hiding—A Survey,” Proc. IEEE, vol. 87, no. 7,1999, pages. 1062–1078.

[4] Farid, “Detecting Hidden Messages Using Higher- Order Statistical Models,” Proc. Int’l Conf. Image Processing, IEEE Press, 2002.

[5] J. Fridrich and M. Goljan, “Practical Steganalysis—State of the Art,” Proc. SPIE Photonics Imaging 2002, Security and Watermarking of Multimedia Contents, vol. 4675, SPIE Press, 2002, pages. 1–13.

[6] N.F. Johnson and S. Jajodia, “Exploring Steganography: Seeing the Unseen,” Computer, vol. 31, no. 2, 1998, pages. 26–34.

[7] N.F. Johnson and S. Jajodia, “Steganalysis of Images Created Using Current Steganographic Software,” Proc. 2nd Int’l Workshop in Information Hiding, Springer-Verlag, 1998, pages. 273–289.

[8] R.J. Anderson and F.A.P. Petitcolas, “On the Limits of Steganography,” J. Selected Areas in Comm., vol. 16, no. 4, 1998, pages. 474–481.

[9] Wireless Bluetooth Controller for DC Motor ECE 445Project Proposal February 5, 2007.



Audio Data Mining Using Multi-Perceptron

Artificial Neural Network

A.R. Ebhendra Pagoti Mohammed Abdul Khaliq DIT, GITAM University DIT, GITAM University Rushikonda, Visakhapatnam-530046 Rushikonda, Visakhapatnam-530046 Andhra Pradesh, India Andhra Pradesh, India

Praveen Dasari DIT, GITAM University, Rushikonda, Visakhapatnam-530046, Andhra Pradesh, India

Abstract

Data mining is the activity of analyzing a given set of data. It is the process of finding patterns from large relational databases. Data mining includes: extract, transform, and load transaction data onto the data warehouse system, store and manage the data in a multidimensional database system, provides data, analyze the data by application software and visual presentation. Audio data contains information of each audio file such as signal processing component- power spectrum, cepstral values that is representative of particular audio file. The relationship among patterns provides information. It can be converted into knowledge about historical patterns and future trends. This work involves in implementing an artificial neural network (ANN) approach for audio data mining. Acquired audio is preprocessed to remove noise followed by feature extraction using cepstral method. The ANN is trained with the cepstral values to produce a set of final weights. During testing process (audio mining), these weights are used to mine the audio file. In this work, 50 audio files have been used as an initial attempt to train the ANN. The ANN is able to produce only about 90% accuracy of mining due to less correlation of audio data.

Keywords: ANN, Backpropagation Algorithm, Cepstrum, Feature Extraction, FFT, LPC, Perceptron, Testing, Training, Weights.

1 Introduction

Data mining is concerned with discovering patterns meaninagfully from data. Data mining has deep roots in the fields of statistics, artificial intelligence, and machine learning. With the advent of inexpensive storage space and faster processing over the past decade, the research has started to penetrate new grounds in areas of speech and audio processing as well as spoken language dialog. It has gained interest due to audio data that are available in plenty. Algorithmic advances in automatic speech recognition have also been a major, enabling technology behind the growth in data mining. Currently, large vocabulary, continuous speech recognizers are now trained on a record amount of data such as several hundreds of millions of words and thousands of hours of speech. Pioneering research in robust speech processing, large-scale discriminative training, inite state automata, and statistical hidden Markov modeling have resulted in real-time recognizers that are able to transcribe spontaneous

118 ♦ Audio Data Mining Using Multi-Perceptron Artificial Neural Network


speech. The technology is now highly attractive for a variety of speech mining applications. Audio mining research includes many ways of applying machine learning, speech processing, and language processing algorithms [1]. It helps in the areas of prediction, search, explanation, learning, and language understanding. These basic challenges are becoming increasingly important in revolutionizing business processes by providing essential sales and marketing information about services, customers, and product offerings. A new class of learning systems can be created that can infer knowledge and trends automatically from data, analyze and report application performance, and adapt and improve over time with minimal or zero human involvement. Effective techniques for mining speech, audio, and dialog data can impact numerous business and government applications. The technology for monitoring conversational audio to discover patterns, capture useful trends, and generate alarms is essential for intelligence and law enforcement organizations as well as for enhancing call center operation. It is useful for a digital object identifier analyzing, monitoring, and tracking customer preferences and interactions to better establish customized sales and technical support strategies. It is also an essential tool in media content management for searching through large volumes of audio warehouses to find information, documents, and news.

2 Technical Work Preparation

2.1 Problem Statement

Audio files are to be mined properly with high accuracy given partial audio information. This an be very much achieved using ANN. This work involves in implementing supervised backpropagation algorithm (BPA). The BPA is trained with the features of audio data for different number of nodes in the hidden layer. The layer with optimal number of nodes has to be chosen for proper audio mining.

2.2 Overview of Audio Mining

Audio recognition is a classic example of things that the human brain does well, but digital computers do poorly. Digital computers can store and recall vast amounts of data perform mathematical calculation at blazing speeds and do repetitive tasks without becoming bored or inefficient. Computer performs very poorly when faced with raw sensory data. Teaching the same computer to understand audio is a major undertaking. Digital signal processing generally approaches the problem of audio recognition into two steps, 1) Feature extraction, 2) Feature matching Each word in the incoming audio signal is isolated and then analyzed to identify the type of excitation and resonate frequency [2]. These parameters are then compared with previous example of spoken words to identify the closest match. Often, these systems are limited to few hundred words; can only accept signals with distinct pauses between words; and must be retrained. While this is adequate for many commercial applications, these limitations are humbling when compared to the abilities of human hearing. There are two main approaches to audio mining. 1. Text-based indexing: Text-based indexing, also known as large-vocabulary continuous speech recognition, converts speech to text and then identifies words in a dictionary that can contain several hundred thousand entries. 2. Phoneme-based indexing: Phoneme based indexing doesn’t convert speech to text but instead works only with sounds. The system first analyzes and identifies sounds in a piece of audio content to create a phonetic-based index. It then uses a dictionary of several dozen

Audio Data Mining Using Multi-Perceptron Artificial Neural Network ♦ 119


phonemes to convert a user’s search term to the correct phoneme string. (Phonemes are the smallest unit of speech in a language, such as the long “a” sound that distinguishes one utterance from another. All words are sets of phonemes). Finally, the system looks for the search terms in the index. A phonetic system requires a more proprietary search tool because it must phoneticize the query term, and then try to match it with the existing phonetic string output. Although audio mining developers have overcome numerous challenges, several important hurdles remain. Precision is improving but it is still a key issue impeding the technology’s widespread adoption, particularly in such accuracy-critical applications as court reporting and medical dictation. Audio mining error rates vary widely depending on factors such as background noise and cross talk. Processing conversational speech can be particularly difficult because of such factors as overlapping words and background noise [3][4]. Breakthroughs in natural language understanding will eventually lead to big improvements. The problem of audio mining is an area with many different applications. Audio identification techniques include Channel vocoder, linear prediction, Formant vocoding, Cepstral analysis. There are many current and future applications for audio mining. Examples include telephone speech recognition systems, or voice dialers on car phones.

2.3 Schematic Diagram

The sequence of Audio mining can be schematically shown as below.

Fig.1: Sequence of audio processing

2.4 Artificial Neural Network

A neural network is constructed by highly interconnected processing units (nodes or neurons) which perform simple mathematical operations [5]. Neural networks are characterized by their topologies, weight vectors and activation function which are used in the hidden layers and output layer [6]. The topology refers to the number of hidden layers and connection between nodes in the hidden layers. The activation functions that can be used are sigmoid, hyperbolic tangent and sine [7]. A very good account of neural networks can be found in [11]. The network models can be static or dynamic [8]. Static networks include single layer



perceptrons and multilayer perceptrons. A perceptron or adaptive linear element (ADALINE) [9] refers to a computing unit. This forms the basic building block for neural networks. The input to a perceptron is the summation of input pattern vectors by weight vectors. In Figure 2, the basic function of a single layer perceptron is shown.

Fig. 2: Operation of a neuron

In Figure 3, a multilayer perceptron is shown schematically. Information flows in a feed-forward manner from input layer to the output layer through hidden layers. The number of nodes in the input layer and output layer is fixed. It depends upon the number of input variables and the number of output variables in a pattern. In this work, there are six input variables and one output variable. The number of nodes in a hidden layer and the number of hidden layers are variable. Depending upon the type of application, the network parameters such as the number of nodes in the hidden layers and the number of hidden layers are found by trial and error method.

Fig. 3: Multilayer Perceptron

In most of the applications one hidden layer is sufficient. The activation function which is used to train the ANN, is the sigmoid function and it is given by:

f(x)=1÷(1+exp(-x)) , wheref (x) is a non - linear differentiable function, (1)

x=∑Wij(p)xⁿi(p)+Θ(p), where Nn is the total number of nodes in the nth layer.

Wij is the weight vector connecting ith neuron of a layer with the jth neuron in the next layer. q is the threshold applied to the nodes in the hidden layers and output layer and p is the



pattern number. In the first hidden layer, xi is treated as an input pattern vector and for the successive layers, xi is the output of the ith neuron of the proceeding layer. The output xi of a neuron in the hidden layers and in the output layer is calculated by:

Xi^(n+1)(p)=1/(1+exp(-x+Θ(p)) (2)

For each pattern, error E (p) in the output layers is calculated by:

E(p)=1/2(∑i=1toNm)(di(p)-xi^M(p))^2 (3)

Where M is the total number of layer which include the input layer and the output layer, NM is the number of nodes in the output layer. di(p) is the desired output of a pattern and Xi M(p) is the calculated output of the network for the same pattern at the output layer. The total error E for all patterns is calculated by:

E=∑(p=1toL)E(p) , where, L is the total number of patterns. (4)

2.5 Implementation

The flowchart, fig. 4 explains the sequence of implementation of audio mining. Fifty audio files were chosen. The feature extraction procedure is applied Preemphasizing and windowing. Audio is intrinsically a highly non-stationary signal. Signal analysis, FFT-based or Linear Predictor Coefficients (LPC) based, must be carried out on short segments across which the audio signal is assumed to be stationary. The feature extraction is performed on 20 to 30 ms windows with 10 to 15 ms shift between two consecutive windows. To avoid problems due to the truncation of the signal, a weighting window with the appropriate spectral properties must be applied to the analyzed chunk of signal. Some windows are Hamming, Hanning and Blackman Normalization Feature normalization can be used to reduce the mismatch between signals recorded in different conditions. Normalization consists in mean removal and eventually variance normalization. Cepstral mean subtraction (CMS) is a good compensation technique for convolutive distortions. Variance normalization consists in normalizing the feature variance to one and in signal recognition to deal with noises and channel mismatch. Normalization can be global or local. In the first case, the mean and standard deviation are computed globally while in the second case, they are computed on a window centered on the current time. Feature extraction method LPC starts with the assumption that an audio signal is produced by a buzzer at the end of a tube, with occasional added hissing and popping sounds. Although apparently crude, this model is actually a close approximation to the reality of signal production. LPC analyzes the signal by estimating the formants, removing their effects from the signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is called the residue. The numbers, which describe the intensity and frequency of the buzz, the formants, and the residue signal, can be stored or transmitted somewhere else. LPC synthesizes the signal by reversing the process: use the buzz parameters and the residue to create a source signal, use the formants to create a filter, and run the source through the filter, resulting in audio. Steps: 1. Audio files –in mono or stereo recorded in natural or inside lab, or taken from a standard data base 2. Extracting features by removing noise provided it is a fresh audio, otherwise for existing audio, noise removal is not required 3. Two phases have to be adopted: Training phase and testing phase 4. Training Phase: In this phase, a set of representative numbers are to be obtained from an initial set of numbers. BPA is used for learning the audio files 5.



Testing phase: In this phase, the representative numbers obtained in step 4 has to be used along with the features obtained from a test audio file to obtain, an activation value. This value is compared with a threshold and final decision is taken to retrieve an audio file or offer solution to take further action which can be activating a system in a mobile phone, etc.

2.6 Results and Discussion

Cepstrum analysis is a nonlinear signal processing technique with a variety of applications in areas such as speech and image processing. The complex cepstrum for a sequence x is calculated by finding the complex Natural logarithm of the Fourier transform of x, then the inverse Fourier transform of the resulting sequence. The complex cepstrum transformation is central to the theory and application of homomorphic systems, that is, systems that obey certain general rules of superposition. The real cepstrum of a signal x, sometimes called simply the cepstrum, is calculated by determining the natural logarithm of magnitude of the Fourier transform of x,then obtaining the inverse Fourier transform of the resulting sequence. It is difficult to reconstruct the original sequence from its real cepstrum transformation, as the



real cepstrum is based only on the magnitude of the Fourier transform for the sequence. Table 1, gives the cepstral coefficients for 25 sample audio files. Each row is a pattern used for training the ANN with BPA. The topology of the ANN used is 6 × 6 × 1. In this, 6 nodes in the input layer, 6 nodes in the hidden layer and 1 node in the output layer is used for proper training of ANN followed by audio mining.

Table 1 Cepstral Features Obtained from Sample Audio Files

F1 –F6 are cepstral values. We can choose more than 6 values for an audio file. Target labeling should be less than 1 and greater than zero. When the number of audio file increases, then more decimal values have to be incorporated.

3 Conclusion

Audio of common birds and pet animals have been recorded casually. The audio file is suitably preprocessed followed by cepstral analysis and training ANN using BPA. A set of final weights with 6 × 6 × 1 configuration is obtained with 7350 iterations to reach .0125 mean squared error rate. Fifty patterns have been used for training the ANN. Thirty patterns were used for testing (audio mining). The results are close to 90% of mining as the audio was recorded in open. The percentage of recognition and audio mining accuracy has to be tested with large number of new audio files from the same set of birds and pet animals

References

[1] Lie Lu and Hong-Jiang Zhang, “Content analysis for audio classification and segmentation.”, IEEE Transactions on Speech and Audio Processing, 10:504–516, October 2002.

[2] T. Tolonen and M. Karjalainen, “A computationally efficient multipitch analysis model,” IEEE Transactions on Speech and Audio Processing, Vol. 8(No. 6):708–716, November 2000.



[3] Haleh Vafaie and Kenneth De Jong, “Feature space transformation using genetic algorithms,” IEEE Intelligent Systems, 13(2):57–65, March/April 1998.

[4] Usama M. Fayyad, “Data Mining and Knowledge Discovery: Making Sense Out of Data,” IEEE Expert, October 1996, pp. 20-25.

[5] Fortuna L, Graziani S, LoPresti M and Muscato G (1992), “Improving back-propagation learning using auxiliary neural networks,” Int. J of Cont. , 55(4), pp. 793-807.

[6] Lippmann R P (1987) “An introduction to computing with neural nets,” IEEE Trans. On Acoustics, Speech and Signal Processing Magazine, V35, N4, pp.4.-22

[7] Yao Y L and Fang X D (1993), “Assessment of chip forming patterns with tool wear progression in machining via neural networks”, Int.J. Mach. Tools & Mfg, 33 (1), pp 89 -102.

[8] Hush D R and Horne B G (1993), “Progress in supervised neural networks”, IEEE Signal Proc. Mag., pp 8-38.



A Practical Approach for Mining Data

Regions from Web Pages

K. Sudheer Reddy G.P.S. Varma P. Ashok Reddy Infosys Technologies Ltd., Hyd. S.R.K.R.Engg College L.B.R.College of Engineering [email protected] [email protected] [email protected]

Abstract

In recent years government agencies and industrial enterprises are using the web as the medium of publication. Hence, a large collection of documents, images, text files and other forms of data in structured, semi structured and unstructured forms are available on the web. It has become increasingly difficult to identify relevant pieces of information since the pages are often cluttered with irrelevant content like advertisements, copyright notices, etc surrounding the main content. This paper deals with the techniques that help us mine such data regions in order to extract information from them to provide value-added services. In this paper we propose an effective automatic technique to perform the task. This technique is based on three important observations about data regions on the web.

1 Introduction

Web information extraction is an important problem for information integration, because multiple web pages may present the same or similar information using completely different formats or syntaxes, which makes integration of information a challenging task. Due to the heterogeneity and lack of structure of web data, automated discovery of targeted information becomes a complex task. A typical web page consists of many blocks or areas, e.g., main content areas, navigation areas, advertisements, etc. For a particular application, only part of the information is useful, and the rest are noises. Hence it is useful to separate these areas automatically for several practical applications. Pages in data-intensive web sites are usually automatically generated from the back-end DBMS using scripts. Hence, the structured data on the web are often very important since they represent their host page’s essential information, e.g., details about the list of products and services.

In order to extract and make use of information from multiple sites to provide value added services, one needs to semantically integrate information from multiple sources. There are several approaches for structured data extraction, which is also called wrapper generation. The first approach is to manually write an extraction program for each web site based on observed format patterns of the site. This manual approach is very labor intensive and time consuming. It thus does not scale to a large number of sites. The second approach is wrapper induction or wrapper learning, which is currently the main technique. Wrapper learning works as follows: The user first manually labels a set of trained pages. A learning system then generates rules from the training pages. The resulting rules are then applied to extract target items from web pages. These methods either require prior syntactic knowledge or substantial manual efforts. Example wrapper induction systems include WEIN.

126 ♦ A Practical Approach for Mining Data Regions from Web Pages


The third approach is the automatic approach. Since structured data objects on the web are normally database records retrieved from underlying web databases and displayed in web pages with some fixed templates. Automatic methods aim to find patterns/grammars from the web pages and then use them to extract data. Examples of automatic systems are IEPAD, ROADRUNNER.

Another problem with the existing automatic approaches is their assumption that the relevant information of a data record is contained in a contiguous segment of HTML code, which is not always true. MDR (Mining Data Records) basically exploits the regularities in the HTML tag structure directly. It is often very difficult to derive accurate wrappers entirely based on HTML tags. MDR algorithm makes use of the HTML tag tree of the web page to extract data records from the page. However, an incorrect tag tree may be constructed due to the misuse of HTML tags, which in turn makes it impossible to extract data records correctly. MDR has several other limitations which will be discussed in the latter half of this paper. We propose a novel and more effective method to mine the data region in a web page automatically. The algorithm is called VSAP (Visual Structure based Analysis of web Pages). It finds the data regions formed by all types of tags using visual cues.

2 Related Work

Extracting the regularly structured data records from web pages is an important problem. So far, several attempts have been made to deal with the problem. Related work, mainly in the area of mining data records in a web page automatically, is MDR (Mining Data Records).

MDR automatically mines all data records formed by table and form related tags i.e., <TABLE>, <FORM>, <TR>, <TD>, etc. assuming that a large majority of web data records are formed by them.

The algorithm is based on two observations:

(a) A group of data records are always presented in a contiguous region of the web page and are formatted using similar HTML tags. Such region is called a Data Region. (b) The nested structure of the HTML tags in a web page usually forms a tag tree and a set of similar data records are formed by some child sub-trees of the same parent node.

The algorithm works in three steps:

Step 1 Building the HTML tag tree by following the nested blocks of the HTML tags in the web page.

Step 2 Identifying the data regions by finding the existence of multiple similar generalized nodes of a tag node. A generalized node (or a node combination) is a collection of child nodes of tag node, with the following two properties:

(i) All the nodes have the same parent.

(ii) The nodes are adjacent.

Then each generalized node is checked to decide if it contains multiple records or only one record. This is done by string comparison of all possible combinations of component nodes using Normalized Edit Distance method. A data region is a collection of two or more generalized node with the following properties:

A Practical Approach for Mining Data Regions from Web Pages ♦ 127


(i) The generalized nodes all have the same parent.

(ii) The generalized nodes all have the same length.

(iii)The generalized nodes are all adjacent.

(iv) The normalized edit distance (string comparison) between adjacent generalized nodes is less than a fixed threshold.

To find relevant data region, MDR makes use of the content mining.

Step 3 Identifying the data records involves finding the data records from each generalized node in a data region. All the three steps of MDR have certain serious limitations which will be discussed in the latter half of the paper

2.1. How to use MDR

In running MDR, we used their default settings. MDR system was downloaded at: http://www.cs.uic.edu/~liub/WebDataExtraction/MDR-download.html

1. Click on "mdr.exe". You will get a small interface window.

2. You can type or paste a URL (including http ://) or a local path into the Combo Box; the Combo Box contains a list of URLs which you have added. At the beginning it may be empty.

3. If you are interested in extracting tables (or with rows and columns of data), Click on "Extract" in the Table section.

4. If you are interested in extracting other types of data records, click on "Extract" in the "Data Records (other types)" section. We separate the two functions for efficiency reasons.

5. After the execution, the output file will be displayed in an IE window. The extracted tables or data regions and data records are there.

Options: Only show the data regions with "$" sign: When dealing with E-Commerce websites, most data records of interest are merchandise. If this option is checked, MDR only outputs that data regions in which the data records are merchandise. (Here we assume every merchandise has a price with "$" sign) In this way, some data regions that also contain regular pattern data records will not be displayed.

3 The Proposed Technique

We propose a novel and more effective method to mine the data region in a web page automatically. The algorithm is called VSAP (Visual Structure Based Analysis of WebPages).The visual information (i.e., the locations on the screen at which tags are rendered) helps the system in three ways:

a) It enables the system to identify gaps that separate records, which helps to segment data records correctly, because the gaps within the data record(if any) is typically smaller than that in between data records.

b) The visual and display information also contains information about the hierarchical structure of the tags.



c) By the visual structure analysis of the WebPages, it can be analyzed that the relevant data region seems to occupy the major central portion of the Webpage.

The system model of the VSAP technique is shown in fig 1.

It consists of the following components.

• Parsing and Rendering Engine

• Largest Rectangle Identifier

• Container Identifier

• Data Region Identifier

The output of each component is the input of the next component.

Fig. 1: System Model

The VSAP technique is based on three observations:

a) A group of data records, that contains descriptions of set of similar objects, is typically presented in a contiguous region of a page.

b) The area covered by a rectangle that bounds the data region is more than the area covered by the rectangles bounding other regions, e.g. Advertisements and links.

c) The height of an irrelevant data record within a collection of data records is less than the average height of relevant data records within that region.

Definition 1: A data region is defined as the most relevant portion of a webpage.

E.g. A region on a product related web site that contains a list of products forms the data region.

Definition 2: A data record is defined as a collection of data a meaningful independent entity. E.g. A product listed inside a data region on a product related web site is a data record.

Fig 2 illustrates an example which is a segment of a webpage that shows a data region containing list of four books. The full description of each book is a data record.



Fig. 2: An Example of a data region containing 4 data records

The overall algorithm of the proposed technique is as follows:

Algorithm VSAP (HTML document)

a) Set maxRect=NULL

b) Set dataRegion=NULL

c) FindMaxRect (BODY);

d) FindDataRegion (maxRect);

e) FilterDataRegion (dataRegion);

End

The lines 1 and 2 specify initializations. The line 3 finds the largest rectangle within a container. Line 4 identifies the data region which consists of the relevant data region and some irrelevant regions also. Line 5 identifies the actual relevant data region by filtering the bounding irrelevant regions. As mentioned earlier, the proposed technique has two main steps. This section presents them in turn.

3.1 Determining the Co-ordinates of All Bounding Rectangles

In the first step of the proposed technique, we determine the coordinates of all the bounding rectangles in the web pages. The VSAP approach uses the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0. This parsing and rendering engine of the web browser gives us these coordinates of a bounding rectangle. We scan the HTML file for tags. For each tag encountered, we determine the coordinate of the top left corner, height and width of the bounding rectangle of the tag.

Definition: Every HTML tag specifies a method for rendering the information contained within it. For each tag, there exists an associated rectangular area on the screen. Any



information contained within rectangular area obeys the rendering rules associated with the tag. This rectangle is called the bounding rectangle for the particular tag.

A bounding rectangle is constructed by obtaining the coordinate of the top left corner of the tag, the height and the width if that tag. The left and top coordinates of the tag are obtained from offsetLeft and offsetTop properties of the HTMLObjectsElement. These values are with respect to its parent tag. The height and width of that tag are available from the offsetHeight and offsetWidth properties of the HTMLObjects Element Class.

Fig. 3 shows a sample web page of the product related website, which contains the number of books and their description which form the data records inside the data region

Fig. 3: A Sample Web page of a product related website

For each HTML tag on web page, there exists an associated rectangle on the screen, which forms the bounding rectangle for that specific tag. Fig 4 shows the bounding rectangles for the <TD> tags of the web pages shown in Fig 3.

Fig. 4: Bounding Rectangles for <TD> tag corresponding to the web page in Fig 3



3.2 Identifying the Data Regions

The second step of the proposed technique is to identify the data region of the web page. The data region is the most relevant portion of a web page that contains a list of data records. The three steps involved in identifying the data region are:

• Identify the largest rectangle.

• Identify the container within the largest rectangle.

• Identify the data region containing the data records within this container.

3.2.1 Identification of the Largest Rectangle

Based on the height and width of bounding rectangles obtained in the previous step, we determine the area of the bounding rectangles of each of the children of the BODY tag. We then determine the largest rectangle amongst these bounding rectangles. The reason for doing this is due to the observation that the largest bounding rectangle will always contain the most relevant data in that web page. In Fig 5 largest rectangle is being shown with a dotted border.

The procedure FindMaxRect identifies the largest rectangle amongst all the bounding rectangles of the children of the BODY tag. It is as follows.

Procedure FindMaxRect (BODY) for each child of BODY tag Begin

Find the coordinate of the bounding rectangle for the child If the area of the bounding rectangle > area of maxRect then maxRect = child endif

end

Fig. 5: Largest Rectangle amongst bounding rectangles of children of BODY tag



3.2.2 Identification of the Container within the Largest Rectangle

Once we have obtained the largest rectangle, we form a set of the entire bounding rectangle. The rationale behind this is that the most important data of the web page must occupy a significant portion of the web page. Next, we determine the bounding rectangle having the smallest area in the set. The reason for determining the smallest rectangle within this set is that the smallest rectangle will only contain data records. Thus a container is obtained. It contains the data region and some irrelevant data.

Fig. 6: The container identified from sample web page in Fig 3

Definition: A container is a superset of the data region which may or may not contain irrelevant data. For example, the irrelevant data contained in the container may include advertisements on the right and bottom of the page and the links on the left side. The Fig 6 shows the container identified from the web page shown in fig 3.

The procedure FindDataRegion identifies the container in the web pages which contains the relevant data region along with some irrelevant data also. It is as follows:

Procedure FindDataRegion (maxRect) ListChildren=depth first listing of the children of the tag Associated with maxRect For each tag in ListChildren Begin

If area of bounding rectangle of a tag > half the area of maxRect then If area of bounding rectangle data region>area of bounding rectangle of tag then data region =tag Endif

Endif

End



The fig 7 shows the enlarged view of the container shown in the fig 6. We note that there is some irrelevant data, both on the top as well as the bottom of the actual data region containing the data records.

Fig. 7: The Enlarged view of the container shown in Fig 6

3.2.3 Identification of Data Region Containing Data Records within the Container

To filter the irrelevant data from the container, we use a filter. The filter determines the average heights of children within the container. Those children whose heights are less than the average height are identified as irrelevant and are filtered off. The fig 8 shows a filter applied on the container in fig 7, in order to obtain the data region. We note that the irrelevant data in this case on top and bottom of the container, which are being removed by the filter.

The procedure FilterDataRegion filters the irrelevant data from the container, and gives the actual data region as the output. It is as follows:

Procedure FindDataRegion (dataRegion) totalHeight=0 For each child of dataRegion totalHeight+=height of the bounding rectangle of child avgHeight= totalHeight/no of children of dataRegion For each child of dataRegion

Begin If height of child’s bounding rectangle<avgHeight Then Remove child from dataRegion Endif

End



Fig. 8: Data Region obtained after filtering the container in Fig 7

The VSAP technique, as described above, is able to mine the relevant data region containing data records from the given web page efficiently.

4 MDR Vs VSAP

In this section we evaluate the proposed technique. We also compare it with MDR. The evaluation consists of three aspects as discussed in the following:

4.1 Data Region Extraction

We compare the first step of MDR with our system for identifying the data regions.

MDR is dependent on certain tags like <TABLE>, <TBODY>, etc for identifying the data region. But, a data region need not be always contained only within specific tags like <TABLE>, <TBODY>, etc. A data region may also be contained within tags other than tables-related tags like <P>, <LI>, <FORMS> etc. In the proposed VSAP system, the data region identification is independent of specific tags and forms. Unlike MDR, where an incorrect tag tree may be constructed due to the misuse of HTML tags, there is no such possibility of tag tree construction in case of VSAP, because the hierarchy of tags is constructed based on the visual cues on web page.In case of MDR, the entire tag tree needs to be scanned in order to mine data regions, but VSAP does not scan the entire tag tree, but it only scans the largest child of the <BODY>tag. Hence, this method proves very efficient in improving the time complexity compared to other contemporary algorithms.

4.2 Data Record Extraction

We compare the record extraction step of MDR with VSAP. MDR identifies the data records based on keyword search (e.g. “$”). But VSAP is purely dependent on the visual structure of



the web page only. It does not make use of any text or content mining. This proves to be very advantageous as it overcomes the additional overhead of performing keyword search on web page.

MDR, not only identifies the relevant data region containing the search result records but also extracts records from all the other sections of the page, e.g. some advertisement records also, which are irrelevant. In MDR, comparison of generalized nodes is based on string comparison using normalized edit distance method. However, this method is slow and inefficient as compared to VSAP where the comparison is purely numeric, since we are comparing the coordinates of the bounding rectangles. It scales well with all the web pages. A single data record may be composed of multiple sub-trees. Due to noisy information, MDR may find wrong combination of sub-trees. In VSAP system, visual gaps between data records help to deal with this problem.

4.3 Overall Time Complexity

Complexity of VSAP is much lesser than the existing algorithms. The existing algorithm MDR has complexity of the order O (NK) without considering string comparison, where N is the total number of nodes in the tag tree and K is the maximum number of tag nodes that generalized node can have (which is normally a small number <10). Our algorithm VSAP has a complexity of the order of O(n), where n is the number of tag- comparisons made.

5 Conclusion

In this paper, we have proposed a new approach to extract structured data from web pages. Although the problem has been studied by several researchers, existing techniques are either inaccurate or make many strong assumptions. A novel and effective method VSAP is proposed to mine the data region in a web page automatically. It is a pure visual structure oriented method that can correctly identify the data region. VSAP is independent of errors due to the misuse of HTML tags. Most of the current algorithms fail to correctly determine the data region ,when the data region consisting of only one data record .Also ,most of the sites fail in the case where a series of at records is seperaed by an advertisement, followed again by a single data record. VSAP works correctly for both the above cases .The no of comparisons done in VSAP is significantly lesser than other approaches. Further the comparisons are made on numbers; unlike other methods where strings or trees are compared .Thus VSAP overcomes the drawbacks of existing methods and perform significantly better than these methods.

Scope for Future Work

Extraction of data fields from the data records contained in these mined data regions will be considered in the future work taking also into account the complexities such as the web pages featuring dynamic HTML etc. The extracted data can be put in some suitable format and eventually stored back into a relational data base .Thus data extracted from each web page can then be integrated into a single collection. This collection of data can be further used for various knowledge discovery applications. e.g., making a comparative study of products from various companies, smart shopping, etc.



References

[1] Jiawei Han and Micheline Kambler, Data Mining: Concepts and Techniques. [2] Arun .K. Pujari, Data Mining Techniques. [3] Pieter Adriaans, Dolf Zantinge, Data Mining. [4] George M. Maracas, Modern Data Warehousing, Mining, and Visualization Core Concepts, 2003. Data [5] Baeza Yates, R .Algorithms for string matching: A survey. [6] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo . Extracting semi-structured information from the

web. [7] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages.

Computer Networks



On the Optimality of WLAN Location

Determination Systems

T.V. Sai Krishna T. Sudha Rani B.V.C. Engineering College Aditya Engineering College Odalarevu, J.N.T.U Kakinada Surampalem, J.N.T.U Kakinada [email protected] [email protected]

Abstract

This paper presents a general analysis for the performance of WLAN location determination systems. In particular, we present an analytical method for calculating the average distance error and probability of error of WLAN location determination systems. These expressions are obtained with no assumptions regarding the distribution of signal strength or the probability of the user being at a specific location, which is usually taken to be a uniform distribution over all the possible locations in current WLAN location determination systems. We use these expressions to find the optimal strategy to estimate the user location and to prove formally that probabilistic techniques give more accuracy than deterministic techniques, which has been taken for granted without proof for a long time. The analytical results are validated through simulation experiments and we present the results of testing actual WLAN location determination systems in an experimental testbed.

Keywords: Analytical analysis, optimal WLAN positioning strategy, simulation experiments, WLAN location determination.

1 Introduction

WLAN location determination systems use the popular 802.11 [10] network infrastructure to determine the user location without using any extra hardware. This makes these systems attractive in indoor environments where traditional techniques, such as the Global Positioning System (GPS) [5], fail to work or require specialized hardware. Many applications have been built on top of location determination systems to support pervasive computing. This includes [4] location-sensitive content delivery, direction finding, asset tracking, and emergency notification.

In order to estimate the user location, a system needs to measure a quantity that is a function of distance. Moreover, the system needs one or more reference points to measure the distance from. In case of the GPS system, the reference points are the satellites and the measured quantity is the time of arrival of the satellite signal to the GPS receiver, which is directly proportional to the distance between the satellite and the GPS receiver. In case of WLAN location determination systems, the reference points are the access points and the measured quantity is the signal strength, which decays logarithmically with distance in free space. Unfortunately, in indoor environments, the wireless channel is very noisy and the radio frequency (RF) signal can suffer from reflection, diffraction, and multipath effect [9], [12],

140 ♦ On the Optimality of WLAN Location Determination Systems


which makes the signal strength a complex function of distance. To overcome this problem, WLAN location determination systems tabulate this function by sampling it at selected locations in the area of interest. This tabulation has been known in literature as the radio map, which captures the signature of each access point at certain points in the area of interest.

WLAN location determination systems usually work in two phases: offline phase and location determination phase. During the offline phase, the system constructs the radio-map. In the location determination phase, the vector of samples received from each access point (each entry is a sample from one access point) is compared to the radio-map and the “nearest” match is returned as the estimated user location. Different WLAN location determination techniques differ in the way they construct the radio map and in the algorithm they use to compare a received signal strength vector to the stored radio map in the location determination phase.

In this paper, we present a general analysis of the performance of WLAN location determination systems. In particular, we present a general analytical expression for the average distance error and probability of error of WLAN location determination systems. These expression are obtained with no assumptions regarding the distribution of signal strength or user movement profile. We use these expressions to find the optimal strategy to use during the location determination phase to estimate the user location. These expressions also help to prove formally that probabilistic techniques give more accuracy than deterministic techniques, which has been taken for granted without proof for a long time. We validate our analysis through simulation experiments and discuss how well it models actual environments. For the rest of the paper we will refer to the probability distribution of the user location as the user profile. To the best of our knowledge, our work is the first to analyze the performance of WLAN location systems analytically and provide the optimal strategy to select the user location.

The rest of this paper is structured as follows. Section 2 summarizes the previous work in the area of WLAN location determination systems. Section 3 presents the analytical analysis for the performance of the WLAN location determination systems. In section 4, we validate our analytical analysis through simulation and measurement experiments. Section 5 concludes the paper and presents some ideas for future work.

2 Related Work

Radio map-based techniques can be categorized into two broad categories: deterministic techniques and probabilistic techniques. Deterministic techniques, such as [2], [8], represent the signal strength of an access point at a location by a scalar value, for example, the mean value, and use non-probabilistic approaches to estimate the user location. For example, in the Radar system [2] the authors use nearest neighborhood techniques to infer the user location. On the other hand, probabilistic techniques, such as [3], [6], [7], [13], [14], store information about the signal strength distributions from the access points in the radio map and use probabilistic techniques to estimate the user location. For example, the Horus system from the University of Maryland [14], [15] uses the stored radio map to find the location that has the maximum probability given the received signal strength vector.

All these systems base their performance evaluation on experimental testbeds which may not give a good idea on the performance of the algorithm in different environments. The authors

On the Optimality of WLAN Location Determination Systems ♦ 141


in [7], [14], [15] showed that their probabilistic technique outperformed the deterministic technique of the Radar system [2] in a specific testbed and conjectured that probabilistic techniques should outperform deterministic techniques. This paper presents a general analytical method for analyzing the performance of different techniques. We use this analysis method to provide a formal proof that probabilistic techniques outperform deterministic techniques. Moreover, we show the optimal strategy for selecting locations in the location determination phase.

3 Analytical Analysis

In this section, we give an analytical method to analyze the performance of WLAN location determination techniques. We start by describing the notations used throughout the paper. We provide two expressions: one for calculating the average distance error of a given technique and the other for calculating the probability of error (i.e. the probability that the location technique will give an incorrect estimate).

3.1 Notations

We consider an area of interest whose radio map contains N locations. We denote the set of locations as L. At each location, we can get the signal strength from k access points. We denote the k-dimensional signal strength space as S. Each element in this space is a k-dimensional vector whose entries represent the signal strength reading from different access points. Since the signal strength returned from the wireless cards are typically integer values,

the signal strength space S is a discrete space. For a vector s ∈ S, f *A(s) represents the

estimated location returned by the WLAN location determination technique A when supplied with the input s. For example, in the Horus system [14], [15], f *Horus(s) will return the

location l ∈ L that maximizes P (l /s).Finally, we use Euclidean (l1, l2) to denote the

Euclidean distance between two locations l1 and l2.

3.2 Average Distance Error

We want to find the average distance error (denoted by E(DErr)). Using conditional probability, this can be written as:

E (DErr) = E (DErr/l is the correct user location)

.P ( l is the correct user location) (1)

where P (l is the correct user location) depends on the user profile.

We now proceed to calculate E(DErr/l is the correct user location). Using conditional probability again:

E (DErr/l is the correct user location)

= E (DErr/s, l is the correct user location)

.P(s/l is the correct user location) (2)

= Euclidean (f *A(s), l)

.P(s/l is the correct user location)



Where Euclidean (f *A(s), l) represents the Euclidean distance between the estimated location and the correct location. Equation 2 says that to get the expected distance error given we are at location l, we need to get the weighted sum, over all the possible signal strength

values s ∈ S, of the Euclidean distance between the estimated user location (f *A(s)) and the actual location l.

Substituting equation 2 in equation 1 we get:

E (DErr) = Euclidean (f *A(s), l) P(s/l is the correct user location) P (l is the correct user location) (3)

Note that the effect of the location determination technique is summarized in the function f *A. We seek to find the function that minimizes the probability of error. We differ the optimality analysis till we present the probability of error analysis.

3.3 Probability of Error

In this section, we want to find an expression for the probability of error which is the probability that the Location determination technique will return an incorrect Estimate. This can be obtained from equation 3 by Noting that every non-zero distance error (represented by the function Euclidean (f *A(s), l)) is considered an error.

More formally, we define the function:

g(x) =

The probability of error can be calculated from equation 3 as:

P (Error) = g (Euclidean (f *A(s), l))

.P(s/l is the correct user location)

.P (l is the correct user location) (4)

In the next section, we will present a property of the Term g (Euclidean (f *A(s), l)) and use this property to get the optimal strategy for selecting the location.

3.4 Optimality

We will base our optimality analysis on the probability of error.

Lemma 1: For a given signal strength vector s, g(Euclidean(f *A(s), l)) will be zero for only

one location l ∊ L and one for the remaining N − 1 locations.

Proof: The proofs can be found in [11] and have been removed for space constraints. The lemma states that only one location will give a value of zero for the function g(Euclidean(f *A(s), l))in the inner sum. This means that the optimal strategy should select this location in order to minimize the probability of error. This leads us to the following theorem. Theorem 1 (Optimal Strategy): Selecting the location that maximizes the probability P(s/l).P(l) is both a necessary and sufficient condition to minimize the probability of error.

Proof: The proof can be found in [11].



Theorem 1 suggests that the optimal location determination technique should store in the radio map the signal strength distributions to be able to calculate P(s/l).Moreover, the optimal technique needs to know the user profile in order to calculate P(l).

Corollary 1: Deterministic techniques are not optimal.

Proof: The proof can be found in [11].Note that we did not make any assumption about the Independence of access points, user profile, or signal strength distribution in order to get the optimal strategy.

A major assumption by most of the current WLAN location determination systems is that all

user locations are equi-probable. In this case, P (l) = and Theorem 1can be rewritten as:

Theorem 2:

If the user is equally probable to be at any location of the radio map locations L, then selecting

The location l that maximizes the probability P(s/l) is both a necessary and sufficient condition to minimize the probability of error.

Proof: The proof is a special case of the proof of Theorem 1.

This means that, for this special case, it is sufficient for the optimal technique to store the histogram of signal strength at each location. This is exactly the technique used in the Horus system [14], [15].

Figure 1 shows a simplified example illustrating the intuition behind the analytical expressions and the theorems. In the example, we assume that there are only two locations in the radio map and that at each location only one access point can be heard whose signal strength, for simplicity of illustration, follows a continuous distribution. The user can be at any one of the two locations with equal probability. For the Horus system (Figure 1.a), consider the line that passes by the point of intersection of the two curves.

Fig. 1: Expected error for the special case of two locations



Since for a given signal strength the technique selects the location that has the maximum

probability, the error if the user is at location 1 is the area of curve 1 to the right of this line. If

the user is at location 2, the error is the area of curve 2 to the left of this line. The expected

error probability is half the sum of these two areas as the two locations are equi-probable.

This is the same as half the area under the minimum of the two curves (shaded in figure).For

the Radar system (Figure 1.b), consider the line that bisects the signal strength space between

the two distribution averages. Since for a given signal strength the technique selects the

location whose average signal strength is closer to the signal strength value, the error if the

user is at location 1 is the area under curve 1 to the right of this line. If the user is at location

2, the error is the area under curve 2 to the left of this line. The expected error probability is

half the sum of these two areas as the two locations are equi-probable (half the shaded area in

the figure).From Figure 1, we can see that the Horus system outperforms the Radar system

since the expected error for the former is less than the later (by the hashed area in Figure 1.b).

The two systems would have the same expected error if the line bisecting the signal strength

space of the two averages passes by the intersection point of the two curves. This is not true

in general. This has been proved formally in the above theorems. We provide simulation and

experimental results to validate our results in section4.

4 Experiments

4.1 Testbed

We performed our experiment in a floor covering a 20,000 feet area. The layout of the floor is shown in Figure 2.

Fig. 2: Plan of the floor where the experiment was conducted. Redadings were collected in the corridors (shown in gray).



Both techniques were tested in the Computer Science Department wireless network. The

entire wing is covered by 12 access points installed in the third and fourth floors of the

building. For building the radio map, we took the radio map locations on the corridors on a

grid with cells placed 5 feet apart (the corridor’s width is 5 feet). We have a total of 110

locations along the corridors. On the average, each location is covered by 4 access points. We

used the mwvlan driver and the MAPI API [1] to collect the samples from the access points.

4.2 Simulation Experiments

In this section, we validate our analytical results through simulation experiments. For this purpose, we chose to implement the Radar system [2] from Microsoft as a deterministic technique and the Horus system [14],[15] from the University of Maryland as a probabilistic technique that satisfy the optimality criteria as described in Theorem 2.

We start by describing the experimental testbed that we use to validate our analytical results and evaluate the systems.

4.2.1 Simulator

We built a simulator that takes as an input the following parameters:

• The radio map locations coordinate.

• The signal strength distributions at each location from each access point.

• The distribution over the radio map locations that represent the steady state

probability of the user being at each location (user profile).

The simulator then chooses a location based on the user location distribution and generates a

signal strength vector according to the signal strength distributions at this location. The

simulator feeds the generated signal strength vector to the location determination technique.

The estimated location is compared to the generated location to determine the distance error.

The next sections analyze the effect of the uniform user profile on the performance of the

location determination systems and validate our analytical results. The results for the

heterogeneous profiles can be found in [11].

4.2.2 Uniform user Location Distribution

This is similar to the assumption taken by the Horus system. Therefore, the Horus system

should give optimal results. Figures 3 shows the probability of error and average distance

error (analytical and simulation results) respectively for the Radar and the Horus systems.

The error bars represent the 95% confidence interval for the simulation experiments. The

figure shows that the analytical expressions obtained are consistent with the simulation

results. Moreover, the Horus system performance is better than the Radar system as predicted

by Theorem 2. The Horus system performance is optimal under the uniform distribution of

user location.



Fig. 3: Performance of the Horus and Radar systems under a uniform user profile (profile 1).

4.3 Measurements Experiments

In our simulations, we assumed that the test data follows the signal strength distributions exactly. This can be considered as the ideal case since in a real environment,

The received signal may differ slightly from the stored signal strength distributions. Our results however are still valid and can be considered as an upper bound on the performance of the simulated systems. In order to confirm that, we tested the Horus system and the Radar

System in an environment where the test set was collected on different days, time of day and by different persons than those in the training set. Figure 4 shows the CDF of the distance error for the two systems. The figure shows that the Horus system (a probabilistic technique) significantly outperforms the Radar system (a deterministic technique) which confirms to our results.

Fig. 4: CDF for the Distance Error for the Two Systems.




We presented an analysis method for studying the performance of WLAN location determination systems. The method can be applied to any of the WLAN location determination techniques and does not make any assumptions about the signal strength distributions at each location, independence of access points, nor the user profile. Second, we studied the effect of the user profile on the performance of the WLAN location determination systems.

We used the analytical method to obtain the optimal strategy for selecting the user location. The optimal strategy must take into account the signal strength distributions at each location and the user profile. We validated the analytical results through simulation experiments. In our simulations, we assumed that the test data follows the signal strength distributions exactly. This can be considered as the ideal case since in a real environment, the received signal may differ slightly from the stored signal strength distributions. Our results however are still valid and can be considered as an upper bound on the performance of the simulated systems. We confirmed that through actual implementation in typical environments. For future work, the method can be extended to include other factors that affect the location determination process such as averaging multiple signal strength vectors to obtain better accuracy, using the user history profile, usually taken as the time average of the latest location estimates, and the correlation between samples from the same access points.

References

[1] http://www.cs.umd.edu/users/moustafa/Downloads.html. [2] P. Bahl and V. N. Padmanabhan. RADAR: An In-Building RF-based User Location and Tracking System.

In IEEE Infocom 2000, volume 2, pages 775–784, March2000. [3] P. Castro, P. Chiu, T. Kremenek, and R. Muntz. A Probabilistic Location Service for Wireless Network

Environments. Ubiquitous Computing 2001, September2001. [4] G. Chen and D. Kotz. A Survey of Context-Aware Mobile Computing Research. Technical Report

Dartmouth Computer Science Technical Report TR2000-381, 2000. [5] P. Enge and P. Misra. Special issue on GPS: The Global Positioning System. Proceedings of the IEEE,

pages 3–172, January 1999. [6] A. M. Ladd, K. Bekris, A. Rudys, G. Marceau, L. E.Kavraki, and D. S. Wallach. Robotics-Based Location

Sensing using Wireless Ethernet. In 8th ACM MOBICOM, Atlanta, GA, September 2002. [7] T. Roos, P. Myllymaki, H. Tirri, P. Misikangas, and J. Sievanen. A Probabilistic Approach to WLAN User

Location Estimation. International Journal of Wireless Information Networks, 9(3), July 2002. [8] A. Smailagic, D. P. Siewiorek, J. Anhalt, D. Kogan, and Y. Wang. Location Sensing and Privacy in a

Context Aware Computing Environment. Pervasive Computing, 2001. [9] W. Stallings. Wireless Communications and Networks. Prentice Hall, first edition, 2002. [10] The Institute of Electrical and Electronics Engineers, Inc. IEEE standard 802.11 – Wireless LAN Medium

Access Control (MAC) and Physical Layer (PHY) specifications.1999. [11] M. Youssef and A. Agrawala. On the Optimality of WLAN Location Determination Systems. Technical

Report UMIACS-TR 2003-29 and CS-TR 4459, University of Maryland, March 2003. [12] M. Youssef and A. Agrawala. Small-Scale Compensation for WLAN Location Determination Systems. In

WCNC 2003, March 2003. [13] M. Youssef and A. Agrawala. Handling Samples Cor relation in the Horus System. In IEEE Infocom 2004,

March 2004.



Multi-Objective QoS Based Routing Algorithm

for Mobile Ad-hoc Networks

Shanti Priyadarshini Jonna Ganesh Soma JNTU College of Engineering JNTU College of Engineering Anantapur, India Anantapur, India [email protected] [email protected]

Abstract

Mobile Ad-Hoc NETwork (MANET) is a collection of wireless nodes that can dynamically be set up anywhere and anytime without using any pre-existing network infrastructure. The dynamic topology of the nodes possess more routing challenges in MANET compared to infrastructure based network. Most current routing protocols in MANETs try to achieve a single routing objective using a single route selection metric. As the various routing objectives in Mobile Ad-hoc Networks are not completely independent, an improvement in one objective can only be achieved at the expense of others. Therefore, efficient routing in MANETs requires selecting routes that meet multiple objectives. Along with this requirement routing algorithm must be capable of providing different priorities to different QoS as needed by the application which vary form one application to other. We develop a Hybrid Routing Algorithm for MANET which uses the advantages of both reactive and proactive routing approaches in finding stable routes, reducing initial route delay and to minimize bandwidth usage. We have proposed a generic Multi-Objective Hybrid Algorithm to find the best available routes considering multiple QoS parameters achieving multiple objectives by evaluating the different alternatives. This algorithm can also provide support to multiple number of QoS parameters which can be varied, is very much needed to support any kind of application, where each application have different priorities for the QoS parameters.

Keywords: Mobile Adhoc Networks.

1 Introduction

FUTURE Mobile Ad-hoc Networks are expected to support applications with diverse Quality of Service requirements. QoS routing is an important component of such networks. The objective of QoS routing is two-fold: to find a feasible path for each transaction; and to optimize the usage of the network by balancing the load.

Routing in mobile ad-hoc network depends on many factors like, including modeling of the topology, selection of routers, and initiation of request, and specific underlying characteristics that could serve as a heuristic in finding the path efficiently. The routing problem in mobile ad-hoc networks relates to how mobile nodes can communicate with one another, over the wireless media, without any support from infrastructured network components. Several

Multi-Objective QoS Based Routing Algorithm for Mobile Ad-hoc Networks ♦ 149


routing algorithms have been proposed in the literature for mobile ad-hoc networks with the goal of achieving efficient routing.

These algorithms can be classified into three main categories based on the way the algorithm finds path to the destination.

They are: 1. Proactive Routing Algorithms

2. Reactive Routing Algorithms

3. Hybrid Routing Algorithms

Proactive protocols perform routing operations between all source destination pairs periodically, irrespective of the need of such routes where as Reactive protocols are designed to minimize routing overhead. Instead of tracking the changes in the network topology to continuously maintain shortest path routes to all destinations, Reactive protocols determine routes only when necessary. The use of Hybrid Routing is an approach that is often used to obtain a better balance between the adaptability to varying network conditions and the routing overhead. These protocols use a combination of reactive and proactive principles, each applied under different conditions, places, or regions.

Fig. 1: Classification and examples of ad hoc routing protocols.

In this paper we propose a generic Multi-Objective Hybrid Routing Algorithm which uses the advantages of both reactive and proactive routing approaches to find the best available routes by considering multiple QoS parameters and achieving multiple objectives.

2 Proposed Algorithm

It is a Multi–Objective Hybrid Routing Algorithm for Mobile Ad Hoc Networks. This algorithm tries to achieve multiple objectives. Here each of these objectives depends upon one or multiple QOS parameters. We have considered n QOS parameters which accounts for achieving multiple objectives. Depending upon the parameters we are considering and usage, different objectives can be achieved. These parameters can be varied depending upon the application. As every application have different requirements of the QOS and thus have different priorities of each parameter. So we have proposed a flexible generic scheme in which user can select different set of ‘n’ QOS parameters accounting for achieving multiple objectives.

150 ♦ Multi-Objective QoS Based Routing Algorithm for Mobile Ad-hoc Networks


The 3-Cartesian co-ordinates are considered in this algorithm in determining expected and request zones by introducing the third co-ordinate z of geographic (earth centered) Cartesian co-ordinate system. Route Recovery with local route repair, based on distance metric of the path length is also added in this algorithm to support real time applications. This algorithm has five phases Neighbor Discovery, Route Discovery, Route Selection, Route Establishment and Route Recovery Phase. Route Discovery Phase consists of sub modules: Intra Zone Routing, Inter Zone Routing.

A Neighbor Discovery Phase

Here Neighbor Discovery Algorithm will look after the maintenance of Neighbor Tables and Zone Routing tables. Each and every node maintains Neighbor Tables and Zone Routing Tables. The Neighbor Table along with the neighbor node addresses also stores available QOS parameter values along the link between itself and its Neighbor. These parameters are considered for selecting best available routes by Intra Zone Routing Protocol (used to select the routes with in the zone). In this phase each and every node periodically transmits beacons to its neighbors. On reception of these packets from neighbors every node updates its Neighbor Table with appropriate values. Each node exchanges their Neighbor Tables from their corresponding neighbors and constructs Zone Routing Tables. Every node constructs Zone Routing Table from their Neighbor Tables using Link State Algorithm. A Zone is a set of nodes which lies with in a limited region in 2-Hop distance from given node.

B Route Discovery Phase

This phase is used to find all the alternate routes available. It has sub modules Intra Zone Routing Protocol and Inter Zone Routing Protocol. Any node which requires route to any destination constructs Route Request Packet (RREQ) in which Desired QOS Metrics [Q1, Q2... Qn] and set of parameters [P1, P2... Qn] to be calculated during route Discovery Phase are introduced. Source node S initially searches for the destination node D whether it belongs to Zone or not. If it belongs to the zone it finds the route desired using Intra Zone routing Module.

Intra Zone Routing Protocol is for selecting the path to any destination which is present with in the zone. Source node selects the path available from the zone routing table only when desired QOS metrics are satisfied. Inter Zone Routing Protocol is for selecting all the available routes to any destination node which lies outside the zone. S broadcasts the RREQ packets, on reception of RREQ every node will check whether it is a member of the Request zone, if it is a member then it checks whether the link between itself and its Predecessor Node is satisfying these QOS constraints or not, and if it can satisfy the QOS requirements then it broadcasts the request further by processing the parameters depending up on the metric, including its details else it discards the request, there by reducing unnecessary routing traffic. Destination ‘D’ may receive RREQ packets from alternate paths, these are the different alternatives available at D. Route Selection Algorithm is used to select the best available route and Route Reply Packet (RREP) is constructed at D and sent back along the path chosen. Each intermediate node processes the RREQ packets and stores the route request details in the route table along with the pointer to Local Route Repair Table (LRRT) in which QOS parameter values attained up to that node are stored to use it further for local route recovery if needed. Every node retains this LRRT table only if it has to repair the route locally, decided by distance metric in the Route Establishment phase.



C Route Selection Phase

In this phase the best available route among ‘k’ alternatives have to be selected taking decision depending up on the ‘m’ (multiple QOS parameters) attributes. Among the k alternatives all are not optimal solutions, pareto optimal solutions have to be found. Finding pareto-minimum vectors among r given vectors, each of dimension m, is a fundamental problem in multi-objective optimization problems. Multi Objective Optimization is used in Route Selection Phase, where Multi Objective Problem is transformed in to Single Objective Problem by weighting method. The goal of such single-objective optimization problems is to find the best solution, which corresponds to the minimum or maximum value of an objective function. In this algorithm multiple objectives are reformulated as single-objective problem by combining different objectives into one (that is, by forming a weighted combination of the different objectives). First, all the objectives need to be either minimized or maximized. This is done by multiplying one of them by -1 (i.e., max f2 is equivalent to min (-f2) = minf2'). Next, these objectives must be lumped together to create a single objective function. Weighting (conversion) factors w1 w2... wn are used in order to obtain a single, combined objective function.

MaximizeF=(+/-)w1f1(x)+(+/-) w2f2(x)...(+/-)wnfn(x).

To find the relative performance of each objective function each of the objective function value obtained is divided by corresponding desired QOS value. Now relative efficiency of each route is obtained by calculating the F value of all valid paths (which satisfy the QOS requirements) from source to destination. Finally, given this single objective function one can find a single optimal solution (optimal route).

D Route Establishment Phase

This phase is for reverse path set up i.e. the route is established from destination to source. After selecting the optimal route available by Route Selection phase, Route Reply packets will be sent along the path selected, back tracking from destination to source setting the status

field value of corresponding entry in route table from Route Not Established(NE) to Established, and updating NextNode_ID as the P.Current_ID (the node from where RREP packet has received). Then send back the RREP packets towards source selecting next hop from route table which is stored during forward path set up. This phase itself decides whether the intermediate node is capable of handling Local route recovery for this path or not depending up on the distance metric which is explained in next section. If the node is capable of local recovery then retains LRRT table entries otherwise clears them thus saving space.

E Route Recovery Phase

Every routing algorithm to support real time applications must have efficient route recovery mechanisms. This algorithm has local route repair mechanism, to have this feature extra overhead is required (since each node has to store QOS requirements per route) but at the same time its neccessary to have this feature. To have a trade-off between efficient route recovery mechanism and space overhead, path length is considerd as a distance metric and divided the nodes in to 2 categories one which can handle local route recovery and other notifies Source or Destination to handle route recovery. Path length divided into (0 - to - 25)



%, (25 – to - 75) %, (75 - to - 100) %. So middle 50% of the nodes which lie in the (25 - to- 75) %portion of path length handle local route recovery and the remaining portions notify Source/Destination. In the route establishment phase every node calculates in which portion of the path length it lies, so as to handle route recovery. Every node which receives RERR messages checks whether it is capable of handling route recovery locally by checking whether LRRT table entries are available or not. If they are available construct RREQ packets locally with entries available from LRRT and broad cast it other wise send RERR packets towards source or destination.

3 Complexity Analysis of the Algorithm

Let N be the total number of nodes in the network, n be the number of neighbor nodes of a particular node, z be the number of nodes in its zone and N1 be the number of nodes in the request zone.

A Space Complexity

For each node, a Neighbor Table and a Zone Routing Table are required. The size of each entry in the Neighbor Table is (9 Bytes + ‘k’ Bytes), where ‘k’ be the number of QOS parameters considered. The size of each entry in the Zone Routing Table is (9 Bytes ‘k’ Bytes). Total size required by each node will be ((9+k)*n + (9+k)*z). The total amount of space required by over all Network will be

N*(9*n + 9*z + k*(n+z)).

Case 1: Average Case

The Space Complexity

= O (N* (9*n + 9*z + k*(n+z))). = O (N* (c1*n + c2*z + c3*(n+z))), where c1, c2, c3 are constants.

= O (N* (c2+c3)*z), as z ≥ n = O (N* z). (≤ O (N2))

This is for average space complexity

Case 2: Best Case

In the best case, the number of nodes in the zone equals the number of neighbor nodes. So the best case space complexity of the network becomes O (N*n).

Case 3: Worst Case

In the worst case, the number of neighbor nodes or the number of nodes in the zone equals to the total number of nodes in the network. In that case the overall complexity becomes O(N 2) (since z= N).

B Time Complexity

For the neighbor table maintenance, the proposed algorithm uses the Link state algorithm. It receives the neighbor tables from all the neighboring nodes and computes the zone routing table. As the numbers of neighboring nodes are n, the complexity for computing zone routing table is of order O(n2).




In the average case the route is found from the routing table using the binary search algorithm in O(log z) time. The average case time complexity of the algorithm for entire network is =N *(O (n2) + O (log z)). =O (N *log z), if 2 z > 2n =O (N *n2), otherwise n2

Case 2: Best Case

In the best case the required route is directly found from the routing table in one step, i.e. in O(1) time. So, the best case time complexity of the algorithm for the entire network is

N * (O(1) + O(n2 )) = O(N *n2 ).

Case 3: Worst Case

In the worst case, the route request has to go through the entire request zone. The complexity becomes 1 O(N *log(z)). Let m be the possible routes satisfying QOS at Destination node. Then for selecting k Pareto optimal solutions from m alternatives, time complexity isO (m2) Then for selecting the best route from k alternatives, the time complexity will be O(k2 ). In worst case k = m ten the time complexity will be O (m2). So the total time complexity for selecting the route will be O (m2) + O (m2) = O (m2).

The worst case time complexity becomes

N*(O(N1* log z) + O(m2))

= N*(O(N1* log z), since m << N1

=O(N* N1* log z). If request zone becomes the entire network, then the complexity becomes O(N2*log z).

4 Communication Complexity


This complexity is considered at steady state conditions of the network. The amount of data transferred between the nodes is of O(n 2) as the nodes have to exchange the neighbor tables with their neighbors.

Case 2: Best Case

In the best case, the route is found from the routing table. So the communication complexity becomes

O(n 2 )+O(1)=O(n 2).

Case 3: Worst Case

In the worst case, the route request and reply has to go through entire request zone. So, the complexity becomes

O (n 2) + O (N12) = O (N12).

If the request zone is the entire network, then it is O (N2). For On demand type of algorithms, this is O (N 2) always.



Complexity Type Best case Average case Worst case Space complexity O(N* n) O(N* z) O(N 2)

Time complexity O(N) O(N* log z) O(N 2*log z)

Communication Complexity O(n 2) O(N 12) O(N 2)

5 Conclusion

MOHRA is an algorithm improved on top of New Hybrid Routing Algorithm(NHRA).

NHRA is a single objective routing protocol, where as this algorithm addresses multiple objectives and it has the advantages of both reactive and proactive routing approaches.This algorithm is used to select optimal route available achieving multiple objectives like minimum delay, highly stable routes with desired bandwidth, which depends upon one or multiple QOS parameters like delay, associativity ticks (this metric is used to find link stability), and bandwidth. Depending upon the parameters considering and usage, different objectives can be achieved.

This algorithm can provide support to multiple number of QoS parameter which can be varied achieving multiple objectives, which is a flexible scheme to support any kind of real time applications, where each application have different priorities and this is all possible with less computational effort.

As we are using associativity count long-lived routes are selected, ensuring a lower packet loss rate arising from the movement of intermediate nodes and fewer route failures. Thus accounting for the increase in packet delivery fraction and reducing end-to-end delay and by using location co-ordinates search space can be reduced for route discovery.

One more major contribution of this work is efficient Route Recovery mechanism with local route repair based on distance metric of the path length to support real time applications.

Based on the simulation study and comparative analysis of the routing algorithms, it is observed that NHP with respect to end to end delay works well when compared with other algorithms, DSDV works well with respect to packet deliver fraction when compared other algorithms.

6 Future Scope

We have analyzed the GPS and its usage in finding Location Co-ordinates, various alternate positioning systems can be studied and suggest one positioning system which is cost-efficient.

The proposed Multi-Objective Hybrid Routing Algorithm can be implemented using Network Simulator 2.

Our QoS aware hybrid routing algorithm only does the job of searching the path with enough resources, but does not reserve them. This job of reserving by QoS Signaling mechanism can be incorporated in to our algorithm.

The implemented algorithm New Hybrid Routing Algorithm uses only two dimensional location co-ordinates. It can be extended to 3-dimensional coordinates to give completeness to the algorithm.



References

[1] “Highly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) for Mobile Computers”, Perkins C.E. and Bhagwat P., Computer Communications Review, Oct 1994, pp.234-244.

[2] “Routing in Clustered Multihop, Mobile Wireless Networks with Fading Channel,” C.C. Chiang, Proceedings of IEEE SICON, pp. 197-211, April 1997.

[3] “The Landmark Hierarchy: a new hierarchy for routing in very large networks,” P.F.Tsuchiya, In Computer Communication Review, vol.18, no.4, Aug. 1988, pp. 35-42.

[4] Perkins C.E. and Royer, E.M., “Ad-hoc on-demand distance vector routing”, WMCSA ‘99. Second IEEE Workshop on Mobile Computing Systems and Applications, pp: 90-100, 1999.

[5] Johnson, D.B. and Maltz, D.A., “Dynamic Source Routing Algorithm in Ad-Hoc Wireless Networks”, Mobile Computing, Chapter 5, Kluwer Academic, Boston, MA, 1996, pp.153-181.

[6] “A Highly Adaptive Distributed Routing Algorithm for Mobile and wireless networks,” V.D. Park and M.S. Corson, IEEE, Proceedings of IEEE INFO-COM’97, Kobe, Japan, pp. 103-112, April 1997.

[7] Nicklas Beijar, “Zone Routing Protocol (ZRP),” www.netlab.tkk.fi/opetus/s38030/k02/Papers/08 Nicklas.pdf

[8] Analysis of the Zone Routing protocols, John Schaumann, Dec 8, 2002 http://www.netmeister.org/misc/zrp/zrp.pdf

[9] “DDR-Distributed Dynamic Routing Algorithm for Mobile Ad hoc Networks,” Navid Nikaein, Houda Labiod and Christian Bonnet, International Symposium on Mobile Ad Hoc Networking & Computing, pp: 19-27, 2000.

[10] M. Joa-Ng and I-Tai Lu, A peer-to-peer zone-based two-level link state routing for mobile ad hoc net-works, IEEE on Selected Areas in Communications,vol. 17, no. 8, pp. 1415 1425, 1999.

[11] Ko Young-Bae, Vaidya Nitin H., “Location-Aided Routing in mobile ad hoc networks”, Wireless Networks 6, 2000, pp.307-321.

[12] S. Basagni, I. Chlamtac, V. Syrotiuk and B. WoodWard, A Distance Routing Effect Algorithm for Mobility.



A Neural Network Based Router

D.N. Mallikarjuna Rao V. Kamakshi Prasad Jyothishmathi Institute of Technology Jawaharlal Nehru Technological and Science, Karimnagar University, Hyderabad [email protected] [email protected]

Abstract

In this paper we describe a router (in a communication network) which takes routing decision by using a neural network which has to be trained regularly. We construct a multi layer feed forward neural network. We train the neural network using the data collected by the ACO (Ant Colony Optimization) algorithm[2]. Given the destination node as input, the Neural Network will give the next node as output on which the packet has to be transferred on. This experiment shows that we can replace the routing tables with Neural Network and a search algorithm is not required to find the next node, given the destination.

1 Introduction

Routing has a profound effect on the performance of the communication networks as it involves decision making process by consulting a routing table. The size of the routing table is proportional to the number of routers available in the network. A routing algorithm should take minimum average response time to find the optimum path(s) for transporting data or message. In doing so it must satisfy the user’s demands for fast service. In today’s world the networks are growing in leaps and bounds. Therefore storing and updating the information about the routers is a tedious task. The routers also must adapt to the changes in the network environment.

Research on neural network based routers have used global information of the communication network (Hopfield and Tank, 1989). Lee and Chang, 1993 used complex neural network to make the routing decisions. Chiu-Che Tseng, Max Garzon have used local information for updating the Neural Network.

The rest of the paper is organized as follows. Section 2 describes the model and the Neural Network component. Section 3 describes the JavaNNS simulator. Section 4 describes the experimental setup and the results.

2 The Model

The model consists of a Neural Network router which has to be trained using a routing table information. We have used the routing table information obtained from the Ant algorithm simulation[2]. The Neural Network has been trained using this routing table information.

The Neural Network

In our communication network, Neural Network is a part of every router which replaces the routing table. Destination node address is given as the input to this Neural Network and it

A Neural Network Based Router ♦ 157


provides the next node as output. Ant algorithm provides routing table information to every node. This information which is taken offline is used to train this multi layer feed forward neural network. First we randomly initialize the weights of the Neural Network. Then using the information provided by the Ant algorithm the weights are updated to reflect the patterns(routing table information).

The Ant Algorithm

ACO algorithms take inspiration from the behaviour of real ants in finding the paths to the food nest. The ants leave pheromone (chemical substance) in the path to the destination. The other ants follow the path which contains the more contains the more concentration of the pheromone. This behaviour of ants has been applied to solving heuristic problems. In these problems a colony of artifical ants are collectively used to communicate indirectly and arrive at a solution. Although we have not implemented the Ant algorithm. We have used the routing table information[2] to train the neural network.

In [2] the authors have simulated the ACO algorithm where in a forward ant is launched from every node periodically. This forward ant pushes the address of the nodes it visits on to the memory stack carried by it. When it reaches the destination a Backward Ant is generated which follows the same path as that of the Forward Ant. The Backward Ant updates the routing table information while moving from the destination to the source. This routing table information has been used for our simulation purpose.

3 JavaNNS Simulator

Stuttgart Neural Network Simulator (SNNS) was developed by a team of the chair at the University of Stuttgart. SNNS which was developed for Unix work stations and Unix PCs is an efficient universal simulator of neural networks. The simulator kernel was in ANSI C and the graphical user interface was written in X11R6 which is a network compiler.

JavaNNS is the successor of SNNS. In this the graphical user interface is much more comfortable and user friendly which is written in Java. Because of this platform-independence is also increased.

JavaNNS is available for the operating systems: Windows NT/Windows 2000, Solaris and RedHat Linux. JavaNNS is freely available and can be downloaded from the link provided in the reference.

4 Experimental Setup and the Results

A 12 node communication network has been used for Ant Algorithm simulation[2]. The authors of the simulation have used ns-2 for simulating the ACO algorithm. The algorithm updates routing table every time a Backward traces back to the source. The routing table contains multiple paths with different probabilities (pheromone values). We have taken one such routing table for node no.3 and normalized the paths. i.e. we have taken the best path only. We have constructed the neural Network in JavaNNS simulator. JavaNNS provides graphical user interface for constructing the neural network. In the interface we can specify the number of layers, type of each layer, number of nodes in each layer, the activation function and type of connections(feed forward, auto associtiave etc.) After constructing the Neural Network we have initialized the weights randomly, again using a control function,

158 ♦ A Neural Network Based Router


between -1 and +1. We have converted the normalized routing table information file into the input/output pattern file. The same file has been used for validation also. Then the Neural Network has been trained for 100 cycles. After the training the error value has reached below our desired value.

Various figures while simulating the Neural Network have been included here.

A 12 node communication network has been used for Ant Algorithm simulation[2]. The network looks like this.

Fig. 1: Network Topology used for Simulation on ns-2[2]

Once the Ant Algorithm has been simulated it generated routing table at every node. We have taken on such routing table (for Node 3) for implementing the

Neural Router. The routing table for Node 3 is given below which specifies which actually multiple paths for every destination. We have taken the best path for the purpose of simulating the Neural Network.

Dest Node Next Node

0 6

1 2

2 2

4 6

5 6

6 6

7 7

8 6

9 6

10 6

11 7

Fig. 2: Routing table at Node 3

The following figure shows the three layer feed forward Neural Network after initializing with random weights but before training.

5 6

0 1 2 3

4 7

8 9 10 11

A Neural Network Based Router ♦ 159


Fig. 3: Neural Network after initializing the weights.

The following figure shows the three layer feed forward Neural Network after it has been trained. We have used Backpropagation algorithm for training the network and learning rate parameter is 0.3.

Fig. 4: The three layer feed forward Neural Network after trained

160 ♦ A Neural Network Based Router


In the figure the upper layer is input layer, the middle one is the hidden layer and the bottom one is the output layer. After initializing the weights randomly we have converted routing table information into input-output pattern file compatible with the JavaNNS simulator. We have used the same file for training as well as validation.

While training the simulator has the facility to plot the error graph. This graph indicates whether the error is decreasing or increasing in other words whether the Neural Network is converging or not. The figure below indicates that the Neural Network has indeed converged and the error has fallen much below the desired value.

Fig. 5: The error graph

We have trained the network using 100 cycles and the error has fallen below 0.02 when the 100th cycle was applied.

Conclusions and the Future work

We conclude that a feed forward Neural Network can replace a routing table. In this paper we have taken the simulated result of the Ant Algorithm’s routing table for training the Neural Network. These two can be combined so that the information is given dynamically to the Neural Network which can adapt dynamically to the changes in the communication network.

References

[1] Chiu-Che Tseng, Max Garzon, Hybrid Distributed Adaptive Neural Router, Proceedings

of ANNIE, 98. [2] V. Laxmi, Lavina Jain and M.S. Gaur, Ant Colony Optimization based Routing on ns-2, International

Conference on Wireless communication and Sensor Networks(WSCN), India, December 2006. [3] University of Tubingen, JavaNNS, Java Neural Network Simulator. The url is http://www.ra.cs.uni-

tuebingen.de/software/JavaNNS/welcome_e.html.



Spam Filter Design Using HC, SA, TA

Feature Selection Methods

M. Srinivas Supreethi K.P. E.V. Prasad Dept. of CSE, JNTUCEA JNTUCE, Anantapur JNTUCE, Kakinada [email protected] [email protected]

Abstract

Feature selection is an important research problem in different statistical learning problems including text categorization applications such as spam email classification. In designing spam filters, we often represent the email by vector space model (VSM), i.e., every email is considered as a vector of word terms. Since there are many different terms in the email, and not all classifiers can handle such a high dimension, only the most powerful discriminatory terms should be used. Another reason is that some of these features may not be influential and might carry redundant information which may confuse the classifier. Thus, feature selection, and hence dimensionality reduction, is a crucial step to get the best out of the constructed features. There are many feature selection strategies that can be applied to produce the resulting feature set. In this paper, we investigate the use of Hill Climbing, Simulated Annealing, and Threshold Accepting optimization techniques as feature selection algorithms. We also compare the performance of the above three techniques with the Linear Discriminate Analysis. Our experiment results show that all these techniques can be used not only to reduce the dimensions of the e-mail, but also improve the performance of the classification filter.

1 Introduction

The junk email problem is rapidly becoming unmanageable, and threatens to destroy email as a useful means of communication. These tide of unsolicited emails flood into corporate and consumer inboxes everyday. Most spam is commercial advertising, often for dubious products, get-rich quick schemes, or quasi-legal services. People waste increasing amounts of their time reading/deleting junk emails. According to a recent European Union study, junk email costs all of us some billion (US) dollars per year, and many major ISPs say that spam adds some cost of their service. There is also the fear that such emails could hide viruses which can then infect the whole network. Future mailing system should require more capable filters to help us in the selection of what to read and avoid us to spend more time on processing incoming messages.

Many commercial and open-source products exist to accommodate the growing need for spam classifiers, and a variety of techniques have been developed and applied toward the problem, both at the network and user levels. The simplest and most common approaches are to use filters that screen messages based upon the presence of words or phrases common to junk e-mail. Other simplistic approaches include black-listing (i.e., automatic rejection of messages received from the addresses of known spammers) and white-listing (i.e., automatic

162 ♦ Spam Filter Design Using HC, SA, TA Feature Selection Methods


acceptance of message received from known and trusted correspondents). In practice, effective spam filtering uses a combination of these three techniques. In this paper, we only discuss how to classify the junk emails and legitimate emails based on the words or features. From the machine learning view point, spam filtering based on the textual content of email can be viewed as a special case of text categorization, with the categories being spam or nonspam. In text categorization [5], the text can be represented by vector space model (VSM). Each email can be transferred into the vector space model. This means every email is considered as a vector of word terms. Since there are many different words in the email and not all classifiers can handle such a high dimension, we should choose only the most powerful discriminatory terms from the email terms. Another reason of applying feature selection is that the reduction of feature space dimension may improve the classifiers' prediction accuracy by alleviating the data sparseness problem.

In this paper, we investigate the use of Hill Climbing (HC), Simulated Annealing (SA), and Threshold Accepting (TA) local search optimization techniques [8] as feature selection algorithms. We also compare the performance of the above three techniques with Linear Discriminate Analysis (LDA) [3]. Our results indicate that, using a K-Nearest Neighbor (KNN) classifier [1], the accuracy of spam filters using any of the above strategies outperform those obtained without feature selection. Among the four approaches, SA reaches the best performance. The rest of the paper is organized as follows. Section 2 introduce the experimental settings and related feature selection strategies. In section 3, we report the experimental results obtained and finally section 4 is the conclusion.

2 Experimental Settings

In our experiments, we first transfer the emails into vectors by TF-IDF formulas [5]. Then we apply the proposed feature sets. Finally we compare the accuracy obtained with the four strategies.

2.1 Data Sets

Unlike general text categorization tasks where many standard benchmark collections exist, it is very hard to collect legitimate e-mails for the obvious reason of protecting personal privacy. In our experiment we use PulCorpus [10]. This corpus consists of 1099 messages; 481 of which are marked as spam and 618 are labeled as legitimate, with a spam rate of 43.77%. The messages in PUI corpus have header fields and HTML tags removed, leaving only subject line and mail body text. To address privacy, each token was mapped to a unique integer. The corpus comes in four versions: with or without stemming and with or without stop word removal. In our experiment we use Lemmatize enabled, stop-list enabled version from PUI corpus. This corpus has already been parsed and tokenized into individual words with binary attachments and HTML tags removed. We randomly chose 62 legitimated e-mails and 48 spam e-mails for testing and the rest e-mails for training.

2.2 Classifiers

K-nearest neighbor classification is an instance-based learning algorithm that has shown to be very effective in text classification. The success of this algorithm is due to the availability of effective similarity measure among the K nearest neighbor. The algorithm starts by

Spam Filter Design Using HC, SA, TA Feature Selection Methods ♦ 163


calculating the similarity between the test e-mail and all e-mail in training set. It then picks the K closet instances and assigns the test e-mail to the most common class among these nearest neighbors. Thus after transforming the training e-mails and test e-mails into vectors, the second step is to find out the K vectors from training vector which are most similar to the test vector. In this work, we used the Euclidian distance as a measure for the similarity between vectors.

2.3 Transfer E-mails to Vectors

In text categorization, the text can be represented by vector space model. For terms appearing frequently in many e-mails has limited discrimination power, we use Term frequency and Inverse document frequency (TF-IDF) representation to represent the e-mail [5]. Accordingly, the more often a term appears in the e-mail, the more important this word is for that e-mail. In our experiment, we sorted the features by its document frequency (DF), i.e., the number of e-mails that contain the ith features to choose 100 features which DF range is 0.02 to 0.5 [2]. Thus the input to the feature selection algorithm is a feature vector of length 100. We then applied the feature selection algorithms to find out the most powerful discriminatory terms from the 100 features and test the performance of the e-mail filter.

2.4 Performance Measures

We now introduce the performance measures used in this paper. Let N=A+B+C+D be the total number of test e-mails in our corpus.

Table 1: Confusion Matrix

Spam Non-Spam

Filter Decision: Spam A B

Filter Decision: Non-Spam C D

If table 1 denotes the confusion matrix of the e-mail classifier, then we define the accuracy, precision, recall, and F1 for spam e-mails as follows:

ACCURACY = A D

N

−, PRECISION (P): A/AB

RECALL (R) = A

A C−

, F1 = 2PR/P R

Similar measures can be defined for legitimate e-mails.

2.5 Feature Selection Strategies

The output of the vector space modeling (VSM) is a relatively long feature vector that may have some redundant and correlated features (curse of dimensionality). This is the main motivation for using the feature selection techniques. The proposed feature selection algorithms are classifier dependant. This means that different possible feature subsets are examined by the algorithm and the performance of a prespecified classifier is tested for each subset and finally the best discriminatory feature subset is chosen by the algorithm. There are many feature selection strategies (FSS) that can be applied to produce the resulting feature set. In what follows, we describe and report results conducted with our proposed FSS.



2.5.1 Hill Climbing (HC)

The basic idea of HC is to choose a solution from the neighborhood of a given solution, which improves this solution best, and stops if the neighborhood does not contain an improving solution [8]. The hill climbing used in this paper can be summarized as follows:

1. Randomly create an initial solution SI. This solution corresponds to a binary vector of length equal to the total number of features in the feature set under consideration. The l's positions denote the set of features selected by this particular individual. Set I*=S1 and calculate its corresponding accuracy Y(S1).

2. Generate a random neighboring solution S2 based on It and calculate its corresponding accuracy Y(S2).

3. Compare the two accuracies. If the corresponding accuracy of the neighboring solution Y(S2) is higher than Y(S1), set I*=S2.

4. Repeat step 2 to 3 for a pre-specified number of iteration (or until a certain criterion is reached.)

Although hill climbing has been applied successfully to many optimization problems, it has one main drawback. Since only improving solutions are chosen from the neighborhood, the method stops if the first local optimum with respect to the given neighborhood has been reached. Generally, this solution is not globally optimal and no information is available on how much the quality of this solution differs from the global optimum. A first attempt to overcome the problem of getting stuck in a local optimum was to restart iterative improvement several times using different initial solutions (multiple restart). All the resulting solutions are still only locally optimal, but one can hope that the next local optimum found improves the best found solution so far. In our experiments, we used 10 different initial solutions.

2.5.2 Simulated Annealing (SA)

Kirkpatrick et. al [6] proposed SA, a local search technique inspired by the cooling processes of molten metals. It merges HC with the probabilistic acceptance of non-improving moves. Similar to HC, SA iteratively constructs a sequence of solutions where two consecutive solutions are neighbored. However, for SA the next solution does not necessarily have a better objective value than the current solution. This makes it possible to leave local optimum. First, a solution is chosen from the current solution. Afterwards, depending on the difference between the objective values of the chosen and the current solution, it is decided whether we move to the chosen solution or stay with the current solution. If the chosen solution has a better objective value, we always move to this solution. Otherwise we move to this solution with a probability which depends on the difference between the two objective values. More precisely, if S1 denotes the current solution and S2 is the chosen solution, we move to S2 with probability:

p(S1, S2) = e-maxY(s1) -Y(s2),O T (1)

The parameter T is a positive control parameter (temperature) which decreases with increasing number of iterations and converges to 0. As the temperature is lowered, it becomes ever more difficult to accept worsening moves. Eventually, only improving moves are allowed and the process becomes 'frozen'. The algorithm terminates when the stopping criterion is met. [7]. Furthermore, the probability above has the property that large



deteriorations of the objective function are accepted with lower probability than small deteriorations. The simulated annealing used in this paper can be summarized as follows:

1. Randomly create an initial solution 51.This solution corresponds to a binary vector of length equal to the total number of features in the feature set under consideration. The l's positions denote the set of features selected by this particular individual. Set I*=S1 and calculate its corresponding accuracy Y(S1).

2. Create the parameter (temperature) T and constant cooling factor a, 0< a<1.

3. Generate a random neighboring solution S2 based on It and calculate its corresponding accuracy.

4. Compare the two accuracy. If the corresponding accuracy of the neighboring solution Y (S2) is higher than Y(S1), set I*=S2. Otherwise, generate U= rand (0…1). Compare U and P(S1,S2).If U> P(S1,S2), I*=S2.

5. Decrease the temperature by T=T* a.

6. Repeat step 3 to 5 for a pre-specified number of iteration (or until a certain criterion is reached).

To compare with HC, we also set the same 10 initial solutions and record the best solutions.

2.5.3 Threshold Accepting (TA)

A variant of simulated annealing is the threshold accepting method. It was designed by

Ducek and Scheuer [8] as a partially deterministic version of simulated annealing. The only

difference between simulated annealing and threshold accepting is the mechanism of

accepting the neighboring solution. Where simulated annealing uses a stochastic model,

threshold accepting uses a static model: if the difference between the objective value of the

chosen and the current solution is smaller than a threshold T, we move to the chosen solution.

Otherwise it stays at the current solution [8]. Again, the threshold is a positive control

parameter which decreases with increasing number of iterations and converges to 0. Thus, in

each iteration, we allow moves which do not deteriorate the current solution more than the

current threshold T and finally we only allow improving moves. The steps of the threshold

accepting algorithm used in this paper are identical to the SA except that step 4 above is

replaced by the following:

4. Compare the two accuracy Y(S2) and Y(S1). If the corresponding accuracy of the neighboring solution Y(S2) is higher than Y(S1), set I*=S2. Otherwise, set β = Y (S2)-Y (S1). Compare β and T. If T> β, I*=S2.

2.5.4 Linear Discriminant Analysis (LDA)

LDA is a well-known technique for dealing with the class separating problem. LDA can be used to determine the set of the most discriminate projection axes. After projecting all the samples onto these axes, the projected samples will form the maximum between-class scatter and the minimum within-class scatter in the projective feature space [3].

Let 1 1

11 1.....

lX X X= and

2 22

1 2.....

lX X X= be samples from two different classes and with some

abuse of notation x = x1 U x2=x1....x1. The linear discriminant is given by the vector W that maximizes [9]



( )T

B

T

W

W S WJ W

W S W=

and 1 2 1 2( )( )T

BS m m m m= − − , 1,2

( )( )i

T

W i i

l x x

S x m x m= ∈

= − −∑∑

are the between and within class scatter matrices respectively and mi is the mean of the classes. The intuition behind maximizing J(w) is to find a direction which maximizes the projected class means (the numerator) while minimizing the classes variance in this direction (the denominator). After we find the linear transformation matrix W, the dataset can be transformed by y wT*x. For the c-class problem, the natural generalization of linear discriminant involves c-1 discriminant functions. Thus, the projection is from a d-dimensional space to a (c- 1)-dimensional space [1].

3 Experiment Results

We conducted experiments to compare the performance of the proposed feature selection

algorithms. Throughout our experiments, we used a KNN classifier with K=30. It is clear that

the system performance with the feature selection strategies is better than the system

performance without the feature selection strategies. For LDA, only 9 spam e-mails among

48 spam cases and 2 legitimate e-mails of 62 legitimate cases were misclassified. For HC,

only 4 spam e-mails among 48 spam cases and 3 legitimate e-mails of 62 legitimate cases

were misclassified. For TA, only 3 spam e-mails among 48 spam cases and 3 legitimate e-

mails of 62 legitimate cases were misclassified. For SA, only 4 spam e-mails among 48 spam

cases and 1 legitimate e-mail of 62 legitimate cases were misclassified. Among all the four

strategies, SA reached the best performance. Accuracy ordering: SA> TA> HC > LDA.

4 Conclusion

In this paper, we proposed the use of three different local search optimization techniques as feature selection strategies for application in spam e-mail filtering. The experimental results show that the proposed strategies not only reduce the dimensions of the e-mail, but also improve the performance of the classification filter. We obtained a classification accuracy of 90.0% for LDA, 93.6% for HC, 94.6% for TA and 95.5% for SA as compared to 88.1% for the system without feature selection.

References

[1] R. Duda, P. Hart and D. Stork, "Pattern Classification," John Wiley and Sons, 2001. [2] N. Soonthornphisaj, K. Chaikulseriwat and P. Tng-on, "Anti-Spam Filtering: A Centroid-Based Classification Approach," 6th IEEE International Conference on Signal Processing, pp. 1096 - 1099, 2002.

[2] L.F. Chen, H.Y.M. Liao, M.T. Ko, J.C. Lin and G.J Yu, "A new LDA-based face recognition system which can solve the small sample size problem." Patten recognition. vol. 33. pp. 1713-1726, 2000.

[3] C. Lai and M. Tsai, "An empirical performance comparison of machine learning methods for spam email categorization," Proceeding of the 4th international conference on hybrid intelligent systems (HIS'04), 2004.

[4] J. F. Pang, D. Bu and S. Bai, "Research and Implementation of Text Categorization System Based on VSM," Application Research of Computers, 2001.

[5] S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, "Optimization by Simulated Annealing." Science, pp 671-580, May 1983.



[6] J. A. Clark, J. L. Jacob and S. Stepney, "The Design of SBoxes by Simulated Annealing." Evolutionary Computation, pp. 1533–1537, Vol.2, June 2004.

[7] J. Hurink "Introduction to Local Search." [8] S. Mika, G. Ratsch, J. Weston, B. Scholkopf and K. Mullers, "Fisher discriminant analysis with kernels."

Neural Networks for Signal Processing IX, 1999. pp. 41 - 48 Madison, WI Aug. 1999. [9] http://iit.demokritos.gr/skel/i-config/downloads.



Analysis & Design of a New Symmetric Key

Cryptography Algorithm and Comparison with RSA

Sadeque Imam Shaikh Dept. of CSE, University of Science & Technology Chittagong(USTC), Bangladesh

[email protected]

Abstract

Networking is the main technology for communication. There are various types of networks but all of them are vulnerable to protect valuable information due to various attacks. So far cryptography is the main weapon that can reduce unauthorized attack on valuable information. In the first phase of this paper literature review based on various types of cryptographic algorithms has been described. In the 2nd phase as a part of generating new ideas a new symmetric key cryptography algorithm has been developed. Although this algorithm is based on symmetric key it has few similarities with RSA as a result comparison has been done between these two algorithms using example in the 3rd phase. Finally source code of this algorithm has been generated using Turbo C++ compiler which successfully encrypts and decrypts information’s as an output.

Keywords: Information Security, Cryptography, Algorithm, Keys, Encryption, Decryption.

1 Proposed Methodologies

Mathematics and Programming are the main parts of this paper as cryptography completely depends on mathematics. Then to practically implement the encryption and decryption of the algorithm programming using Turbo C++ has been used. However books and research papers have been also used to generate new concept.

2 Literature Review

During 1994 The Internet Architecture Board (IAB) published a report that clearly indicated that the information through network or internet requires better effective security. It also represented the vulnerabilities of information system based on unauthorized access and control of network traffic. Those reports were justified by the Computer Emergency Response Team (CERT) coordinator centre (CERT/CC) as they reported that last ten years the attacks on internet and network increase rapidly. That’s why a wide range of technology and tools are needed to face this growing threat and strong cryptographic algorithm is the main weapon that can face this challenge. Cryptographic system basically has two types public key and secret-key cryptography. Asymmetric cryptography manipulates two separate keys for encoding and decoding and provides a robust mechanism for the key transportation. On the other hand, private key cryptography uses an identical key for both encoding and decoding, which is more efficient for large amount of data [Shin and Huang,2007].Suppose there are 4 entities then there should be 6 relationship. From symmetric point of view to

Analysis & Design of a New Symmetric Key Cryptography Algorithm and Comparison with RSA ♦ 169


maintain the security this system will require 6 secret keys. But for asymmetric point of view 4 entities will require 4 key pairs. From networking point of view each network may have many pairs of relationship. For symmetric key cryptography it will be a big challenge to maintain security for so many secret keys comparing with asymmetric key cryptography. The inventor of public key algorithm stated that the limit of public key cryptography for key organization and signature submission almost generally established [Diffie,1988].In 1976 Deffie, Hellman explained the new approach of public key cryptography and challenged all mathematicians to generate a new better method of public key cryptography [Diffie and Hellman,1976].The first response of this challenge was introduced in 1978 by Ron Rivest, Adi Shamir and Len Adleman of MIT. These three scientists introduced another new technique of public key cryptography which is stated as RSA which is still one of the best public key cryptographic technique since then [Rivest et al.,1978].

Crypto analysis of RSA is completely based on factoring of n into 2 prime numbers. Defining q(n) for n is the same for factoring n. If available algorithms are used to calculate d for e and n then it will be time consuming for factoring problem [Kaliski and Robshaw,1995]. But problem with RSA is that it always recommends to use large prime number. For small prime number this may not be effective.This is another important point that has been considered for designing a new algorithm for comparing with RSA. Stronger security for public key distribution can be achieved by providing tighter control over the allocation of public key from the directory [Popek and Kline,1979].But the first alternate approach was suggested by Kohnfelder.He proposed to use certificate which will be utilized by users to transfer keys without having any contact with public key authority. In this technique the key transfer will be taken place so perfectly that it seems to be keys are transferred directly from public key authority. Although one of the main advantages of secret key cryptography is the speed and efficiency but during July 1998 the efficiency of DES was failed and it was proved as the Electronic Frontier Foundation (EFF) declared that encrypted message of DES had been broken [Sebastopol and Reilly,1998].In November 2001, the National Institute of Standard Technology (NIST) declared the Advanced Encryption Standard (AES) as an alternative of the Data Encryption Standard (DES) [Mucci et al.,2007].A crucial common point underlying RSA-based cryptographic schemes is the assumption that it is difficult to factor big value which are the product of prime factors. A list of challenge numbers documents the capabilities of known factoring algorithms, and the current world record is 193 decimal digits that was factored in 2005.Common minimum requirements suggest the use of numbers with at least 1,024 bits, which corresponds to 309 decimal digits [Geiselmann and Steinwandt,2007].

3 Designing New Algorithm

Encryption

1. Chose two prime numbers P & Q. 2. Calculate N=P * Q. 3. Find the relative prime number set S of N. 4. Randomly chose one number from S. say S1. 5. Calculate S1

P & S1Q

6. Find the largest prime number between S1P & S1

Q, says X. 7. Let the Plain Text is TEXT.

170 ♦ Analysis & Design of a New Symmetric Key Cryptography Algorithm and Comparison with RSA


8. Find the immediate large prime number of X, says Y. 9. Calculate V=X+Y. 10. For encryption calculate the transition value T as follows: -

T= (TEXT XOR Y) 11. Now find the cipher text as follows:

CT=T XOR V. 12. The public key would be PK= X*strlen (CT).

Decryption

1. Calculate the length of Cipher Text, says L. 2. Then do PK / L = H 3. Find the next immediate largest prime number from H, i.e. H1. 4. Do sum SUM= H + H1. 5. Calculate CT=CT XOR SUM. 6. Finally the receiver would get the plain text as follows:

PT=CT XOR H1.

4 Comparison with RSA for Same Input Prime Numbers(5 and 3)

RSA Algorithm [Kahate, 2003]

1. Choose two large prime numbers. Say P=5 & Q=3 2. Calculate N=P * Q = 5* 3 =15 3. Select the Public key (i.e. the encryption key) E such that it is not a factor of (P-1) and

(Q-1). As we can see, (P-1)*(Q-1)=4*2=8. The factors of 8 are 2,2,2. Therefore our public key E must not have a factor of 2. Let us choose the public key value of E as 5.

4. Select the private key D such that (D*E) mod (P-1)*(Q-1)=1. Let us choose D as 5, because we can see that (5*5) mod 8 =1, which satisfied our condition.

5. Let the plain text PT=688. 6. For encryption, calculate the cipher text CT for the plain text as follows: CT=PTE

mod N=65mod 15=13 7. Send CT as the cipher text to the receiver. 8. For decryption, calculate the plain text PT from the cipher text CT as follows:

PT=CTD mod N=135mod 15 =13.

New Algorithm (Encryption)

1. Choose two distinct prime numbers. Say P=5 & Q=3 2. Calculate N=P * Q = 5* 3 =15 3. Find the relative prime number of set of N=15,1,2,4,7,8,11,13,14. 4. Randomly chose one number from S. say S1 =13. 5. Calculate S1

P = 135 = 371293, & S1q =133 = 2197

6. Find the largest prime number X between S1P & S1

q. X=371291. 7. Let the Plain Text is 688. 8. Find the immediate large prime number Y greater than X. i.e. Y=371299. 9. Calculate V=X+Y=371291+371299=742500.



10. For encryption calculate the transition value T as follows: - T= (TEXT XOR Y) = 688 XOR 371299 = 370899

11. Now for encryption, find the cipher text as follows: CT=T XOR V = 370899 XOR 742500 = 982199

12. The private key would be PK= X*strlen (CT)= 371291* 3= 1113873

New Algorithm (Decryption)

1. Calculate the length L of Cipher Text CT, i.e. L=3. 2. Then do H=PK / L=1113879/3=371291 3. Find the next largest prime number H1 from H, i.e. H1 =371299

4. Do sum S= H + H1 = 371291+371299=742500. 5. Calculate CT=CT XOR S = 688 XOR 742500 = 270899 6. Finally the receiver would get the plain text as follows:

PT=CT XOR H1 = 270899 XOR 371299 = 688.

5 Advantages of New Algorithm Over RSA

1. RSA recommends using large prime number, i.e. it is not very much effective with small prime number. For example with two small prime numbers say 5 & 3, both the encryption and decryption key becomes 5, which one can easily get with no difficulty.

On the other hand with the same prime number given above, New algorithm provides private key 1113873. This time New algorithm is better.

2. RSA algorithm needs the public key directly to encode the plain text. Whereas new algorithm does not uses the public key directly rather uses a transitional private key. For example taking the above example, RSA, encrypt the plain text PT as follows:

CT=PTE mod N,Where N is 15 & E is 5.

But in new algorithm the private key depends upon the length of encrypted TEXT.

PK= X*strlen (CT)

Where PK is the private key and X is prime number and CT is the cipher text..

So finally we can say, in new algorithm one can not get the actual private key directly, that could be possible with RSA.

3. If we study the RSA algorithm, we would find that the effectiveness of RSA mostly depend upon the size of the two prime numbers, which is absence in new algorithm. In new algorithm with small prime number one can get an effective encrypted data.

5.1 Disadvantages of New Algorithm Over RSA

1. New algorithm is a symmetric key algorithm where RSA is a asymmetric key algorithm hence it has few limitation comparing with RSA.Moreover RSA often accepts two same prime numbers while new algorithm never accept two same prime numbers for the same input.

172 ♦ Analysis & Design of a New Symmetric Key Cryptography Algorithm and Comparison with RSA


Fig. 1

6 Output of Encryption Window

6.1 Output of Decryption window

Fig. 2

7 Conclusion

Cryptography especially public key cryptography is one of the hot topics for information security. If it is necessary to maintain security, privacy and integrity of information system there is no alternative of cryptography that’s why even for satellite communication both the ground stations and satellite in distance orbit transfer and receive information using encryption and decryption to ensure security and privacy for all subscribers. With the passage



of time the technique of cryptography is changing because the cipher which was earlier considered to be effective now becomes insecure. That’s why there is a always chance for developing or researching about cryptographic algorithm. From this point of view this new symmetric algorithm that has been described in this paper may be helpful for further research to enhance maximum information security.

References

[1] [Diffie,1988] Diffie.W,” The first ten years of public key cryptography”, Proceedings of IEEE,May 1988. [2] [Diffie and Hellman,1976] Diffie, W, Hellman, M “Multi-user cryptographic technique”, IEEE transactions

on information theory, November1976. [3] [Geiselmann and Steinwandt,2007] Willi Geiselmann, Rainer Steinwandt,” Special-Purpose Hardware in

Cryptanalysis, The case of 1024 bit RSA”, IEEE computer society 2007. [4] [Kahate,2003] Atul Kahate,” Cryptography and Network security”, Tata McGraw-Hill publishing company

Limited, 2003, pp115-119. [5] [Kaliski and Robshaw,1995] Kaliski, B, Robshaw. M, “The secure use of RSA”,Crypto Bytes, Autumn

1995. [6] [Mucci et al., 2007] C. Mucci, L. Vanzolini, A. Lodi, A. Deledda, R. Guerrieri, F. Campi, M. Toma,

“Implementation of AES/Rijndael on a dynamically reconfigurable architecture” Design, Automation &

Test in Europe Conference & Exhibition IEEE, 2007. [7] [Popek and Kline,1979] Popek,G and Kline,C “Encryption and secure computer networks”, ACM Computer

surveys, December 1979. [8] [Rivest et al.,1978] Rivest, R; Shamir, A; and Adleman, L “A method for obtaining digital signatures and

public key cryptosystems”, Communication of the ACM, February 1978. [9] [Sebastopol and Reilly, 1998] Sebastopol, C, A; O Reilly, Electronic Frontier Foundation, Cracking DES,”

Secrets of encryption research, wiretap Politics and chip design”, Electronic Frontier Foundation, Cracking DES, 1998.

[10] [Shin and Huang, 2007] Shin-Yi Lin and Chih-Tsun Huang, “A High-Throughput Low-Power AES Cipher for Network Applications’’, Asia and South Pacific Design Automation Conference 2007, IEEE computer society.



An Adaptive Multipath Source Routing

Protocol for Congestion Control and Load

Balancing in MANET

Rambabu Yerajana A. K. Sarje Department of ECE Department of Computer ECE Indian Institute of Technology Indian Institute of Technology Roorkee Roorkee [email protected] [email protected]

Abstract

In this paper, we propose a new Multipath routing protocol for ad hoc wireless networks, which is based on the DSR (Dynamic Source Routing)-On demand Routing Protocol. Congestion is the main reason for packet loss in mobile ad hoc networks. If the workload is distributed among the nodes in the system, based on the congestion of the paths, the average execution time can be minimized and the lifetime of the nodes can be maximized. We propose a scheme to distribute load between multiple paths according to the congestion status of the path. Our simulation results confirm that the proposed protocol-CCSR improves the throughput and reduces the number of collisions in the network

Keyword: Ad hoc networks, congestion control and load balancing, routing protocols.

1 Introduction

A mobile Ad hoc network is a collection of wireless mobile hosts forming a temporary

network without the aid of any fixed infrastructure and centralized administration. All nodes

can function, if needed, as relay stations for data packets to be routed to their final

destination. Routing in mobile environments is challenging due to the constraints existing on

the resources (transmission bandwidth, CPU time, and battery power) and the required ability

of the protocol to effectively track topological changes.

Routing protocols for Ad hoc networks can be classified into three categories: proactive, on-

demand also called reactive, and hybrid protocols [7, 8]. The primary characteristic of

proactive approaches is that each node in the network maintains a route to every other node in

the network at all times. In Reactive routing techniques, also called on-demand routing,

routes are only discovered when they are actually needed. When a source node needs to send

data packets to some destination, it checks its route table to determine whether it has a route

to that destination. If no route exists, it performs a route discovery procedure to find a path to

the destination. Hence, route discovery becomes on-demand. Dynamic Source Routing

(DSR) and Ad hoc On-demand Distance Vector (AODV) routing protocols are on demand

routing protocols our proposed protocol is based on DSR protocol [1, 3, 4, and 5].

An Adaptive Multipath Source Routing Protocol for Congestion Control and Load Balancing in MANET ♦ 175


The rest of this paper is organized as follows. Section 2 gives a brief introduction to DSR protocol and our proposed routing protocol CCSR. In section 3, the performance comparisons between CCSR and DSR are discussed. Section 4 concludes the routing algorithm.

2 Dynamic Source Routing Protocol

In DSR protocol, if a node has a packet to transmit to another node, it checks its Route Cache for a source route to the destination [1, 6, 7 and 8]. If there is already an available route, then the source node will just use that route immediately. If there is more than one source route, the source node will choose the route with the shortest hop-count, ‘Source Route’ at the source node includes list of all intermediate traversing nodes in the packet header, when it desires to send the packet to destination in an ad hoc network. Source node initiates route discovery, if there are no routes in its cache. Each route request may discover multiple routes and all routes are cached at the source node.

The Route Reply packet is sent back by the destination to the source by reversing the received node list accumulated in the Route Request packet. The reversed node list forms the ‘Source Route’ for the Route Reply packet. DSR design includes loop free discovery of routes and discovering multiple paths in DSR is possible because paths are stored in cache [6, 8]. Due to the dynamic topology of Ad hoc networks, the single path is easily broken and needs to perform a route discovery process again. In Ad hoc networks multipath routing is better suited than single path in stability and load balance.

3 Cumulative Congestion State Routing Protocol Based on Delimiters

Our motivation is that congestion is a dominant cause for packet loss in MANETs. Unlike well-established networks such as the Internet, in a dynamic network like a MANET, it is expensive, in terms of time and overhead, to recover from congestion. Our proposed CCSR protocol tries to prevent congestion from occurring in the first place. CCSR uses congestion status of the whole path (Congestion Status of the all nodes participated in route path) and source node maintains the table called Congestion Status table (Cst) contains the congestion status of the every path from source node to destination node.

S

1 2

4

D5

3

cs(d)cs(d)+cs(5)

cs(d)+cs(4)+cs(3)

Fig. 1: Using Ccsp Packets

A simplified example is illustrated in Fig. 1. Three possible routes S->1-> 2->D, S->5->D and S->3->4>D are multipaths routes between source node S to the destination node D.

176 ♦ An Adaptive Multipath Source Routing Protocol for Congestion Control and Load Balancing in MANET


Source node S maintains a special table called Congestion Status Table, which stores the congestion status of the every path, remember that here we are calculating the congestion status not for the single node rather all nodes of the path (cumulative congestion status).

3.1 Load Distribution

In CCSR, Destination node will send Cumulative congestion status packets (Ccsp) packets periodically towards the source node. Source node after receiving the Ccsp packets it will update the Cst Table. The distribution procedure at source node will distribute the available packets according to the delimiters used. CCSR protocol uses three delimiters and will decide how many packets need to send to congested paths. According to the Cst table Source node will distribute the packets such that more packet towards the path with less congestion status and sends less packets to the path with more congestion status in the Cst table. Table 1 shows the Cst table of the Source node S and the Congestion Status of the path will be calculated as,

Cs (A): indicates Congestion status of the node A.

Ccs(B):indicates Cumulative Congestion Status of the node B, is calculated using the congestion status of the node B plus congestion status of its previous nodes.

Here Congestion Status of the particular node will be calculated using available buffer size or queue length and number of packets, the ratio of data to the available queue length will give the congestion status of the particular node.

Ccs (D): Cumulative Congestion Status of the node D of the path S, 1, 2, D : Congestion Status of the node D

Ccs (1): Cumulative Congestion Status of the node 1 of the path S, 1, 2, D). : Congestion Status of the node D + Congestion Status of the node 1 : Cs (D) + Cs (1).

Ccs (3): Cumulative Congestion Status of the node 3 of the path S, 3, 4, D. : Cs (D) + Cs (4) + Cs (3).

The typical Cst table of the source node S is shown in the table 1, where pathID indicates the nodes involved in routing and Congestion Status indicates the cumulative congestion status of all nodes involved in the route path. After updating the latest congestion status source node will choose the path and distribute the packets.

Load distribution procedure is shown below

/*

a, b and c indicates the number packets available at nodes of corresponding paths A, B and C and x, y and z indicates queue length of the paths then, a/x, b/y and c/z indicates the congestion status of the paths

NOPACK is data available at source node

L, M, and U are Congestion Status Delimiters Low Congestion status means traffic is low and High means traffic is very high towards the path.

L= Low, M= Medium H= High are the delimiters for load distribution.

CL =minimum value in the congestion status table and



CU =maximum value in the Congestion status table

*/

//Begin Load Distribution Procedure Procedure LoadDIST (NOPACK, A, B, C, L, M) Repeat until NOPACK = 0 IF Cs X < = L Send more packets towards the this path IF Cs X > = H Stop sending the packets to words the path X is the one of some paths in path list IF L < Cs X < M Send CU / Cs X towards the paths X ELSE IF M < Cs X < U Send CU / Cs X towards the paths X

3.2 Additional Analysis

If the Congestion Status of the path.i.e.: Csx is very high for the long period then removes or deletes the path from the list. The overhead is reduced because maintaining such multi paths is very difficult. Deleting the paths which has more congestion status for long time processing time is reduced at source node.

3.3 Congestion State Table

Source node maintains a separate table to keep track of congestion status of the available paths. The Congestion State Table of the Source node S is shown in Table 1.

Table 1

Path ID Congestion Status

S,1,2,D Cs(S+1+2+D)

S,5,D Cs(S+5+D)

S,3,4,D Cs(S+3+4+D)

……… ……….

………..… ………..

4 Simulation

CCSR protocol was simulated in GloMoSim Network Simulator. Number of nodes present in the network was varied from 20 to 60. Nodes moved in an area of (1000x300) m2 in accordance with random waypoint mobility model, with a velocity of 20 m/s and a pause time of 0 second. Simulation time was set as 700 seconds.

We considered the following important metrics for the evaluation: Packet delivery ratio, number of collisions and end-to end delay.

Data Throughput (kilobits per second –Kbps) - describes the average number of bits received successfully at the destination per unit time (second). This metric was chosen to measure the resulting network capacity in the experiments.

178 ♦ An Adaptive Multipath Source Routing Protocol for Congestion Control and Load Balancing in MANET


End-to-end delay (seconds) – This is an average of the sum of delays (including latency), at each destination node during the route discovery from the source to destination. The performance of the CCSR protocol gives better result in terms of delay and throughput than DSR and AODV. As shown in figure 2 and 3 we have compared the result with DSR and in figure 4 and 5 we have compared the proposed protocol result with AODV as a result of that, end-to-end delay of the DSR and AODV suffers the worst delay, this is because of high load congestion in the network nodes and absence of the load balancing mechanism. The simulation shows 5 to 25 percentage improvement in packet delivery ratio and delay.

Fig. 2: End-to-End Delay

Fig. 3: Throughput

End-to-End Delay(Sec)

0

2

4

6

8

10

1

No of nodes

En

d-t

o-E

nd

dela

y(A

vg

)(S

ec)

AODV

CCSR

20 30 40 50 60

600Sec Simulation

Fig. 4: End–to-End Delay



Fig. 5: Throughput

5 Conclusion

In this paper, we proposed a new routing protocol called CCSR to improve the performance of Multipath routing protocol for ad hoc wireless networks. The CCSR uses the cumulative congestion status of the path rather than congestion status of the neighborhood. According to the values of the congestion status for the path stored in a separate table and maintained by source node for processing, the source node will distribute the packets such that more packets to paths with less congestion. It is evident from simulation results that CCSR outperforms both AODV and DSR because it balances the load according to the situation of the network and adaptively changes the decision by source node.

References

[1] [David B. Johnson, David A.Maltz,Yih Chun Hu] “The Dynamic Source Routing Protocol for Mobile Ad Hoc Networks (DSR)”, Internet Draft,draftietfManetdsr09.txt.April,2003. URL://www.ietf.org/internetdrafts/draftietf- manet-dsr-09.txt.

[2] [Yashar Ganjali and Abtin Keshavarzian] “Load Balancing in Ad Hoc Networks: Single-path routing vs. Multi-path Routing”, IEEE INFOCO 2004, Twenty-third annual joint conference of the IEEE computer communications society, volume 2, March 2004, pp: 1120-1125.

[3] [Salma Ktari and Houda Labiod and Mounir Frikha] “Load Balanced Multipath Routing in Mobile Ad hoc Networks”, Communication Systems 2006, ICCS2006 10th IEEE Singapore international conference, Oct 2006, pp: 1-5.

[4] [Wen Song and Xuming Fang] “Routing with Congestion Control and Load Balancing in Wireless Mesh Networks”, ITS Telecommunications proceedings 2006, 6th international conference, pp:719-724.

[5] [Neeraj NEhra R.B. Patel and V.K.Bhat] “Routing with Load Balancing in Ad Hoc Network: A Mobile Agent Approach”, 6th IEEE/ACIS International Conference on Computer and Information science (ICIS 2007), pp: 480-486.

[6] [Mahesh K. Marina and Samir R. Das] “Performance of Route Caching Strategies in Dynamic Source

Routing”, Distributed Computing Systems Workshop, 2001 International Conference, 16-19, April 2001 Pp: 425 – 432.

[7] [A. Nasipuri and S. R. Das] “On-demand Multipath routing for mobile ad hoc networks,” Proc. IEEE

ICCCN, Oct. 1999, pp. 64–70. [8] [S.J. Lee, C.K. Toh, and M. Gerla] “Performance Evaluation of Table-Driven and On-Demand Ad Hoc

Routing Protocols,” Proc. IEEE Symp. Personal, Indoor and Mobile Radio Comm, Sept. 1999, pp. 297-301.



Spam Filtering Using Statistical Bayesian

Intelligence Technique

Lalji Prasad RashmiYadav Vidhya Samand SIMS (RGPV) University SIMS (RGPV) University SIMS (RGPV) University Indore-453002 Indore-453002 Indore-453002 [email protected] [email protected] [email protected]

Abstract

This paper describes how Bayesian mathematics can be applied to the spam problem, resulting in an adaptive, ‘statistical intelligence’ technique that is much harder to circumvent by spammers. It also explains why the Bayesian approach is the best way to tackle spam once and for all, as it overcomes the obstacles faced by more static technologies such as blacklist checking, databases of known spam and keyword checking. Spam is an ever-increasing problem. The number of spam mails is increasing daily. Techniques currently used by anti-spam software are static, meaning that it is fairly easy to evade by tweaking the message a little. To effectively combat spam, an adaptive new technique is needed. This method must be familiar with spammers' tactics as they change over time. It must also be able to adapt to the particular organization that it is protecting from spam. The answer lies in Bayesian mathematics, which can be applied to the spam problem, resulting in an adaptive, ‘statistical intelligence’ technique that is much harder to circumvent by spammers. The Bayesian approach is the only and best way to tackle spam once and for all, as it overcomes the obstacles faced by more static technologies such as blacklist checking, databases of known spam and keyword checking.

1 Introduction

Every day we receive many times more spam than legitimate correspondences while checking mail. On average, we probably get ten spams for every appropriate e-mail. The problem with Spam is that it tends to swamp desirable e-mail. Junk E-Mail courses through the Internet, clogging our computers and diverting attention from mail we really want. Spammers waste the time of a million people. In future, Spam would like OS crashes, viruses, and popup, become one of those plagues that only afflict people who don’t bother to use the right software.

The problem of unsolicited e-mail has been increasing for years. Spam encompasses all the e-mail that we do not want and that is only very loosely directed at us. Unethical e-mail senders bear little or no cost for mass distribution of messages; yet normal e-mail users are forced to spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes. Bayesian filters are advantageous because they take the whole context of a message into consideration. Unlike other filtering techniques that look for spam-identifying words in subject lines and headers, a Bayesian filter uses the entire context of an e-mail when it looks

Spam Filtering Using Statistical Bayesian Intelligence Technique ♦ 181


for words or character strings that will identify the e-mail as spam. A Bayesian filter is constantly self-adapting. Bayesian filters are adaptable in that the filter can train itself to identify new patterns of spam. The Bayesian technique learns the email habits of the company and understands that. It let each user define spam so that the filter is highly personalized. Bayesian filters also automatically update and are self-correcting as they process new information and add it to the database.

2 What is Spam?

Spam is somewhat broader than the category "unsolicited commercial automated e-mail"; Spam encompasses all the e-mail that we do not want and that is only very loosely directed at us.

2.1 How Spam Creates Problem

The problem of unsolicited e-mail has been increasing for years. Spam encompasses all the e-mail that we do not want and that is very loosely directed at us. Normal e-mail users are forced to spend time and effort purging fraudulent mail from their mailboxes. The problem with spam is that it tends to swamp desirable e-mail.

2.2 Looking at Filtering Algorithm

2.2.1 Basic Structured Text Filters

The e-mail client has the capability to sort incoming e-mail based on simple strings found in specific header fields, the header in general, and/or in the body. Its capability is very simple and does not even include regular expression matching. Almost all e-mail clients have this much filtering capability.

2.2.2 White List Filter

The "white list plus automated verification" approach. There are several tools that implement a white list with verification: TDMA is a popular multi-platform open source tool; Choice Mail is a commercial tool for Windows. A white list filter connects to an MTA and passes mail only from explicitly approved recipients on to the inbox. Other messages generate a special challenge response to the sender. The white list filter's response contains some kind of unique code that identifies the original message, such as a hash or sequential ID. This challenge message contains instructions for the sender to reply in order to be added to the white list (the response message must contain the code generated by the white list filter.

2.2.3 Distributed Adaptive Blacklists

Spam is delivered to a large number of recipients. And as a matter of practice, there is little if any customization of spam messages to individual recipients. Each recipient of a spam, however, in the absence of prior filtering, must press his own "Delete" button to get rid of the message. Distributed blacklist filters let one user's Delete button warns millions of other users as to the spamminess of the message. Tools such as Razor and Pyzor operate around servers that store digests of known spam. When a message is received by an MTA, a distributed blacklist filter is called to determine whether the message is a known spam. These tools use clever statistical techniques for creating digests, so that spam with minor or automated mutations In addition, maintainers of distributed blacklist servers frequently create "honey-

182 ♦ Spam Filtering Using Statistical Bayesian Intelligence Technique


pot" addresses specifically for the purpose of attracting spam (but never for any legitimate correspondences).

2.2.4 Rule-Based Rankings

The most popular tool for rule-based Spam filtering, by a good margin, is Spam Assassin. Spam Assassin (and similar tools) evaluates a large number of patterns mostly regular expressions against a candidate message. Some matched patterns add to a message score, while others subtract from it. If a message's score exceeds a certain threshold, it is filtered as spam; otherwise it is considered legitimate.

2.2.5 Bayesian Word Distribution Filters

The general idea is that some words occur more frequently in known spam, and other words occur more frequently in legitimate messages. Using well-known mathematics, it is possible to generate a "spam-indicative probability" for each word. It can generate a filter automatically from corpora of categorized messages rather than requiring human effort in rule development. It can be customized to individual users' characteristic spam and legitimate messages.

2.2.6 Bayesian Trigram Filters

Bayesian techniques built on a word model work rather well. One disadvantage of the word model is that the number of "words" in e-mail is virtually unbounded. The number of "word-like" character sequences possible is nearly unlimited, and new text keeps producing new sequences. This fact is particularly true of e-mails, which contain random strings in Message-IDs, content separators, UU and base64 encodings, and so on. There are various ways to throw out words from the model. It uses trigrams for probability model rather than "words”. Trigram is smaller unit of words. Among all the techniques described above we have chosen Bayesian approach for i2.3 Algorithm for Bayesian probability model of Spam and No spam words. When user logins the administrator checks all new mails for Spam and set the status as Spam, non-Spam, Blacklist for Blacklisted sender and White list for white listed sender. The sender will be checked against sender XML file which maintains the status of user (Blacklisted, White listed, No status).If sender is blacklisted or white listed then no check for Spam is applied. If sender is with No status then we use the following algorithm for checking against Spam. We have used Graham’s Bayesian statistical based approach. Steps of statistical filtering:

We started with corpus of Spam and No spam tokens mapping each token to the probability that an email containing it is a Spam, contained in Probability XML file.

We scanned the entire text, including subject header of each message. We currently considered alphanumeric characters, exclamation mark and dollar signs to be part of tokens, and everything else to be a token separator.

When a new email arrives, we extracted all the tokens and find at most fifteen with probabilities p1...p15 furthest (in either direction) from 0.5.The factor used for extracting 15 interesting words is calculated as follows:

a. For words having probability greater than 0.5 in probability XML –Factor = Token probability -0.5

b. For words having probability less than 0.5 in probability XML – Factor = 0.5 - Token probability



c. One question that arises in practice is what probability to assign to a token we’ve never seen, i.e. one that doesn't occur in the probability XML file. We have assigned.4 to that token.

d. The probability that the mail is a Spam is

p1p2...p15 p1p2...p15 + (1 - p1) (1 - p2)... (1 - p15)

e. We treated mail as Spam if the algorithm above gives it a probability of more than.9 of being Spam.

At this stage we maintained two XML files (Spam XML file, No spam XML file) for each corpus, mapping tokens to number of occurrences

f. According to combined probability if message is a spam then we count the number of times each token (ignoring case) occurs in message and update Spam XML file. And if it is no spam we update No spam XML file.

g. Also the number of spam or no spam mails is updated based on message status.

h. We looked through the entire user's email and, for each token, calculated the ratio of spam occurrences to total occurrences. Pi = Spam occurrences/ Total occurrences

For example, if "cash" occurs in 200 of 1000 spam and 3 of 500 no spam emails, its spam probability is (200/1000) (3/500 + 200/1000) or.971.

i. Whenever the no. of Spam and Ham mails will reach 1000 then probability XML will be updated according to above formula. We want to bias the probabilities slightly to avoid false positives.

Bias Used

There is the question of what probability to assign to words that occur in one corpus but not the other. We choose.01 (For not occurring in Spam XML) and.99 (For not occurring in No spam XML). We considered each corpus to be a single long stream of text for purposes of counting occurrences; we use their combined length, for calculating probabilities. This adds another slight bias to protect against false positives. The token probability is calculated if and only if the no. of both Spam and Ham mails reaches 1000. Here we have used 1000 but you can use even larger corpus of messages. Unless and until the message no. For both Spam and Ham mails reach 1000 the messages having probability greater than 0.6 will be treated as Spam. Afterwards 0.9 is used.

We are using very large corpus of token probabilities in Probability XML instead of corpus of Spam and Ham messages. If user marks a non-Spam mail as Spam the Spam and non Spam XML will be updated accordingly also all words will be assigned high probability in Probability XML, until the no. of mails reaches 1000.

3 Bayesian Model of Spam and Non Spam Words

The spam filtering technique implemented in software is Bayesian statistical probability models of spam and non-spam words. The general idea is that some words occur more frequently in known spam, and other words occur more frequently in legitimate messages. Using well-known mathematics, it is possible to generate a "spam-indicative probability" for each word. Another simple mathematical formula can be used to determine the overall "spam

184 ♦ Spam Filtering Using Statistical Bayesian Intelligence Technique


probability" of a novel message based on the collection of words it contains.Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

Pi = spam occurrences /Total occurrences

3.1 Process

Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word Viagra in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up.

To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members. After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the e-mail’s Spam probability. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the e-mail’s spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a Spam. Email marked as Spam can then be automatically moved to a "Spam" email folder, or even deleted outright

3.2 Advantages

1. A statistical model basically just works better than a rule-based approach.

2. Feature-recognizing filters like Spam Assassin assign a spam "score" to email. The Bayesian approach assigns an actual probability.

3. Makes the filters more effective.

4. Lets each user decide their own precise definition of spam.

5. Perhaps best of all makes it hard for spammers to tune mails to get through the filters.

4 Conclusion

With more and more people use email as the everyday communication tool, there are more and more spam, viruses, phishing and fraudulent emails sent out to our email Inbox. Several email systems use filtering techniques that seek to identify emails and classify them by some simple rules. However, these email filters employ conventional database techniques for pattern matching to achieve the objective of junk email detection. There are several fundamental shortcomings for this kind of junk email identification technique, for example, the lack of a learning mechanism, ignorance of the temporal localization concept, and poor description of the email data.



Spam Filter Express is a powerful spam filter quickly identifies and separates the hazardous and annoying spam from your legitimate email. Based on Bayesian filtering technology, Spam Filter Express adapts itself to your email automatically, filtering out all of the junk mail with close to 100% accuracy. No adding rules, no complex training, no forcing your friends and colleagues to jump through hoops to communicate with you.

5 References

[1] [M. Sahami, S. Dumais, D. Heckerman, E. Horvitz (1998)] "A Bayesian approach to filtering junk e-mail". AAAI'98 Workshop on Learning for Text Categorization.

[2] [BOW] Bowers, Jeremy, Spam Filtering Last Stand, http://www.jerf.org/iri/2002/11/18.html, November 2002 3 [JGC] Graham-Cumming, John, 2004 MIT Spam Conference: How to beat an adaptive spam filter, http://www.jgc.org/SpamConference011604.pps, January 2004

[3] [ROB3] Robinson, Gary, Spam Filtering: Training to Exhaustion, http://www.garyrobinson.net/2004/02/ spam_filtering_.html, February 2004.

[4] [Paul Graham] Better Bayesian filtering http://www.paulgraham.com/better.html [5] [Graham (2002) Paul Graham] A plan for spam. WWW Page, 2002. URL

http://www.paulgraham.com/spam.html. [6] [Spam Cop FAQ.] "On what type of email should I (not) use Spam Cop?" (FAQ). Iron Port Systems, Inc..

Retrieved on 2007-01-05. [7] [Scott Hazen Mueller] "What is spam?". Information about spam. spam.abuse.net. Retrieved on 2007-01-

05. [8] [Center for Democracy and Technology (March 2003)] "Why Am I Getting All This Spam? Unsolicited

Commercial E-mail Research Six Month Report" Retrieved on 2007-06-05. (Only 31 sites were sampled, and the testing was done before CAN-SPAM was enacted.)

[9] ["Spamhaus Statistics : The Top 10"] Spamhaus Blocklist (SBL) database. The Spamhaus Project Ltd. (dynamic report). Retrieved on 2007-01-06.

[10] [Shawn Hernan; James R. Cutler; David Harris (1997-11-25)] "I-005c: E-Mail Spamming countermeasures: Detection and prevention of E-Mail spamming". Computer Incident Advisory Capability Information Bulletins. United States Department of Energy. Retrieved on 2007-01-06.

[11] [Gary Robinson] Spam detection. URL http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html accessed 18 November 2002, 22:00 UTC.

[12] [Yerazunis, W.] "The Spam Filtering Accuracy Plateau", MIT Spam Conference 2003 [13] http://crm114.sourceforge.net/Plateau_Paper.pdf [14] [Meyer, T.A., and Whateley, B., (2004)] “SpamBayes: Effective open-source, Bayesian based, email

classification system.” Conference on Email and Anti-Spam, July 30 and 31, 2004. [15] <http://ceas.cc/papers-2004/136.pdf>. [16] [I. Androutsopoulos, G. Paliouras, V. Karkaletsis,G.Sakkis, C.D. Spyropoulos, and P. Stamatopoulos]

“Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach,” Proceedings of the workshop: Machine Learning and Textual Information Access, 2000, pp. 1-13.

[17] [Cohen, W. W. (1996)] Learning rules that classify e-mail. In AAAI Spring Symposium on Machine Learning in Information Access.



Ensure Security on Untrusted Platform

for Web Applications Surendrababu K. Surendra Gupta Computer Engineering Department Computer Engineering Department SGSITS, Indore-452003 SGSITS, Indore-452003 [email protected]

Abstract

The web is an indispensable part of our lives. Every day, millions of users purchase items, transfer money, retrieve information, and communicate over the web. Although the web is convenient for many users because it provides anytime, anywhere access to information and services, at the same time, it has also become a prime target for miscreants who attack unsuspecting web users with the aim of making an easy profit. The last years have shown a significant rise in the number of web-based attacks, highlighting the importance of techniques and tools for increasing the security of web applications. An important web security research problem is how to enable a user on an untrusted platform (e.g., a computer that has been compromised by malware) to securely transmit information to a web application. Solutions that have been proposed to date are mostly hardware-based and require (often expensive) peripheral devices such as smartcard readers and chip cards. In this paper, we discuss some common aspects of client-side attacks (e.g., Trojan horses) against web applications and present two simple techniques that can be used by web applications to enable secure user input. We also conducted two usability studies to examine whether the techniques that we propose are feasible.

1 Introduction

Since the advent of the web, our lives have changed irreversibly. Web applications have quickly become the most dominant way to provide access to online services. For many users, the web is easy to use and convenient because it provides anytime, anywhere access to information and services. Today, a significant amount of business is conducted over the web, and millions of web users purchase items, transfer money, retrieve information, and communicate via web applications. Unfortunately, the success of the web and the lack of technical sophistication and understanding of many web users have also attracted miscreants who aim to make easy financial profits. The attacks these people have been launching range from simple social engineering attempts (e.g., using phishing sites) to more sophisticated attacks that involve the installation of Trojan horses on client machines (e.g., by exploiting vulnerabilities in browsers in so-called drive-by attacks [19]).

An important web security research problem is how to effectively enable a user who is running a client on an untrusted platform (i.e., a platform that may be under the control of an attacker) to securely communicate with a web application. More precisely, can we ensure the confidentiality and integrity of sensitive data that the user sends to the web application even if

Ensure Security on Untrusted Platform for Web Applications ♦ 187


the user’s platform is compromised by an attacker? Clearly, this is an important, but difficult problem. Ensuring secure input to web applications is especially relevant for online services such as banking applications where users perform money transfers and access sensitive information such as credit card numbers. Although the communication between the web client and the web application is typically encrypted using technologies such as Transport Layer Security [9] (TLS) to thwart sniffing and man-in-the-middle attacks, the web client is the weakest point in the chain of communication. This is because it runs on an untrusted platform, and thus, it is vulnerable to client side attacks that are launched locally on the user’s machine. For example, a Trojan horse can install itself as a browser plugin and then easily access, control, and manipulate all sensitive information that flows through the browser.

Malware that manipulates bank transactions already appears in the wild. This year, for example, several Austrian banks were explicitly targeted by Trojan horses that were used by miscreants to perform illegal money transactions [13, 21]. In most cases, the victims did not suspect anything, and the resulting financial losses were significant. Note that even though the costs of such an attack are covered by insurance companies, it can still easily harm the public image of the targeted organization. A number of solutions have been proposed to date to enable secure input on untrusted platforms for web-based applications. The majorities of these solutions are hardware based and require integrated or external peripheral devices such as smart-card readers [10, 23] or mobile phones [15]. Such hardware-based solutions have several disadvantages. They impose a financial and organizational burden on users and on service providers, they eliminate the anytime, anywhere advantage of web applications and they often depend on the integrity of underlying software components which may be replaced with tampered versions [12, 24, 25].

In this paper, we discuss some common aspects of client side attacks against web applications

and present two simple techniques that can be used by web applications to enable secure

input, at least for a limited quantity of sensitive information (such as financial transaction

data). The main advantage of our solutions is that they do not require any installation or

configuration on the user’s machine. Additionally, in order to evaluate the feasibility of our

techniques for mainstream deployment, we conducted usability studies. The main

contributions of this paper are as follows:

• We present a technique that extends graphical input with CAPTCHAs [3] to protect the confidentiality and integrity of the user input even when the user platform is under the control of an automated attack program (such as a Trojan horse).

• We present a technique that makes use of confirmation tokens that are bound to the sensitive information that the user wants to transmit. This technique helps to protect the integrity of the user input even when the user platform is under the control of the attacker.

• We present usability studies that demonstrate that the two techniques we propose in this paper are feasible in practice.

2 A Typical Client-Side Attack

In a typical client-side web attack, the aim of the attacker is to take control of the user’s web client in order to manipulate the client’s interaction with the web application. Such an attack

188 ♦ Ensure Security on Untrusted Platform for Web Applications


typically consists of three phases. In the first phase, the attacker’s objective is to install malware on the user’s computer. Once this has been successfully achieved, in the second phase, the installed malware monitors the user’s interaction with the web application. The third phase starts once the malware detects that a security critical operation is taking place and attempts to manipulate the flow of sensitive information to the web application to fulfill the attacker’s objectives.

Imagine, for example, that John Smith receives an email with a link to a URL. This email has been sent by attackers to thousands of users. John is naive and curious, so he clicks on the link. Unfortunately, he has not regularly updated his browser (Internet Explorer in this case), which contains a serious parsing-related vulnerability that allows malicious code to be injected and executed on his system just by visiting a hostile web site. As a result, a Trojan horse is automatically installed on John’s computer when his browser parses the contents of the web page. The Trojan horse that the attackers have prepared is a Browser Helper Object (BHO) for the Internet Explorer (IE). This BHO is automatically loaded every time IE is started. With the BHO, the attackers have access to all events (i.e., interactions) and HTML components (i.e., DOM objects) within the browser. Hence, they can easily check which web sites the user is surfing, and they can also modify the contents of web pages. In our example, the attacker’s are interested in web sessions with a particular bank (the Bank Austria). Whenever John is online and starts using the Bank Austria online banking web application, the Trojan browser plugin is triggered. It then starts analyzing the contents of the bank web pages. When it detects that he is about to transfer money to another account, it silently modifies the target account number.

Note that the imaginary attack we described previously is actually very similar to the attacks that have been recently targeting Austrian banks. Clearly, there can be many technical variations of such an attack. For example, instead of using a BHO, the attackers could also inject Dynamic Link Libraries (DLLs) into running applications or choose to intercept and manipulate Operating System (OS) calls. The key observation here is that the online banking web application has no way to determine whether the client it is interacting with has been compromised. Furthermore, when the client has indeed been compromised, all security precautions the web application can take to create a secure communication channel to the client (e.g., TLS encryption) fail. That is, the web application cannot determine whether it is directly interacting with a user, or with a malicious application performing illegitimate actions on behalf of a user.

3 Our Solution

As described in the previous section, the web application must assume that the user’s web client (and platform) is under the control of an attacker. There are two aspects of the communication that an attacker could compromise: the confidentiality or the integrity of input sent from the client to the web application. The confidentiality of the input is compromised when the attacker is able to eavesdrop on the entered input and intercept sensitive information. Analogously, the integrity of the input is compromised when the attacker is able to tamper, modify, or cancel the input the user has entered. As far as the user is considered, there are cases in which the integrity of input may be more important than its confidentiality. For example, as described in Section 2, only when the attacker can effectively modify the account number that has been typed, an illegitimate money transaction causing



financial damage can be performed. In this section, we present two techniques that web applications can apply to protect sensitive user input. We assume a threat model in which the attacker has compromised a machine and installed malicious code. This code has complete control of the client’s machine, but must perform its task in an autonomous fashion (i.e., without being able to consult a human). Our solutions are implemented on the server and are client-independent. The first solution we discuss aims to protect the integrity of user input. The second solution we discuss aims to protect the confidentiality and integrity of the user input, but only against automated attacks (i.e., the adversary is not a human).

3.1 Solution 1: Binding Sensitive Information to Confirmation Tokens

3.1.1 Overview

The first solution is based on confirmation tokens. In principle, the concept of a confirmation token is similar to a transaction number (i.e., TANs) commonly used in online banking. TANs are randomly generated numbers that are sent to customers as hardcopy letters via regular (snail) mail. Each time a customer would like to confirm a transaction, she selects a TAN entry from her hardcopy list and enters it into the web application. Each TAN entry can be used only once. The idea is that an attacker cannot perform transactions just by knowing a customer’s user login name and password. Obviously, TAN-based schemes rely on the assumption that an attacker will not have access to a user’s TAN list and hence, be able to perform illegitimate financial transactions at a time of his choosing. Unfortunately, TAN-based schemes are easily defeated when an attacker performs a client-side attack (e.g., using a Trojan horse as described in Section 2). Furthermore, such schemes are also vulnerable to phishing attempts in which victims are prompted to provide one (or more) TAN numbers on the phishing page. The increasing number of successful phishing attacks prompted some European banks to switch to so called indexed TAN (i-TAN) schemes, where the bank server requests a specific i-TAN for each transaction. While this partially mitigated the phishing threat, i-TANs are as vulnerable to client-side attacks as traditional TANs. In general, the problem with regular transactions numbers is that there is no relationship between the data that is sent to the web application and the (a-priori shared) TANs. Thus, when the bank requests a certain TAN, malicious code can replace the user’s input without invalidating this transaction number. To mitigate this weakness and to enforce integrity of the transmitted information, we propose to bind the information that the user wants to send to our

confirmation token. In other words, we propose to use confirmation tokens that (partially) depend on the user data. Note that when using confirmation tokens, our focus is not the protection of the confidentiality, but the integrity of this sensitive information.

3.1.2 Details

Imagine that an application needs to protect the integrity of some input data x. In our solution, the idea is to specify a function f (.) that the user is requested to apply to the sensitive input x. The user then submits both her input data x and, as a confirmation token, f(x). Suppose that in an online banking scenario, the bank receives the account number n together with a confirmation token t from the user. The bank will then apply f(.) to n and verify that f(n) = t. If the value x, which the user desires to submit, is the same as the input n that the bank receives (x = n), then the computation of f(n) by the bank will equal the computation of f(x) by the user. That is, f(x) = f(n) holds. If, however, the user input is modified, then the bank’s computation will yield f(n) _= f(x), and the bank will know that the integrity of the user’s



input is compromised. Any important question that needs to be answered is how f(.) should be defined. Clearly, f(.) has to be defined in a way so that malicious software installed on a user’s machine cannot easily compute it. Otherwise, the malware could automatically compute f(x) for any input x that it would like to send, and the proposed solution fails. Also, f(.) has to remain secret from the attacker.

We propose two schemes for computing f(x). For both schemes, the user will require a code

book. This code book will be delivered via regular mail, similar to TAN letters described in the previous section. In the first scheme, called token calculation, the code book contains a collection of simple algorithms that can be used by users to manually compute confirmation tokens (similar to the obfuscation and challenge-response idea presented in [4] for secure logins). All algorithms are based on the input that the user would like to transmit.

Fig. 1: Sample token calculation code book

Suppose that the user has entered the account number 980.243.276, but a Trojan horse has actually sent the account number 276.173.862 to the bank (unnoticed by the user). In the first scheme, the bank would randomly choose an algorithm from the user’s code book. Clearly, in order to make the scheme more resistant against attacks, a different code book would have to be created for each user (just like different TANs are generated for different users). Figure 1 shows an excerpt from our sample token calculation code book. Suppose the bank asks the user to apply algorithm ID 6 to the target account number. That is, the user would have to multiply the 4th and 8th digits of the account number and add 17 to the result. Hence, the user would type 31 as the confirmation token. The bank, however, would compute 23 and, because these confirmation values do not match, it would not execute the transaction, successfully thwarting the attack. Suppose that the user has entered the account number 980.243.276, but a Trojan horse has actually sent the account number 276.173.862 to the bank (unnoticed by the user). In the first scheme, the bank would randomly choose an algorithm from the user’s code book. Clearly, in order to make the scheme more resistant against attacks, a different code book would have to be created for each user (just like different TANs are generated for different users). Figure 1 shows an excerpt from our sample token calculation code book. Suppose the bank asks the user to apply algorithm ID 6 to the target account number. That is, the user would have to multiply the 4th and 8th digits of the account number and add 17 to the result. Hence, the user would type 31 as the confirmation token. The bank, however, would compute 23 and, because these confirmation values do not match, it would not execute the transaction, successfully thwarting the attack.

For our second scheme to implement f(.), called token lookup, users are not required to perform any computation. In this variation, the code book would consist of a large number of random tokens that are organized in pages. The bank and the user previously and secretly agree on which digits of the account number are relevant for choosing the correct page. The

….. Token ID5: Create a number using 3rd and 4th digits of target account and 262 to it. Token ID6: Create a number using 2nd and 8th digits of target account and 540 to it. …..



bank then requests the user to confirm a transaction by asking her to enter the value of a specific token on that page. For example, suppose that the relevant account digits are 2 and 7 for user John and that the bank asks John to enter the token with the ID 20. In this case, John would determine the relevant code page by combining the 2nd and 7th digits of the account number and look up the token on that page that has the ID 20. Suppose that the user is faced with the same attack that we discussed previously. That is, the user enters 980.243.276, but the malicious application sends 276.173.862 to the bank. In this case, the user would look up the token with ID 20 on page 82, while the bank would consult page 78. Thus, the transmitted token would not be accepted as valid.

3.2 Solution 2: Using CAPTCHAs for Secure Input

3.2.1 Overview

Graphical input is used by some banks and other institutions to prevent eavesdropping of passwords or PINs. Instead of using the keyboard to enter sensitive information, an image of a keypad is displayed, and the user enters data by clicking on the corresponding places in the image. Unfortunately, these schemes are typically very simple. For example, the letters and numbers are always located at the same window coordinates, or the fonts can be easily recognized with optical character recognition (OCR). As a result, malware can still recover the entered information. The basic idea of the second solution is to extend graphical input with CAPTCHAs [3]. A CAPTCHA, which stands for Completely Automated Public Turing test to tell Computers and Humans Apart, is a type of challenge-response test that is used in computing to determine whether or not the user is human. Hence, a CAPTCHA test needs to be solvable by humans, but not solvable (or very difficult to solve) for computer applications. CAPTCHAs are widely employed for protecting online services against automated (mis)use by malicious programs or scripts. For example, such programs may try to influence online polls, or register for free email services with the aim of sending spam. Figure 2 shows a graphical CAPTCHA generated by Yahoo when a user tries to subscribe to its free email service.

Fig. 2: A graphical CAPTCHA generated by yahoo.

An important characteristic of a CAPTCHA is that it has to be resistant to attacks. That is, it should not be possible for an algorithm to automatically solve the CAPTCHA. Graphical CAPTCHAs, specifically, need to be resistant to optical character recognition [18]. OCR is used to translate images of handwritten or typewritten text into machineeditable text. To defeat OCR, CAPTCHAs generally use background clutter (e.g., thin lines, colors, etc.), a large range of fonts, and image transformations. Such properties have been shown to make OCR analysis difficult [3]. Usually, the algorithm used to create a CAPTCHA is made public. The reason for this is that a good CAPTCHA needs to demonstrate that it can only be broken by advances in OCR (or general pattern recognition) technology and not by the discovery of a “secret” algorithm. Note that although some commonly used CAPTCHA algorithms have already been defeated (e.g., see [17]), a number of more sophisticated CAPTCHA algorithms [3, 7] are still considered resistant against OCR and are currently being widely used by companies such as Yahoo and Google.



3.2.2 Details

Although CAPTCHAs are frequently used to protect online services against automated access, to the best of our knowledge, no one has considered their use to enable secure input to web applications. In our solution, whenever a web application requires to protect the integrity and confidentiality of user information, it generates a graphical input field with randomly placed CAPTCHA characters. When the user wants to transmit input, she simply uses the mouse to click on the area that corresponds to the first character that should be sent. Clicking on the image generates a web request that contains the coordinates on the image where the user has clicked with the mouse. The key idea here is that only the web application knows which character is located at these coordinates. After the first character is transmitted, the web application generates another image with a different placement of the characters, and the process is repeated. By using CAPTCHAs to communicate with the human user, a web application can mitigate client-side attacks that intercept or modify the sensitive information that users type. Because the CAPTCHA characters cannot be identified automatically, a malware program has no way to know which information was selected by the user, nor does it have a way to meaningful select characters of its own choosing.

4 Related Work

Client-side sensitive information theft (e.g., spyware, keyloggers, Trojan horses, etc.) is a growing problem. In fact, the Anti-Phishing Working Group has reported over 170 different types of keyloggers distributed on thousands of web sites [1]. Hence, the problem has been increasingly gaining attention and a number of mitigation ideas have been presented to date. Several client-side solutions have been proposed that aim to mitigate spoofed web-site-based phishing attacks. Pwd- Hash [22] is an Internet Explorer plug-in that transparently converts a user’s password into a domain-specific password.

A side-effect of the tool is some protection from phishing attacks. Because the generated password is domain-specific, the password that is phished is not useful. SpoofGuard [5] is a plug-in solution specifically developed to mitigate phishing attacks. The plug-in looks for “phishing symptoms” such as similar sounding domain names and masked links. Note that both solutions focus on the mitigation of spoofed web-site-based phishing attacks. That is, they are vulnerable against client-side attacks as they rely on the integrity of the environment they are running in. Similarly, solutions such as the recently introduced Internet Explorer antiphishing features [16] are ineffective when an attacker has control over the user’s environment. Spyblock [11] aims to protect user passwords against network sniffing and dictionary attacks. It proposes to use a combination of password-authenticated key exchange and SSL. Furthermore, as additional defense against pharming, cookie sniffing, and session hijacking, it proposes a form of transaction confirmation over an authenticated channel. The tool is distributed as a client-side system that consists of a browser extension and an authentication agent that runs in a virtual machine environment that is “protected” from spyware. A disadvantage of Spyblock is that the user needs to install and configure it, as opposed to our purely server-side solution.

A number of hardware-based solutions have been proposed to enable secure input on untrusted platforms. Chip cards and smart-card readers [10, 23], for example, are popular choices. Unfortunately, it might be possible for the attacker to circumvent such solutions if the implementations rely on untrusted components such as drivers and operating system calls



[12, 24, 25]. As an alternative to smart-cardbased solutions, several researchers have proposed using handhelds as a secure input medium [2, 15]. Note that although hardware-based solutions are useful, unfortunately, they are often expensive and have the disadvantage that they have to be installed and available to users.

A popular anti-keylogger technique that is already being deployed by certain security-aware organizations are graphical keyboards. Similar to our graphical input technique, the idea is that the user types in sensitive data using a graphical keyboard. As a result, she is safe from key loggers that record the keys that are pressed. However, there have been increasing reports of so-called “screen scrapers” that capture the user’s screen and send the screenshot to a remote phishing server for later analysis [6]. Also, with many graphical keyboard solutions, sensitive information can be extracted from user elements that show the entered data to provide feedback for the user. Finally, to the best of our knowledge, no graphical keyboard solution uses CAPTCHAs. Thus, the entered information can be determined in a straightforward fashion using simple OCR schemes.

The cryptographic community has also explored different protocols to identify humans over insecure channels [8, 14, 27]. In one of the earliest papers [14], a scheme is presented in which users have to respond to a challenge, having memorized a secret of the modest amount of ten characters and five digits. The authors present a security analysis, but no usability study is provided (actually, the authors defer the implementation of their techniques to future work). The importance of usability studies is shown in a later paper by Hopper and Blum [8]. In their work, the authors develop a secure scheme for human identification, but after performing user studies with 54 persons, conclude that their approach “is impractical for use by humans.” In fact, a transaction takes on average 160 seconds, and can only be performed by 10% of the population. Our scheme, on the other hand, takes less than half of this time, and 95% of the transactions completed successfully.

Finally, client-side attacks could be mitigated if the user could easily verify the integrity of the software running on her platform. Trusted Computing (TC) [20] initiatives aim to achieve this objective by means of software and hardware. At this time, however, TC solutions largely remain prototypes that are not widely deployed in practice.

5 Conclusion

Web applications have become the most dominant way to provide access to online services. A growing class of problems are client-side attacks in which malicious software is automatically installed on the user’s machine. This software can then easily access, control, and manipulate all sensitive information in the user’s environment. Hence, an important web security research problem is how to enable a user on an untrusted platform to securely transmit information to with a web application.

Previous solutions to this problem are mostly hardware based and require peripheral devices such as smart-card readers and mobile phones. In this paper, we present two novel server-side techniques that can be used to enable secure user input. The first technique uses confirmation tokens that are bound to sensitive data to ensure data integrity. Confirmation tokens can either be looked up directly in a code book or they need to be calculated using simple algorithms. The second technique extends graphical input with CAPTCHAs to protect the confidentiality and integrity of user input against automated attacks. The usability studies that



we conducted demonstrate that, after an initial learning step, our techniques are understood and can also be applied by a non-technical audience.

Our dependency on the web will certainly increase in the future. At the same time, client-side attacks against web applications will most likely be continuing problems as the attacks are easy to perform and profitable. We hope that the techniques we present in this paper will be useful in mitigating such attacks.

References

[1] Anti-phishing Working Group. http://www.antiphishing.org. [2] D. Balfanz and E. Felten. Hand-Held Computers Can Be Better Smart Cards. In Proceedings of the 8th

Usenix Security Symposium, 1999. [3] Carnegie Mellon University. The CAPTCHA Project. http://www.captcha.net. [4] W. Cheswick. Johnny Can Obfuscate: Beyond Mother’s Maiden Name. In Proceedings of the 1st USENIX

Workshop on Hot Topics in Security (HotSec), 2006. [5] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell. Client-side defense against web-based identity

theft. In Proceedings of the Network and Distributed Systems Security (NDSS), 2004. [6] FinExtra.com. Phishers move to counteract bank security programmes. http://www.finextra.com/

fullstory.asp?id=14149. [7] S. Hocevar. PWNtcha - Captcha Decoder. http://sam.zoy.org/pwntcha. [8] N. Hopper and M. Blum. Secure Human Identification Protocols. In AsiaCrypt, 2001. [9] IETF Working Group. Transport Layer Security (TLS). http://www.ietf.org/html.charters/ tls-charter.html,

2006. [10] International Organization for Standardization (ISO). ISO 7816 Smart Card Standard. http://www.iso.org/. [11] C. Jackson, D. Boneh, and J. C. Mitchell. Stronger Password Authentication Using Virtual Machines.

http://crypto.stanford.edu/SpyBlock/spyblock.pdf. [12] A. Josang, D. Povey, and A. Ho. What You See is Not Always What You Sign. In Annual Technical

Conference of the Australian UNIX and Open Systems User Group, 2002. [13] I. Krawarik and M. Kwauka. Attacken aufs Konto (in German).

http://www.ispa.at/www/getFile.php?id=846, Jan 2007. [14] T. Matsumoto and H. Imai. Human Identification Through Insecure Channel. In EuroCrypt, 1991. [15] J. M. McCune, A. Perrig, and M. K. Reiter. Bump in the Ether: A Framework for Securing Sensitive User

Input. In Proceedings of the USENIX Annual Technical Conference, June 2006. [16] Microsoft Corporation. Internet Explorer 7 features.

http://www.microsoft.com/windows/ie/ie7/about/features/default.mspx. [17] G. Mori and J. Malik. Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA. In

Proceedings of the IEEE Computer Vision and Pattern Recognition Conference (CVPR). IEEE Computer Society Press, 2003.

[18] S. Mori, C. Y. Suen, and K. Yamamoto. Historical review of OCR research and development. Document

image analysis, pages 244–273, 1995. [19] A. Moshchuk, T. Bragin, S. D. Gribble, and H. M. Levy. A Crawler-based Study of Spyware on the Web.

In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS), February 2006.

[20] S. Pearson. Trusted Computing Platforms. Prentice Hall, 2002. [21] Pressetext Austria. Phishing-Sch¨aden bleiben am Kunden h¨angen (in German).

http://www.pressetext.at/pte.mc?pte=061116033, Nov 2006. [22] B. Ross, C. Jackson, N. Miyake, D. Boneh, and J. C. Mitchell. Stronger Password Authentication Using

Browser Extensions. In Proceedings of the 14th Usenix Security Symposium, 2005. [23] Secure Information Technology Center Austria (A-SIT). The Austrian Citizen Card.

http://www.buergerkarte.at/index en.html, 2005. [24] A. Spalka, A. Cremers, and H. Langweg. Protecting the Creation of Digital Signatures with Trusted

Computing Platform Technology Against Attacks by Trojan Horse. In IFIP Security Conference, 2001.



A Novel Approach for Routing Misbehavior

Detection in MANETs

Shyam Sunder Reddy K. C. Shoba Bindu Dept. of Computer Science Dept. of Computer Science

JNTU University, Anantapur JNTU University, Anantapur [email protected] [email protected]

Abstract

A mobile ad hoc network (MANET) is a temporary infrastructureless network, formed by a set of mobile hosts that dynamically establish their own network without relying on any central administration. By definition the nature of AdHoc networks is dynamically changing. However, due to the open structure and scarcely available battery-based energy, node misbehaviors may exist. The network is vulnerable to routing misbehavior, due to faulty or malicious nodes. Misbehavior detection systems aim at removing this vulnerability. In this approach we built a system to detect misbehaving nodes in a mobile ad hoc network. Each node in the network monitored its neighboring nodes and collected one DSR protocol trace per monitored neighbor. Network simulator “GloMoSim” is used to implement the system. After collecting parameters for each node in network represents normal behavior of the network. In the next step we incorporate misbehavior in the system and capture behavior of network, which yields as input to our detection system. Detection system is implemented based on 2ACK concept. Simulation results show that the system has good detection capabilities in finding malicious nodes in network.

Keywords: Mobile Ad Hoc Networks, routing misbehavior, network security.

1 Introduction

A Mobile Ad Hoc Network (MANET) is a collection of mobile nodes (hosts) which communicate with each other via wireless links either directly or relying on other nodes as routers. In some MANETs applications, such as the battlefield or the rescue operations, all nodes have a common goal and their applications belong to a single authority, thus they are cooperative by nature. However, in many civilian applications, such as networks of cars and provision of communication facilities in remote areas, nodes typically do not belong to a single authority and they do not pursue a common goal. In such selforganized networks forwarding packets for other nodes is not in the direct interest of any one, so there is no good reason to trust nodes and assume that they always cooperate. Indeed, each node tries to save its resources, particularly its battery power which is a precious resource. Recent studies show that most of the nodes energy in MANETs is likely to be devoted to forward packets for other nodes. For instance, Buttyan and Hubaux simulation studies show that; when the average number of hops from a source to a destination is around 5 then almost 80% of the transmission energy will be devoted to packet forwarding. Therefore, to save energy, nodes may misbehave and tend to be selfish. A selfish node regarding the packet forwarding

196 ♦ A Novel Approach for Routing Misbehavior Detection in MANETs


process is a node which takes advantage of the forwarding service and asks others to forward its own packets but does not actually participate in providing this service. Several techniques have been proposed to detect and alleviate the effects of such selfish nodes in MANETs, two techniques were introduced, namely, watchdog [4] and pathrater [3], to detect and mitigate the effects of the routing misbehavior, respectively. The watchdog technique identifies the misbehaving nodes by overhearing on the wireless medium. The pathrater technique allows nodes to avoid the use of the misbehaving nodes in any future route selections. The watchdog technique is based on passive overhearing. Unfortunately, it can only determine whether or not the next-hop node sends out the data packet. The reception status of the next-hop link’s receiver is usually unknown to the observer. In order to mitigate the adverse effects of routing misbehavior, the misbehaving nodes need to be detected so that these nodes can be avoided by all well-behaved nodes. In this paper, we focus on the following problem:

Misbehavior Detection and Mitigation

In MANETs, routing misbehavior can severely degrade the performance at the routing layer. Specifically, nodes may participate in the route discovery and maintenance processes but refuse to forward data packets. How do we detect such misbehavior? How can we make such detection processes more efficient (i.e., with less control overhead) and accurate (i.e., with low false alarm rate and missed detection rate)?

We propose the 2ACK scheme to mitigate the adverse effects of misbehaving nodes. The basic idea of the 2ACK successfully over the next hop, the destination node of the next-hop link will send back a special two-hop acknowledgment called 2ACK to indicate that the data packet has been received successfully. Such a 2ACK transmission takes place for only a fraction of data packets, but not all. Such a “selective” acknowledgment1 is intended to reduce the additional routing overhead caused by the 2ACK scheme. Judgment on node behavior is made after observing its behavior for a certain period of time.

In this paper, we present the details of the 2ACK scheme and our evaluation of the 2ACK scheme as an add-on to the Dynamic Source Routing (DSR) protocol.

2 Related Work

Malicious networks nodes that participate in routing protocols but refuse to forward messages

may corrupt a MANET. These problems can be circumvented by implementing a reputation

system. The reputation system is used to instruct correct nodes of those that should be

avoided in message’s routes. However, as is, the system rewards selfish nodes, who benefit

from not forwarding messages while being able to use the network. On modern society,

services are usually provided in exchange of an amount of money, previously agreed between

both parts. The Terminodes project defined a virtual currency named beans used by nodes to

pay for the messages. Those beans would be distributed by the intermediary nodes that

forwarded the message. Implementations of digital cash systems supporting fraud detection

require several different participants and the exchange of a significant number of messages.

To reduce this overhead, Terminodes assumes that hosts are equipped with a tamper resistant

security module, responsible for all the operations over the beans counter, that would refuse

to forward messages whenever the number of beans available are not sufficient to pay for the

A Novel Approach for Routing Misbehavior Detection in MANETs ♦ 197


service. The modules use a Public Key Infrastructure (PKI) to ensure the authentication of the

tamper resistant modules. This infrastructure can be used with two billing models. In the

Packet Purse Model, the sender pays to every intermediary node for the message, while in the

Packet Trade Model is the receiver that is charged. In both models, hosts are charged as a

function of the number of hops traveled by the message.

The CONFIDANT protocol implements a reputation system for the members of MANETs. Nodes with a bad reputation may see their requests ignored by the remaining participant, this way excluding them from the network. When compared with the previous system, CONFIDANT shows two interesting advantages. It does not require any special hardware and avoids the “self-inflicted punishment” that could be the exploitation point for malicious users. The system tolerates certain kinds of attacks by being suspicious on the incoming selfishness alerts that other nodes broadcast and relying mostly on its self experience.

These systems show two approaches that conflict in several aspects. The number of requests received by hosts depends of their geographical position. Hosts may become overloaded with requests because they are positioned in a strategical point in the MANET. A well-behaved node that temporarily supports a huge amount of requests should latter be rewarded by this service. CONFIDANT has no memory, in the sense that the services provided by some host are quickly forgotten by the reputation system. On the other hand, beans can be kept indefinitely by hosts. In MANETs, it is expected that hosts move frequently, therefore changing the network topology. The number of hops that a message must travel is a function based on the instant position of the sender and the receiver and varies with time. Terminodes charges the sender or the receiver of a message based on the number of hops traveled what may seems unfair since any of them will pay based on a factor that is outside his control.

3 Routing Misbehavior Model

We present the routing misbehavior model [1] considered in this paper in the context of the DSR protocol. Due to DSR’s popularity, we use it as the basic routing protocol to illustrate our proposed add-on scheme. We focus on the following routing misbehavior: A selfish node does not perform the packet forwarding function for data packets unrelated to itself. However, it operates normally in the Route Discovery and the Route Maintenance phases of the DSR protocol. Since such misbehaving nodes participate in the Route Discovery phase, they may be included in the routes chosen to forward the data packets from the source. The misbehaving nodes, however, refuse to forward the data packets from the source. This leads to the source being confused.

In guaranteed services such as TCP, the source node may either choose an alternate route from its route cache or initiate a new Route Discovery process. The alternate route may again contain misbehaving nodes and, therefore, the data transmission may fail again. The new Route Discovery phase will return a similar set of routes, including the misbehaving nodes. Eventually, the source node may conclude that routes are unavailable to deliver the data packets. As a result, the network fails to provide reliable communication for the source node even though such routes are available. In best-effort services such as UDP, the source simply sends out data packets to the next-hop node, which forwards them on. The existence of a misbehaving node on the route will cut off the data traffic flow. The source has no knowledge of this at all.



In this paper, we propose the 2ACK technique to detect such misbehaving nodes. Routes containing such nodes will be eliminated from consideration. The source node will be able to choose an appropriate route to send its data. In this work, we use both UDP and TCP to demonstrate the adverse effect of routing misbehavior and the performance of our proposed scheme.

The attackers (misbehaving nodes) are assumed to be capable of performing the following tasks:

• dropping any data packet,

• masquerading as the node that is the receiver of its next-hop link,

• sending out fabricated 2ACK packets,

• sending out fabricated hn, the key generated by the 2ACK packet senders, and

• claiming falsely that its neighbor or next-hop links are misbehaving.

4 The New Approach

4.1 Solution Overview

To mitigate the watchdog problem related to the power control usage we propose a new approach. Like the watchdog, we suggest that each node in the route monitors the forwarding of each packet it sends. To explain the concepts we suppose without lose of generality that A sends packets to B and monitors its forwarding to C. A source routing protocol is also assumed to be used.

We define a new kind of feedbacks we call two-hop ACK [4], it is an ACK that travels two hops. Node C acknowledges packets sent from A by sending this latter via B a special ACK. Node B could, however, escape from the monitoring without being detected by sending A a falsified two-hop ACK. Note that performing in this way is power economic for B, since sending a short packet like an ACK consumes too less energy than sending a data packet. To avoid this vulnerability we use an asymmetric cryptography based strategy as follows:

Fig. 1: Solution framework

Node A generates a random number and encrypts it with C’s public key (PK) then appends it in thepacket’s header as well as A’s address. When C re-ceives the packet it gets the number



back, decrypts it using its secret key (SK), encrypts it using A’s PK, and puts it in a two-hop ACK which is sent back to A via B. When A receives the ACK it decrypts the random number and checks if the number within the packet matches with the one it has generated, to validate B’s forwarding regarding the appropriate packet. However, if B does not forward the packet A will not receive the two-hop ACK, and it will be able to detect this misbehavior after a time out. This strategy needs a security association between each pair of nodes to ensure that nodes share their PK with each other. This requires a key distribution mechanisms which is out of the scope of this paper. Another problem would take place when node C misbehave. If C does neither forward the packet nor send the two-hop ACK back to A, B could be supposed by A to not forward the packet even it actually does. To overcome this problem we propose that the sending of the two-hop ACKs is provided implicitly upon the reception of the packet at the MAC layer, and we assume that lower layers (the MAC and physical layers) are robust and tamper resistant. This can be ensured by the hardware and the operating system of each node, that is the operations of the lower layers cannot be modified by any node, and node C could not get rid of sending the two-hop ACK back to A upon the reception of the packet, thereby the B’s monitoring is performed accurately. However, the upper layer including the network layer may be tampered by a selfish or a malicious, and falsified packets can be sent. Our solution is composed of two parts, the first one is located at the network layer and can be viewed as a sub layer at the bottom of this layer, whereas the second one is located over the MAC layer and is a sub layer at the top of this latter. Figure 1 illustrates this framework.

Fig: 1.1: The 2ACK scheme.

4.2 Details of the 2ACK Scheme

The 2ACK scheme is a network-layer technique to detect misbehaving links and to mitigate their effects. It can be implemented as an add-on to existing routing protocols for MANETs, such as DSR. The 2ACK scheme detects misbehavior through the use of a new type of acknowledgment packet, termed 2ACK. A 2ACK packet is assigned a fixed route of two hops (three nodes) in the opposite direction of the data traffic route.

Fig. 1.1 illustrates the operation of the 2ACK scheme. Suppose that N1, N2, and N3 are three consecutive nodes (triplet) along a route. The route from a source node, S, to a destination node, D, is generated in the Route Discovery phase of the DSR protocol. When N1 sends a data packet to N2 and N2 forwards it to N3, it is unclear to N1 whether N3 receives the data packet successfully or not. Such an ambiguity exists even when there are no misbehaving nodes. The problem becomes much more severe in open MANETs with potential misbehaving nodes.

The 2ACK scheme requires an explicit acknowledgment to be sent by N3 to notify N1 of its successful reception of a data packet: When node N3 receives the data packet successfully, it sends out a 2ACK packet over two hops to N1 (i.e., the opposite direction of the routing path as shown), with the ID of the corresponding data packet. The triplet [N1 N2 N3] is



derived from the route of the original data traffic. Such a triplet is used by N1 to monitor the link N2 N3. For convenience of presentation, we term N1 in the triplet [N1 N2 N3] the 2ACK packet receiver or the observing node and N3 the 2ACK packet sender.

Such a 2ACK transmission takes place for every set of triplets along the route. Therefore, only the first router from the source will not serve as a 2ACK packet sender. The last router just before the destination and the destination will not serve as 2ACK receivers.

To detect misbehavior, the 2ACK packet sender maintains a list of IDs of data packets that have been sent out but have not been acknowledged. For example, after N1 sends a data packet on a particular path, say, [N1 N2 N3] in Fig. 1.1, it adds the data ID to LIST (refer to Fig. 2, which illustrates the data structure maintained by the observing node), i.e., on its list corresponding to N2 N3. A counter of forwarded data packets, Cpkts, is incremented simultaneously. At N1, each ID will stay on the list for τ seconds, the timeout for 2ACK reception. If a 2ACK packet corresponding to this ID arrives before the timer expires, the ID will be removed from the list. Otherwise, the ID will be removed at the end of its timeout interval and a counter called Cmis will be incremented.

When N3 receives a data packet, it determines whether it needs to send a 2ACK packet to N1. In order to reduce the additional routing overhead caused by the 2ACK scheme, only a fraction of the data packets will be acknowledged via 2ACK packets. Such a fraction is termed the acknowledgment ratio, Rack. By varying Rack, we can dynamically tune the overhead of 2ACK packet transmissions. Node N1 observes the behavior of link N2 ! N3 for a period of time termed Tobs. At the end of the observation period, N1 calculates the ratio of missing 2ACK packets as Cmis/Cpkts and compares it with a threshold Rmis. If the ratio is greater than Rmis, link N2N3 is declared misbehaving and N1 sends out an RERR (or the misbehavior report) packet. The data structure of RERR is shown in Fig. 3. Since only a fraction of the received data packets are acknowledged, Rmis should satisfy Rmis > 1 - Rack in order to eliminate false alarms caused by such a partial acknowledgment technique.

Fig. 2: Data structure maintained by the observing node.

Fig. 3: Data structure of the RERR packet

Each node receiving or overhearing such an RERR marks the link N2 ! N3 as misbehaving and adds it to the blacklist of such misbehaving links that it maintains. When a node starts its own data traffic later, it will avoid using such misbehaving links as a part of its route.

The 2ACK scheme can be summarized in the pseudocode provided in the appendix for the 2ACK packet sender side (N3) and the observing node side (N1).



5 Simulation Results

GloMoSim (Tool for Simulating Misbehavior in Wireless AdHoc Network)

Global Mobile Information System Simulator (GloMoSim) [7] provides a scalable simulation environment for large wireless and wire line communication networks. Its scalable architecture supports up to thousand nodes linked by a heterogeneous communications capability that includes multihop wireless communications using adhoc networking. Provisions exist for setting the general Simulation Parameters, Scenario Topology, Mobility Radio and Propagation Models, MAC Protocol, Routing Protocol. Using the application configuration file, the following traffic generators are supported: TELNET and CBR. The following parameters are used in the simulation – simulation time : 150 seconds, area:1000*1000m^2, Number of nodes : 30,number of connections : 8, Transmission power :15dBm, Number of malicious node : Variable(110),In defining degree of membership function for each input parameter of fuzzy inference system, we have taken into The MAC layer protocol used in the simulations was the IEEE standard 802.11.Traffic is generated as constant bitrate, with packets of length 512 B sent every 0.21 s.

Misbehavior Implementation

Malicious nodes simulate the following types of active attacks:

1. Modification Attack: These attacks are carried out by adding, altering, or deleting IP addresses from the ROUTE REQUEST, ROUTE REPLY, which pass through the malicious nodes.

2. No forwarding Attack: This attack is carried out by dropping control packets or data packets pass through the malicious nodes.

6 Conclusion

MANETs are particularly sensible to unexpected behaviors. The generalization of wireless devices will soon turn MANETs in one of the most important connection methods to the Internet. However, the lack of a common goal in MANETs without a centralized human authority will make them difficult to maintain: each user will attempt to retrieve the most of the network while expecting to pay as less as possible. In human communities, this kind of behavior is called selfishness. While prohibiting selfishness shows to be impossible over a decentralized network, applying punishments to those that present this behavior may be beneficial. As we have seen, the watch dog technique, used by almost all the solutions currently proposed to detect nodes that misbehave on packets forwarding in MANETs, fails when employing the power control.In this paper, we have proposed a new approach that overcomes this problem. We have proposed and evaluated a technique, termed 2ACK, to detect and mitigate the effect of such routing misbehavior. The 2ACK technique is based on a simple 2-hop acknowledgment packet that is sent back by the receiver of the next-hop link. Compared with other approaches to combat the problem, such as the overhearing technique, the 2ACK scheme overcomes several problems including ambiguous collisions, receiver collisions, and limited transmission powers. The 2ACK scheme can be used as an add-on technique to routing protocols such as DSR in MANETs.



Simulation results also show that there is always possibility of false detection. Consequently, one monitoring node cannot immediately accuse another as selfish when detecting that a packet has been dropped at this latter. Instead, a threshold should be used like in the watchdog, and the monitored node will be considered selfish as soon as the number of packets dropped at this latter exceeds this threshold whose value should be well configured to overcome dropping caused by collisions and nodes mobility.

These results show that we can gain the benefits of an increased number of routing nodes while minimizing the effects of misbehaving nodes. In addition we show that this can be done without a priori trust or excessive overhead.

References

[1] Kejun Liu, Jing Deng, Pramod K. Varshney, Kashyap Balakrishnan ”An Acknowledgement-Based Approach for the Detection of Routing Misbehavior in MANETs”, IEEE Transactions on Mobile Computing vol.6,No.5,May 2007.

[2] H.Miranda and L.Rodrigues,“Preventing Selfishness in Open Mobile Ad Hoc Networks”,October 2002. [3] S.Marti,T.Giuli,K.Lai and M.Baker, “Mitigating Routing Misbehavior in Mobile Ad Hoc Networks”,Aug

2000. [4] Djamel Djenouri, Nadjib Badache,”New Approach for Selfish Nodes Detection in Mobile Ad hoc

Networks”. [5] J.-P. Hubaux, T. Gross, J.-Y. LeBoudec, and M. Vetterli, “Toward Self-Organized Mobile Ad Hoc

Networks: The Terminodes Project,” IEEE Comm. Magazine, Jan. 2001. [6] K. Balakrishnan, J. Deng, and P.K. Varshney, “TWOACK: Preventing Selfishness in Mobile Ad Hoc

Networks,” Proc. IEEE Wireless Comm. and Networking Conf. (WCNC ’05), Mar. 2005. [7] GloMoSim. Available on: http://pcl.cs.ucla.edu/projects/glomosim



Multi Layer Security Approach for Defense Against

MITM (Man-in-the-Middle) Attack

K.V.S.N. Rama Rao Shubham Roy Choudhury

Satyam Computer Services Ltd Satyam Computer Services Ltd [email protected] [email protected]

Manas Ranjan Patra Moiaz Jiwani Berhampur University Satyam Computer Services Ltd [email protected] [email protected]

Abstract

Security threats are the major deterrent for the widespread acceptability of web applications. Web applications have become a universal channel used by many people, which has introduced potential security risks and challenges. Though hardening of web server security is one of the ways to secure data on servers but it fails to handle hackers those who target client side by tapping the connection between client and server, thereby gain access to sensitive data –Commonly known as Man-in-the-Middle (MITM) attack. This paper provides an approach to multi layer security to protect HTTPS further from MITM which is the only attack possible on HTTPS connection. In this paper we proposed security at different OSI layers and also provided an approach to design various topologies on LAN and WAN to enhance the security mechanism.

1 Introduction

Web applications are becoming the dominant way to provide access to on-line services, like

webmail, ecommerce etc. Unfortunately, all users are not using the Internet in a positive way.

Along with usage of internet the security issues are also increasing every day. So there is an

urgent need for tighter security measures. The web server security is tightened now a days

and so attackers are targeting client side. Now attackers are trying to hack the sensitive data

by intruding into the connection between client and server. Http communications are fine for

the average Web server, which just contains informational pages. But in the case of running

an e-commerce site that requires secure transactions, the connection between client and web

server should be secure. The most common means is to use https on Secure Sockets Layer

(SSL), which uses public key cryptography to protect confidential user information. But https

provides security only at top layers of OSI (application layer, presentation layer) in the

protocol stack and ignoring security at lower layers. Hence attackers can use lower layers of

OSI to gain access to the connection through MITM (Man in The Middle) attack. Users on

LAN as well as on WAN are vulnerable to MITM attack. This paper provides an approach to

multi layer security to protect HTTPS from MITM which is the only attack possible on

HTTPS connection. It also discusses security concerns at different OSI layers and provides an

204 ♦ Multi Layer Security Approach for Defense Against MITM (Man-in-the-Middle) Attack


approach to design various topologies on LAN and WAN to enhance the security mechanism.

The rest of the paper is organized as follows. In Section 1 we introduce several attacks on

HTTP and HTTPS. In section 2 we describe MITM attack on LAN.In section 3 we present

our multi layer security approach to prevent MITM attack on LAN.In section 4 we describe

MITM attack on WAN and present our approach to prevent from such attack. In section 5 we

briefly conclude.

2 Attacks on Http/Https

Attacks on HTTP: Attacks on HTTP protocol can be broadly classified into three types.

1. The basic attack is sniffing the request and response parameters over the network. With this attack, an attacker can get access to confidential information i.e. credit card numbers, passwords etc, as these information can be retrieved as plain text.

2. Secondly, one can manipulate request and response parameters..

3. Attacker can get access to your account without knowing username and password by session hijacking and cloning cookies.

In order to circumvent these attacks HTTPS was introduced which was considered to be secure.

Attacks on HTTPS

There are two ways to attack any communication secured via HTTPS.

1. By sniffing the HTTPS packets over network using software such as Wireshark. The sniffed packets are then decrypted and the attacker can extract the hidden information, if the encryption is weak. But when the information is encrypted using 128 bit RSA algorithm, it is difficult to decrypt the information.

2. The most prevalent attack on HTTPS is MITM (Man in the Middle Attack) which is described in the next section.

3 Man in the Middle Attack (MITM) on SSL

To access any secure website (HTTPS) through internet, initially a secure connection is established, which is done by exchanging public keys. During this process of exchanging public keys, chances of the client getting exposed to MITM attack is more. Protocols that rely on the exchange of public keys to protect communications are often the target of these types of attacks.

3.1 MITM on LAN

A person on LAN is more prone to MITM as the victim is on the same physical network as that of attacker. Below are the steps involved for the attack.

Step 1 – ARP CACHE POISONING/ARP SPOOFING Consider three hosts in a switched environment as shown in figure 1, where one of the hosts is an attacker

Multi Layer Security Approach for Defense Against MITM (Man-in-the-Middle) Attack ♦ 205


Fig. 1: Three Hosts in a switched environment

In a switched network, when Host A sends some data to Host B then the switch on receiving the packet from Host A, reads the destination address from the header and then sends the packet to Host B by establishing a temporary connection between Host A and Host B. Once the transfer of data is complete, the connection is terminated. Due to this behavior of a switch, sniffing the traffic that is flowing from Host A to Host B and vice-versa, is not possible. So the attacker uses ARP poisoning technique to capture the traffic

Step 2: GIVING CLIENT FAKE CERTIFICATE: Since all the traffic is flowing through the attacker, he has full access over the victim’s requests. Whenever the victim requests for a secure connection via SSL (HTTPS) and waits for a digital certificate (public key), the attacker generates and sends a fake certificate to the victim and makes the victim trust that a secure connection is established. As from above steps it is clear that these attacks take advantage of protocols that work on OSI layers, i.e. HTTP on layer 7 and TCP on layer 4, whereas ARP works on layer 2.Hence, here we use a multilayer

security approach to secure the vulnerable layers.

4 Multi Layer Security Approach to Prevent Mitm Attack on Lan

Generally security approach is mainly concentrated on Application layer, not giving much emphasis to lower layer. So the first step is to prevent ARP cache poisoning/ARP spoofing, which occurs at layer 2. To protect a hosts ARP cache from being poisoned it is possible to make it static. If an ARP cache has been made static it will not process any ARP Replies and will not broadcast any ARP Requests, unlike a dynamic ARP cache. The static ARP entry is not practical for large networks. So for larger networks, we propose the following steps to secure it against Arp spoofing. First Step will be to change the network topology, i.e., when designing the network, if it’s feasible, adding more subnets. If we subnet the LAN more, then there is less static Arp’s that would need to be applied. Also at each entry/exit node of subnet place an IDS (Intrusion Detection System). IDS will monitor each subnet (a small network) for any changes in MAC address to IP address association, giving an alert as shown in figure 3.

This is how we protect layer 2. Now the second layer involved is layer 3 i.e. Network layer. To secure this layer, we use IPSec (IP Security) Protocol. IPSec protocols can supply access control, authentication, data integrity, and confidentiality for each IP packet between two participating network nodes. After securing layer 3, other layer which is involved is layer 7 which can be secured using HTTPS. However users should be careful about accepting/installing certificates by verifying that certificates are signed by a trusted Certificate Authority, by paying attention to browser’s warning.



Fig. 3: IDS on each subnet

5 MITM on WAN and Defense

Generally MITM on WAN is used for traffic analysis. In traffic analysis an attacker will intercept and examine packets over a public network. This will let the attacker know about your surfing profile and also allows him to track your behaviour over the internet. Data payload consists of the actual message whereas header information consists information about source, destination, size and other details about packets. Even if data payload is encrypted, traffic analysis reveals a lot of information about the data, source and destination which is in header and is not encrypted. Traffic analysis can be performed using tools such as i2, Visual Analytics, etc. In order to minimize the risk of traffic getting intercepted and analyzed we propose this solution.

5.1 Defence Against MITM in WAN

Random Routing: - To protect against MITM on WAN we propose the concept of Random Routing. Here the gateway will also act as a directory server which will maintain a list of different routes through which packets can be routed to the destination. Figure 5 demonstrates the MITM attack on WAN. Traffic between Gateway and Service/Server is interrupted by MITM.

Fig. 5: MITM on WAN

Multi Layer Security Approach for Defense Against MITM (Man-in-the-Middle) Attack ♦ 207


We will first find all available paths from our gateway to the destination server. These paths will be arranged on the basis of path having least network congestion. Now, the traffic will be divided into small chunks. Each chunk will be send through a different path, depending upon the above arrangement of the path.

Algorithm

Step 1: Let the total number of Nodes on network be N.

Number of possible paths (P) that can be taken by traffic from source to destination is N!, i.e. P=N! (Where N=total number of nodes)

Let, Time taken through path 1(pt1) is t1. Time taken through path 2(pt2) is t2

Time taken through path n(ptn) is tn

Hence, Time_taken_through_eachpath[]=t1,t2,t3,t4………..tn; Available Paths[] = pt1,pt2, pt3, pt4………..;

Step 2: Now using a sorting algorithm, we sort the times in ascending order and arrange the paths corresponding to the time taken. The path that takes least time will be at the top. Output of Step 2 is the array of sorted paths depending on the roundtrip time.

Step 3: Now, outgoing traffic (T) from source or destination will be divided into segments such as T1, T2, T3, T4 …..Ts Such that Traffic [ ] = T1, T2, T3, T4.....Ts;

Step 4: Now send the Traffic segments through the optimized paths which are obtained in Step 2. For Example, Segment T1 will be send through the path having the least roundtrip time, T2 through path having the next least roundtrip time and so on.

Example Scenario: Let us assume that the number of nodes available is 6 which give 720 possible paths (6! = 720). Assume that we divide our outgoing traffic into four segments (sg1, sg2, sg3, sg4) and send them through four different paths. The four paths that are selected have least roundtrip time

Fig 6: Demonstrates the above scenario



Table 1. Tabular Representation of Figure 6

Traffic Path N1 N2 N3 N4 N5 N6

T1 Path1 * * *

T2 Path2 *

T3 Path3 * *

T4 Path4 * *

Therefore in order to protect the connection from getting intercepted or analysed, we divert the traffic via random routes as directed by directory server. Directory server has a list of available server nodes on should take, thus reducing the possible chances of the attacker getting complete information about the traffic and henceforth the data.

Therefore in order to protect the connection from getting intercepted or analysed, we divert the traffic via random routes as directed by directory server. Directory server has a list of available server nodes on should take, thus reducing the possible chances of the attacker getting complete information about the traffic and henceforth the data.

5 Conclusion

Since web server security is hardened attackers are targeting client side. Now attackers are trying to hack the sensitive data by intruding into the connection between client and server. The most common way to secure the connection is to use HTTPS. But it provides security only at top layers of OSI ignoring lower layer security. Hence attackers can use lower layers of OSI to gain access to the connection through MITM (Man in The Middle) attack. Users on LAN as well as on WAN are vulnerable to MITM attack. Thus in this paper we proposed an approach for LAN and WAN to protect the connection from MITM attack which ensures that the data is secured in the lower layers also. In case of LAN, we feel that use of network topology, along with IDS to monitor change in Static ARPs can reduce the chances of ARP Poisoning and hence prevent MITM attack on LAN. In case of WAN, we can divide the traffic and ensure that each segment chooses a different optimized path, so that we can minimize the risk of traffic getting analyzed.

References

[1] The Evolution of Cross-Site Scripting Attacks by David Endler, http://www.cgisecurity.com/lib/XSS.pdf [2] Analysis of SSL 3.0 protocol, http://www.schneier.com/paper-ssl.pdf [3] SSL Man-in-the-Middle Attacks by Peter Burkholder,

http://www.sans.org/reading_room/whitepapers/threats/480.php

[4] Security3 by Nick Parlente, http://www.stanford.edu/class/cs193i/handouts2002/39Security3.pdf [5] IETF, »RFC2616: Hypertext Transfer Protocol -- HTTP/1.1, http://www.ietf.org/rfc/rfc2616.txt [6] IETF, »RFC2109: HTTP State Management Mechanism, http://www.ietf.org/rfc/rfc2109.txt [7] The Open Web Application Security Project, Cross Site Scripting,

http://www.owasp.org/asac/input_validation/css.shtml [8] The Open Web Application Security Project, »Session Hijacking«, http://www.owasp.org/asac/auth-

session/hijack.shtml



Video Streaming Over Bluetooth

M. Siddique Khan Rehan Ahmad DCE, Zakir Husain College of DCE, Zakir Husain College of Engineering & Technology Engineering & Technology Aligarh Muslim University Aligarh Muslim University Aligarh-202002, India Aligarh-202002, India [email protected] [email protected]

Tauseef Ahmad Mohammed A. Qadeer DCE, Zakir Husain College of DCE, Zakir Husain College of Engineering & Technology Engineering & Technology Aligarh Muslim University Aligarh Muslim University Aligarh-202002, India Aligarh-202002, India [email protected] [email protected]

Abstract

The Bluetooth speciation describes a robust and powerful technology for short-range wireless communication. Unfortunately, the speciation is immense and complicated, presenting a formidable challenge for novice developers. This paper is concerned with recording video from handhelds (Mobile Phones) to desktop computers, and playing video on handhelds from servers using real-time video streaming. Users could be able to record huge data and store it in the computers within its range of the Bluetooth dongle. The videos on the server can be played on handhelds through real-time streaming by exploiting Bluetooth network. We can create a Bluetooth PAN (Piconet) in which mobile computers can dynamically connect to master and communicate with other slaves. Dynamically we can select any mobile computer and transfer data into it. Handhelds have limited storage capacity with respect to computers, Therefore computers are preferred to store recorded data.

1 Introduction

1.1 Problem Statement

A Bluetooth network has no fixed networking infrastructure [Bluetooth.com]. It consists of multiple mobile nodes which maintain network connectivity through wireless communication, and it is completely dynamic, so such networks are easily deployable. Mobile has limited storage with respect to computer. Therefore Efforts have been put up to transfer recording video to computer and also play prerecorded video from PC on Handhelds by real-time streaming through Bluetooth.

1.2 Motivation

The widespread use of Bluetooth and mobile devices has generated the need to provide services which are currently possible only in the wired networks. The services that are provided with wired network needs to be explored for Bluetooth. We expect that in near future we will have Bluetooth PAN providing all the services. Mobile Phones are very

210 ♦ Video Streaming Over Bluetooth


common gadgets and most of them have Camera and audio recording facility. So nowadays it could be use as a multipurpose, but it has limited storage therefore data could be transfer to the computer and can record for many hours and also the recorded data can be played on Mobile Phones. This increases the usability of Mobile Phones.

1.3 Approach

To transfer video data between mobile device and Personal computer is a difficult work. In java j2me [Prabhu and Reddi, 2004] is used on mobile side which continuously take the camera input for video and microphone for recording audio. The recorded audio and video is converted into byte array and byte stream is directed in the output stream using Bluetooth. In the PC side j2se [Deitel and Deitel, 2007] is used for server programming. This server program opens a input stream and connect to the client output stream and whatever data is written on server input stream is in byte form and redirect to a file and later the file is save in any video format you want. Later, this saved data can be redirected by real-time streaming for playing on mobile in same manner.

2 Mobile System Architecture

2.1 Overview

The convergence of computing, multimedia and mobile communications is well underway. Mobile users are now able to benefit from a broad spectrum of multimedia features and services including capturing, sending and receiving images, videos and music. To deliver such data-heavy, processing-intensive services, portable handheld systems must be optimized for high performance but low power, space and cost. So, there are several processors in the market which are being used in the mobile phones today, out of which, the STn8815 processor platform from STMicroelectronics is a culmination of breakthroughs in video coding efficiency, inventive algorithms and chip implementation schemes and is being used in most of the NOKIA mobile phones and PDAs. It enables smart phones, wireless PDAs, Internet appliances and car entertainment systems to play back media content, record pictures and video clips, and perform bidirectional audio-visual communication with other systems in real time. The general architecture of a mobile device using such a processor is shown in figure 2.

Fig. 1: typical system architecture using STn8815

Video Streaming Over Bluetooth ♦ 211


3 Video Streaming Over Bluetooth

Traditional video streaming over wired/wireless networks typically has band-width, delay and loss requirements due to its real-time nature. More over, there are many potential reasons including time-varying features, out-of-range devices, and interference with other devices or external sources that make Bluetooth links more challenging for video streaming. To address these challenges for video streaming over Bluetooth links, recent research has been conducted. To present various issues and give a clear picture of the field of video streaming over Bluetooth, we discuss three major areas, namely video compression, Quality -of Service (QoS) control and intermediate protocols [Xiaohang]. Each of the areas is one of the basic components in building a complete architecture for streaming video over Bluetooth. The relations among them can be illustrated in Figure.3. Figure 3 shows functional components for video streaming over Bluetooth links [Xiaohang]. Moreover, the layer/ layers over which a component works is also indicated. The aim of video compression is to remove redundant information form a digitized video sequence. Raw data must be compressed before transmission to achieve efficiency. This is critical for wireless video streaming since the bandwidth of wireless links is limited. Upon the client’s request, the media sever retrieves compressed video and the QoS control modules adapts the media bit-streams, or adjusts transmission parameters of intermediate layer based on the current link status and QoS [Xiaohang] requirements. After the adaptation, compressed video stream are partitioned into packets of the chosen intermediate layer (e.g., L2CAP, HCI, IP), where packets are packetized and segmented. It then sends the segmented packets to Bluetooth module for transmission. On the receiving side, the Bluetooth module receives media packets from air, reassembles them in the intermediate protocols, and sends them to decoder for decompression.

As shown in figure 3, QoS control can be further categorized into congestion control and error control [Feamstear and Balakrishnan]. Congestion control in Bluetooth is employed to prevent packet loss and reduce delay by regulating transmission rate or reserving bandwidth according to changing link status and QoS requirements. Error control, on the other hand, is to improve video quality in the presence of packet loss.



4 USB Programming

For USB port programming, we have to use an open source API called jUSB API since no API is available for USB programming in any JAVA SDK, even not in j2sdk1.5.0.02. The design approach to implement the usb.windows package for the Java USB API is separated into two parts. One part deals with the enumeration and monitoring of the USB while the other part looks into the aspects of communicating with USB devices in general. Both parts are implemented using Java Native Interface (JNI) to access native operation on the Windows operating system. The jUSB dynamic link library (DLL) provides the native functions that realize the JNI interface of the Java usb.windows package.

Fig. 3: Architecture for Streaming Over Bluetooth

Communication with an USB device is managed by the jUSB driver. The structures and important aspects of the jUSB driver are introduced in section 5. The chapter itself is a summary and covers only some fraction of the driver implementation. A lot of useful information about driver writing and the internal structures can be looked up in Walter Oney’s book “Programming the Microsoft Driver Model” [Oney’s]. What we have explained is clearly shown in Figure 5, since the original USB driver stack is as shown in Figure 4 but the other drivers can not be accessed by the programmer Once the JAVA USB API is installed you are ready to program your own USB ports to detect USB devices as well as read and write through these devices. This JAVA USB API is actually an open source project carried out at Institute for Information Systems, ETH Zürich by Michael Stahl. For details of how to write code for programming, refer to the “Java USB API for Windows” by Michael Stahl [Stahl]. Basic classes used in this API are listed below – DeviceImpl Class: basic methods used are

Open Handel Close Handel Get friendly Device Name Get Attached Device Name



Get Num Ports Get Device Description Get Unique Device ID jUSB class : basic methods used are :- JUSBReadControl getConfigurationBuffer doIntrruptTransfer

Fig. 4: USB driver stack for Windows Figure 5: Java USB API layer for Windows

5 Design

Fig. 6: Architectural design and data flow diagram



Fig. 7: Interface Design

6 Conclusion

In this paper, we have shown a system of compressing and streaming of live videos over networks, with an objective to design an effective solution for mobile access. We developed a j2me application on mobile side and j2se application on PC side. Three major aspects are to be taken into consideration namely video compression, Quality of Service (QoS) control and intermediate protocols. Video compression is to remove redundancy to achieve efficiency in a limited bandwidth network. QoS includes congestion control and error control. It is to check packet loss, reduce delay and improving video quality. The server side requires USB port to be programmed for enumeration, monitoring and communicating with USB devices.

7 Future Enhancement

The developed application uses the Bluetooth as media. In future, the EDGE/GPRS [Fabri et

al., 2000] or Wi-Fi can also be used as media. Although, EDGE/GPRS provides with lesser bandwidth while Wi-Fi provides much more bandwidth than Bluetooth

References

[1] [Bluetooth.com] Specification of Bluetooth System –Core vol.1,ver1.1 www.bluetooth.com [2] [Chia and Salim, 2002] Chong Hooi Chia and M. Salim Beg, “MPEG-4 video transmission over Bluetooth

links”, Proc. IEEE International Conf. on Personal Wireless Communication, New Delhi 15-18 Dec 2002 [3] [Deitel and Deitel, 2007] Deitel & Deitel, “JAVA How to Program”, sixth edition, Prentice Hall (2007) [4] [Fabri et al., 2000] Simon N. Fabri, Stewart Worrall, Abdul Sadka, Ahmet Kondoz, “Real-Time Video

Communications over GPRS”, 3G Mobile Communication Technologies, Conference Publication No. 471, 0 IEE 2000

[5] [Feamstear and Balakrishnan] Nick Feamster and Hari Balakrishnan, “Packet Loss Recovery for Streaming

Video” http://nms.lcs.mit.edu/projects/videocm/ [6] [Johansson et al., 2001] P. Johansson, M. Kazantzidls, R. Kapoor and M. Gerla, “Bluetooth: An Enabler for

Personal Area Networking”, Network, IEEE, Vol. 15, Issue 5, Sept.-Oct. 2001, p.p. 28-37



[7] [Lansford and Stephens, 2001] J. Lansford, A. Stephens, R. Nevo, “Wi-Fi (802.11b) and Bluetooth:

enabling coexistence”, Network, IEEE, vol. 15, issue 5, Sept.-Oct. 2001, p.p. 20-27 [8] [Oney’s] Walter Oney’s, “Programming the Microsoft Driver Model” [9] [Prabhu and Reddi, 2004] C.S.R. Prabhu, A. Prathap Reddi, “BLUETOOTH TECHNOLOGY and its

Application with Java and J2ME”, Prentice Hall India (2004) [10] [Stahl] Michael Stahl, “Java USB API for Windows” [11] [Xiaohang] Wang Xiaohang, “Video Streaming over Bluetooth: A Survey”



Role of SNA in Exploring and Classifying Communities

within B-Schools through Case Study

Dhanya Pramod Krishnan R. Manisha Somavanshi IIM, Pune IIM, Pune IIM, Pune [email protected] [email protected] [email protected]

Abstract

The facets of organizational behavior have changed since the advent of internet. World Wide Web has become not only a platform for communication but also facilitates knowledge sharing. This paper focuses on how people within an academic organization behave as communicators. Social network analysis (SNA) has emerged as a powerful method for understanding the importance of relationships in networks. This paper presents a contemplation that examines the mode and frequency of communication, use of web technology for communication within and between departments of academic institutes. Few studies are done among the B-Schools’ current organization models and the use of Social network to improve the relationship and communication among the members and between different communities. Here we have identified communities and main roles in B-School using SNA. The formal and informal communications are found having great influence on the social network.

1 Introduction

We have created a model to describe the relationship among all the different departments in the B-school. Their communities and how these communities affect on the productivity and the working environment in the organization. This model considers different types of relationships among different members of different communities. These relations are competition, communication, exchange of information, education.

Social network analysis means the understanding flows of communication between people, groups, organizations and other information or knowledge processing entities. Social network is one of the most important true-life networks in our real world scenarios. A typical feature of the social network is the dense structure which is essential for understanding the network’s internal structure and function. Traditional social network analysis usually focuses on the principle of centralization and the power of a single individual or entity, however, in people’s daily life, a group or an organization often holds a more influential position and plays a more important role. Therefore, in this paper, we first present a scenario where Academic Institutes social networks are useful for investigating, identifying structure of the institute. A typical academic institute consists of departments like administration, accounts, library, examination, placements and canteen. Cross functional flow of information between the departments are very essential for the smooth functioning of the institute and to provide updated information on various day to day matters. There are various ways by which people communicate with each other viz. emails, intranet, meetings or MIS reports. The number, size, and connections

Role of SNA in Exploring and Classifying Communities within B-Schools through Case Study ♦ 217


among the sub-groupings in a network can tell us a lot about the likely behavior of the network as a whole. How fast will things move across the actors in the network?

The rest of this section includes the flow of communication in B-Schools, where different categories are discussed. In section 3 SNA applied to a case study and findings are reported. Finally we proposed a Social Network Model, touched upon related work and end up the paper with conclusion and future work.

2 Study of Communication Flow in B-Schools

We have studied the organizational structure and processes of some Top B-Schools in Pune and analyzed the communication flow. According to the communication pattern, categorization is done A to F. Following parts of the section describes the various types of communication flows.

Fig. 1: Category A

In the category A organization the power of decision making is centralized at the top of the organizations structure and subsequently delegated to various departmental heads who inturn communicate to the respective departments under their perview. [Fig 1] shows the organization charts for this category. Various departments are divided under 7 main responsibility centers. Viz; Administration, Academics, Library, Training & Placements, Hostel, Gymkhana and Alumni. The means of communication are meetings, emails, interoffice memo, intraoffice memo and phone calls. Regular feedback is communicated to the concerned departmental needs in the form of reports which is then communicated to the top level management in meetings. Since the activities of various departments are interdependent horizontal flow of information exists between departmental heads.

In Category B the policies and rules governing the functioning of the organization are decided by the management and the director is in-charge of planning, execution & control of the process. Academics and administration responsibilities are delegated to HOD and registrar respectively while library and placement activities are carried out separately. In this category of organization all the administrative responsibilities are with registrar and academics. HOD has to coordinate with him for daily functioning of the departments. The Category C institute does not have separate academic head. Director is involved in academic activities. Category F organization has more vertical levels as each and every type of responsibilities has a coordinator officer for monitoring. All academic related activities like seminars, workshops, faculty development programs etc. has different coordinators who inturn delegate work to faculty members. Horizontal levels are more in category E and vertical communication are coordinated by separate cells.

218 ♦ Role of SNA in Exploring and Classifying Communities within B-Schools through Case Study


Fig. 2: Category F

Category F organization follows the principal of decentralization where since every department has its own head and procedure for working in organizations which have diverse activities and requires skilled workforce for respective tasks. After analyzing communication pattern among all above A to F organization structures we found Category F [Fig 2] organization is having more decentralized organizational structure and hence social network is more suited for such kind of organizations. Therefore we have considered Category F types of organization for our case study.

3 Social Network Model

For our case study we had considered B-School which falls under category F and having IT and Management departments. In this organization faculty community has various sub communities for handling activities. To list a few reception committees, hall committee, food committees, technical committee, transport committee etc. for different events. To handle day to day activities coordinator and staff liaison with each other and thus form a community. Learning Facilitator community, class teacher or mentors etc are other communities found. These kinds of organization end up with lots of communities. Interesting to note that a person is part of many communities we found that there is a tremendous scope for a social network in this kind of organization. So that they can communicate in a standard and common platform and maintain the information exchanged for further reference. It will also enable easy handling over of responsibility when a person leaves the organization.

3.1 Data Analysis

The specified organization IT department has 14 faculty members and management department has 20 faculty members. Management department is further divided into 5 specialization having 5, 4, 6, 3, 2 faculty members respectively. The organization having very strong communication among faculty members within each department according to the analysis done using the SNA We have calculated the inside communication degree and inside communication strength with an additional parameter frequency of communication.



Inside communication degree (ICD) = No. of edges / No. Vertices * (No. of Vertices – 1)

Inside communication Strength (ICS)= ∑ Weight of edges/No. Vertices* (No. of Vertices– 1)

Where weight of edges = Frequency of communication (Average no of communication/ month)

Outside communication degree (OCD) = No. of edges between A & B/No. Vertices A * No. of Vertices B

Outside communication Strength (OCS) = ∑ Weight of edges Between A & B / No. Vertices A * No. of Vertices B

Where weight of edges = Frequency of communication (Average no of communication / month) For the IT department [Table 1] ICD = 173 / 182 = 0.95 ICS = 443 / 14 * (14-1) = 443 / 182 = 2.43 For the Management department [Table 2] ICD = 380 / 380 = 1 ICS = 801 / 20 * (20-1) = 801 / 380 = 1.80 Interdepartmental Communication between IT and Management [Table 3] OCD = 12 / 280 = 0.04 OCS = 180 / 280 = 0.64 OCS of IT & Admin = 123 / 42 = 2.92 OCS of IT & Placement = 60 / 28 = 2.14 OCS of IT & Library = 17 / 28 = 0.60 OCS of IT & Lab = 65 / 42 = 1.54

Table 1: IT Department Communication Chart



Fig. 3: IT Department Social Network

Fig. 4: It Dept- Central Nodes’ Community



Fig. 5: IT-Mgt-Community

Fig. 6: Mgt Dept- Community



Fig. 7: IT-Mgt- Community Density

Table 2: Management Department

Sub Community Node 1 Node 2 Edge Density

ACADEMICS MD 1 MD 2 E 22 100

MD 3 E 23 60

MD 6 E 24 80

MD 5 E 25 15

MD 4 E 26 10

MKTG MD 2 MD 9 E 27 10

MD 10 E 28 10

MD 11 E 29 10

MD 12 E 30 10

FIN MD 3 MD 18 E 31 10

MD19 E 32 5

MD 20 E 33 0

ECO MD 5 MD 17 E 34 9

IT MD 7 MD 8 E 35 30

HR MD 4 MD 13 E 36 6

MD 14 E 37 10

MD 15 E 38 6

MD 16 E 39 20

MD 21 E 40 0

Table 3: Interdepartmental Communication Strength

Library Admin Placement Lab Mgt Dept

IT 17 123 60 65 180



According to the above figures, it is clear that within a specific department, strength of communication is more compared to inter departmental communication. The various types of interdepartmental communities are shown in [Table 3] the data we have considered include all communication Medias, like email, phone calls, memo etc. The various factors that affect interdepartmental communication is drawn in [Table 4].

Table 4: Community wise communication

Community Contribution %

Common event organizers

7.69

Friendship 46.15

Common Interest 46.15

3.2 Web based communication Analysis

We have identified the percentage utilization of web as the communication media [Table 5] and also the formal [Table 6] and informal [Table 7] communication that happens on the web. Table 8 shows the communication of IT department with the other departments on the web.

Table 5: Web Based Communication Analysis

Total Communities on

Web

% of web based

communication

IT department 30

Management department 15

Interdepartmental 30

Table 6: Formal Web Based Communication

Departments % of formal web based

communication

IT department 20

Management department 10


Table 7: Informal Web Based Communication

Analysis

Departments % of Informal web

based communication

IT department 10

Management department

5


Table 8: Interdepartmental Web Based Analysis

Library Admin Placement

IT 80% 10% 90%

3.3 Major Findings from the Above Statistical Analysis

• SNA proved to be very powerful in identifying the centrality [Fig 7] of the social network existing in the B-School and roles identified from the network, Node 1 as HOD, Node 14 as Director, Node 15 Deputy Director.

• The bridge between IT and Management Department is the edge between the Node 1 and 15 [Fig 7] i.e. IT HOD and Mgt. Dept. Deputy Director.

• Node 1, 14 and 15 holds a dense communication in the whole network.

• The informal communication between peer departments (IT & Management) is strong due to friendship and common interest.

• Web based communication is less within the department.

• IT and Management Department utilizes web for 50% of total communication.

• IT department to nonacademic department web based communication is strong.

• The Informal web based communication is slightly larger due to common interest in community that exists across the department.



4 Proposed Model

We have proposed a social network model where the communication can happen in much structured and powerful manner. This model would also provide archives of exchanged information and thus enhances traceability.

We have identified three categories of communities

• Role based community: Every academic organization there exist communities of people who play same role. In the above mentioned case study Directors, Learning facilitators, class teachers, coordinators etc fall in this category. The same community may exist throughout the lifetime of an organization even though the members of community changes as and when people leave/join the organization. Thus the proposed model allows role based communities to be created. At any point of time a community can be evolved as the organization undergoes structural changes. This comes under formal community.

• Friends: Friendship, as in any other organization can be the reason for communication. This community is useful to understand the kinds of information people share. This is an informal community. A new community can be evolved at any time.

• Special/common Interest Groups: The organization may have group of faculty members who teach similar subjects and thus share knowledge. This could be a formal community or an informal one.

5 Related Work

Zuoling Chen and Shigeyoshi Walanabe in their paper “A Case Study of Applying SNA to Analyze CSCL Social Network” [7] discuss and prove that group structure, member’s physical location distribution and member’s social position has a great impact on web based social network. “Enhanced Professional Networking and its impact on Personal Development and Business Success’ by J Chen [3] describes how professional networking events cultivate new cross divisional business collaborations and help to improve individual skills. Chung-Yi Weng & Wei Ta Chu Conducted an analysis of social network existing in movies in paper titled “Movie Analysis based on Roles’ Social Network” [2]. They have proved that based on roles’ social network communities, storyline can be detected. The framework they have proposed for determining the leading roles and identifying community does not address role recognition of all characters but efficient in community identification.


There is a tremendous scope for SNA to identify the patterns of communication within an academic institute. The current mode of communication is not easily archive able as it is not used much in the organization wide processes. We have come to a conclusion that informal communication should be encouraged as it is less in the organization so that more knowledge sharing and unanimity can be achieved. Our proposed social network model provides a common standardized communication framework for the organization. Our work will be further extended to analyze how individual productivity is affected and influence



accomplishment of individual goals and organizational goals. Implementation of the framework will be done subsequently.

References

[1] [Breslin, 2007]The Future of Social Networks on the Internet: The Need for Semantics Breslin, J.; Decker, S.; Internet Computing, IEEE Volume 11, Issue 6, Nov.-Dec. 2007 Page(s):86 – 90

[2] [Chung, 2007] Movie Analysis Based on Roles' Social Network Chung-Yi Weng; Wei-Ta Chu; Ja-Ling Wu; Multimedia and Expo, 2007 IEEE International Conference on 2-5 July 2007 Page(s):1403 – 1406

[3] [Chen, 2006]Enhanced Professional Networking and its Impact on Personal Development and Business Success J. Chen1; C.-H. Chen-Ritzo1

[4] http://domino.research.ibm.com/cambridge/research.nsf/ [5] [Hussain, 2007] Terrorist Networks Analysis through Argument Driven Hypotheses Model Hussain, D. M.

Akbar; Availability, Reliability and Security, 2007. ARES 2007. The Second International Conference on 10-13 April 2007 Page(s):480 – 492

[6] [Jamali, 2006] Different Aspects of Social Network Analysis Jamali, M.; Abolhassani, H.; Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on 18-22 Dec. 2006 Page(s):66 – 72

[7] [Saltz, 2007] Increasing Participation in Distance Learning Courses Saltz, J.S.; Hiltz, S.R.; Turoff, M.; Passerini, K.; Internet Computing, IEEE Volume 11, Issue 3, May-June 2007 Page(s):36 – 44

[8] [Zuoliang 2007] A Case Study of Applying SNA to Analyze CSCL Social Network Zuoliang Chen; Watanabe, S.; Advanced Learning Technologies, 2007. ICALT 2007. Seventh IEEE International Conference on 18-20 July 2007 Page(s):18 – 20



Smart Medium Access Control (SMAC)

Protocol for Mobile Ad Hoc Networks

Using Directional Antennas

P. Sai Kiran School of Computer Science & Informatics, SreeNidhi Institute of Science and Technology

Hyderabad, Andhra Pradesh, India [email protected]

Abstract

This paper proposes a Smart Medium Access Control (SMAC) Protocol for Mobile Ad Hoc Networks (MANET) using Directional Antennas. SMAC protocol exploits the Directional Transmission and sensing capability of the Directional Antennas there by increasing the Performance of MANET. SMAC protocol proposes a Dual Channel Approach for Data and Control Information to overcome deafness and Hidden terminal problems. SMAC protocol proposes a new Node Mobility update Model used for addressing Node Mobility. SMAC protocol also uses alternate method to backoff timer, processing the data packets in the queue ready for transmission in other directions when the data packet in the front of the queue to be transmitted finds channel busy in its direction. SMAC has the advantage over other MAC protocols proposed for Directional Antennas as it addresses all the issues like Mobility of Node, Deafness, Hidden Terminal etc.

1 Introduction

According to the definition of IEEE 802.11: A network composed solely of stations within mutual communication range of each other via the wireless medium (WM).

Traditional MAC Protocols such as IEEE 802.11 DCF (Distributed Coordination Function) and IEEE 802.11 Enhanced DCF designed for Omni-directional antennas and cannot achieve high throughput in ad hoc networks as they waste a large portion of the network capacity. On the other hand, smart antenna technology can improve spatial reuse of the wireless channel, which allows nodes to communicate simultaneously without interference.

The capabilities of directional antennas were not exploited when using conventional MAC protocols, such as IEEE 802.11. In fact, the network performance may even deteriorate due to issues specific to directional antennas. There are many protocols proposed that exploits the directional antenna capabilities while addressing the specific issues of the MAC protocols using directional antennas. We want to propose a MAC protocol using directional antennas that will not only concentrate on spatial reuse but also on throughput and performance of the protocol.

This Paper is organized as Follows: Section II deals with Design Considerations for a proposed SMAC Protocol, Section III Gives the working of the proposed SMAC protocol, Section IV Concludes the Paper.

Smart Medium Access Control (SMAC) Protocol for Mobile Ad Hoc Networks Using Directional Antennas ♦ 227


2 Design Considerations for SMAC Protocol

2.1 Antenna Model

The Best Design consideration for a MAC protocol using Directional Antennas is Smart Antenna. This Protocol although considers the smart antenna as the choice for designing the protocol, also supports nodes with other directional antenna models like switched beam antennas.

2.2 Directionality

If the Antenna Model is smart antenna, the transmission would be very accurate in the direction of transmission. In this protocol, the 3600 coverage is divided into number of segments based on the directionality in the clock wise direction.

If the Antenna type is switched beam antenna, then the number of segments would be equal to the number of directional antennas in the switched beam and the numbering of the segments would be given in the clock wise direction.

If we consider an example of switched beam antenna with 6 directional antennas, the segments numbering would be as indicated in the Figure 1.

Figure 2 considers the usage of Smart Antennas where the number of segments would be based on the level of spatial reuse required and to reduce the interference.

2.3 Sensing the Medium

The Design Consideration for sensing the medium in this protocol is directional carrier sensing. If the data is to be transmitted to a segment, then this protocol requires not only the segment in the directional of the target to be free or idle but also the immediate neighboring segments should also be free from transmission.

For example, if we consider the Figure 1 and if the direction of intended transmission is segment 2 then the node would initiate transmission only when the segments 1, 2 and 3 are found to be free of transmission. This Design option considers the node mobility (source or destination), that was neglected by many of the previous protocols for MANET using directional Antennas.

1

2

3

4

5

6

1 2

3 8

7

6 5

4

Fig. 1: Segment Numbering for Switched Beam

Antenna

Fig. 2: Segment Numbering for Smart Antenna

228 ♦ Smart Medium Access Control (SMAC) Protocol for Mobile Ad Hoc Networks Using Directional Antennas


2.4 Information at Each Node

Every node maintains information indicating the direction of neighboring nodes as well as the status of the segments whether the segments can be used for transmission and free from any transmission in that direction.

This information will be maintained in two tables at each node.

Table 1: Neighbor Node Information

Segment Number Node

Source Destination

Status

2.4.1 Neighboring Node Information Table

This table would indicate the direction of the nodes in the form of segment numbers and this information will be used to communicate with the nodes using the segment numbers indicated. The Source Segment number indicates the segment being used by the source to reach the destination node. The destination segment number indicates the segment through which the destination node receives this packet. The status would indicate the state of the node whether the node is busy in transmission or not. This status information is very much needed and used in this protocol so as to avoid the Deafness problem. This would indicate that the node is busy in transmission, so it is not responding to the RTS request of a node.

An Entry of the node in this table would indicate that the node is within the coverage Area of the current node.

2.4.2 Segment Table

This table will have the information about the segment and the status of the segment (Busy/Idle).

The structure of the table is as shown in Table 2.

Table 2: Segment Table

Segment Number Status Waiting

The Segment number will indicate the segment and the status will be either Busy or Idle. Waiting field is just a single bit indicating 0 or 1. Bit 1 indicate that one or more packets are waiting in the backoff queue for the segment to be free.

Need for Two Tables: We are maintaining two different tables one to indicate the node status and another to indicate the Segment status. Segment table is needed as we need to check for the availability of the segments free from transmission. On the other hand waiting bit in the table would reduce the time needed to search if any packets are waiting for the segment to be free or not.

In Figure 1, as mentioned if we are using segment 2 for transmission we will also block any transmission from segment 3 and a and thus will indicate that segment 1,2 and 3 to be busy in the segment table.

2.5 Channel

The Channel will be divided in to two sub channels Control channel and Data channel. The Control channel will be used to send the control information through node updates for maintaining the tables.



The Data channel will be used to transmit RTS-CTS-Data etc. The Node can simultaneously transmit in both channels without any interference.

2.6 Mobility Updates

All the nodes in this network will transmit a update message to all its neighboring one hop nodes in directional mode.

The update packets will be transmitted to one segment at a time. The update packet format is as shown in figure 3.

The Fields in the packet are

Source node ID: The Node ID transmitting the packet.

Segment Number: The Segment Number used by the sender node to transmit the update packet.

Status: This field is used by the node to inform the one-hop neighbor about the status of transmission in different segments of the node. If the sender node is transmitting the data using the segment 1 in a 6 segment node. As we want to reserve the segments 2 and 6 i.e immediate segments, the status would indicate that the segments 1,2, and 6 are busy and would indicate by 1 in those bit positions.

2.6.1 Sensing the Medium

A node will sense the medium periodically in a particular segment direction and would transmit the update packet in that direction. If the Medium is found to be busy, then it would switch on to the next segment after waiting in omni-directional mode for a certain period of time. The node shifts to omni directional mode to listen to other update messages if any transmitted by other nodes through any other segments.

A node maintains single bit information for each segment, whether it had transmitted update packet in a particular segment direction or not. If it had already transmitted the update packet in a segment direction then it would make the bit position of the segment number to be 1. If it had failed to sense the medium to be idle then it would keep the bit value to zero.

Source Node ID Segment Number No of Segments Status

M bits to represent N segments N bits to represent status of N segments

Fig. 3: Update Packet Format

2.6.2 Transmission

When the node finds the medium to be idle in a particular direction, then it would transmit the request-to-transmit (RTS) packet in that direction. Here the node would not address any node in that direction as it will not be sure about the identity of the nodes in that direction. Any node receiving this RTS packet would respond with a Clear to Send (CTS) packet with its ID. The RTS packet will have the sender ID. The receiving node will update neighbor node table about the sender node location. The CTS packet transmitted by the receiving node will have its ID for the sender node to update its table about the node location. The sender node after receiving the CTS would transmit the update packet about the status of the node in that Direction.



The receiving node after receiving the update packet would piggyback the Acknowledgement with the Update Message to the sender. Thus, the two nodes would update their tables when one node successfully accesses the channel.

2.6.3 Delivery of Update Messages

A node for each segment would maintain single Bit to indicate the transmission of update packet. The Node will update the Bit to 1 after transmitting update packet. The transmission can be by the node initiating the transmission of update packet in that segment direction or by responding to the update packet transmitted by piggybacking. Before sensing another segment in increasing order, the node would check the status of update message delivery for any previous segments and would give another chance. A new round starting from segment 1 initiated only after successfully transmitting update messages in all the directions.

2.6.4 Timeouts and Retransmission

When a node senses the medium in a particular direction or segment and finds to be free, it transmits RTS packet. A node after transmitting a RTS packet will not receive a CTS packet if there is no node in that direction or the node in that direction does not respond because of deafness problem. The sender node would timeout before receiving any CTS packet from the other node. The node would retransmit the RTS message for three consecutive times before giving up. After three attempts it would make the update bit set to 1.

Because of the design option that a node not only initiates the transmission of updates packet but also piggybacks the received update packet form another nodes. The probability of update packet missing transmission in a particular direction is very less.

2.7 Queue Model

SMAC protocol uses different queues apart from a Ready queue that will maintain the packets ready to transmit. SMAC maintains a queue called nodenotfound to store the packets for which the destination node not found. SMAC also maintains N number of backoff queues where N is the number of segments for a node. For example, Node with Figure1 Antenna would maintain 6 backoff queues apart from nodenotfound queue and ready queue.

3 Working of Proposed Protocol

The proposed SMAC uses Directional transmission scheme. SMAC protocol uses alternative method [1] to back off, when the direction in which the packet to be transmitted is found busy.

3.1 Carrier Sensing and Backoff

The SMAC protocol uses alternate method that processes the data packets in the queue ready for transmission in other directions when the data packet in the front of the queue to be transmitted finds channel busy in its direction.

Initially a node read the packets that are ready for transmission from the ready queue. It would read the Destination Node ID from the packet header and see for the existence of the node in the neighbor node table. SMAC will process the packet further if it finds node in the table. SMAC will place the packet in nodenotfound queue if it does not find the node in the table. The packets placed in nodenotfound queue will get a chance two more times for



transmission before informing the routing protocol about the unavailability of the node. For this, single bit is added to the header part of the packet and initialized to 0.

If SMAC finds the destination node information, it will get the destination node segment. SMAC will find the status of the segment and the neighboring segments for transmission. If the status of the segments is found idle then the SMAC will proceed with packet transmission after RTS, CTS exchange. If the segments are found busy then the packet will be placed in to the Backoff queue for that segment. The waiting Bit will be updated to 1 by the SMAC after placing the packet in the Backoff queue.

SMAC will then proceed to the next packet ready in the queue. It will check whether the next packet in the queue belongs to the same destination of previous packet. If the previous packet was transmitted successfully then this packet will also be transmitted and if the packet was placed in to Backoff queue then this packet will be placed in to the Backoff queue without further processing.

3.2 Data Transmission

Once the carrier is sensed to be idle in the intended direction, the sender node will transmit a RTS (Request to Send) message indicating the transmission. The Destination node would update its segment table about the transmission and respond with a CTS message. The Sender before transmitting an RTS message would update its segment table with the segment for transmission. Once the sender receives the CTS message, it would start data transmission. Sender and destination nodes will update the segment tables after completing the transmission.

3.3 Processing Backoff Queues

SMAC protocol would process Backoff queues when the next packet in the ready queue is having different destination node to that of previous packet. SMAC would check the segment table to see if any segment waiting bit set to 1 and status is idle. This would mean that the segment, which was busy, previously is now idle and there are packets in the Backoff queue for that segment.

SMAC protocols would identify those segments and would process the packets in that segment back off queue and transmit all the packets of the segment. Before transmitting the packets, the SMAC would check the neighboring node table to see whether the node is still in that segment or moved to a new segment. If SMAC does not find the node information in the neighboring node table then the packets for that node will be shifted to nodenotfound queue. If the node is moved to a new segment because of node mobility then the segment to which the node moved will be verified to see if the status is idle or not. If the status of the segment is idle then the packets will be transmitted in that segment direction. The packets will be placed to the New segment backoff queue if the status of the segment is busy. Once all the packets in the segment backoff queue are processed, waiting bit of the segment will be set to 0.

3.4 Processing Nodenotfound Queue

SMAC protocol will also process the nodenotfound queue after processing the Backoff queues. A packet in the nodenotfound queue will be tried for transmission twice before



initiating the node not found message to the routing protocol. All the packets in the nodenotfound queue will be processed before leaving the queue. If the packet destined to a node is found in the neighbor node table then the packet will either be transmitted or placed in the backoff queue for that segment depending on the status of the destination node. If the node information is still not available then the node bit is modified to 1 if it is zero. If the node bit is already 1 indicating that the packet is being tried for transmission for second time, the packet will be discarded from the nodenotfound queue and informs the network layer that node is not available.

4 Conclusion

This paper proposed a new MAC protocol using Directional Antenna. We tried to exploit the Directional transmission capabilities of the directional antennas to improve throughput, spatial reuse and overall performance of communication. This paper introduced Node Mobility model using control information transmitted through control channel. The node mobility model is designed to be used not only for this SMAC but also for routing protocols and transport layer services like TCP and Quality of Service connection establishment and IP Address Configuration etc,. This paper included a method alternate to the backoff timer to increase the throughput. This protocol overcomes the various problems like Hidden Terminal problem, Deafness, Information Staleness and Node Mobility problem that are specific to the MAC protocols for MANET using directional antennas.

This Protocol can be further extended to include service differentiation for providing Quality of Service problem.

References

[1] [P. Sai Kiran, 2006] P. Sai Kiran, “Increasing throughput using directional antennas in Wireless Ad Hoc Networks”, IEEE ICSCN’06.

[2] [P. Sai Kiran, 2006] P. Sai Kiran, “A survey on mobility support by MAC protocols using directional antennas for wireless ad hoc networks”, in Proc, IEEE ISAHUC2006.

[3] [P. Sai Kiran, 2007] P. Sai Kiran, “Statefull Addressing Protocol (SAP) for Mobile Ad Hoc Networks”, in proc, IASTED CIIT’07,

[4] [C. Siva Ram Murthy, 2004] C. Siva Ram Murthy & B. Manoj; “Ad Hoc Wireless Networks, Architectures and Protocols”, Prentice Hall, 2004

[5] [Hongning Dai, 2006] Hongning Dai, Kam-Wing Ng and Min-You Wu, An “Overview of MAC Protocols with Directional Antennas in Wireless ad hoc Networks”, ICWMC 2006, Bucharest, Romania, July 29-31, 2006

[6] [Masanori Takata, 2005]Masanori Takata, Masaki Bandai and Takashi Watanabe, “Performance Analysis of a Directional MAC for Location Information Staleness in Ad Hoc Networks”, ICMU 2005 pp.82-87 April 2005.

[7] [Ram Ramanathan, 2005] Ram Ramanathan, Jason Redi, Cesar Santivanez, David Wiggins, and Stephen Polit, “Ad Hoc Networking with Directional Antennas: A Complete System Solution” IEEE Communications, March 2005.

[8] [Jungmin So, 2004] Jungmin So, Nitin Vaidya, “Multi-Channel MAC for Ad Hoc Networks: Handling Multi-Channel Hidden Terminals Using A Single Transceiver”, in proc, MobiHoc’04, May 24–26, 2004.

[9] [Romit Roy Choudhury, 2004] Romit Roy Choudhury and Nitin H. Vaidya, “Deafness: A MAC Problem in Ad Hoc Networks when using Directional Antennas”, in proc ICNP’04.

[10] [Tetsuro Ueda, 2004] Tetsuro Ueda, Shinsuke Tanaka, Siuli Roy, Dola Saha and Somprakash Bandyopadhyay, “Location-Aware Power-Efficient Directional MAC Protocol in Ad Hoc Networks Using Directional Antenna”, IEICE 2004.



[11] [Michael Neufeld] Michael Neufeld and Drik Grunwald, “Deafness and Virtual Carrier Sensing with Directional Antennas in 802.11 Networks”, Technical Report CU-CS-971-04, University of Colorado.

[12] [Ajay Chandra V, 2000] Ajay Chandra V. Gummalla and John O. Limb, “Wireless Medium Access Control Protocols”, IEEE Communications Surveys and Tutorials, Second Quarter 2000

[13] [Romit Roy Chowdary,2002] Romit Roy Chowdary, Xue Yang, Ram Ramanathan and Nitin. H Vidya, “Using Directional Antennas for Medium Access Control in Ad Hoc Networks”, MOBICOM ’02 Sep 23-28 2002

[14] [Z. Huang, 2002] Z. Huang, C.-C. Shen, C. Srisathapornphat and C. Jaikaeo, “A busy-tone based directional MAC protocol for ad hoc networks,” in Proc. IEEE Milcom, 2002.



Implementation of TCP Peach

Protocol in Wireless Network

Rajeshwari S. Patil Satyanarayan K. Padaganur Dept of CSE Dept of CSE Dept of ECE Bagalkot Bagalkot B.L.D.E.A’s CET Bijapur, Karnataka

Abstract

Throughput or the Good put improvement in an IP network is one of the most significant issues towards data communication and networks. Various IP congestion control protocols are being suggested over the years. But in the networks where the Bit Error Rate very high due to link failures, like the one in the case of satellite network or the ADHOC network, network layer controlling is not suggestible. In those networks transport layer control on frame transfer rate is adopted. One of the preliminary classes of transmission control adopted to improve goodput in such networks is TCP-Peach.

The objective of this work is to improve TCP-Peach congestion control scheme, and extend it to TCP-Peach+ to further improve the goodput performance for satellite IP networks. In TCP-Peach+, two new algorithms, Jump Start and Quick Recovery, are proposed for congestion control in networks.

These algorithms are based on low priority segments, called NIL segments, which are used to probe the availability of network resources as well as error recovery. The objective is to develop a simulation environment to test the algorithm.

Keywords: TCP-Peach, Jump start, Quick recovery

1 Introduction

The work implements a strategy to improve the performance of networks through transport layer rate control, in a network, which are more vulnerable to the link failures. The cause of link failures could be fading, noise, interference or energy loss or the external signals. Thus we will discuss about the Network architecture and the transport layer in detail in this segment.

1.1 End-to-End Network Concept

The data communication requirements of many advanced space missions involve seamless, transparent connectivity between space-based instruments, investigators, ground-based instruments and other spacecraft. The key to an architecture that can satisfy these requirements is the use of applications and protocols that run on top of the Internet Protocol (IP). IP is the technology that drives the public Internet and therefore draws billions of dollars

Implementation of TCP Peach Protocol in Wireless Network ♦ 235


annually in research and development funds. Most private networks also utilize IP as their underlying protocol. IP provides a basic standardized mechanism for end-to-end communication between applications across a network. The protocol provides for automated routing of data through any number of intermediate network nodes without affecting the endpoints.

2 Role of TCP

As it is being discussed, TCP provides good flow control through the acknowledgements and frame processing. But due to un necessary bandwidth requirement due to these causes a lot of channel delay. This in terms becomes the cause for congestion in the network. All the protocols is the TCP employs common policies for performance improvement for congestion, delay loss and congestions. Hence sometimes, it there is a data loss due to link failure, the node attempts re transmission of the packet, considering that the loss is caused by congestion. The window size management is purely based on assumption theory. Bandwidth and link status feed back is not employed. Therefore when the probability of error is high, the performance of the network degrades. Hence we are going to utilize the NULL segments of the TCP packets to carry suitable information about the link status and adjust the transmission rate based on the link status calculation through channel condition estimation. Normally at the beginning of the transmission, the window size is small and it is increased slowly. Even If enough bandwidth is available, Nodes can not utilize them appropriately. Hence sudden increase in the bandwidth is facilitated. Under link error situations, by monitoring the channel status, lost packet re-transmission is prohibited. Because if the packets are re transmitted, they would still be lost. Instead a quick recovery by decrementing the window size is proposed. Overall the problem can be stated as “Simulation of TCP-Peach+ for better Network Resource Management and Performance improvement for high loss probable networks.

3 Methodology

For any protocol implementation it is important to implement a network achitecture. Our Network architecture contains N number of static nodes spread over 400x400 sqm area. One source and destination is randomly selected and a shortest path is obtained. All the nodes, participating in the routings are allocated with the available bandwidth. The more the available channel the less the probability of error. Hence the probability of error can be considered as a state function of link bandwidth between the nodes.

Conceptually the transport layer and the network layers are implemented. The transport layer will prepare the queue where the packet buffers would be stored. Transmitting node would create a transmission window or the congestion window and the buffer of the size of the window would be transferred. Routing is done through the Network layer and the acknowledgement and retransmission policies are managed in the transport layer.

The receiver would acknowledge about the reception about the packet. Once the packet acknowledgement is received, it is removed from the buffer. Through the Tcp-Peach+, some null segments are made low priority nil segment where a transmitter can ask the receiver to transmit the channel state. If there are enough bandwidth left, the receiver notifies the

236 ♦ Implementation of TCP Peach Protocol in Wireless Network


transmitter about the state else it is not notified. Based on the reply of nill segment, the transmitting nodes adjusts it’s window size.

Tcp-Peach and Tcp-Peach+ Protocol will be implemented in order to provide QOS support to the transmission control protocol. Jump Start and Quick Recovery are implemented in the Tcp-Peach to extend the work. For implementing the protocols, we have adopted data structure protocol packets. A priority Queue is implemented with status for data buffering.

Custom data length and link error probability is given as the input from the user. The simulation model depicts the suage of Nill segment in the case of jump start and quick recovery and finally the loss, retransmission and delay is calculated. Load v/s throuput and the link error v/s throughput is plotted for both the cases in order to analyze their behevior.

4 TCP-Peach

TCP protocols have performance problems in satellite networks because:

1. The long propagation delays cause longer duration of the Slow Start phase during which the TCP sender may not use the available bandwidth.

2. The congestion window, cwnd, cannot exceed a certain maximum value, rwnd, provided by the TCP receiver. Thus, the transmission rate of the sender is bounded. Note that the higher the round trip time, RTT, the lower is the bound on the transmission rate for the sender.

3. The TCP protocol was initially designed to work in networks with low link error rates, i.e., all segment losses were mostly due to network congestions. As a result, the TCP sender decreases its transmission rate. However, this causes unnecessary throughput degradation if segment losses occur due to link errors.

TCP-Peach is a new flow control scheme for satellite networks. The TCP-Peach is an end-to-end solution whose main objective is to improve the throughput performance in satellite networks. The TCP-Peach assumes that all the routers on the connection path apply some priority mechanism. In fact, it is based on the use of low priority segments, called dummy segments. TCP-Peach senders transmit dummy segments to probe the availability of network resources on the connection path in the following cases:

a. In the beginning of a new connection, when the sender has no information about the current traffic load in the network.

b. When a segment loss is detected, and the sender has no information about the nature of the loss, i.e., whether it is due to network congestion or link errors.

c. When the sender needs to detect network congestion before it actually occurs. As dummy segments have low priority, their transmission do not affect the transmission of traditional data segments. TCP-Peach contains the following algorithms: Sudden Start, Congestion Avoidance, Fast Retransmit, Rapid Recovery and Over Transmit. Sudden Start, Rapid Recovery and Over Transmit are new algorithms; Fast Retransmit is the same as in TCP-Reno and Congestion Avoidance has some modifications.

Implementation of TCP Peach Protocol in Wireless Network ♦ 237


5 System Design

6 Result

238 ♦ Implementation of TCP Peach Protocol in Wireless Network


7 Conclusion

The performance of the TCP-Peach+ depicts that the algorithm would perform very well under trying network conditions like high link error probability as that in the case of satellite network or the ADHOC networks. Normally TCP does not provide independent rules for the congestion and the link errors. Hence the recovery from the link break downs are very slow. But our proposed technique shows that the protocol performs pretty well and recovery from failure is very fast. The re-transmission depicts that under severe link break and segment losses, the nodes do not retransmit the lost segments due to the identification of failures. Hence the resources are not unnecessarily wasted. Congestion control algorithms like the sliding windows are integrated with the proposed protocol, we may obtain better results as par as overall performance is concerned.

References

[1] J. Broch, D. A. Maltz, D. B. Johnson, Y-C. Hu, J. Jetcheva, “A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols,” MOBICOM 1998, 85-97.

[2] L. Zhou, Haas Securing ad hoc networks [referred 29.03.2004] http://www.ee.cornell.edu/haas/publications/network99.ps

[3] Ad hoc On-Demand distance vector RFC 0827, IETF Network working Group. July 2003 [4] Routing Protocols Security Requirements, Internet Draft, IETF Network working Group. June 1995 [5] David B. Johnson. Routing in ad hoc networks of mobile Hosts. In Proceedings of the IEEE workshop on

mobile computing systems and applications (WMCSA’94), pages 158-163, December 1994 [6] S. Tanenbaum, Computer Networks, Third Edition. Prentice Hall, Englewood Clis, Chapter 5, pp. 357358,

1996). [7] Mobile Ad hoc Networks (MANET) Charter. MANET Homepage.

http://www.ietf.org/html.charters/manetcharter. html. [8] S. Corson and J. Macker, Mobile Ad hoc Networking (MANET): Routing Protocol Performance Issues and

Evaluation Considerations, RFC2501, January 1999. [9] L. Zhou and Z.J. Haas, Securing Ad Hoc Networks, IEEE Network Magazine, vol. 13, no.6, December

1999. [10] E. Royer and C-K. Toh, A Review of Current Routing Protocols for Ad-Hoc Mobile Wireless Networks.

IEEE Personal Communications Magazine, April 1999, 46-55.

E- References

[11] www.interhack.net [12] www.iana.com [13] www.cisco.com [14] www.scit.wlv.ac.uk [15] www.blogs.krify.com



A Polynomial Perceptron Network

for Adaptive Channel Equalisation

Gunamani Jena

R. Baliarsingh

G.M.V. Prasad BVC Engg College, (JNTU) CSE, NIT, Rourkela BVCEC [email protected] [email protected]

Abstract

Application of artificial neural network structures (ANN) to the problem of channel equalisation in a digital communication system has been considered in this paper. The difficulties associated with channel nonlinearities can be overcome by equalisers employing ANN. Because of nonlinear processing of signals in an ANN, it is capable of producing arbitrarily complex decision regions. For this reasons, the ANN has been utilized for the channel equalisation problem. A scheme based on polynomial perceptron network (PPN) structures has been proposed for this task. The performance of the proposed network PPN along with other ANN structure (FLANN) has been compared with the conventional LMS based channel equaliser. Effect of eigenvalue ratio of the input correlation matrix on the performance of equalisers has been studied. From the simulation results, it is observed that the performance of the proposed PPN based equaliser outperforms the other two in terms of bit error rate (BER) and attainable MSE level over a wide range of eigenvalue spread, signal to noise ratio and channel nonlinearities.

Keywords: PPN: polynomial perceptron network, FLANN: functional link artificial neural network, MSE, SNR and EVR

1 Introduction

The rapidly increasing need for digital communication has been met primarily by higher speed data transmission over the wide spread Network of voice bandwidth channels. The channels deliver at the receiver the corrupted and transform versions of their input waveforms. The corruption of the waveform may be statistically additive, multiplicative or both, because of possible background thermal noise, impulse noise and fade. Transformations performed in the channels are frequency translation, non-linear, harmonic distortion and time dispersion. By using adaptive channel equaliser at the front end of the receiver, the noise introduced in the channel gets nullified and hence signal to noise ratio of the receiver improves. This paper deals with the design of an adaptive channel equaliser based on polynomial perceptron network(PPN) architecture and the study of its performance. The performance of the linear channel equaliser employing a linear filter with FIR or lattice structure and using a least mean square (LMS) or recursive least (RLS) algorithm is limited especially when the nonlinear distortion is severe. In such cases nonlinear equaliser structures maybe conveniently employed with added advantages in terms of lower bit error rate (BER), lower mean square error (MSE) and higher convergence rate than those of a linear equaliser. Artificial neural network (ANNs) can perform complex mapping between its input and output

240 ♦ A Polynomial Perceptron Network for Adaptive Channel Equalisation


space and are capable of firming complex regions with non-linear decision boundaries. Further, because of nonlinear characteristics of ANNs these networks of difference architecture have found successful application in channel equalisation problem. One of the earliest applications of ANN in digital communication channel equalisation is reported by Siu et. al [7]. They have proposed a multilayer perceptron (MLP) structure for channel equalisation with decision feedback and have shown that the performance of this network is superior to that of a linear equaliser trained with LMS algorithm. Using MLP structures in the problem of channel equalisation, quite satisfactory results have been reported for Pulse Amplitude Modulation (PAM) and Quadrature Amplitude Modulation (QAM) signals [1-5]. The PPN structure is good for developing an efficient equaliser for nonlinear channel. The performance of PPN equaliser has been obtained through computer simulation for different nonlinear channels. The convergence characteristics and bit error rate (BER) are obtained through simulation study and results are analyzed. It is observed that PPN equaliser offers superior performance in terms of BER compared to its LMS counterpart and outperforms the LMS equaliser particularly for nonlinear channels with respect to higher convergence rate and lower mean square error (MSE).

2 Data Transmission System

Consider a synchronous data communication link with 4 QAM signal constellations to

transmit a sequence of complex – valued symbols t(k) = x 2λ +jx2λ +1 at time kT, Where 1/T

denotes symbol rate, xλ λ = 0,1,…., represents the un modulated information sequence having statistical independent and equiprobable values 1, -1. The transmitted symbols t(k) may be written in terms of its in phase and quadrature components as t (k) = tk, I + jtk, Q. A discrete time model for the digital transmission system with equaliser is shown in fig. 1. The combined effect of the transmitter filter; the transmission medium and other components are included in the “channel”. A widely used model for a linear dispersive channel is a FIR model whose output at time instant k may be written as

( ) ( ) ( )∑−

=

−=

1

0

.kahN

i

iktih

(1)

Fig. 1: Digital Transmission system with Equaliser

A Polynomial Perceptron Network for Adaptive Channel Equalisation ♦ 241


Where h (i) is the channel tap values and Nh is the length of the FIR channel. If the nonlinear distortion caused by the channel is to be considered, then the channel model should be treated as nonlinear and its output may be expressed as

a'(k) = ϕ(t(k), t(k-1), …., t(k-Nh+1); h(0), h(1), …, h(Nh-1)) (2)

Where ϕ (.) is some nonlinear function generated by “NL” block. The channel output is

corrupted with additive white Gaussian noise q(k) of variance σ2 to produce r(k), the signal received at the receiver. The received signal r(k) may be represented by its in phase and quadarture components as r (k) = rk,I + jrk,Q. The purpose of the equaliser is to recover the

transmitted symbol t(K) or T ( K - τ) from the knowledge of the received signal samples

without any error, where τ is the transmission delay associated with the physical channel.

3 Channel Equaliser as a Pattern Classifier

The channel output is passed through a time delay to produce the equaliser input vector as shown in Fig. 1.

For this section consider a K ary PAM system with signal constellation given by si=2i-K-1;

1≤ i≤K. The arguments for the channel equaliser as pattern classifier may be extended for QAM signal constellation. The equaliser input at the kth time instant is denoted as Uk and is given by

Uk = [u1u2….uM]T

where ui1= r(k-i+1) for i = 1,2, ……, M,. The equaliser order is denoted by M and [.]T denotes matrix transpose. The ANN utilize this information to produce an output,

( )ky which is an estimate of the desired output y (k) = t (k-τ),. The delay parameter of the

equaliser is denoted by τ. Depending on channel output vector Uk, the equaliser tries to estimate an output which is close to one of the transmitted values, si, for i = 1,2, …, K. In other words, the equaliser seeks to classify the vector Uk into any one of the K classes. Depending on values of M and Nh, and current and past J 1 transmitted symbol the classification decision of the equaliser is effected. The J dimensional transmitted symbol vector at the time instant k is given by

Γk = [t(k) t (k-1) ….. t(k-J+1)]T (3)

Where J = M + Nh -1. The total number of possible combinations of Tk is given by Nt = KJ.

When the additive Gaussian noise q(k) is zero, let the M dimensional channel output vector is given by

Bk = [b(k) b(k-1)…… B(k-M+1)]T (4)

Corresponding to each transmitted symbol vector Γk there will be one channel output vector Bk. Thus, Bk will also have Nt number of possible combinations, called desired channels states. These Nt states are to be partitioned into K classes Ci, I = 1,2,……., K depending on

the value of the desired signal y(k). The states belonging to class Ci are given by BkεCi if y(k)

= Si.

When the white Gaussian noise is added to the channel, Bk becomes equal to Uk, which is a stochastic vector. Since each of si is assumed to be equiprobable, the number of channel states in each class is given by Nt/K. The observation vectors form clusters around the desired channel sates and thus, the statistical means of these data clusters are the desired states.



Therefore, determining the transmitted symbol t(k) with knowledge of the observation vector Uk is basically a classification problem. For this purpose, a decision function may be formed as follows [8]

DF (Uk)=w0+w1u1+w2u2+ …… + wMuM (5)

Here, wi, I = 0, 1, ……., M are the weight parameters. Ignoring the time index k, the decision function may be written as DF(U)=W

Tu, where U is the current channel observation vector

augmented by 1 and W = [w0 w1 ….. wM]T is the weight parameter vector. For the K classes,

K numbers of decision functions are found with the property.

DFi(U) = WiT U ≥ 0 if U ∈Ci < 0 otherwise (6)

For i = 1,2, ……., K. Here, Wi is the weight vector associated with the I th decision function. A generalized nonlinear decision function is needed to take care of many practical linearly non – separable situations and this can be formed as.

( ) ( )∑=

=

fN

0i

UwiφiUDF (7)

Where the φI (U), I = 1,2,…, Nf are real, single valued functions of the input pattern U, φ0 (U) = 1 and Nf is the number of terms used in the expansion of the decision function DF (U).

Let us define a vector U* with each of the components are the functions φi(U) and given by

U* = [1φ1 (U) φ2 (U) …. φNf (U)]T.

The decision function ma be expressed as

DF(U) = WT. U

* (8)

Thus using φi(U), the (M+1) – dimensional augmented channel observation vector U may be transformed into a (Nf + 1) – dimensional vector U*. Using this decision function, complex decision boundaries can be formed to carry out nonlinear classification problems. This may be achieved by employing different ANN structures for the channel equalisation problem, which is described in the following sections.

4 ANN Structures for Equalisation

In this paper we have employed two ANN structures (FLANN and PPN) for the channel equalisation problem and their performance in terms of convergence speed, MSE level, and BER is compared by taking different linear and nonlinear channel models. Brief description of each of the ANN structures is given below.

4.1 Functional Link ANN

The FLANN is a single layer network in which the hidden layers are removed. In contrast to the linear weighting of the input pattern by the linear links of an MLP, the functional link acts on an element of a pattern or on the entire pattern itself by generating a set of linearly independent functions, and then evaluating these functions with the pattern as the argument. Thus, separability of input patterns is possible in the enhanced space [5]. Further, the FLANN structure offers less computational complexity and higher convergence speed than those of MLP because of its single layer structure. The FLANN structure considered for the channel



equalisation problem is depicted in Fig. 3. Here, the functional expansion block makes used of a functional model comprising of a subset of orthogonal sin and cos basis functions and the original pattern along with its outer products. For example, considering a two – dimensional input pattern X = [x1x2]

T, the enhanced pattern is obtained by using the trigonometric

functions as X* = [x1 cos (x1cos (πx1) ……… x2 cos (πx2) sin (πx2) sin (πx2) ……. x1x2]T

which is then used by the network for the equalisation purpose. The BP algorithm, which is used to train the network, becomes very simple because of absence of any hidden layer.

Fig. 2: The FLANN Structure

4.2 The PPN Structure

Weierstrass approximation theorem states that any function, which is continuous in a closed interval, can be uniformly approximated within any prescribed tolerance over that interval by some polynomial. Based on this, the PPN structure was proposed and shown in Fig. 3. Here, the original input pattern dimension is enlarged and this enhancement is carried out by polynomial expansion in which higher other and cross product terms of the elements of the input pattern are used. This enhanced pattern is then used for channel equalisation. It is a single layer ANN structure possessing higher rate of convergence and lesser computational load than those of an MLP structure.

Fig. 3: The PPN Structure

The behavior and mapping ability of a PPN and its application to channel equalisation is reported by Xiang et. Al. [10]. A PPN of degree d with a weight vector W produces an output



y given by ( )( )XPρy d

w= (9)

Where ρ is the nonlinear tanh function. X is the n – dimensional input pattern vector given by

X = [x1,x2,……., xn]T. The

d

wP is a polynomial of degree d with the weight vector W =

[w0w1w2….] and is given by

∑=

+=

n

i

ii

d

w xwwP1

0

1

11.

+∑∑

= =

+

n

i

n

ii

iii xxw11 12

211........

+ ( )

∑∑ ∑= = −=

n

i

n

ii

n

dii

iiiiii

d

ddxxxw

1 1

.......

1 12

2121.......................

(10)

When d → ∞, pdw (X) becomes the well known volt era series, A structure of a PPN with

degree, d = 2 and pattern dimension n = 2 is shown in Fig. 3.

The same BP algorithm may be used to train the network. However, in this network the number of terms needed to describe a polynomial decision function grows rapidly as the polynomial order and the pattern dimension increases which in turn increases the computational complexity.

Fig.4: The generalized FLANN Structure.

Fig. 5:. Structure of an ANN based Equaliser



5 Computational Complexity

A comparison of computational complexity between FLANN and PPN structure, all trained by the BP algorithm is presented. The computational complexity of the FLANN and the PPN is similar if the dimension of the enhanced pattern is same for both cases. However, in the case of the FLANN with trigonometric functions, extra computations are required for calculating the sin and cos functions; where as in the PPN, for computation of higher order and outer product terms, only multiplications are needed. Consider an L layer MLP with n1 number of nodes (excluding the threshold unit) in layer l (1 = 0, 1, ……., L), where no and nL are the number of nodes in the input layer and output layer, respectively. Three basic computations, i.e., addition, multiplication and computation of tanh(.) are involved for updating the weights of an MLP and PPN. Major computational burden on the MLP is due to error propagation for calculation of square error derivative of each node in all hidden layers. In one iteration, all computations in the network, take place in three phases, i.e.,

a. Forward calculation to find the activation value all nodes of the entire network;

b. Back error propagation for calculation of square error derivatives;

c. Updating of the weights of the entire network.

The total number of weights to be updated in one iteration in an MLP structure is given by

( )( )∑−

=+

+1L

01 111 n1n. Where as, in FLANN and PPN it is given by (n0 + 1). Since hidden layer

does not exist in FLANN and PPN, the computational complexity is drastically reduced than MLP structure. Comparison of computational complexity between FLANN and a PPN using BP algorithm in one iteration is provided in Table-1

Table 1

Operations FLANN PPN

Addition 2n1.n0+n1 2n1.n0

+ + n1

Multi 3n1.n0+ + n0 3n1.n0

+ +n0

Tanh(.) n1 n2

Cos(.)/ sin(.) n0 -

n0+ = n0 + 1

6 ANN-Based Channel Equalisation

An equalisation scheme for 4-QAM signals is shown in Fig. 6. Each of the in-phase and the quadrate component of the received signal at time instant k, rk, I and rk, Q passes through a tapped delay line. These delay signals constitute the input pattern to the ANN and is given by U(k) = [u1 u2 … uM – 1]

T = [rk, I rk, Q rk –1, I rk – 1, Q …]T. At the time instant k, U(k) is applied to

the ANN and the network produces two outputs 1y and 2y corresponding to estimated value of the in-phase and quadrature component of the transmitted symbol or its delayed version, respectively. For the equalisation problem, two ANN structures, i.e., a PPN and a FLANN structure along with a linear one trained with LMS algorithm are employed for the simulation studies. The BP algorithm is employed for all ANN-based equalisers. Further, in all the ANN structures, all the nodes except those of the input layer have tanh(.) nonlinearity, as activation function is basically a four-category classification problem. Since equalisation is basically four nonlinear boundaries can be formed by using an ANN, it may be conveniently employed to form discriminate functions to classify the input pattern into any one of the four categories.



Fig 6.a: Equalisers for Channel 6 at SNR of 15dB, NL=0 Fig 6.b: Equalisers for Channel 6 at SNR of 15dB, NL=1

7 Simulation Studies

Simulation studies have been carried out for the channel equalisation problem as described by Fig. 1, using the two discussed ANN structures (PPN, FLANN) with BP algorithm and a linear FIR equaliser with LMS algorithm. Here impulse response of the channel considered is given by [4]

( ) ( ) ,2iΛ2π

cos1,2

1ih

−+

=

i = 1, 2, 3 (11)

0 otherwise

The parameter Λ determines the Eigen value ratio (EVR) of the input correlation matrix, r =

E[U(k)UT(k)] where E is the expectation operator. The EVR is defined

as minmaxminmax λandλwhere/λλ is the largest and the smallest Eigen value of R, respectively;

The digital Message was with 4 – QAM – signal constellation and in the form j11±± in

which each symbol was obtained from a uniform distribution. To the channel output a zero

mean Gaussian noise was added, the received signal power is normalized to unity so as to

make the SNR equal to the reciprocal of noise variance at the input of the equaliser. To study

the performance of the equaliser under different EVR conditions of the channel, the

parameter Λ was varied from 2.9 to 3.5 in steps of 0.2. The value of EVR is given by 6.08,

11.12, 21.71 and 46.82 for Λ value of 2.9, 3.1, 3.3 and 3.5 respectively. Selection of the

detail structure of the ANN’s the various parameter values including the learning rateµ , the

momentum rateγ , the polynomial order and the number of functions used in the FLANN and

PNN were determined by numerous experiments to give the best result in the respective ANN

structures. The polynomial and trigonometric functions were used for functional expansion of

the input pattern in the equalisers based on PPN and FLANN, respectively. To have a fair

comparison between the PNN and FLANN based equalisers, in both of the cases input pattern

was expanded to a 18 dimensional pattern from r(k) and r(k – 1). Thus both the FLANN and

the PPN have 19 and two number of nodes in the input and output layer, respectively.



Further, a linear FIR equaliser of order eight trained with LMS algorithm was also simulated.

In the case of FLANN and PPN, µ and γ values were 0.3 and 0. 5 respectively. The MSE

floor corresponds to the steady state value of the MSE, which was obtained after averaging

over 500 independent runs each consisting of 3000 iterations. To study the BER performance,

each of the equaliser structures was trained with 3000 iterations for optimal weight solution.

After completion of the training, iterations of the equaliser were carried out. The BER was

calculated over 100 independent runs each consisting of 104 data samples. Six different

channels were studied with the following normalized transfer function given in z-transform

form:

CH = 1: 1.0,

CH = 2: 0.447 + 0.8941z−

,

CH = 3: 0.209 + 0.9951z−

+ 0.2092z−

,

CH = 4: 0260 + 0.9301z−

+ 0.2602z−

,

CH = 5: 0.304 + 0.9031z−

+ 0.3042z−

,

CH = 6 : 0.341 + 0.8761z−

+ 0.3412z−

. (12) CH = 1 corresponds to a channel without any ISI since it has a unity impulse response. CH = 2 corresponds to a non-minimum phase channel [1]. CH = 3, CH = 4, CH = 5, and CH = 5, and CH = 6 corresponds to A values of 2.9, 3.1, 3.3 and 3.5 respectively. Three different nonlinear channel models with the following types of Nonlinearity were introduced NL

= 0: b(k) = a(k),

NL = 1: b(k) = tanh(a(k)),

NL = 2: b(k) = a(k) + 0.2a2 (k) – 0.1a

3(k),

NL = 3: b(k) = a(k) + 0.2a2(k) – 0.1a

3(k) + 0.5cos (π a(k)). (13)

NL = 0 corresponds to a linear channel model. NL = 1 corresponds to a nonlinear channel which may occur in the channel due to saturation of amplifiers used in the transmitting system. NL = 2 and NL = 3 are two arbitrary nonlinear channels.

8 MSE Performance Study

Here, the MSE performances of the two ANN structures (FLANN and PPN) along with a linear LMS-based equaliser are reported considering different channels with linear as well as nonlinear channel models. The convergence characteristics for CH = 6 at SNR level of 15 dB are plotted in Fig.6.a-6.d. It may be observed that the ANN based equalisers show much better convergence rate and lower MSE floor than those of the linear equaliser for linear as well as nonlinear channel models. Out of the two ANN structures, the PPN based equaliser maintains its superior performance over the other two structures in terms convergence speed and steady state MSE level for the linear and nonlinear channel models.

8.1 BER Performance Study

The bit error rate (BER) provides the true picture of performance of an equaliser. The computation of BER was carried out for the channel equalisation with 4 QAM signal constellations using the three ANN based and the linear LMS based structures.



Fig 6.c: Equalisers for Channel 6 at SNR of 15dB, NL=2 Fig 6.d: Equalisers for Channel 6 at SNR of 15dB, NL=3

8.2 Variation of BER with SNR

The BER performance for CH = 2 is plotted in Fig. 7.a-7.d The performance of the PPN based equaliser is superior to FLANN based equaliser for both linear and nonlinear channel models. Especially, for severe nonlinear channel model (NL = 3), the PPN based equaliser outperforms other structures.

9 Effect of EVR on BER Performance

The BER was computed for channels with different EVR values for linear as well as nonlinear channel models at SNR value of 12 dB. The results obtained are plotted in Fig. 8.a-8.d as EVR increases the BER performance of all the three equalisers degrades. However, the performance degradation due to increase in the EVR is much less in the ANN based equaliser with comparison to the linear LMS based equaliser. The performance degradation is the least in the PPN based equaliser with linear and the three nonlinear channel models for a wide variation of EVR from 1 to 46.8.

Fig. 7.a: Fig. 7.b:



Fig.7.c Fig.7.d

Fig.7.A BER performance of FLANN based, PPN based and LMS based Equalisers for Channel 2 with variation of SNR a) NL=0, b) NL=1, c) NL=2, d) NL=3

Fig.8.a Fig.8.b

Fig.8.c Fig.8.d

Fig.8. Effect of EVR on the BER performance of the three ANN based and linear LMS based equalisers for CH=2 with variation of SNR (a) NL=0, (b) NL=1, (c) NL=2, (d) NL=3.



10 Conclusion

It is shown that performance of ANN based equalisers provides substantial improvement in terms of convergence rate, MSE floor level and BER. In a linear equaliser the performance degrades drastically with increase in EVR, especially when the channel is nonlinear. However, it is shown that, in the ANN based equaliser the performance degradation with increase in EVR is not so severe. A PPN based equaliser structure for adaptive channel equalisation has been studied. Because of its single layer structure it offers advantages over the other two. Out of the two ANN equaliser structures (PPN and FLANN), the performance of the PPN is found to be the best in terms of MSE level, convergence rate, BER, effect of EVR and computational complexity for linear as well as nonlinear channel models over a wide range of SNR and EVR variations. Performance of PPN and MLP are similar but the single layer PPN structure is preferable to FLANN as it offers less computational complexity and may be used in other signal processing applications.

References

[1] [Chen, et. al., 1990] S. Chen, G. J. Gibson, C. F. N. Cowan, and P. M. Grant, “Adaptive channel equalisation of finite nonlinear channels using multiplayer perception,” Signal Process, vol. 20, pp. 107-119, 1990.

[2] [Soraghan, et. al.,1992] W.S. Gan, J.J. Soraghan, and T. S. Durrani, “A new functional –link based equaliser,” Electron. Letr. vol.28, pp. 1643-1645, Aug. 1992.

[3] [Gibson, et. al.,1991] G.J. Gibson, S. Siu, and C.F.N. Cowan, “The application of nonlinear structures to the reconstruction of binary signals,” IEEE Trans. Signal processing, vol.39 pp. 1877-1884, Aug.1991.

[4] [Haykin, 1991] S. Haykin, Adaptive Filter Theory, 2nd ed. Englewood Cliffs, NJ: Prentice –Hall, 1991. [5] [Mayer, et. al.,1993] M. Meyer and G. Pfeiffer, “Multilayer Perceptron based decision feedback equalisers

for channels with inter symbol interference” Proc. IEE, vol. 140, pt. 1, pp. 420-424, Dec.1993. [6] [Pao, 1989] Y.H. Pao, “Adaptive Pattern Recognition and Neural Networks. Reading,” M.A.: Addison –

Wesley, 1989. [7] [Siu, et. al., 1990] S. Siu, G. J. Gibson, and C.F.N. Cowan, “Decision feedback equalisation using neural

network structures and performance comparison with standard architecture,” Proc. Inst. Elect. Engg, vol. 137, pt. 1, pp. 221-225, Aug. 1990.

[8] [Tou, et. al.,1981] J. T. Tou and R.C. Gonzalez, Pattern Recognition Principles. Reading, MA: Addison –Wesley, 1981.

[9] [Widrow, et. al., 1990] B. Widrow and M.A. Lehr, “30 Years of adaptive neural networks: Perceptron, madaline and back propagation,” Proc. IEEE, vol. 78, pp. 1415-1442, Sept. 1990.

[10] [Xiang, et.,al., 1994] Z. Xiang, G. Bi, and T.L. Ngoc, “Polynomial Perceptron and their applications to fading channel equalisation and co-channel interference suppression,” IEEE Trans. Signal Processing, vol. 42, pp. 2470-2479, Sept. 1994.



Implementation of Packet Sniffer

for Traffic Analysis and Monitoring

Arshad Iqbal Mohammad Zahid Department of Computer Engineering Department of Computer Engineering Zakir Husain College of Zakir Husain College of Engineering & Technology Engineering & Technology Aligarh Muslim University Aligarh Muslim University Aligarh-202002, India Aligarh-202002, India


Mohammed A Qadeer Department of Computer Engineering, Zakir Husain College of Engineering & Technology

Aligarh Muslim University, Aligarh-202002, India [email protected]

Abstract

Computer software that can intercept and log traffic passing over a digital network or part of a network is better known as PACKET SNIFFER. The sniffer captures these packets by setting the NIC card in the promiscuous mode and eventually decodes them. The decoded information can be used in any way depending upon the intention of the person concerned who decodes the data. (i.e. in malicious purpose or in beneficial purpose). Depending on the network structure (hub or switch) one can sniff all or just parts of the traffic from a single machine within the network; however, there are some methods to avoid traffic narrowing by switches to gain access to traffic from other systems on the network. This paper focuses on the basics of packet sniffer and its working, development of the tool like packet sniffer by an individual on Linux platform. It also discusses the way to detect the presence of such software on the network and to handle them in efficient way. Focus has also been laid to analysis the bottle neck scenario arising in the network using this self developed packet sniffer. Before the development of this indigenous software, minute observation has been made on the working behavior of already existing sniffer software as WIRESHARK (formerly known as EHTEREAL), TCPDUMP, and SNORT, which serve as the base for the development of our sniffer software. For the capture of the packets, a library known as LIBPCAP has been used. The development of such software gives a chance to the developer to incorporate the additional features that are not in the existing one.

1 Introduction

Packet sniffer is a program running in a network attached device that passively receives all data link layer frames passing by the device’s network adapter. It is also known as Network or Protocol Analyzer or Ethernet Sniffer. The packet sniffer captures the data that is

252 ♦ Implementation of Packet Sniffer for Traffic Analysis and Monitoring


addressed to other machines, saving it for later analysis. It can be used legitimately by a network or system administrator to monitor and troubleshoot network traffic. Using the information captured by the packet sniffer an administrator can identify erroneous packets and use the data to pinpoint bottlenecks and help maintain efficient network data transmission. Packet Sniffers was never made to hack or stole information. They had a different goal, to make things secure. In figure 3 we have shown that how the data travel from application layer to network interface card.

Fig. 1: Flow of packets

2 Library: LIBCAP

Pcap consists of an application programming interface (API) for capturing packets in the network. UNIX like systems implements pcap in the libpcap library; Windows uses a port of libpcap known as WinPcap. LIBPCAP is widely used standard packet capture library that was developed for use with BPF (BERKELY PACKET FILTER) kernel device. BPF can be considered as OS kernel extension. It is BPF, which enable communication operating system and NIC. Libpcap is C language Library that extends the BPF library constructs. Libpcap is used to capture the packets on the network directly from the network adapter. This library is in built feature of operating system. It provides packet capturing and filtering capability. It was originally developed by the tcpdump developers in the Network Research Group at Lawrence Berkeley Laboratory [Libpcap]. If this library is missing in the operating system, we can install them at the latter phases, as it is available as an open source.

3 Promiscuous Mode

The network interface card works in two modes

a. Non promiscuous mode (normal mode)

b. Promiscuous mode

When a packet is received by a NIC, it first compares the MAC address of the packet to its own. If the MAC address matches, it accepts the packet otherwise filters it. This is due to the network card discarding all the packets that do not contain its own MAC address, an operation mode called non promiscuous, which basically means that each network card is minding its own business and reading only the frames directed to it. In order to capture the

Implementation of Packet Sniffer for Traffic Analysis and Monitoring ♦ 253


packets, NIC has to set into the promiscuous mode. Packet sniffers which does sniffing by setting the NIC card of its own system to promiscuous mode, and hence receives all packets even they are not intended for it. So, packet sniffer captures the packets by setting the NIC card into promiscuous mode. To set a network card to promiscuous mode, all we have to do is issue a particular ioctl( ) call to an open socket on that card and the packets are passed to the kernel.

4 Sniffer Working Mechanism

When the packets are sent from one node to another (i.e. from source to destination) on the network. In the network, a packet has to pass through many intermediate nodes. A node whose NIC is set into the promiscuous mode tends to receives the packet. The packet arriving to the NIC are copied to the device driver memory, which is then passed to the kernel buffer from where it is used by the user application. In Linux kernel, libpcap uses “PF_PACKET” socket which bypasses most packet protocol processing done by the kernel [Dabir and Matrawy, 2007].Each socket has two kernel buffers associated with it for reading and writing. By default in Fedora core 6, the size of each buffer is 109568 bytes. In our packet sniffer, at user level the packets are copied from the kernel buffer into a buffer created by libpcap when a live capture session is created. A single packet is handled by the buffer at a time for the application processing before next packet is copied into it [Dabir and Matrawy, 2007]. The new approach taken in the development of our packet sniffer is to improve the performance of packet sniffer using libpcap is to use same buffer space between kernel space and application. The figure 2 and 3 shows interface of our packet sniffer while capturing packets.

Fig. 2: Packet sniffer while capturing session.

Fig. 3: Shows the details of selected packet



5 Basic Steps for the Development of Packet Sniffer on Linux Platform

We are going to discuss the basic step that we have taken during the development of our packet sniffer. The rest of the steps only deal with interpreting the header and data formatting. The steps which we have taken are as follows.

5.1 Socket Creation

Socket is a bi-directional communication abstraction via which an application can send and receive data.

There are many type of socket:

SOCK_STREAM: TCP (connection oriented, guaranteed delivery)

SOCK_DGRAM: UDP (datagram based communication)

SOCK_RAW: allow access to the network layer. This can be build ICMP message or Custom IP packet.

SOCK_PACKET: allows access to the link layer (e.g. Ethernet). This can be used to build entire frame (for example to build a user space router).

When a socket is created, a socket stream, similar to the file stream, is created, through which data is read. [Ansari et al., 2003]

5.2 To Set NIC in Promiscuous Mode

To enable the packet sniffer to capture the packet, the NIC of the node on which sniffer software is running has to set into promiscuous mode. In our packet sniffer it was implemented by issuing an ioctl ( ) call to an open socket on that card. The ioctl system call takes three arguments,

a. The socket stream descriptor.

b. The function that the ioctl function is supposed to perform. Here, the macro used is SIOCGIFFLAGS.

c. Reference to the ifreq member [Ansari et al., 2003]

Since this is a potentially security-threatening operation, the call is only allowed for the root user. Supposing that ``sock'' contains an already open socket, the following instructions will do the trick:

ioctl (sock, SIOCGIFFLAGS, & ethreq); ethreq.ifr_flags |= IFF_PROMISC; ioctl (sock, SIOCGIFFLAGS, & ethreq);

The first ioctl reads the current value of the Ethernet card flags; the flags are then ORed with IFF_PROMISC, which enables promiscuous mode and are written back to the card with the second ioctl.

5.3 Protocol Interpretation

In order to interpret the protocol developer should have some basic knowledge of protocol that he wishes to sniff. In our sniffer which we developed on Linux platform we interpreted



the protocols such as IP,TCP,UDP,ICMP protocols by including the headers as <linux/tcp.h>,<linux/udp.h>,<linux/ip.h>and <linux/icmp.h>. In figure 4,5 and 6 below we are showing some the packet header formats.

6 Linux Filter

As network traffic increases, the sniffer will start losing packets since the PC will not be able to process them quickly enough. The solution to this problem is to filter the packets you receive, and print out information only on those you are interested in. One idea would be to insert an ``if statement'' in the sniffer's source; this would help polish the output of the sniffer, but it would not be very efficient in terms of performance. The kernel would still pull up all the packets flowing on the network, thus wasting processing time, and the sniffer would still examine each packet header to decide whether to print out the related data or not. The optimal solution to this problem is to put the filter as early as possible in the packet-processing chain (it starts at the network driver level and ends at the application level, see Figure 7). The Linux kernel allows us to put a filter, called an LPF, directly inside the PF_PACKET protocol-processing routines, which are run shortly after the network card reception interrupt has been served. The filter decides which packets shall be relayed to the application and which ones should be discarded.

Fig. 4: TCP protocol header fields

Fig. 5: UDP protocol header fields



Fig. 6: IP protocol header fields

Fig. 7: Filter processing chain

7 Methods to Sniff On Switch

An Ethernet environment in which the hosts are connected to a switch instead of a hub is called a Switched Ethernet. The switch maintains a table keeping track of each computer's MAC address and delivers packets destined for a particular machine to the port on which that machine is connected. The switch is an intelligent device that sends packets to the destined computer only and does not broadcast to all the machines on the network, as in the previous case.

7.1 ARP Spoofing

As we know that ARP is used to obtain the MAC address of the destination machine with which we wish to communicate. The ARP is stateless, we can send an ARP reply, even if one has not been asked for and such a reply will be accepted. Ideally, when you want to sniff the



traffic originating from a machine, you need to ARP spoof the gateway of the network. The ARP cache of that machine will now have a wrong entry for the gateway and is said to be "poisoned". This way all the traffic from that machine destined for the gateway will pass through your machine. Another trick that can be used is to poison a hosts ARP cache by setting the gateway's MAC address to FF:FF:FF:FF:FF:FF(also known as the broadcast MAC). There are various utilities available for ARP spoofing. An excellent tool for this is the arpspoof utility that comes with the dsniff suite.

7.2 MAC Flooding

Switches keep a translation table that maps various MAC addresses to the physical ports on the switch. As a result of this, a switch can intelligently route packets from one host to another, but it has a limited memory for this work. MAC flooding makes use of this limitation to bombard the switch with fake MAC addresses until the switch can't keep up. The switch then enters into what is known as a `failopen mode', wherein it starts acting as a hub by broadcasting packets to all the machines on the network. Once that happens sniffing can be performed easily. MAC flooding can be performed by using macof, a utility which comes with dsniff suite.

8 Bottleneck Analysis

With the increase of traffic in the network, the rate of the packets being received by the node also increases. On the arrival of the packet at NIC, they have to be transferred to the main memory for the processing. A single packet is transferred over the bus. As we know that the PCI bus has actual transfer of not more than 40 to 50 Mbps because a device can have control over the bus for certain amount of time or cycles, after that it has to transfer the control of the bus. And we know that slowest component of PC is disk drive so, bottle neck is created in writing the packets to disk in traffic sensitive network. To handle the bottle neck we can make an effort to use buffering in the user level application. According to this solution, some amount of RAM can be used as buffer to overcome bottleneck.

9 Detection of Packet Sniffer

Since the packet sniffer has been designed as a solution of many network problem. But one can not ignore its malicious use. Sniffers are very hard to detect due to its passiveness but there is always a way. And some of them are:

9.1 ARP Detection Technique

As we know that sniffing host receives all the packets, including those that are not destined for it. Sniffing host makes mistakes by responding to such packets that are supposed to be filtered by it. So, if an ARP packet is sent to every host and ARP packet is configured such that it doest not have broadcast address as destination address and if some host respond to such packets, then those host have there NIC set into promiscuous mode [Sanai]. As we know that Windows is not an open source OS, so we can’t analyze its software filter behavior as we do in Linux.In Linux we can analyze the behavior of filter by examining the source code of this OS. So, here we are presenting the some of addresses to do so with the Windows. They are as follows.



a. FF-FF-FF-FF-FF-FF Broadcast address: The packet having this address is received by all nodes and responded by them.

b. FF-FF-FF-FF-FF-FE fake broadcast address: This address is fake broadcast address in which last 1 bit is missing. By this address we check whether the filter examines all the bits of address and respond to it.

c. FF-FF-00-00-00-00 fake broadcast 16 bit address: In this address we can see those first 16 bits are same as broadcast address.

d. FF: 00:00:00:00:00 fake broadcast 8 bits: This address is fake broadcast address whose first 8 bits are same as the broadcast address [Sanai].

9.2 RTT Detection

RTT stands for Round Trip Time. It is the time that packet takes to reach the destination along with the time which is taken by response to reach the source. In this technique first the packets are sent to the host with normal mode and RTT is recorded. Now the same host is set to promiscuous mode and same set of packets are sent and again RTT is recorded. The idea behind this technique is that RTT measurement increases when the host is in promiscuous mode, as all packets are captured in comparison to host that is in normal mode [Trabelsi et

al., 2004].

10 Future Enhancement

This packet sniffer can be enhanced in future by incorporating the features like-Making the packet sniffer program platform independent, Filtering of the packets can be done using filter table, Filtering the suspect content from the network traffic and Gather and report network statistics.

11 Conclusion

A packet sniffer is not just a hacker’s tool. It can be used for network traffic monitoring, traffic analysis, troubleshooting and other useful purposes. However, a user can employ a number of techniques to detect sniffers on the network as discussed in this paper and protect the data from being sniffed.

References

[1] [Ansari et al., 2003] S.Ansari, Rajeev S.G. and Chandrasekhar H.S, “Packet Sniffing: A brief Introduction”, IEEE Potentials, Dec 2002- Jan 2003, Volume:21, Issue:5, pp:17 - 19

[2] [Combs, 2007] G.Combs, "Ethereal". Available at http://www.wireshark.org (Aug 15, 2007) [3] [Dabir and Matrawy, 2007] A. Dabir, A. Matrawy, “Bottleneck Analysis of Traffic Monitoring Using

Wireshark”, 4th International Conference on Innovations in Information Technology, 2007, IEEE Innovations '07, 18-20 Nov. 2007, Page(s):158 - 162

[4] [Drury, 2000] J. Drury, “Sniffers: What are they and how to protect from the”, November 11, 2000, http://www.sans.org/infosecFAQ/switchednet/sniffers.htm

[5] [Kurose, 2005] Kurose, James & Ross, Keith, “Computer Networking”,Pearson Education, 2005 [6] [Libpcap] Libpcap, http://wikepedia.com [7] [Sanai] Daiji Sanai, “Detection of Promiscuous Nodes Using ARP Packet”, http://www.securityfriday.com/. [8] [Sniffing FAQ] Sniffing FAQ, http://www.robertgraham.com



[9] [Sniffer] Sniffer resources, http://packetstorm.decepticons.org

[10] [Stevens, 2001] Richard Stevens, “TCP/IP Illustrated: Volume “, 2001. [11] [Stevens and Richard, 2001] Stevens, Richard, “UNIX Network Programming”, Prentice Hall India, 2001 [12] [Stones et al., 2004] Stones, Richard & Matthew, Neil, “Beginning Linux Programming”, Wrox

Publishers,2004. [13] [Trabelsi et al., 2004] Zouheir Trabelsi, Hamza Rahmani, Kamel Kaouech, Mounir Frikha, “Malicious

Sniffing System Detection Platform”, Proceedings of the 2004 International Symposium on Applications and the Internet (SAINT’04), IEEE Computer Society



Implementation of BGP Using XORP

Quamar Niyaz S. Kashif Ahmad Department of Computer Engineering Department of Computer Engineering Zakir Husain College of Zakir Husain College of Engineering & Technology Engineering & Technology Aligarh Muslim University Aligarh Muslim University Aligarh-202002, India Aligarh-202002, India [email protected] [email protected]

Mohammad A. Qadeer Department of Computer Engineering, Zakir Husain College of Engineering & Technology

Aligarh Muslim University, Aligarh-202002, India [email protected]

Abstract

In this paper, we present an approach to implement BGP and discussion about an open source routing software, XORP (eXtensible Open Routing Platform), which we will use for routing in our designed networks. In creating our project we will use the Linux based PCs, on which XORP will run, as a router. It is required that each Linux based PC must have two or more NICs (network interface cards). We will implement here BGP routing protocol, which is used for routing between different Autonomous Systems.

1 Introduction

With the continuous growth of Internet, Efficient routing, Traffic Engineering, QoS (Quality of Service), has become the challenge for Network Research Community. In the current scenario routers, available by the established vendors like Cisco etc. are so much architecture dependent that don’t provide APIs that allow any third party applications to run on their hardware.This arises the need, for Network Research Community, of having some open source routing software which allow researchers to access the APIs and Documentation, to develop their own EGP(Exterior Gateway Protocol) and IGP(Interior Gateway Protocol) routing protocols and new techniques for QoS etc. XORP [xorp.org] is one such type of effort in this direction. The goal of XORP is to develop an open source router platform that is stable and fully featured enough for production use, and flexible and extensible enough to enable network research. Currently XORP implements routing protocols for IPv4 and IPv6, including BGP, OSPF, RIP, PIM-SM, IGMP, and a unified means to configure them. The best part of XORP is that it also provides an extensible programming API. XORP runs on many UNIX flavors. In our paper first we will discuss the design and architecture of XORP and after that we will discuss implementation of routing protocols in our designed network using XORP.

2 Architecture of XORP

The XORP design philosophy stresses extensibility, performance and robustness and traditional router features. For routing and management modules, the primary goals are

Implementation of BGP Using XORP ♦ 261


extensibility and robustness. These goals are achieved by carefully separating functionality into independent modules, running in separate UNIX processes, with well-defined APIs between them.

2.1 Design Overview

XORP can be divided into two subsystems. The higher-level (“user-space”) subsystem consists of the routing protocols and management mechanisms. The lower-level (“kernel”) provides the forwarding path, and provides APIs for the higher-level to access. User-level XORP uses a multi-process architecture with one process per routing protocol, and a novel inter-process communication mechanism known as XORP Resource Locators (XRLs)[Xorp-ipc]. XRL communication is not limited to a single host, and so XORP can in principle run in a distributed fashion. For example, we can have a distributed router, with the forwarding engine running on one machine, and each of the routing protocols that update that forwarding engine running on a separate control processor system. The lower-level subsystem can use traditional UNIX kernel forwarding, the Click modular router [Kohler et al.2000] or Windows kernel forwarding (Windows Server 2003). The modularity and minimal dependencies between the lower-level and user-level subsystems allow for many future possibilities for forwarding engines. Figure 1 shows the processes in XORP which we described in section 2.2, although it should be noted that some of these modules use separate processes to handle IPv4 and IPv6. For simplicity, the arrows show only the main communication flows used for routing information. [Handley et al. 2002]

Fig. 1: XORP High-level Processes

2.2 XORP Process Description

As shown in Figure 1 there are several process in the XORP system, some of which are meant for routing protocols e.g. OSPF, BGP4+, RIP etc. and some of which are meant for management and forwarding mechanism e.g. FEA, RIB, SNMP etc. Among these various processes there are four core processes- FEA, RIB, Router Manager (rtrmgr), IPC finder, which we will describe in the following section

262 ♦ Implementation of BGP Using XORP


2.2.1 FEA (Forward Engine Abstraction)

The role of the Forwarding Engine Abstraction (FEA) in XORP is to provide a uniform interface to the underlying forwarding engine. It shields XORP processes from concerns over variations between platforms. The FEA performs four distinct roles: interface management, forwarding table management, raw packet I/O, and TCP/UDP socket I/O[Xorp-fea].

2.2.2 RIB (Routing Information Base)

The RIB process takes routing information from multiple routing protocols, stores these routes, and decides which routes should be propagated on to the forwarding engine. The RIB performs the following tasks:

• Stores routes provided by the routing protocols running on a XORP router.

• If more than one routing protocol provides a route for the same subnet, the RIB decides which route will be used.

• Protocols such as BGP may supply to the RIB routes that have a next-hop that is not an immediate neighbor. Such next hops are resolved by the RIB so as to provide a route with an immediate neighbor to the FEA.

• Protocols such as BGP need to know routing metric and reachability information to next hops that are not immediate neighbors. The RIB provides a way to register interest in such routing information, in such a way that the routing protocol will be notified if a change occurs [Xorp-rib].

2.2.3 rtrmgr (Router Manager)

XORP tries to hide from the operator the internal structure of the software, so that the operator only needs to know the right commands to use to configure the router. The operator should not need to know that XORP is internally composed of multiple processes, nor what those processes do. All the operator needs to see is a single router configuration file that determines the startup configuration, and a single command line interface that can be used to configure XORP. There is a single XORP process that manages the whole XORP router - this is called rtrmgr (XORP Router Manager). The rtrmgr is responsible for starting all components of the router, to configure each of them, and to monitor and restart any failing process. It also provides CLI (Command Line Interface) to change the router configuration [Xorp-rtrmgr].

2.2.4 IPC (Inter Process Communication) Finder

The IPC finder is needed by the communication method used among all XORP components. Each of the XORP components registers with the IPC finder. The main goal of XORP’s IPC scheme is:

• To provide all of the IPC communication mechanisms that a router is likely to need, e.g. sockets, ioctl’s, System V messages, shared, memory.

• To provide a consistent and transparent interface irrespective of the underlying mechanism used.

• To provide an asynchronous interface.

• To potentially wrapper communication with non-XORP processes, e.g. HTTP and SNMP servers.



To be renderable in human readable form so XORP processes can read and write commands from configuration files.

3 Implementation of BGP Routing Protocol

Routing Protocols are classified into two categories: Intra-Autonomous routing protocol and Inter-Autonomous routing protocol also known as Exterior Gateway Protocol (EGP). An AS(Autonomous System) corresponds to a routing domain that is under one administrative authority, and which implements its own routing policies.BGP is a Inter-AS routing protocol used for routing between different ASs. To implement and design the protocol we will use Linux base PCs, on which XORP is installed, as a router. Each of these PCs must have two or more NICs. To start XORP a configuration file is needed. The XORP router manager process can be started by using the following command rtrmgr -b my config.boot, where my config.boot is the configuration file. On startup, XORP explicitly configures the specified interfaces and starts all the required XORP components like FEA, RIP, OSPF, BGP and etc.

Figure 2 shows the interaction between the configuration files, The Router Manager, FEA etc. In the following section we will discuss the BGP protocol, network topology for the protocol and syntax for it in the configuration file.

Fig. 2: Interaction Between Modules [Xorp-fea]

3.1 BGP (Border Gateway Protocol)

The Border Gateway Protocol is the routing protocol used to exchange routing information across the Internet. It makes it possible for ISPs to connect to each other and for end-users to connect to more than one ISP. BGP is the only protocol that is designed to deal with a network of the Internet's size, and the only protocol that can deal well with having multiple connections to unrelated routing domains [Kurose and Ross, 2007].

264 ♦ Implementation of BGP Using XORP


3.2 BGP Working

The main concept used in BGP is that of the Autonomous System (AS) which we described earlier.

BGP is used in two different ways:

• eBGP is used to exchange routing information between routers that are in different ASs.

• iBGP is used to exchange information between routers that are in the in the same AS. Typically these routes were originally learned from eBGP.

Each BGP route carries with it an AS Path, which essentially records the autonomous systems through which the route has passed between the AS where the route was originally advertised and the current AS. When a BGP router passes a route to a router in a neighbouring AS, it prepends its own AS number to the AS path. The AS path is used to prevent routes from looping, and also can be used in policy filters to decide whether or not to accept a route.

When a router receives a route from an iBGP peer, if the router decides this route is the best route to the destination, then it will pass the route on to its eBGP peers, but it will not normally pass the route onto another eBGP peer. This prevents routing information looping within the AS, but it means that by default every BGP router in a domain must be peered with every other BGP router in the domain.

Routers typically have multiple IP addresses, with at least one for each interface, and often an additional routable IP address associated with the loopback interface1. When configuring an IBGP connection, it is good practice to set up the peering to be between the IP addresses on the loopback interfaces. This makes the connection independent of the state of any particular interface. However, most eBGP peering will be configured using the IP address of the router that is directly connected to the eBGP peer router. Thus if the interface to that peer goes down, the peering session will also go down, causing the routing to correctly fail over to an alternative path.

3.3 Network Topology

In our design we have created three Autonomous Systems AS65030, AS65020, and AS65040. There are two end systems one is attached with AS 65020 and other is attached with AS 65040.

Configuration on End Systems:

End System1 –

IP Address 45.230.20.2

subnet mask 255.255.255.0

End System 2 –

IP Address 45.230.20.2

subnet mask 255.255.255.0



In our topology as shown in figure 3 all the routers are simple PCs on which XORP processes are running and make enable them to work as routers. The router in AS65030 has a BGP identifier of 45.230.10.10, which is the IP address of one its interface. This router has two BGP peering configured, with peer on IP addresses 45.230.10.20 and 45.230.1.10. These peering are an eBGP connection because the peers are in a different ASs’ (65020 and 65040).

AS 65020

AS 65030

AS 65040

45.230.10.10/2445.230.1.20/24

45.230.1.10/2445.230.20.1/24

45.230.30.1/24

45.230.10.20/24

45.230.20.2/24

45.230.30.2/24

Fig. 3: Network Topology for BGP

References

[1] [Handley et al., 2002] M. Handley, O. Hodson, E. Kohler, “XORP: An Open Platform for Network Research”, In Proc. of First Workshop on Hot Topics in Networks, Oct. 2002.

[2] [Kohler et al., 2000] E. Kohler, R. Morris, B. Chen, J. Jannotti, M. F. Kaashoek, “The Click Modular Router”, ACM Trans. on Computer Systems, vol. 18, no. 3, Aug. 2000.

[3] [Kurose and Ross, 2007] James F. Kurose and Keith W. Ross, “Computer Networking- A Top Down Approach Featuring Internet”, Pearson Education (2007).

[4] [Xorp-fea] XORP Forwarding Engine Abstraction XORP technical document. http://www.xorp.org/. [5] [Xorp-ipc] XORP Inter-Process Communication library XORP technical document. http://www.xorp.org/ [6] [Xorp-rib] XORP Routing Information Base XORP technical document. http://www.xorp.org/. [7] [Xorp-rtrmgr] XORP Router Manager Process (rtrmgr) XORP technical document. http://www.xorp.org/. [8] [xorp.org] Extensible Open Routing Platform – www.xorp.org



Voice Calls Using IP enabled Wireless Phones

on WiFi / GPRS Networks

Robin Kasana Sarvat Sayeed Department of Computer Engineering Department of Computer Engineering Zakir Husain College of Zakir Husain College of Engineering & Technology Engineering & Technology Aligarh Muslim University Aligarh Muslim University Aligarh-202002, India Aligarh-202002, India [email protected] [email protected]

Mohammad A Qadeer Department of Computer Engineering

Zakir Husain College of Engineering & Technology Aligarh Muslim University, Aligarh-202002, India

[email protected]

Abstract

A research on the related technology and implementation of IP phone based on WiFi network is discussed in this paper; it includes the net structure of the technology used in designing the terminal of IP phone. This technology is a form of telecommunication that allows data and voice transmissions to be sent across a wide range of interconnected networks. A WiFi enabled IP phone is used which is preinstalled with the Symbian Operating System and a software application is developed using J2ME which allows free and secured communication between selected IP phones in the WiFi network. This communication is done with the use of routing tables organized in the WiFi routers. Using the free bandwidth of 2.4 GHz communication channels are established. The communication channel being a free bandwidth is vulnerable to external attacks and hacking. Thereby this challenge of creating a secure communication channel is addressed by using two different encryption mechanisms. The payload and header of the voice data packets are encrypted using two different algorithm techniques. Hence the communication system is made almost fully secure. Also the WiFi server can tunnel the calls to the GPRS network using UNC. It is cost effective, it allows easier communication, is great for international usages, and it can be very useful for large corporations. In time this will become a cheap and secure way to communicate and will have a large effect on university, business and personal communication.

1 Introduction

As human started to get civilized, great need for more, advance equipments occurs. Most of the things which in their early phase are considered to be part of leisure, have become one of the most necessary things of daily life. One such invention is telephone. However the

Voice Calls Using IP enabled Wireless Phones on WiFi / GPRS Networks ♦ 267


development of conventional telephony systems is far behind the development of today’s Internet. Centralized architectures with dumb terminals make exchange of data very complex, but provide very limited functions. Closed and hardware proprietary systems hinder the enterprise in choosing products from different vendors and deploying a voice function to meet their business needs. Consequently, Web-like IP phone distributed architecture [Collateral] is proposed to facilitate enterprises and individuals to provide their own phone services. The advent of Voice over Internet Protocol (VoIP) has fundamentally been transforming the way telecommunication evolves [Yu et al., 2003]. This technology is a form of telecommunication that allows data and voice transmissions to be sent across a wide variety of networks. VoIP allows businesses to talk to other branches, using a PC phone, over corporate Intranets. Driven by the ongoing deployment of broadband infrastructure and the increasing demand of telecommunication service, VoIP technologies and applications have led to the development of economical IP phone equipment for ever-rising VoIP communication market [Metcalfe,2000] based on embedded systems, IP phone application can satisfyingly provide the necessary interfaces between telephony signals and IP networks [Ho et al.,2003]. Although IP phone communication over the data networks such as LAN exists but these IP phones are fixed type. We have tried to implement wireless IP phone communication using the WiFi network. This network being in the free bandwidth channel is considered insecure and vulnerable to security threats and hacking. So the area of concern is the security and running cost of a communication system. As a lot of sensitive information can be lost because of insecure communication system, a lot of work is required to be done in this field to fill the lacuna. The base idea is unifying voice and data onto a single network infrastructure by digitizing the voice signals, convert them into IP packets and send them through an IP network together with the data information, instead of using a separate telephony network.

2 Related Work

The primary feature of a voice application is that it is extremely delay-sensitive rather than error-sensitive. There are several approaches that have been developed to support delay-sensitive applications on IP networks. In the transport layer, UDP can be used to carry voice packets while TCP may be used to transfer control signals, as long delay is caused by TCP by its retransmission and three-handshake mechanism. The Real-Time transport protocol (RTP) [Casner et al.,1996] is a compensative protocol for real-time deficiency on packet networks by operating on UDP and providing mechanisms for realtime applications to process voice packets. The Real-Time Control protocol (RTCP) [Metcalfe, 2000] provides quality feedback for the quality improvement and management of the real-time network. Several signaling protocols have been proposed for IP phone applications. SIP is peer to- peer protocols. Being simple and similar to HTTP, SIP [Rosenberg et al., 2002] will bring the benefits of WWW architecture into IP telephony and readily run wherever HTTP runs. It is a gradual evolution from existing circuit-switched networks to IP packet-switched network. A lot of work has been done to implement IP phones over data networks, even on the internet (Skype), but almost all work has been done mainly using secure communication channels and fixed IP phones. Although a lot of work has been done to connect different heterogeneous networks like UMA (Unified Mobile Access) technology allows the use of both GPRS network and WiFi networks (indoors) for calling [Arjona and Verkasalo, 2007].

268 ♦ Voice Calls Using IP enabled Wireless Phones on WiFi / GPRS Networks


3 IP Phone Communication Over WiFi

IP based phone communication in a particular WiFi network is free. Moreover the communication is secured as the existing WiFi network is used rather than using the services of any other carrier. 128 bit encrypted voice communication takes place between authorized and authenticated IP phone users. If the user wants to call to outside world then he has suffix a symbol, in this case ' * '.Then the call is routed to the outside world. Also, if the user moves out of the WiFi range, handover takes place the mobile unit again starts working on GPRS network.

3.1 Architecture

IP enabled cell phones are the mobile units capable of accessing the WiFi network. WiFi Routers have routing tables which are used to route the calls to the desired IP phone. A J2ME application was developed which provide access to the IP phone in the WiFi network.

3.2 Connection Mechanism

• IP phones registers its fixed IP on the WiFi route, where the router will update its routing table with this IP phone being active (Figure 1).

• The name and number of the phone with the particular IP are searched in the database and IP is replaced with the name of the user in the WiFi routing table.

• If the number starts with a special symbol say asterisk ' * ' then the router tunnels the call to GPRS network using UNC.

When signal of WiFi fades out handover takes place and the mobile unit starts working on the GPRS network.

Fig. 1: Registering of IP phone in the Routing table

3.3 Management of Call Between WiFi to WiFi

A number (user 2) is dialed using the J2ME application from user 1’s mobile unit. The application then sends the number in 128 bit encrypted form to the router, requesting a call to be placed (Figure 2(a)). The router in the WiFi network searches its routing table for the



desired number and if the number is active then a packet of data signaling an incoming call is sent to the corresponding IP on the WiFi network.The J2ME application on user 2’s mobile unit alert’s the user of an incoming call. The routing table gets updated to both the IP mobile units as being busy (Figure 2(b)). When the user 2 accepts the incoming call, real time transfer of voice data packets starts between the two mobile units.The header of each packet is encrypted in such a way that the router can decrypt it and route it to the required mobile unit. While the actual voice data packets are encrypted in such a way that only the other mobile unit can decrypt and it can not be decrypted at the router end. When the call is broken down the routing table is again modified and the busy status is changed. If the user at the other end doesn’t want to take the call and presses the hang up button then the user at the first end is send a message that the user dialed is busy.

Fig. 2(a): User 1 dialing User 2’s number Fig. 2(b): User 2 receiving a call from User 1

3.4 Management of Call Between WiFi to Public Network

When a user wants to dial a call to the outside world (that is to the public network), he has to suffix an ' * ' before the number he wants to call to. If he dials "*1234567890", Then the WiFi router identifies the ' * ' and routes the call via broad band connection to the UNC (UMA Network controller). Till UNC IP was being used to carry the voice –data packets

(Refer to Figure 3). After that point it depends upon the UNC which technology is used to carry the packets. Also if the call has to be routed to the outside world the packets have to be decrypted as the UNC is unaware of the encryption used by the WiFi network. More-over the packet has to be organized and decrypted according to the needs of UNC.

3.5 WiFi to GPRS Handover

In case of WiFi to GPRS handover, first the mobile unit has to detect that the WiFi signal has completely faded out. Also now the WiFi service is no longer acceptable. At this stage the mobile unit sends a handover request to a neighboring GPRS cell. The selection of mobile cell depends upon the SIM card present in the mobile unit at that time. Then the core network of the service provider has to handle the resource allocation procedure with the base station controller (BSC) for the GPRS calls. Once the allocation is complete a signal is sent to the mobile unit that the handover has taken place.



Fig. 3: Encryption –Decryption mechanism of the channel

3.6 Implementation

• A cell phone with Symbian 60 ver. 9 operating system, with Java capabilities, also it should be equipped with WLAN 802.11 b/g capabilities.

• J2ME software is required to place the calls and allow the encryption to take place for a secured communication.

• A router with the routing tables is required to route the calls to specific online users. The router should be authenticated in the WiFi environment and should also be WLAN 802.11 b/g supported.

3.7 Security

Security is one of the main areas of concern especially if we are communicating over the free bandwidth of 2.4 GHz of the WiFi.This is taken into care by using two different encryption methods. One is used for encrypting the header of the datapackets this can be broken down by both the WiFi router and the mobile unit. While the payload is encrypted using a different method which can only be broken down at the other mobile unit(Refer to Figure 3). There are very less chances of the signals being tapped as this whole communication system is taking place on a private network, with authentic and limited connectivity. Also the area of coverage being limited the signals can not be tapped easily. It also handles the new upcoming threats which the employer is facing specially in defense and other sensitive organizations related to security and privacy of the organization because of the highly sophisticated mobile devices capable of audio and video recording.This communication system solves problem by giving the employees mobile units which are Java enabled and capable of accessing the WiFi network, when the enter the organization and confiscating the employees mobile phones.

3.8 Cost Efficient

Cost involved in the setup and running of a communication system is a major issue. This method of communication deals very effectively in this aspect. As the only major cost involved is mainly in the setup of the communication system, which also comes out to be very less than the conventional GPRS and CDMA networks. The running cost of the network is only of the calls routed through UNC to the GPRS network, which is the cost levied by the service provider while the calls made within the WiFi network are free of cost. Hence the



running cost can be assumed to be nil as compared to the running cost of GPRS and CDMA networks. Hence this communication system is very cost effective and cost efficient system.

3.9 Coverage

The coverage area of the network depends upon the WiFi router coverage. Unlike GPRS network we can not deploy a number of WiFi hotspots for increasing the coverage.It is mainly due to the problem faced in handover. The mobile unit will not attempt a handoff until the quality of signal deteriorates quiet considerably.This is a problem when continuous coverage is built because the client will not attempt to change the cell even if the other cell is providing better signal strength. This ends up in very late cell changes, poor voice quality and dropped connections. Furthermore, in some cases even if the handover break is short, the perceived voice quality can be very poor for several seconds due to low signal quality prior to the handover. To address the handover time several methods have been developed but in practice are not widely implemented nor supported by current devices [Arjona and Verkasalo, 2007][IEEE HSP5,1999][IEEE HSP2.4,1999][IEEE QoS,2004].

Fig. 4: WiFi Network access to Terrestrial and Cellular Network

3.10 Future Prospects

Through this paper we have tried to establish a new way of communication between two wireless IP phones over the WiFi network. However there are many areas which remain untouched and demand attention. There is a high potential for the development of applications for this communication system which in turn will transform this system into a full-fledged communication system. Applications like Short Messaging Service (SMS) can also be developed. This service will function between two IP phones on the same WiFi network or even a series of interconnected networks. Data exchange i.e. sharing and transfer of information and files between two IP phones is another application waiting to be developed. Again this service can function on the same WiFi network or a series of interconnected networks. Accessing and surfing the internet on the wireless IP phone through a single access point will be very cost efficient. Moreover acquiring a list of all the users that are logged on the network a real time chatting application can be developed. Moreover the interconnected IP phones can be linked to server like the Asterisk and dialing outside their native network to the outside world will be possible. This will be quite preferable as only a single line outside the network is needed which will allow access to all the connected IP phones.



4 Conclusion

In this paper we have described a new way to provide communication within a specified area. Here we have proposed to use IP enabled mobile units which will be able to communicate to each other via the WiFi network. With the help of a simple Java application the allowed IP phones can automatically log on in the network and can communicate among themselves. The WiFi bandwidth of 2.4 GHz acts as communication channel between the mobile unit and the router. The same bandwidth is used as a communication channel between the different WiFi networks thereby treating the whole network as one and creating a huge data cloud. Since the bandwidth of the WiFi network is free, the only cost involved in this communication system is the initial setup cost, hence making it very much viable. Although it limits the communication area but also provides the flexibility to dial calls to the outside world by tunneling the calls through UNC to the public network (terrestrial, GPRS and CDMA) network. At the same time it also addresses the security issues and is an answer to the no mobile zones. These are basically the zones where the organizations have prohibited the use of mobile phones because of certain security constraints such as the fear of leakage of sensitive information outside a desired area. This security in the communication channel is maintained as the data packets are 128 bit encrypted.

References

[1] [Arjona and Verkasalo, 2007] Andres Arjona, Hannu Verkasalo, Unlicensed Mobile Access (UMA) Handover and Packet Data Analysis, Second International Confrenece on Digital Telecommunication ( ICDT'07 )

[2] [Casner et al.,1996] Schulzrinne H.,Casner, S.,Fredrick R.,and Jacobson V., RTP : A Transport Protocolfor Real –Time Applications, RFC 1889,January 1996

[3] [Collateral] Pintel Corp., Next Generation VoIP Services and Applications Using SIP and.Java Technology Guide, http://www.pingtel.com/docs/collateral_techguide_final.pdf

[4] [Ho et al. 2003] Chian C. Ho,tzi-Chiang Tang,Chin-Ho Lee, Chih –Ming Chen,Hsin-Yang Tu,Chin-Sung Wu,Chao-His Chang,Chin-Meng Huan, “ H.323 VoIP Telephone Implementation Embedding A Low Power SOC Processor”,0-7803-7749-4/ 03 IEEE.,p.163-166.

[5] [IEEE HSP2.4,1999] IEEE, "Part 11 :Wireless LAN Medium Access Control (MAC) and Physical Layer Specifications –High Speed Physical Layerin the 2.4GHz band", IEEE Standard 802.11b,1999

[6] [IEEE HSP5,1999] IEEE, "Part 11 :Wireless LAN Medium Access Control (MAC) and Physical Layer Specifications –High Speed Physical Layerin the 5 GHz band", IEEE Standard 802.11a, 1999

[7] [IEEE QoS,2004] IEEE, "Part 11 :Wireless LAN Medium Access Control (MAC) and Physical Layer Specifications – Amendment: Medium Access Control (MAC) Enhancements of Quality of Service", IEEE Standard P802.11e/D12.0,November 2004

[8] [Metcalfe 2000] B. Metcalfe, “ The Next Generation Internet ”, IEEE Internet Computing, vol.4, p. 58 -59, Jan- Feb,2000

[9] [Rosenberg et al.,2002] Rosenberg J., Schulzrinne H., Camarillo G.,Jhonston A.,Peterson J.,Sparks R., Handley M. and Schooler E.,SIP:Session Initiation Protocol Protocol, RFC 2543, The Internet Society,Feburary 21, 2002

[10] [Yu et al., 2003] Jia Yu, Jan Newmarch, Michael Geisler, “ JINI/J2EE Bridge for Large-scale IP Phone Services ”,Proceedings of the Tenth Asia-Pacific Software Engineering Confernce (APSEC’03),1530-1362/03



Internet Key Exchange Standard for: IPSEC

Sachin P. Gawate N.G. Bawane Nilesh Joglekar G.H.R.C.E G.H.R.C.E. IIPL Nagpur Nagpur Nagpur [email protected] [email protected]

Abstract

This paper describes the purpose, history, and analysis of IKE [RFC 2409], the current standard for key exchange for the ‘IPSec protocol. We discuss some issues with the rest of IPSec, such as what services it can offer without changing the applications, and whether the AH header is necessary. Then we discuss the various protocols of IKE, and make suggestions for improvement and simplification.

1 Introduction

IPSec is an IETF standard for real-time communication security. In such a protocol, Alice initiates communication with a target, Bob. Each side authenticates itself to the other based on some key that the other side associates with it, either a shared secret key between the two parties, or a public key. Then they establish secret session keys (4 keys, one for integrity protection, and one for encryption, for each direction). The other major real-time communication protocol is SSL [Roll, standardized with minor changes by the IETF as TLS. IPSec is said to operate at “layer 3” whereas SSL operates at “layer 4”. We discuss what this means, and the implications of these choices, in section 1.2.

1.1 ESP vs. AH

There are several pieces to IPSec. One is the IPSec data packet encodings of which there are two: AH (authentication header), which provides integrity protection. and ESP (encapsulating security payload) that provides encryption and optional integrity protection. Many people argue [FS99] that AH is unnecessary, given that ESP can provide integrity protection. The integrity protection provided by ESP and AH is not identical, however. Both provide integrity protection of everything beyond the IP header, but AH provide integrity protection for some of the fields inside the IP header as well. It is unclear why it is necessary to protect the IP header. If it were necessary, this could be provided by ESP in “tunnel mode” (where a new IP header with ESP is pretended to the original packet, and the entire original packet including IP header is considered payload, and therefore cryptographically protected by ESP). Intermediate routers can not enforce AH’S integrity protection, because they do not know the session key for the Alice-Bob security association, so AH can at best be used by Bob to check that the IP header was received as launched by Alice. Perhaps an attacker could change the QOS fields, so that the packet would have gotten referential or discriminatory treatment unintended by Alice, but Bob would hardly wish to discard a packet from Alice if the contents were determined cryptographically to be properly received, just because it traveled by a different path, or according to different handling, than Alice intended. The one function

274 ♦ Internet Key Exchange Standard for: IPSEC


that AH offers that ESP does not provide is that with AH, routers and firewalls know the packet is not encrypted, and can therefore make decisions based on fields in the layer 4 header, such as the ports. (Note: even if ESP is using null encryption, there is no way for a router to be able to know this conclusively on a packet-by-packet basis.) This “feature” of having routers and firewalls look at the TCP ports can only be used with unencrypted IP traffic, and many security advocates argue that IPSec should always be encrypting the traffic. Information such as TCP ports does divulge some information that should be hidden, even though routers have become accustomed to using that information for services like differential queuing. Firewalls also base decisions on the port fields, but a malicious user can disguise any traffic to fit the firewall’s policy database (e.g., if the firewall allows HTTP, then run all protocols on top of HTTP), so leaving the ports unencrypted for the benefit of firewalls is also of marginal benefit. The majority of our paper will focus on IKE, the part of IPSec that does mutual authentication and establishes session keys.

1.2 Layer 3 vs. Layer 4

The goal of SSL was to deploy something totally at the user level, without changing the operating systems, whereas the goal of IPSec was to deploy something within the OS and not require changes to the applications. Since everything from TCP down is generally implemented in the OS, SSL is implemented as a process that calls TCP. That is why it is said to be at the “Transport Layer” (layer 4 in the OS1 Reference Model). IPSec is implemented in layer 3, which’ means it considers everything above layer 3 as data, including the TCP header. The philosophy behind IPSec is that if only the OS needed to change, then by deploying an IPSec-enhanced OS all the applications would automatically benefit from IPSec’s encryption and integrity protection services. There is a problem in operating above TCP. Since TCP will not be participating in the cryptography, it will have no way of noticing if malicious data is inserted into the packet stream. TCP will acknowledge such data and send it up to SSL, which will discard it because the integrity check will indicate the data is bogus, but there is no way for SSL to tell TCP to accept the real data at this point. When the real data arrives, it will look to TCP like duplicate data, since it will have the same sequence numbers as the bogus data, so TCP will discard it. So in theory, IPSec approach of cryptographically protecting each packet independently is a better approach. However, if only the operating system changes, and the applications and the API to the applications do not change, then the power of IPSec cannot be fully utilized. The API just tells the application what IP address is on a particular connection. It can’t inform the application of which user has been authenticated. That means that even if users have public keys and certificates, and IPSec authenticates them, there is no way for it to inform the application. Most likely after IPSec establishes an encrypted tunnel, the user will have to type a name and password to authenticate to the application. So it is important that eventually the APIs and applications change so that IPSec can inform the application of something more than the IP address of the tunnel endpoint, but until they do, IPSec accomplishes the following:

It encrypts traffic between the two nodes. As with firewalls, IPSec can access a policy database that specifies which IP addresses are allowed to talk to which other IP addresses.

Some applications do authentication based on IP addresses, and the IP address from which information is received is passed up to the application. With IPSec, this form of authentication becomes much more secure because one of the types of endpoint identifiers

Internet Key Exchange Standard for: IPSEC ♦ 275


IPSec can authenticate is an IP address, in which case the application would be justified in trusting the IP address asserted by the lower layer as the source.

2 Overview of IKE

IKE is incredibly complex, not because there is any intrinsic reason why authentication and session key establishment should be complex, but due to unfortunate politics and the inevitable result of years of work by a large committee. Because it is so complex, and because the documentation is so difficult to decipher, IKE has not gotten significant review. The IKE exchange consists of two phases. We argue that the second phase is unnecessary. The phase 1 exchange is based on identities such as names, and secrets such as public key pairs, or pre-shared secrets between the two identities. The phase 1 exchange happens once, and then allows subsequent setup of multiple phase 2 connections between the same pair of identities. The phase 2 exchanges relies on the session key established in phase 1 to do mutual authentication and establish a phase 2 session key used to protect all the data in the phase 2 security association. it would certainly be simpler and cheaper to just set up a security association in a single exchange, and do away with the phases, but the theory is that although the phase 1 exchange is necessarily expensive (if based on public keys), the phase 2 exchanges can then be simpler and less expensive because they can use the session key created out of the phase 1 exchange. This reasoning only makes sense if there will be multiple phase 2 setups inside the same phase 1 exchange. Why would there be multiple phase 2-type connections between the same pair of nodes? Here are the arguments in favor of having two phases:

• It is a good idea to change keys periodically. You can do key rollover of a phase 2 connection by doing another phase 2 connection setup, which would be cheaper than restarting the phase 1 connection setup.

• You can set up multiple connections with different security properties, such as integrity-only, encryption with a short (insecure, snooper-friendly) key, or encryption with a long key.

• You can set up multiple connections between two nodes because the connections are application-to application, and you’d like each application to use its own key, perhaps so that the IPSEC layer can give the key to the application.

We argue against each of these points:

• If you want perfect forward secrecy when you do a key rollover, then the phase 2 exchange is not significantly cheaper than doing another phase 1 exchange. If you are simply rekeying, either to limit the amount of data encrypted with a single key, or to prevent replay after the sequence number wraps around, then a protocol designed specifically for rekeying would be simpler and less expensive than the IKE phase 2 exchanges.

• It would be logical to use the strongest protection needed by any of the traffic for all the traffic rather than having separate security associations in order to give weaker protection to some traffic. There might be some legal or performance reasons to want to use different protection for different forms of traffic, but we claim that this should be a relatively rare case that we should not be optimizing for. A cleaner method of



doing this would be to have completely different security associations rather than multiple security associations loosely linked together with the same phase 1 security association.

• This case (wanting to have each application have a separate key) seems like a rare case, and setting up a totally unrelated security association for each application would suffice. In some cases, different applications use different identities to authenticate. In that case they would need to have separate Phase 1 security associations anyway. In this paper we concentrate on the properties of the variants of Phase 1 IKE. Other than arguably being unnecessary, we do not find any problems with security or functionality with Phase 2 IKE.

3 Overview of Phase I IKE

There are two “modes” of IKE exchange. “Aggressive mode” accomplishes mutual authentication and session key establishment in 3 messages. “Main mode” uses 6 messages, and has additional functionality, such as the ability to hide endpoint identifiers from eavesdroppers, and negotiate cryptographic parameters. Also, there are three types of keys upon which a phase 1 IKE exchange might be based: pre-shared secret key, public encryption key, or public signature key. The originally specified protocols based on public encryption keys were replaced with more efficient protocols. The original ones separately encrypted each field with the,other sides public key, instead of using the well known technique of encrypting a randomly chosen secret key with the other side’s public key, and encrypting all the rest of the fields with that secret key. Apparently a long enough time elapsed before anyone noticed this that they felt they needed to keep the old-style protocol in the specification, for backward compatibility with implementations that. might have been deployed during this time. This means there are 8 variants of the Phase 1 of IKE! That is because there are 4 types of keys (old, style public encryption key, new-style public encryption key, public signature key, and pre-shared secret key), and for each type of key, a main mode protocol and an aggressive mode protocol. The variants have surprisingly different characteristics. In main mode there are 3 pairs of messages. In the first pair Alice sends a “cookie” (see section 4.2) and requested cryptographic algorithms, and Bob responds with his cookie value, and the cryptographic algorithms he will agree to. Message 3 and 4 consist of a Diffie-Hellman exchange. Messages 5 and 6 are encrypted with the Diffie-Hellman value agreed upon in messages 3 and 4, and here each side reveals its identity and proves it knows the relevant secret (e.g., private signature key or pre-shared secret key). In aggressive mode there are only 3 messages. The first two messages consist of a Diffie-Hellman exchange to establish a session key, and in the 2nd and 3rd messages each side proves they know both the Diffie-Hellman value and their secret.

3.1 Key Types

We argue one simplification that can be made to IKE is to eliminate the variants based on public encryption keys. It’s fairly obvious why in some situations the pre-shared secret key variants make sense. Secret keys are higher performance. But why the two variants on public key? There are several reasons we can think of for the signature-key-only variant: Each side knows its own signature key, but may not know the other side’s encryption key until the other side sends a certificate. If Alice’s encryption key was escrowed, and her signature key was



not, then using the signature keys offers more assurance that you’re talking to Alice rather than the escrow agent. In some scenarios people would not be allowed to have encryption keys, but it is very unlikely that anyone who would have an encryption key would not also have a signature key. But there are no plausible reasons we can come up with that would require variants based on encryption keys. So one way of significantly simplifying IKE is to eliminate the encryption public key variants.

3.2 Cookies

Stateless cookies were originally proposed in Photuris [K94] as a way of defending against denial of service attacks. The server, Bob, has finite memory and computation capacity. In order to prevent an attacker initiating

connections from random IP addresses, and using up all of the state Bob needs in order to keep track of connections in progress, Bob will not keep any state or do any significant computation unless the connect request is accompanied by a number, known as a “cookie”, that consists of some function of the IP address from which the connection is made and a secret known to Bob. In order to connect to Bob, Alice first makes an initial request, and is given a cookie. After telling Alice the cookie value, Bob does not need to remember anything about the connect request. When Alice contacts Bob again with a valid cookie, Bob will be able to verify, based on Alice’s IP address, that Alice’s cookie value is the one Bob would have given Alice. Once he knows that Alice can receive from the IP address she claims to be coming from, he is willing to devote state and significant computation to the remainder of the authentication. Cookies do not protect against an attacker, Trudy, launching packets from IP addresses at which she can receive responses. But in some forms of denial of service attacks the attackers choose random IP addresses as the source, both to make it harder to catch them, and to make it harder to filter out these attacking messages. So cookies are of some benefit. If computation were the only problem, and Bob had sufficient state to keep track of the maximum number of connect requests that could possibly arrive within the time window before he is allowed to give up and delete the state for the uncompleted connection, it would not be necessary for the cookie to be stateless. But memory is a resource at Bob that can be swamped during a denial of service attack, so it is desirable for Bob not to need to keep any state until he receives a valid cookie. OAKLEY [098] allowed the cookies to be optional. If Bob was not being attacked and therefore had sufficient resources, he could accept connection requests without cookies. A round trip delay and two messages could be saved. In Photuris the cookie (and the extra two messages) was always required. The idea behind the OAKLEY stateless cookies is:

In the “main mode” variants, none of the Ike variants allows from being forced to do a significant amount of computation. However, IKE requires Bob to keep state from the first



message, before he knows whether the other side would be able to return a cookie. It would be straightforward to add two messages to IKE to allow for a stateless cookie. However, we claim that stateless cookies can be implemented in IKE main mode without additional messages by repeating in message, 3 the information in message 1. Furthermore, it might be nice, in aggressive mode, to allow cookies to be optional, turned on only by the server when it is experiencing a potential denial of service attack, using the OAKLEY technique.

3.3 Hiding Endpoint Identities

One of the main intentions of main mode was the ability to hide the endpoint identifiers. Although it’s easy to hide the identifier from a passive attacker, with some key types it is difficult to design a protocol to prevent an active attacker from learning the identity of one end or the other. If it is impossible to hide one side’s identity from an active attacker, we argue it would be better for the protocol to hide the initiator’s identity rather than the responder’s (because the responder is likely to be at a fixed IP address so that it can be easily found while the initiator may roam and arrive from a different IP address each day). Keeping that in mind, we’ll summarize how well the IKE variants do at hiding endpoint identifiers. In all of the aggressive mode variants, both endpoint identities are exposed, as would be expected. Surprisingly, however, we noticed that the signature key variant of aggressive mode could have easily been modified, with no technical disadvantages, to hide both endpoint identifiers from an eavesdropper, and the initiator’s identity even from an active attacker! The relevant portion of that protocol is:

The endpoint identifiers could have been hidden by removing them from message 1 and 2 and including them,encrypted with the Diffie-Hellman shared value,in messages 2 (Bob’s identifiers) and 3 (Alice identifiers). In the next sections we discuss how the main mode protocols hide endpoint identifiers

3.3.1 Public Signature Keys

In the public signature key main mode, Bob’s identity is hidden even from an active attacker, but Alice’s identity is exposed to an active attacker impersonating Bob’s address to Alice. The relevant part of the IKE protocol is the following:



An active attacker impersonating Bob’s address to Alice will negotiate a Diffie-Hellman key with Alice and discover her identity in msg 5.The active attacker will not be able to complete the protocol since it will not be able to generate Bob’s signature in msg 6.

The protocol could be modified to hide Alice’s identity instead of Bob’s from an active attacker. This would be done by moving the information from msg 6 into msg 4. This even completes the protocol in one fewer message. And as we said earlier, it is probably in practice more important to hide Alice’s identity than Bob’s.

3.3.2 Public Encryption Keys

In this variant both sides’ identities are protected even against an active attacker. Although the protocol is much more complex, the main idea is that the identities (as well as the Diffie-Hellman values in the Diffie-Hellman exchange) are transmitted encrypted with the other side’s public key, so they will be hidden from anyone that doesn’t know the other side’s private key. We offer no optimizations to the public encryption key variants of IKE other than suggesting their removal.

3.3.3 Pre-Shared Key

In this variant, both endpoints’ identities are revealed, even to an eavesdropper! The relevant part of the protocol is the following:

Since the endpoint identifiers are exchanged encrypted, it would seem as though both endpoint identifiers would be hidden. However, Bob has no idea who he is talking to after message 4, and the key with which messages 5 and 6 are encrypted is a function of the pre shared key between Alice and Bob. So Bob can’t decrypt message 5, which reveals Alice’s identity, unless he already knows, based on messages 1-4, who he is talking to!

The IKE spec recognizes this property of the protocol, and specifies that in this mode the endpoint identifiers have to be the IP addresses! In which case, there’s no reason to include them in messages 5 and 6 since Bob (and an eavesdropper) already knows them!

Main mode with pre-shared keys is the only required protocol. One of the reasons you’d want to use IPSec is in the scenario in which Alice, an employee traveling with her laptop, connects into the corporate network from across the Internet. IPSec with pre-shared keys would seem a logical choice for implementing this scenario. However the protocol as designed is completely useless for this scenario since by definition Alice’s IP address will be unpredictable if she’s attaching to the Internet from different locations. It would be easy to fix the protocol. The fix is to encrypt messages 5 and 6 with a key which is a function of the shared Diffie-Hellman value, and not also a function of the pre-shared key. Proof of knowledge of the pre-shared key is already done inside messages 5 and 6. In this way an active attacker who is acting as a man-in the middle in the Diffie-Hellman exchange would be able to discover the endpoint identifiers, but an eavesdropper would not. And more



importantly than whether the endpoint identifiers are hidden, it allows use of true endpoint identifiers, such as the employee’s name, rather than IP addresses. This change would make this mode useful in the scenario (road warrior) in which it would be most valuable.

4 Negotiating Security Parameters

IKE allows the two sides to negotiate which encryption, hash, integrity protection, and Diffie-Hellman parameters they will use. Alice makes a proposal of a set of algorithms and Bob chooses. Bob does not get to choose 1 from column A, 1 from column B, 1 from column C, and 1 from column D, so to speak. Instead Alice transmits a set of complete proposals. While this is more powerful in the sense that it can express the case. where Alice can only support certain combinations of algorithms, it greatly expands the encoding in the common case where Alice is capable of using the algorithms in any combination. For instance, if Alice can support 3 of each type of algorithm, and would be happy with any combination, she’d have to specify 81 (34) sets of choices to Bob in order to tell Bob all the combinations she can support! Each choice takes 20 bytes to specify-- 4 bytes for a header and 4 bytes for each of encryption, hash, authentication, and Diffie-Hellman.

5 Additional Functionality

Most of this paper dealt with simplifications we suggest for IKE. But in this section we propose some additional functionality that might be useful.

5.1 Unidirectional Authentication

In some cases only one side has a cryptographic identity. For example, a common use case for SSL is where the server has a certificate and the user does not. In this case SSL creates an encrypted tunnel. The client side knows it is talking to the server, but the server does not know who it is talking to. If the server needs to authenticate the user, the application typically asks for a name and password. The one-way authentication is vital in this case because the user has to know he is sending his password to the correct server, and the protocol also ensures that the password will be encrypted when transmitted. In some cases security is useful even if it is only one-way. For instance, a server might be disseminating public information, and the client would like to know that it is receiving this information from a reliable source, but the server does not need to authenticate the client. Since this is a useful case in SSL, it would be desirable to allow for unidirectional authentication within IPSec. None of the IKE protocols allow this.

5.2 Weak Pre-shared Secret Key

The IKE protocol for pre-shared secrets depends on the secret being cryptographically strong. If the secret were weak, say because it was a function of a password, an active attacker (someone impersonating one side to the other) could obtain information with which to do an off-line dictionary attack. The relevant portion of the IKE protocols is that first the two sides generate a Diffie- Hellman key, and then one side sends the other something which is encrypted with a function of the Diffie-Hellman key and the shared secret. If someone were impersonating the side that receives this quantity, they know the Diffie-Hellman value, so the encryption key is a function of a known quantity (the Diffie-Hellman value) and the weak



secret. They can test a dictionary full of values and recognize when they have guessed the user’s secret. The variant we suggest at the end of section 4.3.3 improves on the IKE pre-shared secret protocol by allowing identities other than IP addresses to be authenticated, but it is still vulnerable to dictionary attack by an active attacker, in the case where the secret is a weak secret. Our variant first establishes an anonymous Diffie-Hellman value, and then sends the identity, and some proof of knowledge of the pre-shared secret, encrypted with the Diffie-Hellman value. Whichever side receives this proof first will be able to do a dictionary attack and verify when they’ve guessed the user secret. There is a family of protocols [BM92], [BM94], [Jab96], [Jab97], [Wu98], [KPOl], in which a weak secret, such as one derived from a password, can be used in a cryptographic exchange in a way that is invulnerable to dictionary attack, either by an eavesdropper or someone impersonating either side. The first such protocol, EKE, worked by encrypting a Diffie-Hellman exchange with a hash of the weak secret, and then authenticating based on the strong secret created by the Diffie-Hellman exchange. The ability to use a weak secret such as a password in a secure way is very powerful in the case where it is a user being authenticated. The current IKE pre-shared secret protocol could be replaced with one of these protocols at no loss in security or performance. For instance, a 3-message protocol based on EKE would look like:

The user types her name and password at the client machine, so that it can compute W. Alice sends her name, and her Diffie-Hellman value encrypted with W. Bob responds with his Diffie-Hellman value, and a hash of the Diffie-Hellman key, which could only agree with the one computed by Alice if Alice used the same W as Bob has stored. In the third message, Alice authenticates by sending a different hash of the Diffie-Hellman key. This protocol does not hide Alice’s identity from a passive attacker. Hiding Alice’s identity could be accomplished by adding two additional messages at the beginning, where a separate Diffie-Hellman exchange is done, and the remaining 3 messages encrypted with that initially established Diffie-Hellman key.

References

[1] [BM92] S. Bellovin and M. Merritt, “Encrypted Key Exchange: Password- based protocols secure against dictionary attacks”, Proceedings of the IEEE Symposium on Research in Security and Privacy, May 1992.

[2] [BM94] S. Bellovin and M. Merritt, “Augmented Encrypted Key Exchange: a Password-Based Protocol 1994

[3] [FS99] Ferguson, Niels, and Schneier, Bruce, “A Cryptographic Evaluation of IPSec”, http://www.counterpane. com, April 1999.

[4] [Jab961] D. Jablon, “Strong password-only authenticated key exchange”, ACM Computer Communications Review, October 1996.

[5] [Jab971] D. Jablon, “Extended Password Protocols Immune to Dictionary Attack”, Enterprise Security Workshop, June 1997.

[6] [K94] Karn, Phil, “The Photuris Key Management Protocol’’, Internet Draft draft-kam-photuris-OO.txt, December1994.

[7] [KPOI] Kaufman, Charlie, and Perlman, Radia, “PDM: A New Strong Password-Based Protocol”, Usenix Security Conference, 2001.



[8] [098] Orman, Hilarie, “The OAKLEY Key Determination Protocol”, RFC 2412, Nov 1998. [9] [PKOO] Perlman, R. and Kaufman, C. “Key Exchange in IPSec: Analysis of IKE”, IEEE Inetemet

Computing, Nov/Dec 2000. [10] [ROI] Rescorla,Eric, SSL and TLS: Designing and Building Secure Systems, Addison Wesley, 2001. [11] [RFC2402] Kent, Steve, and Atkinson, Ran, “1P Authentication Header”, RFC 2402, Nov 1998. [12] [RFC2406] Kent, Steve, and Atkinson. Ran, “IP Encapsulating Security Payload (ESP),” RFC 2406, Nov

1998.



Autonomic Elements to Simplify

and Optimize System Administration

K. Thirupathi Rao K.V.D. Kiran Department of Computer Department of Computer Science and Engineering, Science and Engineering, Koneru Lakshmaiah College of Engineering Koneru Lakshmaiah College of Engineering Green Fields.India-522502 Green Fields.India-522502 [email protected] [email protected]

S. Srinivasa Rao D. Ramesh Babu Department of Computer Department of Computer Science and Engineering, Science and Engineering, Koneru Lakshmaiah College of Engineering Koneru Lakshmaiah College of Engineering Green Fields.India-522502 Green Fields.India-522502 [email protected] [email protected]

M. Vishnuvardhan Department of Computer, Science and Engineering,

Koneru Lakshmaiah College of Engineering, Green Fields.India-522502 [email protected]

Abstract

Most computer systems become increasingly large and complex, thereby compounding many reliability problems. Too often computer systems fail, become compromised, or perform poorly. To improve the system reliability, one of the most interesting methods is the Autonomic Management which offers a potential solution to these challenging research problems. It is inspired by nature and biological system, such as the autonomic nervous system that have evolved to cope with the challenges of scale, complexity, heterogeneity and unpredictability by being decentralized, context aware, adaptive and resilient. Today, a significant part of system administration work specifically involves the process of reviewing the results given by the monitoring system and the subsequent use of administration or optimization tools. Due to the sustained trend toward ever increasingly distributed applications, this process is much more complex in practice than it appears in theory. Each additional component increases the number of possible “adjustments” enabling optimal implementation of the services in terms of availability and performance. To master this complexity in this paper we presented a model that describes the chain of actions/reactions to achieve desirable degree of automation through an autonomic element.

1 Introduction

With modern computing, consisting of new paradigms such as planetary-wide computing, pervasive, and ubiquitous computing, systems are more complex than before. Interestingly, when chip design became more complex we employed computers to design them. Today we are now at the point where humans have limited input to chip design. With systems becoming

284 ♦ Autonomic Elements to Simplify and Optimize System Administration


more complex it is a natural progression to have the system to not only automatically generate code but build systems, and carryout the day-to-day running and configuration of the live system. Therefore autonomic computing has become inevitable and therefore will become more prevalent.

To deal with the growing complexity of computing systems requires autonomic computing. The autonomic computing, which is inspired by biological systems such as the autonomic human nervous system [1, 2] and enables the development of self-managing computing systems and applications. The systems and applications use autonomic strategies and algorithms to handle complexity and uncertainties with minimum human intervention. An autonomic application or system is a collection of autonomic elements, which implement intelligent control loops to monitor, analyze, plan and execute using knowledge of the environment. A fundamental principle of autonomic computing is to increase the intelligence of individual computer components so that they become “self-managing,” i.e., actively monitoring their state and taking corrective actions in accordance with overall system-management objectives. The autonomic nervous system of the human body controls bodily functions such as heart rate, breathing and blood pressure without any conscious attention on our part. The parallel notion when applied to autonomic computing is to have systems that manage themselves without active human intervention. The ultimate goal is to create Autonomic Management computer systems that will become self-managing, and more powerful; users and administrators will get more benefits from computers, because they can concentrate their works with little conscious intervention. The paper is organized as follows. Section 2 deals with the characteristics of autonomic computing system, Section 3 deals with architecture for autonomic computing, section 4 deals with the autonomic elements for simplifying and optimizing system administration and concluded in section 5 followed by References.

2 Characteristics of Autonomic Computing System

The new era of computing is driven by the convergence of biological and digital computing systems. To build tomorrow’s autonomic computing systems we must understand working and exploit characteristics of autonomic system. Autonomic systems and applications exhibit following characteristics. Some of these characteristics are discussed in [3, 4].

Self Awareness: An autonomic system or application “knows itself” and is aware of its state and its behaviors.

Self Configuring: An autonomic system or application should be able configure and reconfigure itself under varying and unpredictable conditions without any detailed human intervention in the form of configuration files or installation dialogs.

Self Optimizing: An autonomic system or application should be able to detect suboptimal behaviors and optimize it self to improve its execution.

Self-Healing: An autonomic system or application should be able to detect and recover from potential problems and continue to function smoothly.

Self Protecting: An autonomic system or application should be capable of detecting and protecting its resources from both internal and external attack and maintaining overall system security and integrity.

Autonomic Elements to Simplify and Optimize System Administration ♦ 285


Context Aware: An autonomic system or application should be aware of its execution environment and be able to react to changes in the environment.

Open: An autonomic system or application must function in an heterogeneous world and should be portable across multiple hardware and software architectures. Consequently it must be built on standard and open protocols and interfaces.

Anticipatory: An autonomic system or application should be able to anticipate to the extent possible, its needs and behaviors and those of its context, and be able to manage it self proactively.

Dynamic: Systems are becoming more and more dynamic in a number of aspects such as dynamics from the environment, structural dynamics, huge interaction dynamics and from a software engineering perspective the rapidly changing requirements for the system. Machine failures and upgrades force the system to adapt to these changes. In such a situation, the system needs to be very flexible and dynamic.

Distribution: systems become more and more distributed. This includes physical distribution, due to the invasion of networks in every system, and logical distribution, because there is more and more interaction between applications on a single system and between entities inside a single application.

Situated ness: There is an explicit notion of the environment in which the system and entities of the system exist and execute, environmental characteristics affect their execution, and they often explicitly interact with that environment. Such an (execution) environment becomes a primary abstraction that can have its own dynamics, independent of the intrinsic dynamics of the system and its entities. As a consequence, we must be able to cope with uncertainty and unpredictability when building systems that interact with their environment. This situated ness often implies that only local information is available for the entities in the system or the system itself as part of a group of systems.

Locality in control: When Computing systems and components live and interact in an open world, the concept of global flow of control becomes meaningless. So Independent computing systems have their own autonomous flows of control, and their mutual interactions do not imply any join of these flows. This trend is made stronger by the fact that not only do independent systems have their own flow of control, but also different entities in a system have their own flow of control.

Locality in interaction: physical laws enforces locality of interactions automatically in a physical environment.. In a logical environment, if we want to minimize the conceptual and management complexity we must also favors modeling the system in local terms and limiting the effect of a single entity on the environment. Locality in interaction is a strong requirement when the number of entities in a system increases, or as the dimension of the distribution scale increases. Otherwise tracking and controlling concurrent and autonomously initiated interactions is much more difficult than in object-oriented and component-based applications. The reason for this is that autonomously initiated interactions imply that we can not know what kind of interaction is done and we have no clue about when a (specific) interaction is initiated.

Need for global Autonomy: the characteristics described so far, make it difficult to understand and control the global behaviors of the system or a group of systems. Still, there is a need for



coherent global behaviors. Some functional and non functional requirements that have to be solved by computer systems are so complex that a single entity can not provide it. We need systems consisting out of multiple entities which are relatively simple and where the global behavior of that system provides the functionality for the complex task

3 Architecture for Autonomic Computing

Autonomic systems are composed from autonomic elements and are capable to carry out administrative functions, managing their behaviors and their relationships with other systems and applications by reducing human intervention in accordance with high-level policies. Autonomic Computing System can make decisions and manage themselves in three scopes. In detail these scopes are discussed [6].

Resource Element Scope: In resource element scope, individual components such as servers and databases manage themselves.

Group of Resource Elements Scope: In group of resource elements scope, pools of grouped resources that work together perform self-management. For example, a pool of servers can adjust work load to achieve high performance. Business Scope: overall business context can be self-managing. It is clear that increasing the maturity levels of Autonomic Computing will affect on level of making decision.

3.1 Autonomic Element

Autonomic Elements (AEs) are the basic building blocks of autonomic systems and their interactions produce self managing behavior. Each AE has two parts: Managed Element (ME) and Autonomic Manager (AM) as shown in figure.

Fig. 1: Building Blocks of Autonomic Systems

Sensors retrieve information about the current state of the environment of ME and then compare it with expectations that are held in knowledge base by the AE. The required action is executed by effectors. Therefore, sensors and effectors are linked together and create a control loop.

The description of the Figure -1 is as follows

Managed Element: It is a component from system. It can be hardware, application software, or an entire system.

A Knowledge

Analyze Plan

Knowledge Monitor

Execute

Sensors Effectors

Managed Element

Monitor



Autonomic Manager: These execute according to the administrator policies and implement self-management. An AM uses a manageability interface to monitor and control the ME. It has four parts: monitor, analyze, plan, and execute.

Monitor: Monitoring Module provides different mechanisms to collect, aggregate, filter, monitor and manage information collected by its sensors from the environment of a ME

Analyze: The Analyze Module performs the diagnosis of the monitoring results and detects any disruptions in the network or system resources. This information is then transformed into events. It helps the AM to predict future states.

Plan: The Planning Module defines the set of elementary actions to perform accordingly to these events. Plan uses policy information and what is analyzed to achieve goals. Policies can be a set of administrator ideas and are stored as knowledge to guide AM. Plan assigns tasks and a resource based on the policies, adds, modifies, and deletes the policies. AMs can change resource allocation to optimize performance according to the policies.

Execute: It controls the execution of a plan and dispatches recommended actions into ME. These four parts provide control loop functionality.

3.2 AC Toolkit

IBM assigns autonomic computing maturity levels to its solutions. There are five levels total and they progressively work toward full automation [5].

Basic Level: At this level, each system element is managed by IT professionals. Configuring, optimizing, healing, and protecting IT components are performed manually.

Managed Level: At this level, system management technologies can be used to collect information from different systems. It helps administrators to collect and analyze information. Most analysis is done by IT professionals. This is the starting point of automation of IT tasks.

Predictive Level: At this level, individual components monitor themselves, analyze changes, and offer advices. Therefore, dependency on persons is reduced and decision making is improved.

Adaptive Level: At this level, IT components can individually or group wise monitor, analyze operations, and offer advices with minimal human intervention. Autonomic Level: At this level, system operations are managed by business policies established by the administrator. In fact, business policy drives overall IT management, while at adaptive level; there is an interaction between human and system.

4 Autonomic Elements to Simplify and Optimize System Administration

Although computers are one of the main drivers for the automation and acceleration of almost all business processes, maintaining such computer systems is mostly manual labors. As this seems ironic new approaches are being presented in order to make the machine take care of such support tasks itself, i.e. automatically. Automation means that predefined actions are independently executed under specific conditions by the machine. Since carrying out specified actions like scripts and programs is the primary task of most computer systems, the challenge obviously is defining the conditions. The essence of such conditions is a set of



logical rules referring to measurement data. Yet, which values are to be collected, which relationships must be represented and how will this create an automatically maintained and documented IT landscape? These are the key questions for automation.

4.1 Present Status

Today, automation is left in the hands of the technician responsible for a specific system. The focus of such an administrator is to relieve him of repetitive tedious tasks. Anyone who has repaired the same minor item on a computer 50 times will write an appropriate script. That this results in a positive reduction of workload is indisputable. But the actual benefit from such private action is usually difficult to plan. Normally it cannot be transferred to other systems or environments and it is rarely documented. Consequently, this procedure cannot be considered automation in the conventional sense. Although such scripts may show good results it is certainly not possible to develop an IT strategy upon them.

4.2 Ideal Process

To achieve a desirable degree of automation, first the terms of the automation environment, the results of the automation and the remaining manual labor must all be defined. The processes in an automated environment can be described as follows:

1. A measurement process constantly monitors the correct functioning of the IT system.

2. Should a problem occur, a set of rules is activated that classifies the problem.

3. This rule set continues to initiate actions and analyze their results in combination with the measurement data until either:

a. The problem is resolved, or

b. The set of rules can no longer initiate any action and passes the result of the work done up to that point on to an intelligent person who then attempts to solve the problem. The steps listed above are challenging. And just starting with their execution means to determine the proper functionality of the IT systems through measurement first.

4.3 The Model

For the administration of servers, networks, applications, etc. tools have been available from the very beginning to take care of traditional administration tasks. In a technical environment these are divided into monitoring/debugging tools and administration/optimization applications. Monitoring tools monitor the services and functionalities of the respective servers, or aspects of them. They provide the administrator with information about the status and performance of the services and processes. The information flow for administration and optimization tools is usually reversed. The administrator decides on the actions he wants to take. He will make his decisions based on, among other things, observations derived from monitoring. These procedures can then be applied to the services using the available tools. Today, a significant part of system administration work specifically involves this process of reviewing the results given by the monitoring system and the subsequent use of administration or optimization tools. Due to the sustained trend toward ever increasingly



distributed applications, this process is much more complex in practice than it appears in theory. Each additional component increases the number of possible “adjustments” enabling optimal implementation of the services in terms of availability and performance. To master this complexity, it is necessary to clarify the dependencies between machines, applications, resources and services. This makes it possible to identify the correct points for intervention and to estimate the likely consequences of changes (Figure -2 -- M-A-R-S model diagram). If such a dependency model is adequately defined, it is possible to significantly optimize the tasks involved in IT operations using a new class of applications referred to here as “Autonomic Elements”.

As a rule, the function of today’s tools is unidirectional, i.e. the tool either informs the administrator about the need for intervention or the administrator initiates appropriate actions in the target system via another tool. Autonomic Elements have the advantage, just like the human administrator himself, of possessing a model of dependencies. For example, they can use the information from monitoring to determine which possible intervention options are appropriate and which areas are potentially affected. A preliminary selection like that saves the administrator a good part of his day-to-day work making it possible to achieve a faster response, and the time saved can be well-used in other areas. In addition, such a rule set enables the definition of standard actions that make manual intervention in acute situations completely unnecessary.

Fig. 2: M-A-R-S Model

4.4 Case Study

A mail server receives emails from applications, saves them in interim storage and dispatches them to the Internet. This process produces a large number of log files that document the server’s processing. Due to the size of the incoming files and the necessity of archiving them, they are automatically transferred to an archive server during the night but remain on the server itself for research purposes in case of user questions.

Should the available space in the log partitions of the server reach a critical value, there is hopefully a monitoring system that informs the administrator.



The administrator then checks which of the logs have been successfully transferred to the archive server and removes them from the mail server using an appropriate tool. In our example, we conduct two “monitoring” events (available space and transferred logs) for one “administrative” event. Because it is based on a consistent pattern, the same action will be necessary as soon as the level of available space reaches a critical value. Figure-3 description is the described chain of actions/reactions can be fully automated through an autonomic element. This element quasi bridges the monitoring and administration tools and thus can access the complete monitoring information (The level of available space and transferred log files) and all administrative options (The removal of files) plus complete information about a model that describes the dependencies between servers, services etc. The autonomic element “knows” the demand to retain as many log files as possible on the server and knows the archiving conditions for these files on the archive server (and the status of the transfer). It is therefore in the position to intelligently correlate the two monitoring events and automatically delete the required number of log files that have already been transferred to the level to the server.

The administrator first becomes involved when something in this chain of actions does not function as defined. For example an error occurs in the archiving or the deletion process fails.

Fig. 3: Automated Problem Solving

As it is well-known that in IT maintenance nothing is as constant as change itself, the system administration team wins valuable time applying the described approach this gained time can be invested into the optimization of the dependency model and set of rules in turn. Under ideal conditions, this would lead to a continuous improvement in IT services without demanding a great effort of the administration team.

5 Conclusion

In this paper, we have presented the essence of the autonomic computing and development of such systems. It gives the reader a feel for the nature of these types of systems. A significant part of system administration work specifically involves the process of reviewing the results



given by the monitoring system and the subsequent use of administration or optimization tools. The model described in this paper simplifies and optimizes the system administration. This makes it possible to identify the correct points for intervention and to estimate the likely consequences of changes. The case study presented uses autonomic element which informs the administrator when critical values are met.

References

[1] S. Hariri and M. Parashar. Autonomic Computing: An overview. In Springer-Verlag Berlin Heidelberg, pages 247–259, July 2005.

[2] Kephart J. O., Chess D. M.. The Vision of Autonomic Computing. Computer, IEEE, Volume 36,Issue 1, Pages 41-50, January 2003.

[3] Sterritt R., Bustard D.Towards an Autonomic Computing Environment. University of Ulster,Northern Ireland.

[4] Bantz D. F. et al. Autonomic personal computing. IBM Systems Journal, Vol 42, No 1, January 2003Bigus J. P. et al. ABLE: A toolkit for building multi agent autonomic systems. IBM Systems Journal, Vol. 41, No. 3, August 2002.

Image Processing



A Multi-Clustering Recommender System

Using Collaborative Filtering

Partha Sarathi Chakraborty University Institute of Technology, The University of Burdwan, Burdwan

[email protected]

Abstract

Recommender systems have proved really useful in order to handle with the information overload on the Internet. Many web sites attempt to help users by incorporating a recommender system that provides users with a list of items and/or web pages that are likely to interest them. Content-based filtering and collaborative filtering are usually applied to predict these recommendations. Hybrid of these two above approaches has also been proposed in many research works. In this work clustering approach is proposed to group users as well as items. For generating prediction score of an item, similarities between active user and all other users in the same user cluster is calculated considering only items belongs to the same item cluster as the target item. The proposed system was tested on the Movie Lens data set, yielding recommendations of high accuracy.

1 Introduction

In many markets, consumers are faced with a wealth of products and information from which they can choose. To alleviate this problem, many web sites attempt to help users by incorporating a recommender system [Resnick and Varian, 1997] that provides users with a list of items and/or WebPages that are likely to interest them. Once the user makes her choice, a new list of recommended items is presented.

E-commerce recommender systems can be classified into three categories: the content filtering based; the collaborative filtering based; the hybrid content filtering and collaborative filtering based [Ansari et al, 2001]. The first one produces recommendations to target users according similarity between items. And the second, however, provides recommendations based on the purchase behaviors (preferences) of other like-minded users.

Clustering, on the other hand, is a method by which large sets of data is grouped into clusters of smaller sets of similar data. It is a useful technique for the discovery of some knowledge from a dataset. K-means clustering is one of the simplest and fastest algorithms, and is therefore widely used. It is a non-hierarchical algorithm that starts by defining k points as cluster centres, or centroids in the input space. The algorithm clusters the objects of a dataset by iterating over the objects, assigning each object to one of the centroids, and moving each centroid towards the centre of a cluster. This process is repeated until some termination criterion is reached. When this criterion is reached, each centroid is located at a cluster centre, and the objects that are assigned to a particular centroid form a cluster. Thus, the number of centroids determines the number of possible clusters.

296 ♦ A Multi-Clustering Recommender System Using Collaborative Filtering


In this paper, we consider a collaborative filtering approach where items and users are clustered separately. Neighbors of an active user chosen from the user cluster to which the active user belongs. On the other hand, similarity between two users are calculated not considering the whole set of items, rather taking only items of a particular item cluster.

The rest of the paper is organized as follows. Section 2 provides a brief overview of collaborative filtering. In Section 3, we present related works. Next, we describe the details of our approach in Section 4. We present the experimental evaluation that we employ in order to compare the algorithms and we end the paper with conclusions and further research in Section 5 and 6 respectively.

2 Collaborative Filtering

Collaborative filtering systems are usually based on user-item rating data sets, whose formats are shown in Table 1. Ui is ID of the user involving in a recommender system; and Ij is ID of the item rated by users. There are two general classes of collaborative filtering algorithms: memory-based methods and model-based methods [Breese et al, 1998]. Memory based algorithms use all the data collected from all users to make individual predictions, whereas model-based algorithms first construct a statistical model of the users and then use that model to make predictions.

Table-1 User-Item Rating Matrix

One major step of collaborative filtering technologies is to compute the similarity between target user and candidate users so as to offer nearest neighbors to produce high-quality recommendations. Two methods often used for similarity computation are: cosine-based and correlation-based [Sarwar, 2001].

Vector Cosine method computes user similarity as the scalar product of rating vectors:

(1)

in which, s(a,u) is the similarity degree between user a and user u, R(a,u) is the set of items rated by both user a and user u, rx,i is rating that user x gives to items i.

A Multi-Clustering Recommender System Using Collaborative Filtering ♦ 297


Pearson Correlation method is similar to Vector Cosine method, but before the scalar product between two vectors is computed, ratings are normalized as the difference between real ratings and average rating of user:

(2)

in which, rx is the average rating of user x

Once the nearest neighbors to a target user u is obtained, the following formula[ Breese et al,1998] is used for calculating prediction score

(3)

where ra,i denotes the correlation between the active user Ca and its neighbors Ci who have rated the product Pj. P¯ Ca denotes the average ratings of customer Ca, and PCi,j denotes the rating given by customer Ci on product Pj.

3 Related Works

Two popular model-based algorithms are the clustering for collaborative filtering [Kohrs and Merialdo, 1999] [[Ungar and Foster] and the aspect models [Hofmann and Puzicha,1999]. Clustering techniques have been used in different ways in designing recommender systems based on collaborative filtering. [Sarwar et al] first partitions the users using clustering algorithm and then applies collaborative filtering by considering the whole partition as neighborhood for a user to which he belongs. In another paper [Zhang and Chang, 2006] a genetic clustering algorithm is introduced to partition the source data, guaranteeing that the intra-similarity will be high but the inter-similarity will be low whereas [Yang et al,2004] uses CURE (Clustering Using Representatives) to transform the original user-product matrix into a new user cluster product matrix which is much more dense and has much fewer rows than the original one. One another attempt [Zhang et al, 2008] partition the users, discovered the localized preference in each part and using the localized preference of users to select neighbors for prediction instead of using all items. The paper [Khanh Quan et al,2006] written by Truong Khanh Quan, Ishikawa Fuyuki, Honiden Shinichi propose a method of clustering items, so that inside a cluster, similarity between users does not change significantly. After that, when predicting rating of a user towards an item, we only aggregate ratings of users who have high similarity degree with that user inside the cluster to which that item belongs.

4 Our Approach

Usually, in collaborative filtering approach of recommender system design the whole set of items are considered in computing similarity between users. As have been stated in [ Khanh



Quan et al, 2006] we also think that this process does not provide a good result because the number and type of items that is offered by an online store is very large. As a result, a set of user can be said similar to each other for one type of item, but they may not be similar to that extend when we consider a different type of items.

So, in our approach, we partition items into several groups using clustering algorithm. Items which are rated similarly by different users are placed under the same cluster. We also partition users into several groups. The user partitioning as well as item partitioning both have been done using k-means algorithm. The clustering is done offline and clustering information is stored in database. For an active user, the system first determines the cluster to which he belongs. Neighbors of this active user will be chosen from this cluster only. For generating the prediction score of an item for the active user, we first determine the cluster to which the item belongs and consider only those items in calculating similarity between the active user and other users belongs to the same user cluster as active user. We then take first N neighbors and calculate prediction score for that item using formula (3)

The algorithm for calculating prediction score is as follows-

1. Apply clustering algorithm to produce p partitions of users using the training data set.

Formally, the data set A is partitioned in A1, A2,...,Ap, where Ai ∩ Aj = φ, for 1 ≤ i, j ≤ p; and A1 ∪ A2 ∪... ∪ Ap = A.

2. Apply the clustering algorithm to produce q partitions of items using the training data

set. Formally, the data set A is partitioned in B1, B2,...,Bq, where Bi ∩ Bj = φ, for 1 ≤ i, j ≤q; and

1. B1 ∪ B2 ∪... ∪ Bq = A.

2. For a given user, find the cluster to which he/she belongs. Suppose, Am.

3. For calculating the prediction score Ra,j for a customer ca on product pj

a. Find the cluster to which the item belongs, suppose, Bn

b. Compute similarity between the given user and all other users belongs to cluster Am considering only items belongs to cluster Bn.

c. Take first N users with highest similarity values as neighbor.

d. Calculate prediction score using formula (3)

5 Experimental Evaluation

In this section, we report results of an experimental evaluation of our proposed techniques. We describe the data set used, the experimental methodology, as well as the performance improvement compared with traditional techniques.

5.1 Data Set

We are performing experiments on a subset of movie rating data collected from MovieLens web-based recommender (movielens.umn.edu). MovieLens is web-based research recommender system that debuted in Feb 1997. The data set used contained 100,000 ratings from 943 users and 1682 movies (items), with each user rating at least 20 items. The item sparsity is easily computed as 0.9369, which is defined as

A Multi-Clustering Recommender System Using Collaborative Filtering ♦ 299


(4)

The ratings in the MovieLens data are integers ranging from 1 to 5 which entered by users. And we selected 80% of the rating data set as the training set and 20% of the data as the test data.

5.2 Evaluation Metric

Mean Absoluter Error (MAE) [Herlocker et al, 2004] is the most commonly applied evaluation metric for collaborative filtering, it evaluate the accuracy of a system by compare the numerical recommendation scores against the actual user ratings for the user-item pairs in the test dataset. In our experiment, we use MAE as our evaluation. We assume p1,p2,…….pM is set of the ratings for the given active users and q1,q2,….,qM is the actual ratings set of the active users, and the MAE metrics is formula as:

(5)

5.3 Experimental Results

Table2 shows our experimental results. It can be observed that though for neighborhood size10 and 20 the result of our approach is not satisfactory but for neighborhood size 30 our approach shows better result than collaborative filtering without clustering and collaborative filtering with only user clustering. The same result can also be seen from the graph shown in figure 1 where C identifies clustering of users and IC identifies clustering of items.

Table 2 – Result of Multi Clustering

Neighborhood Without No of User Clusters=10 No of User Clusters=10

Size Clustering No of Item Clusters=0 No of Item Clusters=10

10 0.7746 0.7874 0.9488

20 1.0194 0.7821 0.8214

30 1.0426 0.8033 0.8024

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30

Neighbourhood Size

MA

E

C=0

C=10

C=10 IC=10

Fig. 1: Comparing Result of Multi Clustering with Other Cases




In our approach we have shown that for generating prediction score of an item for the active user if we consider only the items which belongs to the same item cluster as the target item (item for which score is calculated) for measuring similarity between active user and other users belongs to the same user cluster it can produce better result that collaborative filtering without clustering or collaborative filtering with only user clustering. But we have used basic k-means clustering algorithm for clustering users as well as items. In future studies we will try to improve the quality of prediction by investigating and using more sophisticated clustering algorithms.

References

[1] [Ansari et al, 2001] S Ansari, R Kohavi, L Mason, Z Zheng, Integrating Ecommerce and data mining: architecture and challenges. In: Proceedings The 2001 IEEE International Conference on Data Mining. California, USA: IEEE Computer Society Press, 2001, pages 27-34.

[2] [Breese et al,1998] J Breese, D Hecherman, C Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. San Francisco: organ Kaufmann, 1998. pages 43-52.

[3] [Herlocker et al,2004] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22 (2004), ACM Press, pages 5-53.

[4] [Hofmann and Puzicha, 1999] T. Hofmann and J. Puzicha, Latent Class Models for Collaborative Filtering. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, 1999, pages 688-693.

[5] [Khanh Quan et al, 2006] Truong Khanh Quan, Ishikawa Fuyuki and Honiden Shinichi, Improving Accuracy of Recommender System by Clustering Items Based on Stability of User Similarity, International Conference on Computational Intelligence for Modelling Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06),2006

[6] [Kohrs and Merialdo,1999 ] A. Kohrs and B. Merialdo. Clustering for Collaborative Filtering Applications. In Proceedings of CIMCA'99. IOS Press, 1999.

[7] [Resnick and Varian, 1997] P. Resnick and H. R. Varian. Recommender systems. Special issue of Communications of the ACM, pages 56–58, March 1997.

[8] [Sarwar, 2001] B Sarwar, G Karypis, J Ried. Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th international World Wide Web conference. 2001. pages 285-295.

[9] [Sarwar et al], Badrul M. Sarwar, George Karypis, Joseph Konstan, and John Ried, Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering

[10] [ Ungar and Foster] L. H. Ungar and D. P. Foster. Clustering Methods for Collaborative Filtering. In Proc. Workshop on Recommendation Systems at the 15th National Conf. on Artificial Intelligence. Menlo Park, CA: AAAI Press.

[11] [Yang et al, 2004] Wujian Yang, Zebing Waug and Mingyu You, An improved Collaborative Filtering Method for Recommendiations’ Generation, In Proceedings of IEEE International Conierence on Systems, Man and Cybernetics,2004

[12] [Zhang and Chang, 2006] Feng Zhang, Hui-you Chang, A Collaborative Filtering Algorithm Employing Genetic Clustering to Ameliorate the Scalability Issue, In Proceedings of IEEE International Conference on e-Business Engineering (ICEBE'06), 2006

[13] [Zhang et al, 2008] Liang Zhang, Bo Xiao, Jun Guo, Chen Zhu, A Scalable Collaborative Filtering Algorithm Based On Localized Preference, Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming, 2008



Digital Video Broadcasting in an Urban

Environment an Experimental Study

S. Vijaya Bhaskara Rao K.S. Ravi N.V.K. Ramesh Sri Venkateswara University K.L. College of Engg. K.L. College of Engg. Tirupati – 517 502 Vijayawada Vijayawada [email protected] [email protected]

J.T. Ong G. Shanmugam Yan Hong

Nanyang Technological Nanyang Technological Nanyang Technological University, Singapore University, Singapore University, Singapore

Abstract

Singapore is the first country to have island wide DVB-T Single Frequency Network (SFN). Plans are being made to extend DTV services for portable and fixed reception. However for the planning of future fixed and portable services, the reception of signals and other QoS (Quality of Service) parameters have to be defined. An initial measurement campaign was undertaken in 47 sectors to study digital TV coverage in Singapore. The measurements set-up consists of different devices like a spectrum analyzer and EFA receiver interfaced to a Laptop computer with lab-view program. A GPS receiver on board is used to locate the measurement point. Using Geographic Information System (GIS) data-base 100mX100m pixels are identified along all the routes and the average field strength is estimated over that area. Measurement values are compared with proprietary prediction software. The detailed clutter data-base has been developed and used with the software. Field strengths at different percentages of probability were estimated to establish ‘good’ and ‘acceptable’ coverage. In general, it is found that the signals in the majority of sectors exhibit a log-normal behavior with a standard deviation of about 6 dB.

1 Introduction

Singapore is the first country to have island wide DVB-T Single Frequency Network (SFN). Plans are being made to extend DTV services for portable and fixed reception. However for the planning of future fixed and portable services, the reception of signals and other QoS (Quality of Service) parameters have to be defined. Hence a series of measurements were conducted with this in mind. The initial results of the analyses for mobile DTV reception are presented in this paper.

The main objective of this experimental campaign is to characterize the behavior of fixed DTV signals in different environments in Singapore. Initial measurements are made with antennas at 2m above the ground because it is logistically more difficult to conduct measurements at 10m; also more measurements could be made quickly in a moving vehicle. These measurements are made to tune the prediction model and also to improve the measurement procedures and analytical techniques. Earlier studies have reported that

302 ♦ Digital Video Broadcasting in an Urban Environment an Experimental Study


analogue TV propagation models assume log-normal signal spatial variation. The standard deviation parameter for the distributions varies from 9 to 15 dB [ITU-recommendation PN.370-5]. Broadband signals have been measured in countries like UK, Sweden and France under the [VALIDATE, 1998] program and the shape of their statistical variation has been reported to behave log-normally. The standard deviations observed for such signals are small - typically 2.5 to 3 dB depending on the environment surrounding the receiver location. But all these measurements are based on reception from a single transmitter. ITU-R P.370 gives a standard deviation for wide band signals of 5.5 dB. These previous measurements are limited; the methods as well as the procedures for measuring and estimating the standard deviation of the measured signals have not been clearly detailed.

2 Experimental Set-up

The measurement campaign was carried out using the equipment set-up shown in the figure.1. The video camera, GPS receiver antenna and an omni-directional dual-polarized antenna were fixed on the top of the vehicle. The video camera provides a view of the surrounding area i.e. the land-use or clutter information around the measurement location. A GPS receiver was used to synchronize location to the measured parameters. The omni-directional antenna was connected to one or more of instruments for measuring Quality of Service parameters (spectrum analyzer, R&S EFA TV test receiver, Dibcom DV3000 evaluation kit, etc.). The measuring instruments were interfaced to a lap-top computer which has a Lab-View data logging program preinstalled. Only measurements using the spectrum analyzer were discussed in this paper. The notebook was programmed to sweep through the 8 MHz DTV spectrum and to store the data once every second. It sweeps through the 8 MHz spectrum eight times in one second. The notebook records 401 field strength points for one 8 MHz sweep. The program also computes the minima, maxima and average of 401 points. This provides a record of the individual TV channel spectrum. The information on the road view, location and field strength were monitored in real-time and recorded.

Measurements were carried out in 47 sectors covering the entire island for Channel 37 (Mobile TV) at 602 MHz. The DVB-T standard is COFDM-2K and the modulation is QPSK. The main transmitter is located in Bukit Batok and the transmitting antenna (horizontally polarized) is at an elevation of 214 meters above MSL. There are 10 repeaters (vertically polarized) connected in a single frequency network (SFN). In our measurements the field strength recorded is inclusive of SFN gain. More details about the clutter data-base, measurement pixels and software predictions are presented in the next section.

Fig. 1: Block diagram of the experimental set-up.

Digital Video Broadcasting in an Urban Environment an Experimental Study ♦ 303


3 Methodology

Details about the sectors are given in figure.2. The sectors are chosen to provide a good representation of the different land-use or clutter environments for Singapore. The analysis of the data has been carried out in two ways. Firstly the recorded field strength values along each sector are processed separately to study the log-normal distribution of the slow fading. The fast fading signals are removed by averaging the received signals over a period or a specified window. The window widths are selected as per Lee’s method [Lee, 1985]. A log-normal fit was made for each sector and the coverage for 50, 70, 90, 95 & 99% of probability are obtained. The distributions of the raw measured data with both the slow and fast fading are also analyzed. Along two sectors, which are approximately radial with reference to the transmitter, the measured field strength was compared with the Hata [Hata, 1980] and Walfisch-Bertoni [Walfisch and Bertoni, 1988] models. In the second method the measured field strengths are compared with the software predictions over pixels of 100mX100m.

Using “ArcGIS environment” a hydrologically correct Digital Elevation Model (DEM) at 100 m resolution was created from the 10m contour data. The clutter factor grid map was created with a 100m-grid interval, using the newly modeled PR-Clutter information. Later these datasets were converted to the prediction software working file for the signal strength prediction and converted back to grid files with the predicted signal strength values for further analysis in the GIS environment. Measured field strength is averaged over 100mX100m pixels and then compared to the software predictions. In an SFN environment, the software predicts and considers the maximum signal strength available over the grid, i.e. it could be from the main transmitter or from the repeater. It also provides the information about the same.

Fig. 2: Details of the Sectors Where the Measurements are Carried Out

The field measurements were taken using our measurement vehicle, integrated with GPS. Upon completion, these log files were plotted in the GIS environment for analysis. Since the predicted value maps are in 100m grid raster maps, the GPS measurement files were also ‘grided’ taking the average value within the 100m grids, as shown in figure 3.

Now that both the predicted signal strength and the measured signal strength maps are in the same 100m topographical grid files, the difference between the measured and the software predicted field strengths can be analyzed using GIS. The proprietary prediction software is



merely a shell; terrain and clutter data are required for reliable predictions. To compare the measured field strength with the software prediction we have developed a clutter model “Singapore” based on the plot ratios designated by the Urban Redevelopment Authority of Singapore (URA).

Fig. 3: 100 m pixels with the measurement points

Considering the true land use and the plot ratio, we classified the whole land area into 7 categories, which are used in our prediction model. This clutter model is continuously being fine-tuned with additional information like the openness of the area, etc.; the model will be changed to best suite to predict the signal strength in Singapore.

4 Results

As mentioned earlier, the measured field strength values are processed to study the log-normal distribution along all the sectors. It is found that almost all sectors show the log-normal distribution. A typical plot of log-normal distribution is shown in figure.4. A running average method is used to fit the measured values with log-normal distribution. From the figure it can be seen that the curve best fits a log-normal with a standard deviation of 6.62 dB. Cumulative Distribution Function (CDF) of the raw data with both fast fading and slow fading components is also shown in the figure. There is little difference between the two distributions. This could be due to the sampling rate and the speed of the vehicle. In one second a vehicle traveling with 40km/h covers approximately 11 m, corresponding to 22λ at 600 MHz.

Fig. 4: Log-normal fit for sector-4



The route distance for the sector-4 is 5.9 km. There are two repeater stations near to this route. The mean value observed is 60.5 dBuV/m and the standard deviation is 6.6 dB. The standard deviations and log-normal distribution for all the sectors were similarly fitted. Figure 5 shows the mean and standard deviation observed in all 47 sectors.

Fig. 5: Mean and Standard Deviations Observed for All the Sectors

It can be seen from the figure that the sectors with ID 8, 27 and 38 have high mean and standard deviation values. The obvious reason for this is that the sector 8 is very close to the repeater at Alexandra point and sector 27 is very close to the main transmitter. Sector 38 (Clementi Ave 6) is partially open for about 1.2 km and remaining 800m is through the HDB estates. Hence the field strength drops from 80 dBuV/m to about 55 dBuV/m resulting high standard deviation The variation of standard deviation from the mean standard deviation for all the sectors (5.36dB) is shown in the figure.6.

Fig. 6: Variation of the standard deviation in each sector from the mean.

It can be seen from the figure that sector IDs 8, 12, 15, 16, 27 have a variations from the mean of 2.5 dB. However sector 38 has a large deviation from the mean of 5.4 dB. The coverage at 95 and 99% of probability is obtained from the log-normal fit and from the standard deviation.



Figure 7 shows the field strength computed at 95 and 99% of probability at 2m level along with the minimum recommended value (for fixed reception) according to Chester 97 [6]. In the measurements it is observed that the minimum field strength required for reception of a good picture is about 40dBuv/m (fixed location). This threshold value is also shown in the figure. There are few sectors in which the 99% values are within 2 or 3 dB of the minimum threshold. In this study it was observed that there is no height gain from 2m to 10 m in a location where there is no line-of-sight to the transmitter. However when there is fairly open ground and path clearance to the transmitter, a gain of about 8-9 dB is observed. With hindsight this is to be expected, hence extrapolation between 2m and 10m measurements should be carried out with great care.

Fig. 7: Coverage at 95 and 99% of probability observed for all the sectors

5 Summary of Observations

Measurements carried out in all 47 sectors show that the slow fading can be modeled with a log-normal distribution. The mean standard deviation observed is 5.36 dB. In some sectors the standard deviation observed is higher and is about 10 dB. It is observed that when ever the vehicle passes through the HDB environment, large variations in signal strength are observed, resulting in high standard deviations. Large standard deviations of about 14 dB were obtained inside an ‘HDB’ (building heights of about 50m) environment. This confirms the need to characterize the variation of standard deviation with reference to the exact clutter. A standard deviation of 5.5 dB could be for an ideal situation and for medium city environment. Hence for further DTV planning in different clutter types, an optimum value of standard deviation has to be established. The coverage at 95 and 99% of probability shows the mobile DTV reception in Singapore is good.

Comparison with the empirical models shows large deviations as these models do not employ local clutter data. The Walficsh-Bertoni model assumes an uniform building height for prediction. In the Singapore HDB environment, as the spacings between the buildings are not uniform and large variations and errors would occur in the predictions. Comparison with the proprietary software predictions is reasonably good.

This initial measurement campaign highlighted the many challenging problems encountered in the measurements, prediction and analyses of quality of service parameters for the reception of digital TV reception in built-up areas in Singapore. Therefore future work will concentrate on very careful detailed smaller scale spatial measurements in HDB areas using transmissions from one individual transmitter at a time.



References

[1] [CEPT, 1997] The Chester 1997 Multilateral Coordination Agreement relating to Technical criteria, coordination principles and procedures for the introduction of Terrestrial Digital video Broadcasting (DVB-T). ITU-recommendation PN.370-5)

[2] [Lee,W.C.Y.,1985] Estimation of local average power of a mobile radio signal, IEEE. Trans.Veh. Technol., Vol.VT-34, no.1, pages 2-27.

[3] [M. Hata, 1980] Empirical Formula for Propagation Loss in Land Mobile Radio Services, IEEE Trans. on Veh. Technol., Vol. VT-29, no. 3, pages 317-325.

[4] [VALIDATE, 1998] Final project report. [5] [J. Walfisch and H.L. Bertoni, 1988] A Theoretical model of UHF propagation in Urban Environments,

IEEE Trans. Antenna & Propagation, Vol.36, No.12, pages 1788-1796.



Gray-level Morphological Filters for Image

Segmentation and Sharpening Edges

G. Anjan Babu Santhaiah Dept of Computer Science Dept of Computer Science S.V. University, Tirupati ACET, Allagadda, Kurnool

Abstract

The aim of the present study is to propose new method for tracking of edges of images. The present study involves edge detection and morphological operations for sharpening edges. The detection criterion expresses the fact that important edges should not be missed. It is of paramount important to preserve, uncover or detect the geometric structure of image objects. Thus morphological filters, which are more suitable than linear filters for shape analysis, play a major role for geometry based enhancement and detection.

A new method for image segmentation and e sharpening edges based on morphological transformation is proposed. This algorithm uses the morphological transformations dilation and erosion. A gradient determined grey level morphological procedure for edge increase and decrease is present. First, the maximum gradient, in the local neighborhood, forms the contribution to the erosion of the center pixel of that neighborhood. The gradients of the transformed image are then used as contributions to subsequent dilation of eroded image. The edge sharpening algorithm is applied on various sample images. Proposed algorithm segments the image by preserving important edges.

Keywords: Dilation, Erosion, Peak, Valley, Edge, Toggle contrast.

1 Introduction

Mathematical morphology stresses the role of “shape” in image pre-processing, segmentation and object description. Morphology usually divided into binary mathematical morphology which operates on binary images and gray-level mathematical morphology which operates on binary images, and Gary-level mathematical morphology which acts on gray-level images.

The two fundamental operations are Dilation and erosion. Dilation expands the object to the closest pixels of the neighborhood. Dilation combines two sets using vector addition

X ⊕ B= p∈∑ 2: p = x+b, x∈X and b∈B

Where X is Binary image, B is the Structuring element.

Erosion shrinks the object. Erosion Θ combines two sets using vector subtraction of set elements and is the dual operation of dilation.

X Θ B= p∈∑ 2: p+b∈X and for every b∈B

Gray-level Morphological Filters for Image Segmentation and Sharpening Edges ♦ 309


Where X is Binary image, B is the Structuring element.

Extending morphological operators from binary to gray-level images can be done by using set representations of signals and transforming these input sets by means of morphological set operators. Thus, consider an image signal f(x) defined on the continuous or discrete plane ID=R^2 or Z^2 and assuming values in R=R U (-∞, ∞). Shareholding f at all amplitude levels v produces an ensemble of binary image represented by the threshold sets.

VΘ )f( ,v)x(f:IDx ≥∈≡ -∞ < v < +∞. The image can be exactly reconstructed from all its threshold sets since

x:Rvsup)x(f ∈= )]f(v[ΘΨ∈ Where “sup” denotes superman transforming threshold set of the input signal f by a set operator and viewing the transformed sets as threshold sets of a new image creates a flat image operator, whose output signal is

)]([:sup))(( fvxRvxf ΘΨ∈∈=Ψ For example if Ψ is the set dilation and erosion by B, the above procedure creates the two

most elementary morphological image operators, the dilation and erosion of f(x) by a set

B: (f ⊕ g)(x) ≡ ∨ f (x-y), (f Θ B)(x) ≡ ∧ f(x+y)

Where ‘V’ denotes supremum (or maximum for finite B) and ‘ ∧ ’ denotes infimum (or minimum for finite B). Flat erosion (dilation) of a function f by a small convex set B reduce (increase) the peaks (valleys) and enlarges the minima (maxima) of the function. The flat

opening f o B = (f Θ B) Θ B of f by B smooths the graph of f from below by cutting down

its peaks, whereas the closing f • B = (f ⊕ B) Θ smoothes it from above by filling up its valleys. The most general translation-invariant morphological dilation and erosion of a gray-level image signal f(x) by another signal g are:

(f ⊕ g)(x) ≡ ∨ f (x-y)+g(y), (f Θ g)(x) ≡ ∧ f(x+y)-g(y).

Note that signal dilation is a nonlinear convolution where the sum of products in the standard linear convolution is replaced by a max of sums. Dilation or erosions can be combined in many ways to create more complex morphological operations that can solve a broad variety of problems in image analysis and nonlinear filtering. Their versatility is further strengthened by a theory outlined in that represents a broad class of nonlinear and linear operators as a minimal combination of erosions and dilations. Here we summarize the main results of this theory, restricting our discussion only to discrete 2-D image signals.

Any translation invariant set operator Ψ is uniquely characterized by its kernel, ker

( Ψ ) ).(0:2^ XZX Ψ∈∈≡ The kernel representation requires an infinite number of erosions or dilations. A more efficient representation uses only substructure of the kernel, its

basis Bas( Ψ ), defined as the collection of kernel elements that are minimal with respect to

the partial ordering .⊆ If Ψ is also increasing and upper semi continuous, then Ψ has a nonempty basis and can be represented exactly as a union of erosions by its basis sets:

)()( Ψ∈=Ψ BasAX U AXΘ .

The morphological basis representation has also been extended to gray-level signal operators, that is translation invariant and commutes with thresholding.

310 ♦ Gray-level Morphological Filters for Image Segmentation and Sharpening Edges


2 Morphological Peak/Valley Feature Detection

Residual between openings or closings and original image offer an intuitively simple and mathematically formal way for peak or valley detection. speciafiacally, subtracting from an input image f its opening by a compact convex set b yields an output consisting of the image peaks whose support cannot contain b. This is top-hat transformation,

Peak ( ),()( Bfff o−= Which has found numerous applications in geometric feature detection? It can detect bright blobs, i.e. regions with significantly brighter intensities relative to the surroundings. The shape of the detected peaks support is controlled by the shape of b, where as the scale of the peak is controlled by the size of b. similarly, to detect dark blobs, modeled as image intensity

valleys, we can use the valley detector, Valley fBff −•= )()(

The morphological peak/valley detectors are simple, efficient, and have some advantages over curvature-based approaches. Their applicability in situations in which the peaks or valleys are not clearly separated from their surroundings is further strengthened by generalizing them in following way. The conventional opening is replaced by a general lattice opening such as an area opening or opening by reconstruction. This generalization allows more effective estimations of the image background surroundings around the peak and hence a better detection of the peak.

3 Edge or Contrast Enhancement

3.1 Morphological Gradients

Consider the difference between the flat dilation and erosion of an image f by a symmetric disk like set b containing the origin whose diameter diam (b) is very small:

Edge (f) = ).(/)()( BdiamBfBf Θ−⊕ If f is binary, edge (f) extracts its boundary. If f is gray level, the above residual enhances its

edges by yielding an approximation to || ||f∇ which is obtained in the limit of equation as diam (b) -> 0. Further, thresholding this morphological gradients leads to binary dge detection. The symmetric morphological gradient is the average of two symmetric ones: the

erosion f-( )Bf ⊕ and the dilation gradient (f .) fB −⊕ the symmetric or asymmetric morphological edge-enhancing gradients can be made more robust for edge detection by first smoothing the input image with a linear blur. These hybrid edge-detection schemes that largely contain morphological gradients are computationally more efficient and perform comparably or in some cases better than several conventional schemes based only on linear filters.

3.2 Toggle Contrast Filter

Consider a gray level image f[x] and small-size symmetric disk like structuring element b containing the origin. The following discrete nonlinear filter can enhance the local contrast of f by sharpening its edges:

])[( xfψ = ])[( xbf ⊕ IF ])[(][][])[( xBfxfxfxbf Θ−≤−⊕ )

Gray-level Morphological Filters for Image Segmentation and Sharpening Edges ♦ 311


])[( xBfΘ IF ].[(][][])[( xBfxfxfxBf Θ−>−⊕ )

At each pixel x, the output value of this filter toggles between the value of the dilation of f by b at x and the value of its erosion by b according to which is closer to input value f[x].The toggle filter is usually applied not only once but is iterated. The more iterations, the more cintrast enhanecement. Further, the iterations converge to a limit (fixed point) reached after a finite number of iterations.

4 Experimental Results

In this work now turn to several experiments made with the algorithm introduced above. For all tests, in this study use a 8-neighborhood system of order1. For Example in this study two examples are taken; one is linkon, 64X64 and monalisa has 64X64. Figures are shown below.

Toggle Contrast Enhancement

Original Monalisa After Erode After Dilation After Toggle Contrast

Feature Detection Based on Peaks

Original Lincon Image After Opening After Closing After Peak

Feature Detection Based on Valleys

Original Lincon After Closing After Opening After Valley

Edge Detection Based Morphological Gradient

Lincon Original After Dilation After Erosion After Gradient

312 ♦ Gray-level Morphological Filters for Image Segmentation and Sharpening Edges


5 Conclusion

The present study on Image processing is a collection of techniques that improve the quality of the given image in some sense. The techniques developed are mainly problem oriented. In this paper Morphological approach is made, the edges in the images are thickly marked and are better visible than that of primitive operations. The Rank filter algorithm described in present study has a potentiality to generate new concepts in design of constrained filter. In morphology, the Dilation is performed if the central value of the kernel is less than ‘n’ and if it is grater than ‘n’ Erosion is performed. These two are contradictory transformations, and the resultant images require an in-depth study.

A new algorithm for image segmentation and sharpening edges has been implemented using morphological transformations. The edge sharpening operator illustrates that it can be useful to consider edges as two-dimensional surface. This allows the combination of gradient direction and magnitude information. Edge sharpening is useful for extraction of phase regions. It does not have much effect, when implemented on diagonal edges. Sharp edges have been detected by this algorithm. This algorithm has been tested on various images and verified the result.

References

[1] [H.J.A.M. Heijmans] Morphological Image Operators (Academic, Boston, 1994) [2] [H.P Kramer and J.B Buckner] “Iterations of a nonlinear transformation for enhancement of digital

images”, Pattern Recognition, 7, 53- 58(1975). [3] [P. Maragos and R.W. Schafer] “Morphological Filters. Part I, Their Set-Theoretic Analysis and relations

to linear shift-invariant filters, PartII: Their relations to median, order-static and stack filters” IEEE Trans. Acoust Speech Signal Process.35, 1153-1184(1987),

[4] [F. Meyer] “Contrast Feature Extraction “, in special issue of Practical Metallographic, J.L Chermant, Ed (Rfederer-Verlag, Stuttgart, 1978) Pp.374-380.

[5] [S. Osher and L.I. Rudin] “Feature-oriented image enhancement using shock filters” SIAM J. Numer Anal, 27,919-940(1990).

[6] [O P. Salembier] “Adaptive rank order based filters”, Signal Process 27, 1-25(1992). [7] [J. Serra, Ed] Image Analysis and Mathematical Morphology (Academic, Newyork, 1982). [8] [J. Serra ed.] Image Analysis and Mathematical Morphology Vol -2. Theoretical Advances (Academic,

New York, 1988).



Watermarking for Enhancing Security

of Image Authentication Systems

S. Balaji B. MouleswaraRao N. Praveena K.L. College of Engineering K.L. College of Engineering K.L. College of Engineering Green Fields, Vaddeswaram Green Fields, Vaddeswaram Green Fields, Vaddeswaram Guntur – 522502 Guntur – 522502 Guntur - 522502 [email protected] [email protected] [email protected]

Abstract

Digital watermarking techniques can be used to embed proprietary

information, such as a company logo, in the host data to protect the intellectual

property rights of that data. They are also used for multimedia data

authentication. Encryption can be applied to biometric templates for increasing

security; the templates (that can reside in either (i) a central database, (ii) a

token such as smart card, (iii) a biometric-enabled device such as a cellular

phone with fingerprint sensor) can be encrypted after enrolment. Then, during

authentication, these encrypted templates can be decrypted and used for

generating the matching result with the biometric data obtained online. As a

result, encrypted templates are secured, since they cannot be utilized or

modified without decrypting them with the correct key, which is typically

secret. However, one problem associated with this system is that encryption

does not provide security once the data is decrypted. Namely, if there is a

possibility that the decrypted data can be intercepted, encryption does not

address the overall security of the biometric data. On the other hand, since

watermarking involves embedding information into the host data itself (e.g.,

no header-type data is involved), it can provide security even after decryption.

The watermark, which resides in the biometric data itself and is not related to

encryption-decryption operations, provides another line of defense against

illegal utilization of the biometric data. For example, it can provide a tracking

mechanism for identifying the origin of the biometric data (e.g., FBI). Also,

searching for the correct decoded watermark information during authentication

can render the modification of the data by a pirate useless, assuming that the

watermark embedding-decoding system is secure. Furthermore, encryption

can be applied to the watermarked data (but the converse operation, namely,

applying watermarking to encrypted data is not logical as encryption destroys

the signal characteristics such as redundancy, that are typically used during

watermarking), combining the advantages of watermarking and encryption

into a single system. In this paper we address all the above issues and explore

the possibility of utilizing watermarking techniques for enhancing security of

image authentication systems.

314 ♦ Watermarking for Enhancing Security of Image Authentication Systems


1 Introduction

While biometric techniques have inherent advantages over traditional personal identification techniques, the problem of ensuring the security and integrity of the biometric data is critical. For example, if a person’s biometric data (e.g., her fingerprint image) is stolen, it is not possible to replace it, unlike replacing a stolen credit card, ID, or password. It is pointed out that a biometrics-based verification system works properly only if the verifier system can guarantee that the biometric data came from the legitimate person at the time of enrolment. Furthermore, while biometric data provide uniqueness, they do not provide secrecy. For example, a person leaves fingerprints on every surface she touches and face images can be surreptitiously observed anywhere that person looks. Hence, the attacks that can be launched against biometric systems have the possibility of decreasing the credibility of a biometric system.

2 Generic Watermarking Systems

Despite the obvious advantages of digital environments for the creation, editing and distribution of multimedia data such as image, video, and audio, there exist important disadvantages: the possibility of unlimited and high-fidelity copying of digital content poses a big threat to media content producers and distributors. Watermarking, which can be defined as embedding information such as origin, destination, and access levels of multimedia data into the multimedia data itself, was proposed as a solution for the protection of intellectual property rights.

The flow chart of a generic watermark encoding and decoding system is given in Fig. 1. In this system, the watermark signal (W) that is embedded into the host data (X) can be a function of watermark information (I) and a key (K) as in

W = f0(I,K),

or it may also be related to host data as in

W = f0(I,K,X).

Fig. 1: Digital watermarking block diagram: (a) watermark encoding, (b) watermark decoding.

Watermarking for Enhancing Security of Image Authentication Systems ♦ 315


The watermark information (I) is the information such as the legitimate owner of the data that needs to be embedded in the host data. The key is optional (hence shown as a dashed line in Fig. 1) and it can be utilized to increase the security of the entire system; e.g., it may be used to generate the locations of altered signal components, or the altered values. The watermark is embedded into host data to generate watermarked data

Y=f1(X,W).

In watermark decoding, the embedded watermark information or some confidence measure indicating the probability that a given watermark is present in the test data (the data that is possibly watermarked) is generated using the original data as

I =g(X,Y,K),

or without using the original data as

I =g(Y,K).

Also, it may be desirable to recover the original, non-watermarked data, X, in some applications, such as reversible image watermarking. In those cases, an estimate ˆ X of the original data is also generated.

In watermark embedding, it is desired to keep the effects of watermark signal as imperceptible as possible in invisible watermarking applications: the end user should not experience a quality degradation in the signal (e.g., video) due to watermarking. For this purpose, some form of masking is generally utilized. For example, the frequency masking properties of the human auditory system (HAS) can be considered in designing audio watermark signals. Similarly, the masking effect of edges can be utilized in image watermarking systems. Conversely, in visible watermarking applications, it is not necessary to consider these systems as the actual aim of the application is to robustly mark the data, such as in embedding copyright data in terms of logos for images available over the Internet. An example of visible image watermarking is given in Fig. 2

Fig. 2: Visible Image Watermark.



Although there exist watermarking methods for almost all types of multimedia data, the number of image watermarking methods is much larger than the other types of media. In text document watermarking, generally the appearance of an entity in the document body is modified to carry watermark data. For example, the words in a sentence can be shifted slightly to the left or right, the sentences themselves can be shifted horizontally, or the features of individual characters can be modified (Fig. 3). Although text document watermarks can be defeated relatively easily by retyping or Optical Character Recognition (OCR) operations, the ultimate aim of making unauthorized copies of the document more expensive in terms of effort/time/money than obtaining the legal rights from copyright owner can still be achieved.

Fig. 3: Text Watermarking Via Word-Shift Coding.

Some authors claim that image watermarking methods can be applied to video, since a video can be regarded as a sequence of image frames. But the differences that reside in available signal space (much larger in video) and processing requirements (real time processing may be necessary for video) require methods specifically designed for video data. Sample methods modify the motion vectors associated with specific frames or the labeling of frames to embed data. Audio watermarking techniques are generally based on principles taken from spread-spectrum communications. Modifying audio samples with a pseudo-randomly generated noise sequence is a typical example.

In image watermarking, the watermark signal is either embedded into the spatial domain representation of the image, or one of many transform domain representations such as DCT, Fourier, and wavelet. It is generally argued that embedding watermarks in transform domains provides better robustness against attacks and leads to less perceptibility of an embedded watermark due to the spread of the watermark signal over many spatial frequencies and better modeling of the human visual system (HVS) when using transform coefficients. An example of watermarking in the spatial domain is given in Fig. 4(b). Amplitude modulation is applied to the blue channel pixels to embed the 32-bit watermark data, represented in decimal form as 1234567890. This is a robust watermarking scheme: the watermark data can be retrieved correctly even after the watermarked image is modified. For example, the embedded data 1234567890 is retrieved after the watermarked image is (i) blurred via filtering the image pixels with a 5x5 structuring element (Fig. 4(c)), and (ii) compressed (via JPEG algorithm with a quality factor of 75) and decompressed (Fig. 4(d)).A specific class of watermarks, called fragile watermarks, are typically used for authenticating multimedia data. Unlike robust watermarks (e.g., the one given in Fig. 5), any attack on the image invalidates the fragile watermark present in the image and helps in detecting/identifying any tampering of the image. Hence, a fragile watermarking scheme may need to possess the following features: (i) detecting tampering with high probability, (ii) being perceptually transparent, (iii) not requiring the original image at decoding site, and (iv) locating and characterizing modifications to the image.



( a ) (b)

(c) (d)

Fig. 4: Image Watermarking: (A) Original Image (640x480, 24 Bpp), (B) Watermarked Image Carrying The Data 1234567890, (C) Image Blurred After Watermarking, (D) Image Jpeg Compressed-Decompressed After

Watermarking.

3 Fingerprint Watermarking Systems:

There have been only a few published papers on watermarking of fingerprint images. Ratha et al. proposed a data hiding method, which is applicable to fingerprint images compressed with the WSQ (Wavelet Scalar Quantization) wavelet-based scheme. The discrete wavelet transform coefficients are changed during WSQ encoding, by taking into consideration possible image degradation. Fig.6 shows an input fingerprint image and the image obtained after the data embedding-compressing-decompressingc cycle. The input image was obtained using an optical sensor. The compression ratio was set to 10.7:1 and the embedded data (randomly generated bits) size was nearly 160-bytes. As seen from these images, the image quality does not suffer significantly due to data embedding, even though the data size is considerable.

(a) (b)

Figure 5: Compressed-domain fingerprint watermarking [51]: (a) input fingerprint, (b) data embedded-compressed-decompressed fingerprint.



Pankanti and Yeung proposed a fragile watermarking method for fingerprint image verification. A spatial watermark image is embedded in the spatial domain of a fingerprint image by utilizing a verification key. Their method can localize any region of image that has been tampered after it is watermarked; therefore, it can be used to check integrity of the fingerprints. Fig. 6 shows a sample watermark image comprised of a company logo, and the watermarked image. Pankanti and Yeung used a database comprised of 1,000 fingerprints (4 images each for 250 fingers). They calculated the Receiver Operating Characteristics (ROC) curves before and after the fingerprints were watermarked. These curves are observed to be very close to each other, indicating that proposed technique does not lead to a significant performance loss in fingerprint verification.

(a) (b)

Fig. 6: Fragile fingerprint watermarking : (a) watermark image, (b) fingerprint image carrying the image in (a).

Pankanti and Yeung proposed a fragile watermarking method for fingerprint image verification. A spatial watermark image is embedded in the spatial domain of a fingerprint image by utilizing a verification key. Their method can localize any region of image that has been tampered after it is watermarked; therefore, it can be used to check integrity of the fingerprints. Fig.6 shows a sample watermark image comprised of a company logo, and the watermarked image. Pankanti and Yeung used a database comprised of 1,000 fingerprints (4 images each for 250 fingers). They calculated the Receiver Operating Characteristics (ROC) curves before and after the fingerprints were watermarked. These curves are observed to be very close to each other, indicating that proposed technique does not lead to a significant performance loss in fingerprint verification.

4 Architecture of the Proposed System

Two application scenarios are considered in this study. The basic data hiding method is the same in both scenarios, but it differs in the characteristics of the embedded data, the host image carrying that data, and the medium of data transfer. While fingerprint feature vector or face feature vector is used as embedded data, other information such as user name (e.g., “John Doe”), user identification number (“12345”), or authorizing institution (“FBI”) can also be hidden into the images. In this paper we explore the first scenario.

Fig. 7: Fingerprint watermarking results for : (a) input fingerprint, (b) fingerprint image watermarked using gradient orientation, (c) fingerprint image watermarked using singular points.



The first scenario involves a steganography-based application (Fig. 8): the biometric data (fingerprint minutiae) that need to be transmitted (possibly via a non-secure communication channel) are hidden in a host (also called cover or carrier) image, whose only function is to carry the data. For example, the fingerprint minutiae may need to be transmitted from a law enforcement agency to a template database, or vice versa. In this scenario, the security of the system is based on the secrecy of the communication. The host image is not related to the hidden data in any way. As a result, the host image can be any image available to the encoder. In our application, we consider three different types of cover images: a synthetic fingerprint image, a face image, and an arbitrary image (Fig. 9). The synthetic fingerprint image (360x280) is obtained after a post-processing of the image generated using the algorithm described by Cappell. Using such a synthetic fingerprint image to carry actual fingerprint minutiae data provides an increased level of security since the person who intercepts the communication channel and obtains the carrier image is likely to treat this synthetic image itself as a real fingerprint image, and not consider that it is in fact carrying the critical data. The face image (384x256) was captured in our Regional Forensic Science Lab,Vijayawada. The “Sailboat” image (512x512) is taken from the USC-SIPI database.

Fig. 8: Block diagram of application scenario



This application can be used to counter the seventh type of attack (namely, compromising the communication channel between the database and the fingerprint matcher)

(a) (b) (c)

Fig. 9: Sample Cover images: (a) Synthetic Fingerprint, (b) Face, (c) “Sailboat”.

An attacker will probably not suspect that a cover image is carrying the minutiae information. Furthermore, the security of the transmission can be further increased by encrypting the host image before transmission. Here, symmetric or asymmetric key encryption can be utilized, depending on the requirements of the application such as key management, coding-decoding time (much higher with asymmetric key cryptography), etc. The position and orientation attributes of fingerprint minutiae constitute the data to be hidden in the host image.

5 Conclusion

The ability of biometrics-based personal identification techniques to differentiate between an authorized person and an impostor who fraudulently acquires the access privilege of an authorized person is one of the main reasons for their popularity compared to traditional identification techniques. However, the security and integrity of the biometric data itself raise important issues, which can be ameliorated using encryption, watermarking, or steganography. In addition to watermarking encryption can also be used to further increase the security of biometric data. Our first application is related to increasing the security of biometric data exchange, which is based on steganography. As a consequence the verification accuracy based on decoded watermarked images is very similar to that with original images. The proposed system can be coupled with a fragile watermarking scheme to detect illegitimate modification of the watermarked templates.

References

[1] [D. Maio, D. Maltoni, R. Cappelli, J.L. Wayman and A.K. Jain FVC 2004] Third Fingerprint Verification Compe t i t ion in P roc . In ternat ional Conference on Biometr ic Authent ica t ion ( ICBA) .

[2] [D. Maio, D. Maltoni, R. Cappelli, J.L. Wayman and A.K. Jain FVC 2002] Third Fingerprint Verification Competition in Proc. International Conference on Pattern Recognition.

[3] [S. Pankanti and M.M. Yeung] Verification watermarks on fingerprint recognition and retrieval. In Proc.SPIE, Security and Watermarking of Multimedia Contents, vol. 3657, pages 66–78, 2006.

[4] [N. K. Ratha, J. H. Connell, and R. M. Bolle] Secure data hiding in wavelet compressed fingerprint images. In Proc. ACM Multimedia, pages 127–130, 2007.



[5] [N. K. Ratha, J. H. Connell, and R. M. Bolle] An analysis of minutiae matching strength. In Proc. AVBPA 2001, Third International Conference on Audio- and Video-Based Biometric Person Authentication, pages 223–228, 2006.

[6] [A. K. Jain, S. Prabhakar, and S. Chen] “Combining multiple matchers for a high security fingerprint verification system,” Pattern Recognition Letters, vol. 20, pp. 1371–1379, 2005.



Unsupervised Color Image Segmentation

Based on Gaussian Mixture Model

and Uncertainity K-Means

Srinivas Yarramalle Satya Sridevi. P Department of Information Technology, M.Tech (CST) (CL) Vignan’s IIT, Visakhapatnam-46 Acharya Nagarjuna University [email protected]

Abstract

In this paper we propose a new model of image segmentation based on Finite Gaussian Mixture Model and UK-Means algorithm. In the Gaussian mixture model the pixels inside image region follows the Gaussian distribution and the image is assumed to be a mixture of these Gaussians. The initial components of the image are estimated by using the UK-means algorithm. This method does not totally depend on random selection of parameters; hence it is a reliable and sustainable which can be used for unsupervised image data. The performance of this algorithm is demonstrated by color image segmentation.

Keywords: Gaussian Mixture Model, K-Means algorithm, Uk-Means, Segmentation

1 Introduction

Image segmentation is a key process of image segmentation of image Analysis with its applications to the Pattern recognition, Object detection and Medical image analysis. A number of image segmentation algorithms based on Histogram[1], Model based [2], Saddle point [3], Markovian [4] etc., were proposed. Among these algorithms model based image segmentation has taken importance since the segmentation is based using the parameters of each of the pixels. Segmentation algorithms differ from application to application there exists no algorithm which suits for all the purposes [5]. The advantage with image segmentation is by compressing some segments communication can be made possible by saving network resource.

To segment a image one can use models based on Bayesian Classifier, Markov, Graph cut approach etc., Depending on these models, to segment an image we have three major approaches: 1) Edge based 2) Region based and 3) Gaussian Mixture Model based. Among these models Gaussian mixture model based image segmentation has gained popularity. [4][5][6].

Here we assume that each pixel in the image follows a distribution which is assumed to be a Normal Distribution with mean and variance, we assume that this distribution is a Gaussian distribution and since each pixel in the image is following a Gaussian distribution, the image is assumed as a Gaussian Mixture Model. To identify the pixel density and to estimate the mixture densities of the image, Joint Entropy algorithm is used. The segmentation process is carried out by clustering each pixel of the image data based on homogeneity. This method is

Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means ♦ 323


stable. The main disadvantage in image segmentation is, if the number of components for Gaussian Mixture Model is assumed to be known in prior, then the segments may not be effective, and secondly, the initialization of parameters, which may greatly effect the segmentation result. Hence to estimate the parameters efficiently, in our model UK-Means algorithm is used.

2 Gaussian Mixture Model

Image segmentation is a process of dividing the image such that the homogenous pixels come together. A pixel is defined as a random variable which varies on a two dimensional space. To understand and interpret the pattern of the pixels inside the image region, one has to fit in a model. Generally image segmentation based on Gaussian mixture model are considered, here each pixel is assumed to be following a Gaussian Mixture Model and the entire image is a mixture of these Gaussian variates. The basic methodology for segmenting an image is to find effectively the number of clusters so that the homogenous pixels come together. If a feature, Texture, Pattern is known then it is easy to segment the image based on these patterns, However for realistic data we cannot interpret the number of clusters hence UK-mean algorithm is used to identify the number of clusters inside the image. Once the number of clusters are identified then for each of these image region we have to estimate the model parameters namely µ, σ, Π (where µ is the mean, σ is the standard deviation, Π is the mixing weight)

2.1 The Probability Density Function of Gaussian Mixture Model

Image is a matrix where each element is a pixel. The value of the pixel is a number that shows intensity or color of the image. Let X is a random variable that takes these values. For a probability model determination, we can suppose to have mixture of Gaussian distribution as the following form

)/()(1

2

,∑=

=

k

i

iii xNpxf σµ

(1)

Where K is the number of regions to be estimated and Pi>0 are weights such that ∑=

=k

i ip1

1

2

)(exp

2

1)(

2

2

,2

,

i

i

ii

x

piN

σ

µ

σ

σµ

−−

=

(2)

Where µ i, σi are mean, standard deviation of region i. The parameters of each region are

)....,,.........,.....,.........,.,..........( 22

11,1 kkkpp σσµµθ =.

To estimate the number of image regions UK-mean algorithm is used

3 K-Means Clustering

In section-3, K-Means algorithm is discussed and then in the section-3.1 UK-Means algorithm is presented

K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to

324 ♦ Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means


classify a given data set through a certain number of clusters (assume K clusters) fixed a priori. The main idea is to define K centroids should be placed in a cunning way because different locations cause different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each belonging to a given data set and associate it to nearest centroid. When no point is pending, the first step is completed and an early group age is done. At this point we need to re-calculate K new centroids as centers of the clusters resulting from the previous step. After we have these K new centriods, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the K centriods change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this a squared error function.

The objective function

2

Where xij – cj 2

is a chosen distance measure between a data point xj and the cluster

centre cj is an indicator of the distance of the n data points from their respective cluster

centres.

The algorithm is composed of the following steps:

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Although it can be proved that the procedure will always terminate, the K-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The K-means algorithm can be run multiple times to reduce this effect.

3.1 K-Means for Uncertain Data

The clustering algorithm with the goal of minimizing the expected sum of squared errors E (SSE) is called UK-Means algorithm. UK-Means algorithm suits best when the data is unsupervised. The UK-Means algorithm calculations are presented below.

2

1 i j

K

ijj X C

C X= ∈

−∑ ∑

Where is the distance metric between a data point Xi and a cluster mean cj. The distance

is measured using the eculidean distance and is given by

Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means ♦ 325


2

1

k

i i

i

X Y x y=

− = −∑

1

j

j i

i Cj

C xC =

= ∑

The K-Means algorithm is as follows

1. Assign initial values for cluster means c1 to ck 2. repeat 3. for i=1 to n do 4. Assign each data point xi to cluster Cj where E(||Cj-Xj||) is less 5. end for 6. for j=1 to K do 7. Recalculate Cluster mean of cluster Cj 8. end for 9. until convergence 10. return

The main difference between UK-Means and K-Means clustering lies in the computation of distance and clusters. UK-Means compute the expected distance and cluster centroids based on the data uncertainity.

4 Performance Evaluation and Conclusion

The performance evaluation of above two methods are tested with two images the Sheep and Mountains using image quality metrics such as signal to noise ratio and mean square error. The original and segmented images are shown in fig (1)

Fig. 1: Segmented image

from the above images it can be easily seen that the image segmentation methods developed by UK-Means gives the best results.

The edges inside the image are clear. the performance evaluation of the two methods is given in table (1)

326 ♦ Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means


Name of image

Image segmentation by Gaussian model

+K means algorithm Image segmentation by Gaussian model +U K

means algorithm

Image SNR MSE SNR MSE

Sheep 38.2 0.7 47.3 0.4

Mountains 32.7 0.8 41.7 0.6

from the table (1) it can be easily seen that signal to noise ratio for the image segmentation algorithms based on UK-Means is more i.e., if the signal is more, the error is less which implies that the output image is very nearer to the input image

References

[1] C. A. Glasbey “An analysis of histogram based threshold algorithm” CVGIP, VOL 55 Pg 532-537. [2] Micheal Chau et al “Uncertain K-Means algorithm”, proceedings of the workshop on sciences of Artificial,

December 7-8, 2005. [3] M. Brezeail and M. Sonka “Edge based image segmentation” IEEE world congress on computional

intelligence pg 814-819,1998 [4] L. Chan et al, “Image Texture classification Based on finite Gaussian mixture models” IEEE transaction

on image processing 2001. [5] Rahman Farnoosh, Gholamhossein Yari, “Image segmentation using Gaussian mixture model”,

Proceedings of Pattern Recognition, 2001 [6] S. K. Pal, N. R. Pal, “A Review Of Image Segmentation Techniques”, Proceedings of IEEE Transcations

on Image Processing, 1993. [7] Bo Zhao, Zhongxing Zhu, Enrong Mao and Zhenghe Song, “Image Segmentation Based on Ant Colony

Optimization and K-means Clustering” proceedings of IEEE International conference on Automation and Logistics, 2007



Recovery of Corrupted Photo

Images Based on Noise Parameters

for Secured Authentication

Pradeep Reddy CH Srinivasulu D Ramesh R Jagan’s College of Engg & Tech. Narayana Engg. College Narayana Engg. College Nellore Nellore Nellore

[email protected] [email protected] [email protected]

Abstract

Photo-image authentication is an interesting and demanding field in image

processing mainly for reasons of security. In this paper, photo-image

authentication refers to the verification of corrupted facial image of an

identification card, passport or smart card based on its comparison with the

original image stored in database. This paper concentrates on noise parameter

estimation. In the training phase, a list of corrupted images is generated by

adjusting the contrast, brightness and Gaussian noise of the original image

stored in the database and then PCA (Principal Component Analysis) training

is given to generated images. In the testing phase, the Gaussian noise is

removed from the scanned photo image using wiener filter. Then, linear

coefficients are calculated based on LSM (Least Square Method) method and

noise parameters are estimated. Based on these estimated parameters,

corrupted images are synthesized. Finally, comparing the synthesized image

with the scanned photo image using normalized correlation method performs

authentication. The proposed method can be applied to various fields of image

processing such as photo image verification for credit cards and automatic

teller machine (ATM) transactions, smart card identification systems,

biometric passport systems, etc.

1 Introduction

1.1 Background

Photo-image authentication plays an important role in a wide range of applications such as

biometric passport system, smart card authentication, identification cards that use photo-

image for authentication. This paper mainly concentrates on the topic of authenticating the

corrupted photo-images. That is, the proposed method provides secure authentication against

forgery and offers accurate and efficient authentication. Authentication is still a challenge for

researchers because corrupted images can easily be forged. The different methodologies and

noise parameter estimation give an insight as to which algorithm is to be generated to

estimate the noise parameters from the original image and generated corrupted images.

328 ♦ Recovery of Corrupted Photo Images Based on Noise Parameters for Secured Authentication


1.2 Related Work

Research of face authentication has been carried out for a long time. Several researchers have analyzed various methods for dealing with corrupted images. In previous studies, modeling the properties of noise, as opposed to those of the image itself, has been used for the purpose of removing the noise from the image. These methods use local properties to remove noise. But they cannot remove the noise totally which is distributed over a wide area. In addition, a similar noise property of different regions in the image is affecting the noise removing process. Also, these methods do not support to recover regions, which are damaged due to noise or occlusion. The corruption of photo- images is a commonly occurring phenomenon, which is a source of serious problems in many practical applications such as face authentication [6]. There are several approaches to solve the noise problem without taking multiple training images per person. The Robust Feature Extraction in corrupted image use polynomial coefficients. It mainly works on Gaussian and White noise. Using Principal Component Analysis and Kernel Principal Component Analysis can do Reconstruction of the missing parts in partially corrupted images. But it uses multiple images to produce good results and not efficient for real-time face-image authentication. In the face-authentication based on virtual view, single image is used in training set. This method gives a good performance in relation to various kinds of poses, but difficulties are encountered in case of virtual view generation for occluded regions.

2 Proposed Work

Our intended approach is to estimate the noise parameters using Least Squares Minimization method. In order to authenticate the corrupted photo-images, the proposed method has a

training phase and a testing phase. In the training phase adjusting the parameters of contrast, brightness and Gaussian blur of an original photo-image generates corrupted images. Then, basic vectors for the corrupted images and the noise parameters are obtained. In the testing phase, first, the Gaussian noise is removed from the test-image by using the Wiener filter. Then linear coefficients are computed by decomposition of the noise model, and then the noise parameters are estimated by applying these coefficients to the linear composition of the noise parameters. Subsequently, a synthesized image is obtained by applying the estimated parameters to the original image contained in the database. Finally, comparing the synthesized photo-image and the corrupted photo-image does photo- image authentication

2.1 Noise Model

Digital images are prone to various types of noises like blur, speckles, stains, scratches, folding lines, salt and pepper. There are several ways in which a noise can be introduced into an image, depending on how the image is created. There are various types of noises present in the digital images. They are Gaussian noise, Poisson noise, Speckle noise and Impulse noise. Generally noise can be included at the time of acquisition and transmission. Gaussian noise is the one, which is frequently added in the image acquisition process. Gaussian noise makes more effects in the image. This paper mainly concentrates on Gaussian noise. Three noise parameters namely contrast, brightness and Gaussian noise are considered in the current study because the noise in the corrupted image can be synthesized by a combined adjustment of the three noise parameters. The contrast and brightness of an image are changed for generating corrupted images in the following manner: Icb = c * Image (x, y) + b

Recovery of Corrupted Photo Images Based on Noise Parameters for Secured Authentication ♦ 329


Where c is contrast parameter, b is brightness parameter and ICB is image corrupted by the changes of contrast and brightness.

Gaussian blur can be generated by IG = Iorg(x,y) * Gblur(x,y)

Noise model is defined as the combination of corrupted images Ic and noise parameters p.

Noise Model = (Iic) i= (1…m)

Where Iic is a combination of corrupted images, M is the number of corrupted images and p is

the noise parameter value. Then the values of Noise parameters are estimated by applying linear coefficients. This is calculated by using Least Square Minimization method (discussed in section 2.4) from the trained data set and the wiener filtered image to the linear composition of the noise parameters. The calculated noise parameters should be applied to the original image to obtain the synthesized image

2.3 Principal Component Analysis Algorithms

Principal component analysis is a data-reduction method that finds an alternative set of parameters for a set of raw data (or features) such that most of the variability in the data is compressed down to the first few parameters. It is a facial algorithm, which is used for 2D eigenfaces. In PCA training, a set of eigenfaces is created. Then the new images are projected onto the eigenfaces and checked if image are close to “face space”.

Step 1: Prepare the data

The faces constituting the training set should be prepared for processing.

Step 2: Subtract the mean

Ψ =1/M∑i=1M

Γi where Γ - matrix of original faces, Ψ − average matrix

Φι = Γι −Ψ

Step 3: Calculate the covariance matrix

C = (1/M) ∑n-1M

Φn ΦnT where C – covariance matrix

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix

The eigenvectors (eigenfaces) ui and the corresponding eigenvaluesi should be calculated.

C = (1/M) ∑n-1M

Φm Φn = AAT

L = ATA L n x m = Φm

TΦn

ul = ∑k-lM

vlk Φk l = 1,...,M

Where, L is M ×M matrix, v are M eigenvectors of L and u are eigenfaces.

Step 5: Select the principal components

Rank the eigenfaces according to their eigenvalues in descending order from M eigenvectors (eigenfaces) ui, only M' should be chosen, which have the highest eigenvalues. If the eigenvalue of the eigenvectors is high, then it explains the more characteristic features of the faces. Low eignevalues of the eigenfaces are neglected as they explain very small part of characteristic features of the faces. After M' eigenfaces ui are determined, the “training” phase of the algorithm is finished.



2.4 Least Square Minimization Method

Least Square Minimization Method is mathematical optimization technique, which gives a series of measured data and attempts to find a function, which closely approximates the data (a "best fit"). Least Square Method is a statistical approach to estimate the expected value or function with the highest probability from the observation with random errors. It is commonly applied for two cases, curve fitting and coordinate transformation.) By minimizing the square sum of residuals, the unknown parameters ‘a’ and ‘b’ are determined. Unknown parameters in case of y = ax + b, are determined as follows AX = B or xia = b = yi.

In this paper the Least Square Minimization method is used to estimate the linear coefficients of the corrupted photo-image. The linear coefficients are calculated by the decomposition of the noise models generated using principal component analysis technique.

2.5 Normalized Correlation Method

Normalized Correlation Method is the frequently used approach to perform matching between two images. It is used for the purpose of authenticating the test-image with the synthesized-image. Normalized Correlation Method should calculate the correlation coefficient. The correlation coefficient gives the amount of similarity between the synthesized-image and the test- image. The higher the correlation coefficient value, the higher the similarity between the images is. This correlation coefficient value should lie between -1 and 1, independent of scale changes in the amplitude of the image.

3 Proposed System Architecture

Fig. 1.System Architecture

The Corrupted Image module generates a list of corrupted images by adjusting the three-noise parameters contrast, brightness and Gaussian blur of the original image stored in the

Recovery of Corrupted Photo Images Based on Noise Parameters for Secured Authentication ♦ 331


database. PCA algorithm is used in Noise Parameter Estimation and also to train the list of corrupted images. This also removes the Gaussian noise from the scanned corrupted photo-image by using wiener filter. The Noise Parameter Estimation module uses linear coefficients and then the noise parameters are estimated. The Synthesized Image module is used to synthesize a corrupted photo-image by applying the noise parameters estimated in the previous module to the original photo-image. The Authentication module performs photo-image authentication by comparing the test-image and the synthesized photo-image.


Thus, new method of authenticating corrupted photo-images based on noise parameter estimation has been implemented. In contrast to the previous techniques, this method deals with the corrupted photo-images based on noise parameter estimation, using only one image per person for training and using few relevant principle components. This method provides an accurate estimation of the parameters and improves the performance of photo-image authentication. The experimental results show that the noise parameter estimation of the proposed method is quite accurate and that this method can be very useful for authentication.

Further research is needed to develop a method of estimating partial noise parameters in a local region and generating various corrupted images for the purpose of accurate authentication. Thus, it is expected that the proposed method can be utilized for practical applications requiring photo-image authentication, such as biometric passport systems and smart card identification systems.

References

[1] http://www.imageprocessingplace.com/image databases [2] http://mathworld.wolfram.com/CorrelationCoefficient.html [3] http://www.wolfram.com. [4] [L.I. Smith, 2006] Lindsay I. Smith: A Tutorial on Principal Component Analysis

http://www.csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf [5] [A. Pentland and M. Turk, 1991] A. Pentland, M. Turk, “Eigenfaces for recognition”, Journal of Cognitive

Neurosciece, Vol. 2 (1), pp. 71–86, 1991. [6] [S.W. Lee et at., 2006] S.W. Lee, H.C Jung, B.W Hwang, Lee Seong-Whan, “Authenticating corrupted

photo images based on noise parameter estimation” Pattern Recognition, Vol.39 (5), pp. 910-920, May 2006.

[7] Rafael C. Gonzalez, Richard E. Woods, Steven L. Eddins, “Digital image processing using Matlab”, 2nd Edition, Pearson, 2004

[8] [B. W. Hwang and S.W. Lee 2003] B.-W. Hwang, S.-W. Lee, “Reconstruction of partially damaged face images based on a morphable face model”, IEEE Transactions on. Pattern Analysis Machine Intelligence, Vol.25 (3), pp.365–372, 2003.

[9] [K.K. Paliwal and C. Sanderson, 2003] K.K. Paliwal, C. Sanderson, “Fast features for face authentication under illumination direction changes”, Pattern Recognition Letters, Vol. 24 (14), 2003.

[10] C. Sanderson, S. Bengio, “Robust features for frontal face authentication in difficult image condition, in: Proceedings of International Conference on Audio- and Video-based Biometric Person Authentication”, Guildford, UK, pp. 495–504, 2003

[11] [J. Bigun et al., 2002] J. Bigun, W. Gerstner, F. Smeraldi, “Support vector features and the role of dimensionality in face authentication”, Lecture Notes in Computer Science, Pattern Recognition Support

Vector Mach, 2002. [12] [A.M. Martinez, 2001]A.M. Martinez, “Recognizing imprecisely localized, partially occluded, and

expression variant faces from a single sample per class”, IEEE Transactions on. Pattern Analysis Machine

Intelligence, Vol.24 (6), pp.748–763, 2001



[13] [A.C. Kak and A.M. Martinez, 2001] A.C. Kak, A.M. Martinez, “PCA versus LDA”, IEEE Transactions on

Pattern Analysis Machine Intelligence, Vol.23 (2), pp. 229–233, 2001. [14] [P.N. Belhumeur et al.,1997]P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, “Eigenfaces vs. fisherfaces:

recognition using class specific linear projection”, IEEE Transactions on Pattern Analysis Machine

Intelligence, Vol.19 (7), pp. 711–720, 1997 [15] [K.R. Castleman, 1996] K.R. Castleman, “Digital Image Processing, 2nd Edition, Prentice-Hall, Englewood

Cliffs, New Jersey, 1996 [16] D. Beymer, T. Poggio, “Face recognition from one example view, in: Proceedings of International

Conference on Computer Vision”, Massachusetts, USA, pp. 500–507, 1995



An Efficient Palmprint Authentication System

K. Hemantha Kumar CSE Dept, Vignan’s Engineering College, Vadlamudi-522213

Abstract

A reliable and robust personal verification approach using palmprint features is presented in this paper. The characteristics of the proposed approach are that no prior knowledge about the objects is necessary and the parameters can be set automatically. In our work, we use poly online palmprint database provided by honkong polytechnique university. In the proposed approach, user can place palm on any direction, finger-webs are automatically selected as the datum points to define the region of interest (ROI) in the palmprint images. The hierarchical decomposition mechanism is applied to extract principal palmprint features inside the ROI, which includes directional decompositions. The directional decompositions extracts principal palmprint features from each ROI. A total of 7720 palmprint images were collected from 386 persons to verify the validity of the proposed approach. For palmprint verification we use principle component analysis and the results are satisfactory with acceptable accuracy (FRR: 0.85% and FAR: 0.75%). Experimental results demonstrate that our proposed approach is feasible and effective in palmprint verification.

Keywords: Palmprint verification, Finger-web, Template generation, Template Matching; Principle component analysis.

1 Introduction

Due to the explosive growth and popularity of the Internet in recent years, an increasing number of security access control systems based on personal verifications is required. Traditional personal verification methods rely heavily on the use of passwords, personal identification numbers (PINs), magnetic swipe cards, keys, smart cards, etc. These traditional methods offers only limited security. Many biometric verification techniques dealing with various physiological features including facial images, hand geometry, palmprint, fingerprint and retina pattern [1] have been proposed to improve the security of personal verification. Some of the important features of biometric verification techniques are uniqueness, repeatability, immunity to forgery, operation under controlled light or not, throughput rate is high, false rejection rate (FRR) and false acceptance rate (FAR) are low, and ease of use. There is still no biometric verification technique that can satisfy all these needs. In this paper, we present a novel palmprint verification method for personal identification. In general, palmprints consist of some significant textures and a lot of minutiae similar to the creases, ridges and branches of fingerprints. In addition, there are many different features existing in palmprint images, such as the geometry, the principal line, the delta point, wrinkle features, etc. [9]. Both palmprint and fingerprint offer stable, unique features for personal identification which have been used for criminal verification by law enforcement agents for more than 100 years [9]. However, it is a difficult task to extract small unique features

334 ♦ An Efficient Palmprint Authentication System


(known as minutiae) from the fingers of elderly people as well as manual laborers [5,6]. Many verification technologies using biometric features of palms were developed recently [7–11].

In this paper, we propose an approach for personal authentication using palmprint verification. The overall system is compact and simple. User can place palm on any direction. Both principal lines and wrinkles of palmprints are named principal palmprints in the following contexts. Most of the users are not willing to give palm images because there may be chances to misuse the palmimages. To overcome the above problem we generate templates and store templates for verification and from these templates not possible to generate original palm images. The rest of this paper is organized as follows. The segmentation and the procedure for the determination of finger-web locations and the location of region of interest (ROI) are presented in section 2. Template construction is presented in section 3. The user verification is presented in Section 4. Finally, concluding remarks are given in Section 5.

2 Region of Interest Identification

By carefully analyzing the histograms of palmprint images, we find that the histograms are typically bimodal. Hence, we adopt a mode method to determine the suitable threshold in binarizing palmprint images. Using the threshold detect the border of the palm and only border pixel place high intensity then the border image is shown in fig 1(b).

Fig.1(a): Palm image (b) Border of the Palm

After identifying the border depending on the finger’s direction image must be rotated to proper direction then identify the Region Of Interest. The figure 2 has different directions of palm images of the same person.

To increase the verification accuracy and reliability, we compare the principal palmprint features extracted from the same region in different palmprint images for easy verification. The region to be extracted is known as the ROI and its size is 180X180. For this reason, it is important to fix the ROI at the same position in different palmprint images to ensure the stability of the extracted principal palmprint features. The ROI of the above images is given in figure 3.

An Efficient Palmprint Authentication System ♦ 335


Fig. 2: Palm Images of Same Person on Different Directions

Fig. 3: ROI of figure2 in chronological order.

3 Principal Palmprint Features Extraction and Template Generation

The line segment is a significant feature to represent principal palmprints. It contains various features, such as end point locations, directions, width, length and intensity. To extract these features iteratively we apply edge function and morphological functions. After applying these functions we apply subsample the images then generates the templates. The template of the ROI is shown in figure4.

336 ♦ An Efficient Palmprint Authentication System


Fig. 4: Template of the above ROI

4 Verification

We use Principle Component Analysis to verify whether they are captured from the same person are not. The total number of palmprint images used in our experiment was 7720, which were collected from 386 persons each with 20 palmprint images captured. The size of each palmprint image was 284X384 with 100 dpi resolution and 256 gray-levels. The first acquired 10 images were used as the template images set and the 10 images acquired afterwards were taken as the test set. There were a total of 3860 palmprint images used in constructing 386 templates for the template library and each template size is 22X22. Another 3860 palmprint images were used as the testing images to verify the validity of the proposed approach. We adopted the statistical pair known as False Rejection Rate (FRR) and False Acceptance Rate (FAR) to evaluate the performance of the experimental results. The results are satisfactory with acceptable accuracy (FRR: 0.85% and FAR: 0.75%). Experimental results demonstrate that our proposed approach is feasible and effective in palmprint verification.

5 Conclusion

In this paper, we present An Efficient Palmprint Authentication System. There are two main advantages of our proposed approach. The first is that the user place palm in any direction and our algorithm automatically rotate palm image and obtain ROI. The second is that instead of palm images templates are stored for verification so users need not worry about palm images are used for other purposes. The algorithm for automatically rotation and detecting the finger-webs location has been tested on 7720 palmprint images captured from 386 different persons. The results show that our technique conforms to the results of manual estimation. We also demonstrate that the use of finger-webs as the datum points to define ROIs is reliable and reproducible. Under normal conditions, the ROIs should cover almost the same region in different palmprint images. Within the ROI, principal palmprint features are extracted by applying the template generation which consists of edge detectors, and sequential morphological operators. Any new palmprint features are matched with those from the template library by the PCA to verify the identity of the person. Experimental results demonstrate that the proposed approach can obtain acceptable verification accuracy. Such an approach can be applied in access control systems. In high-security demanded applications, very low FAR (even zero) and FRR (acceptable) are mandatory. It is a conflict to reduce both FAR and FRR by using the same biometric features. In order to reduce FAR without increasing FRR, we can combine our techniques with those using palm geometric shapes, finger creases and other biometric features for verification in future research.

An Efficient Palmprint Authentication System ♦ 337


References

[1] A.K. Jain, R. Bolle, S. Pankanti, Biometrics Personal Identification in Networked Society, Kluwer Academic Publishers, Massachusetts, 1999.

[2] Y. Yoshitomi, T. Miyaura, S. Tomita, S. Kimura, Face identification thermal image processing, Proceeding 6th IEEE InternationalWorkshop on Robot and Human Communication, RO-MAN’ 97 SENDAI.

[3] J.M. Cross, C.L. Smith, Thermographic imaging of the subcutaneous vascular network of the back of the hand for biometric identification, Institute of Electrical and Electronics Engineers 29thAnnual 1995 International Carnahan Conference, 1995, pp. 20–35.

[4] Chih-Lung Lin, Thomas C. Chuang, Kuo-Chin Fan, Palmprint verification using hierarchical decomposition, Pattern Recognition 38 (2005) 2639 – 2652.

[5] A. Jain, L. Hong, R. Bolle, On-line fingerprint verification, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 302–313.

[6] L. Coetzee, E.C. Botha, Fingerprint recognition in low quality images, Pattern Recogn. 26 (1993) 1441–1460.

[7] C.C. Han, P.C. Chang, C.C. Hsu, Personal identification using hand geometry and palm-print, Fourth Asian Conference on Computer Vision (ACCV), 2000, pp. 747–752.

[8] H.J. Lin, H.H. Guo, F.W. Yang, C.L. Chen, Handprint Identification Using Fuzzy Inference, The 13th IPPR Conference on Computer Vision Graphics and Image Processing, 2000, pp. 164–168.

[9] D. Zhang, W. Shu, Two novel characteristics in palmprint verification: datum point invariance and line feature matching, Pattern Recogn. 32 (1999) 691–702.

[10] J. Chen, C. Zhang, G. Rong, Palmprint recognition using crease, International Conference on Image Processing, vol. 3, 2001, pp. 234–237.

[11] W.K. Kong, D. Zhang, Palmprint texture analysis based on low-resolution images for personal authentication, 16th International Conference on Pattern Recognition, vol. 3, 2002, pp. 807–810.



Speaker Adaptation Techniques

D. Shakina Deiv Pradip K. Das M. Bhattacharya ABV-IIITM, Gwalior Deparment of CSE Depatrment of ICT IIT, Guwahati ABV-IIITM, Gwalior

Abstract

Speaker Adaptation techniques are used to reduce speaker variability in Automatic Speech Recognition. Speaker dependent acoustic model is obtained by adapting the speaker independent acoustic model to a specific speaker, using only a small amount of speaker specific data. Maximum likelihood transformation based approach is certainly one of the most effective speaker adaptation methods known so far. Some researchers have proposed constraints on the transformation matrices for model adaptation, based on the knowledge gained from Vocal tract length normalization (VTLN) domain. It is proved that VTLN can be expressed as linear transformation of cepstral coefficients. The cepstral domain linear transformations were used to adapt the Gaussian distributions of the HMM models in a simple and straightforward manner as an alternative to normalizing the acoustic feature vectors. The VTLN constrained model adaptation merits exploration as its performance does not vary significantly with the amount of adaptation data.

1 Introduction

In Automatic Speech Recognition (ASR) systems, speaker variation is one of the major

causes for performance degradation. The error rate of a well trained speaker dependent

speech recognition system is three times less than that of a speaker independent speech

recognition system [Huang and Lee, 1991]. Speaker normalization and speaker adaptation are

the two commonly used techniques to alleviate the effects of speaker variation.

In Speaker Normalization, transformation is applied to acoustic features of a given speaker’s

speech wave so as to better match it to a speaker independent model. Cepstral mean removal,

Vocal Tract Length Normalization (VTLN), Feature space normalization based on mixture

density Hidden Markov Model (HMM) and signal bias removal estimated by Maximum

Likelihood Estimation (MLE) are some techniques employed for Speaker Normalization.

Most of the state of the art speech recognition systems make use of Hidden Markov Model

(HMM) as a convenient statistical representation of speech. One way to account for the

effects of speaker variability is Speaker Adaptation, achieved by modifying the acoustic

model. Model transformations attempt to map out distributions of HMMs to a new set of

distributions so as to make them a better statistical representation of a new speaker. Mapping

of output distributions of HMMs are very flexible and can provide compensation for not only

speaker variability but also environmental conditions.

Speaker Adaptation Techniques ♦ 339


2 Speaker Adaptation Techniques

Speaker adaptation is typically undertaken to improve the recognition accuracy of a Large Vocabulary Conversational Speech Recognition (LVCSR) System. In this approach, a speaker dependent acoustic model is obtained by adapting the speaker independent acoustic model to a specific speaker, using only a small amount of speaker specific data, thus enhancing the recognition accuracy close to that of a speaker dependent model.

The following two are the well established methods of Speaker Adaptation:

2.1 Bayesian or MAP Approach

Maximum Adaptation a Posteriori (MAP) is a general probability distribution estimation in which prior knowledge is used in the process of estimation. The parameters of the speaker independent acoustic models form the prior knowledge in this case. This approach requires a large amount of adaptation data and is slow, though optimal.

2.2 Maximum Likelihood Linear Regression

This is a transformation based speaker adaptation method that is widely used. The parameters of general acoustic models are adapted to a speaker’s voice using a linear regression model estimated by maximum likelihood of adaptation data. However this method too takes a fairly large adaptation data to be effective.

3 Extension of Standard Adaptation Techniques

The above techniques increase the recognition rate but are computationally intensive. Therefore efforts are on to reduce the number of parameters to be computed by exploiting some special structure or constraints on the transformation matrices for adaptation using less data.

3.1 Extended MAP

The extended MAP (EMAP) adaptation makes use of information about correlations among parameters [Lasry and Stern, 1984]. Though the adaptation equation makes appropriate use of correlations among adaptation parameters, solution of the equation depends on the inversion of a large matrix, making it computationally intensive.

3.2 Adaptation by Correlation

An adaptation algorithm that used the correlation between speech units, named Adaptation by Correlation (ABC) was introduced [Chen and DeSouza, 1997]. The estimates are derived using least squares theory. It is reported that ABC is more stable than MLLR when the amount of adaptation data is very small.

3.3 Regression Based Model Prediction

Linear regression was applied to estimate parametric relationships among the model parameters and update those parameters for which there was insufficient adaptation data [Ahadi and Woodland, 1997].

340 ♦ Speaker Adaptation Techniques


3.4 Structured MAP

Structured MAP (SMAP) adaptation was proposed [Shinoda and Lee, 1998], in which the transformation parameters were estimated in a hierarchical structure. The MAP approach helps to achieve a better interpolation of the parameters at each level. Parameters at a given level of the hierarchical structure are used as the priors for the next lower child level. The resulting transformation parameters are a combination of the transformation parameters at all levels. The weights for the combinations are changed according to the amount of adaptation data present. The main benefit of the SMAP adaptation is that automatic control is obtained over the effective cluster size in a fashion that depends on the amount of adaptation data.

3.5 Constraints on Transformation Based Adaptation Techniques

The transformation matrix was constrained to a block diagonal structure, with feature components assumed to have correlation only within a block [Gales et al., 1996]. This reduces the number of parameters to be estimated. However, the block diagonal matrices did not provide better recognition accuracy.

Principal component analysis (PCA) reduces the dimensionality of the data. [Nouza, 1996] used PCA for feature selection in a speech recognition system. [Hu,1999] applied PCA to describe the correlation between phoneme classes for speaker normalization.

The speaker cluster based adaptation approach explicitly uses the characteristics of an HMM set for a particular speaker. [Kuhn et al., 1998] introduced “eigenvoices” to represent the prior knowledge of speaker variation. [Gales et al., 1996] proposed Cluster adaptation training (CAT). The major difference between CAT and Eigenvoice approaches is how the cluster models are estimated.

4 VTLN constrained Model Adaptation

Some researchers have proposed constraints on the transformation matrices for model adaptation, based on the knowledge gained from VTLN domain.

Vocal tract length normalization (VTLN) is one of the most popular methods to reduce inter-speaker variability that arises due to physiological differences in the vocal-tracts. This is especially useful in gender independent systems, since on average the vocal tract is 2-3cm shorter for females than males, causing females’ formant frequencies to be about 15% higher.

VTLN is usually performed by warping the frequency-axis of the spectra of speakers/clusters by appropriate warp factor prior to the extraction of cepstral features. The most common method for finding warp factors for VTLN invokes the maximum likelihood (ML) criterion to choose a warp factor that gives a speaker’s warped observation vectors the highest probability.

However, as the VTLN transformation is typically non-linear, exact calculation of the Jacobian is highly complex and is normally approximated.. Moreover, cepstral features are the predominantly used ones in ASR. This has lead many research groups to explore the possibility of substituting the frequency-warping operation by a linear transformation in the cepstral domain.

[McDonough et al., 1998] proposed A class of transforms which achieve a remapping of the frequency axis similar to conventional VTLN. The bilinear transform (BLT) is a conformal map expressed as



Q(z) = (z-α) /(1- αz)

where α is real and |α| < 1. The use of BLT and a generalization of BLT known as All Pass Transforms were explored for the purpose of speaker normalization. The BLT was found to approximate to a reasonable degree the frequency domain transformations often used in VTLN. The cepstral domain linearity of APT makes speaker normalization easy to implement and produced substantial improvements in recognition performance of LVCSR.

The work was extended to develop a speaker adaptation scheme based on APT [McDonough and Byrne, 1999]. Its performance was compared to BLT and the MLLR scheme. Using test and training material obtained from Switchboard corpus, they have shown that the performance of the APT based speaker adaptation was comparable or better than that of MLLR when 2.5 min. of unsupervised data was used for parameter estimation. It is shown that the APT scheme outperformed MLLR when the enrollment data was reduced to 30 sec.

[Claes et al., 1998] devised a method to transform the HMM based acoustic models trained for a particular group of speakers, (say adult male speakers) to be used on another group of speakers (children). The transformations are generated for spectral characteristics of the features from a specific speaker. The warping factors are estimated based on the average third formant. As MFCC involves additional non-linear mapping, linear approximation for the exact mapping was computed by locally linearizing it. With reasonably small warping data, the linear approximation was accurate.

Uebel and Woodland studied the effect of non-linearity on normalization. The linear approximation to transformation matrix, K between the warped and the unwarped cepstra was estimated using Moore-Penrose pseudo inverse as

K = (CT C) -1 CT Ĉ Where Ĉ and C are column-wise arranged warped and unwarped cepstral feature matrices respectively. They have inferred that both linear approximation based and extract transformation based VTLN approaches provided similar performance.

[Pitz et al., 2001] concluded that vocal tract normalization can always be expressed as a linear transformation of the cepstral vector for arbitrary invertible warping functions. They derived the analytic solution for the transformation matrix for the case of piece-wise linear warping function.

It was shown [Pitz and Ney, 2003] that vocal tract normalization can also be expressed as linear transformation of Mel Frequency Cepstral coefficients (MFCC). An improved signal analysis in which Mel frequency warping and VTN are integrated into DCT is mathematically simple and gives comparable recognition performance. Using special properties of typical warping functions it is shown that the transformation matrix can be approximated by a tridiagonal matrix for MFCC. The computation of transformation matrix for VTN helps to properly normalize the probability distribution with Jacobian determinant of the transformation. Importantly, they infer that VTN amounts to a special case of MLLR explaining the experimental results that improvement in Speech recognition obtained by VTN and MLLR are not additive.

In the above said cases [McDonough et al.,1998; Pits and Ney, 2005], the derivation is done for continuous speech spectra, and requires the transformation to be analytically calculated for each warping function. The direct use of the transformation for discrete samples will result in aliasing error.

342 ♦ Speaker Adaptation Techniques


The above-mentioned relationships are investigated in the discrete frequency space [Cui and Alwan, 2005]. It is shown that, for MFCCs computed with Mel-scaled triangular filter banks, a linear relationship can be obtained if certain approximations are made. Utilizing that relationship as a special case of MLLR, an adaptation approach based on formant-like peak alignment is proposed where the transformation of the means is performed deterministically based on the linearization of VTLN. Biases and adaptation of the variances are estimated statistically by the EM algorithm.

The formant-like peak alignment algorithm is used to adapt adult acoustic models to children’s speech. Performance improvements are reported compared to traditional MLLR and VTLN.

In the APT and piece-wise linear warping based VTLN approaches discussed above, the warping is expressed as a linear transformation in linear cepstra only. Incase of Mel cepstra, the warping function cannot be expressed as linear transformations unless linear approximations are used [Claes et al., 1998 ; Pits and Ney 2005].

In the shift based speaker normalization approach [Sinha and Umesh, 2007], warping is effected through shifting of the warped spectra, and therefore the warping function can easily be expressed as exact linear transformation in feature (Mel cepstral) domain.

The derivation of the transformation matrices [Sinha, 2007], relating the cepstral features in shift based speaker normalization method is straightforward and much simpler than those suggested in the above said methods.

The cepstral domain linear transformations were used to adapt the Gaussian distributions of the HMM models in a simple and straightforward manner as an alternative to normalizing the acoustic feature vectors. This approach differs from other methods in that, the transformation matrix is not estimated completely using adaptation data rather selected from a set of matrices which are a priori known and fully determined. Therefore this approach leads to a highly constrained adaptation of the model.

If only the means of the model are adapted, this can be considered a highly constrained version of standard MLLR.

In principle this approach allows different VTLN transformations to be applied to different groups of models, as in regression class approach. Thus the constraint of choosing a single warping function for all speech models is relaxed unlike in VTLN.

Better normalization performance of the model adaptation based speaker compensation is reported for children speech compared to conventional feature- transformation approach.

The experiments conducted by [Sinha, 2007] show that VTLN constrained model adaptation approach does not significantly vary in performance with the amount of adaptation data. Hence the strengths of this approach compared to conventional MLLR merits further exploration.

5 Conclusion

The review of various speaker adaptation techniques has lead to the following observations.

• The cepstral domain linearity of APT makes speaker normalization easy to implement and produced substantial improvements in recognition performance of LVCSR.



• Vocal tract normalization can be expressed as a linear transformation of the cepstral vector for arbitrary invertible warping functions.

• Vocal tract normalization can also be expressed as linear transformation of Mel Frequency Cepstral coefficients (MFCC) with the help of some approximations. The transformations are to be analytically calculated for each warping function.

• The shift in the speech scale required for speaker normalization can be equivalently performed through a suitable linear transformation in the cepstral domain. The derivation of the Transformation matrix is simple.

• VTLN amounts to a special case of MLLR explaining the experimental results that improvement in Speech recognition obtained by VTLN and MLLR are not additive.

• The relative strengths of VTLN constrained model adaptation compared to MLLR, merits exploration as its performance does not significantly vary with the amount of adaptation data.

References

[1] [Huang and Lee, 1991] Huang, X. and Lee, K. Speaker Independent and Speaker Dependent and Speaker Adaptive Speaker Recognition. Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing. 1991 pp. 877-880.

[2] [McDonough and Byrne 1999] McDonough, J. and Byrne, W. Speaker Adaptation with All-Pass Transforms. Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing. 1999.

[3] [Pitz and Ney, 2003] Pitz, M. and Ney, H. Vocal Tract Normalization as Linear Transformation of MFCC. Proc. of EUROSPEECH. 2003.

[4] [Cui and Alwan, 2005] Cui, X. and Alwan, A. MLLR-like Speaker Adaptation based on Linearization of VTLN with MFCC features. Proc. of EUROSPEECH.2005.

[5] [Sinha and Umesh, 2007] Sinha, R. and Umesh, S. A Shift based Approach to Speaker Normalization using Non-linear Frequency Scaling Model. Speech Communication. 2007, doi:10.1016/j.specom.2007.08.002.

[6] [Sinha, 2004] Sinha, R. Front-End Signal Processing for Speaker Normalization in For Speech Recognition. Ph.D. thesis I.I.T., Kanpur 2004.



Text Clustering Based on WordNet and LSI

Nadeem Akhtar Nesar Ahmad Aligarh Muslim University, Aligarh- 202002 Aligarh Muslim University, Aligarh- 202002 [email protected] [email protected]

Abstract

Text clustering plays an important role in retrieval and navigation of documents in several web applications. Text documents are clustered on the basis of statistical and semantic information they shared among them. This paper presents the experiments for text clustering based on semantic word similarity. Semantic similarities among documents are found using both WordNet and Latent Semantic Analysis (LSI). WordNet is a lexicon database, which provides semantic relationships like synonymy, hypernymy etc. among words. Two words are taken semantically similar if at least one synset in the WordNet for the two is same. LSI is a technique which brings out the latent statistical semantics in a collection of documents. LSI uses the higher order term co-occurrence to find the semantic similarity among words and documents. The proposed clustering technique uses WordNet and LSI to find the semantic relationship among words in the document collection. Based on the semantic relationship, Sets of strongly related words are selected as keyword sets. These keyword sets of words are then used to make the document clusters.

1 Introduction

Clustering [Graepel, 1998] is used in a wide range of topics and areas. Uses of clustering techniques can be found in pattern recognition and Pattern Matching, Artificial Intelligence, Web Technology, and Learning. Clustering improves the accuracy of search and retrieval of web documents in search engine results. Several algorithms and methods like suffix tree, fuzzy C-mean, hierarchical [Zamir et al, 1998], [Bezdek, 1981], [Fasulo, 1999] have been proposed for the text clustering. In most of them, a document is represented using Vector Space Model (VSM) as a vector in n-dimensional space. Different words are given importance according to criteria like term frequency-Inverse Document frequency (tf-idf). These methods consider the document as a bag of words, where single words are used as features for representing the documents and they are treated independently. They ignore the semantic relationship among them. Moreover, documents belong to very large number of dimensions.

Semantic information can be incorporated by employing ontologies like WordNet [Wang et al]. In this paper, to cluster a given collection of documents, semantic relationships among all the words present in the document collection. This relationship is found using WordNet and Latent Semantic Analysis (LSI).

Semantic information among words is found using WordNet [Miller et al., 1990] dictionary. WordNet contains words organized into synonym sets called synsets. The semantic similarity

Text Clustering Based on WordNet and LSI ♦ 345


between two words is found by considering their associated synsets. If at least one synset is common, the two words are considered semantically same.

Latent Semantic Analysis (LSI) [Berry, et al 1996] is technique which finds the latent

semantic relationships among words in the document collection by exploiting the higher

order co-occurrence among words. So, two words may be related even if they don’t occur in

the same document. LSI also reduces the dimensions resulting in richer word relationship

structure that reveals latent semantic hidden the document collection.

To find word semantic relationships, we adopt two different approaches. In the first approach, we first run the LSI algorithm to get word-word relationships and then WordNet dictionary is used to find semantically similar words. In the second approach, we first find the semantically similar words and then LSI is used to get word-word relationships. Using the relationships among words, sets of strongly related words are selected. These strongly related words are then used to make the document clusters. In the end, both approaches are evaluated against the same document collection and results are compared.

The paper is organized in five sections. After the first introductory section, second section defines how WordNet and LSI are used separately to find semantic word similarity. Section 3 relates to combined use of LSI and WordNet. Section 4 presents results and discussion. In section 5, conclusion and future work is presented.

2 Word Semantic Relationship

The proposed clustering method combines both statistical and semantic similarity to cluster

the document set. The sets of strongly related words in the document collection are identified

using LSI and WordNet. LSI derives the semantic relationship among words based on the

statistical co-occurrence of the words. It is based on the specific document collection

knowledge, whereas WordNet provides general domain knowledge about words. These word-

set are used to identify document cluster. There are n documents d1, d2,…, dn and m distinct

terms t1, t2,…, tm in the document collection.

2.1 Preprocessing

Stop words (both natural like the, it, where etc. and domain dependent) are removed from all

the documents. Those words, which are too frequent or too rare don’t help in clustering

process. So they are also removed by deleting all those words which frequency of occurrence

in the document is out of the predefined frequency range (fmin-fmax). Each document di is

represented as a vector of term frequency-inverse document frequency (tf-idf) of m terms

di=(w1i,,w2i,…, wmi) where wji is the weight of term j in the document i. We first find the

relationship among all the words in the document collection. The relationship between two

words is based on the occurrence of those words in the document collection found using LSI

and semantic relationship derived from the WordNet dictionary. This involves the

construction of a term correlation matrix R of all the words included in the document vectors.

By constructing a term correlation matrix, the relationship among different words in the

documents is captured.

346 ♦ Text Clustering Based on WordNet and LSI


2.2 WordNet

WordNet is a large lexical database of English, developed under the direction of George A. Miller at Princeton University. In WordNet Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. Despite having several types of lexical relations, it is heavily grounded on its taxonomic structure that employs the IS A inheritance (Hyponymy/Hypernymy) relation.

We can view WordNet as a graph where the synsets are vertices and the relations are edges. Each vertex contains a gloss that expresses its semantics and a list of words that can be used to refer to it. Noun synsets are the most dominant type of synsets; there are 79689 noun synsets, which correspond to almost 70% of the synsets in WordNet.

To find semantic relationship between two words we use noun and verb synsets.

In the word similarity calculation using WordNet, similarity is based on the common synonym sets (synsets). If the two words share at least one synset, they are considered similar [Chua et al, 2004]. For example, noun synsets for the words buff, lover and hater are shown in the table 1.

Table 1 WordNet Synset Similarity

Words Synsets

Buff Sense 1: fan, buff, devotee, lover Sense 2: buff Sense 3: buff Sense 4: yellowish brown, raw sienna, buff, caramel, caramel brown Sense 5: buff, buffer

Lover Sense 1: lover Sense 2: fan, buff, devotee, lover Sense 3: lover

Hater Sense 1: hater

From the table I, we can see that there are two identical synsets. Sense 1 of ‘buff’ and sense 2 of ‘lover’ have the same synsets with the identical synonyms. So, words ‘buff’; and ‘lover’ are considered semantically same. Whereas, there is no synset match between words pair buff-hate and lover-hater. So these pairs are not similar.

Word sense disambiguation is not performed in the approach. Although word sense disambiguation (WSD) is advantageous for identifying the correct clusters, we have not used the WSD to make the clustering approach simple.

2.3 Latent Semantic Analysis (LSI)

Latent Semantic Analysis is a well-known learning algorithm which is mainly used in searching, retrieving and filtering. LSI is based on a mathematical technique called Singular Value Decomposition (SVD). LSI is used in the reduction of the dimensions of the documents in the document collection. It removes those less important dimensions which introduce noise in the clustering process.



In our approach, we have used LSI to calculate term by term matrix, which gives hidden semantic relationships among words in the document collection. In LSI, term by document matrix M is decomposed into three matrices- a term by dimension matrix U, a dimension by dimension singular matrix Σ and a document by dimension matrix V.

M = U Σ VT Σ is a singular matrix in which diagonal represents the Eigen values. In the dimensionality reduction, the highest k Eigen values are retained and rest are ignored.

Mk = Uk Σk VkT

The value for the constant k must be chosen for each document set carefully.

Term by term matrix TT is calculated as [Kontostathis et al, 2006]:

TT = (Uk Σk)( Uk Σk)T

The value at position (i,j) of the matrix represents the relationship between word i and word j in the document collection. This method exploits the transitive relations among words. If a word A co-occurs with word B and word B co-occur with word C, word A and C will be related because there is second order co-occurrence between them. The above-mentioned method explores all the high order term co-occurrence of the two terms, thus providing relationship values that cover all connectivity paths between those two terms. Words that are used in the same context are given high relationship values even though they are not used together in the same document.

LSI also reduces the dimensions by neglecting dimensions associated with lower Eigen values. Only those dimensions are kept which are associated with high Eigen values in this way LSI removes the noisy dimensions.

3 Coupling WordNet and LSI

WordNet and LSI both provide semantic relationship among words. WordNet information is purely semantic that is synonym sets are used to match the words. LSI semantic information is based on the statistical data. It finds the hidden semantic relationship among words by churning out words co-occurrence in the document collection.

We have adopted two different approaches to couple WordNet and LSI word relationships. In the first approach, WordNet is used before LSI. Every word of a document is compared with every word of other documents using WordNet synset approach. If two words are found similar, each word is added to document vector of the other document. For example, document D1 contains word w1 and document D2 contains word w2. If words w1 and w2 share some synset in the WordNet, then w2 is added to the document vector of D1 and w1 is added to document vector of D2. In this way, the semantic relationship between two words is converted into statistical relationship because now those two words co-occur in two different documents. After that we use LSI. Document by term matrix is formed from the document vectors. We get term by document matrix M by transposing the document by term matrix. From the matrix M, we get term by term matrix TT using above-mentioned method. We call this approach WN-LSI.

In the second approach, we use LSI before WordNet. From the document vectors, we formed term by document matrix, which is used to get matrix TT by applying LSI. After that, for every pair of words, semantic relationship from the WordNet is found. If a match occurs, corresponding entry in the matrix TT is updated. We call this approach LSI-WN.



Next sets of those words are found, which are strongly related. For this, Depth first search graph traversal algorithm is used. Term by term matrix TT is seen as representation of a graph containing m nodes where (i,j) entry in TT represents the label of edge between node i and j. In the traversal of this graph, only those edges are traversed whose label is greater than

a predefined value α. Independent components of the graph identify the different set of

keywords. By setting the value of α, we can control the number of set of keywords c.

Next, each document is compared with each keyword set. If the document t contains more than µ% words in the keyword set, document is assigned to the cluster associated with that keyword set.. In this way each cluster has those documents in which words in associated keyword set are frequent. Documents generally have several topics with different strength. As a result documents in distinct cluster may overlap. To avoid nearly identical clusters, similar clusters are merged together. For this purpose similarity among clusters is calculated as:

Sc1, c2 = | Nc1∩Nc2 | / max (Nc1, Nc2) Where Sc1, c2 is the similarity between cluster c1 and c2 Nc1 is the number of documents in cluster c1 Nc2 is the number of documents in cluster c2

Nc1∩ Nc2 is the number of documents common to cluster c1 and c2

If Sc1, c2 is greater than β, cluster c1 and c2 are merged.

4 Experiments and Results

For evaluation purpose, we performed experiments using mini 20NewsGroup document collection, which is a subset of 20NewsGroup document collection. This collection contains 2000 documents categorized in 20 categories. We also downloaded some documents from the www and performed experiments on them.

Various parameters values are specified below for the experiments:

αααα Inter-word similarity 0.60

ββββ Inter-cluster similarity 0.20 µ Similarity between document and cluster 0.15

For the second experiment, documents are taken from three different categories, two of which are further divided into subcategories. (sub-categories listed within brackets in Table 2).

For the first experiment, we choose documents from 6 different categories from 20miniNewsGrop data set.

Table 2 Document Sets

Document Set A1 Number Computer 08 Food (fish, yoke, loaf, cheese, beverage, water) 25 Automobile (car, bus, roadster, auto) 17 Document Set A2 Number rec.sport.baseball 10 Sci.electronics 10 sci.med 20 Talk.politics.guns 10 Talk.religion.misc 10



Experiments on these two document sets are performed by running three different programs named LSI, WN-LSI and LSI-WN. In the LSI, we have used only latent semantic analysis. WN-LSI and LSI-WN are as defined in section 3.

For the document set 1, LSI produces 12 clusters, WN-LSI produces 4 clusters and LSI-WN

produces 24 clusters. LSI produces 6 clusters that belong to subcategories correctly.

Remaining 6 clusters produces mix of document from subcategories, but all documents in a

cluster belongs to one of the three main categories computer, food and automobile. WordNet

enhances the similarity relationship between the following words pair: (picture, image), (data,

information), (memory, store), (disk, platter) etc. in computer category. (car, automobile),

(transit, transportation), (travel, journey), (trip, travel) in automobile category and (drink,

beverage), (digest, stomach), (meat, substance), (nutrient, food) in the food category. First

two clusters produced by WN-LSI belong to food category and next two belong two

automobile category. 13 documents are placed in wrong clusters.

For the document set 2, LSI produces 9 clusters, WN-LSI produces 5 clusters and LSI-WN produces 30 clusters.

The number of clusters produced by LSI-WN is quite large for both the document sets. Possible reason for this is that WordNet might have introduced noise into the word relationships after the singular value decomposition.

WN-LSI performed a little bit satisfactorily. WordNet enriched the document vectors with the common synsets which helped LSI getting strong relationships among relevant words for identifying relevant clusters.


This paper describes a method of incorporating knowledge from WordNet dictionary and Latent Semantic Analysis into text document clustering task. The method uses the WordNet synonyms and LSI to build the term relationship matrix.

Overall results of the experiments performed on the document sets are quite disappointing. WN-LSI gives somewhat better results than LSI but LSI-WN fails to perform.

It seems that the major weakness of our approach is the keyword sets selection procedure.

Some of the keyword sets are quite relevant and produces good clusters. But some of the

keyword sets contains words that span over multiple categories. Also, some documents are

not assigned to any keyword set.

Clustering results for a document set are highly dependable on the values of various

parameter values (α,β,γ). Changing the value of these parameters severely affects the number of initial clusters. These must be chosen for each document set carefully.

Results are also affected by polysemy problem. Two words are held similar if at least one synset is same. But senses of words may be different in the documents.

There is a lot of scope for improvement in our approach. One area for the future work is the incorporation of other WordNet relations like hypernyms. Use of a word sense disambiguation technique may certainly improve the clustering results.



References

[1] [Berry, et al 1996] Berry, M. et al. SVDPACKC (Version 1.0) User's Guide, University of Tennessee Tech. Report CS-93-194, 1993 (Revised October 1996).

[2] [Bezdek, 1981] J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. New York, 1981.

[3] [Chua et al, 2004] Chua S, Kulathuramaiyer N, “Semantic Feature Selection Using WordNet”, proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI’04).

[4] [Fasulo, 1999] D. fasulo. An analysis of recent work on clustering algorithms. Technical report # 01-03-02, 1999.

[5] [Graepel, 1998] T. Graepel. Statistical physics of clustering algorithms. Technical Report 171822, FB Physik, Institut fur Theoretische Physic, 1998.

[6] [Hotho et al, 2003] A. Hotho, S. Staab, and G. Stumme, Wordnet improves Text Document Clustering, Proc. of the Semantic Web Workshop at SIGIR-2003, 26th Annual International ACM SIGIR Conference, 2003.

[7] [Kontostathis et al, 2006] Kontostathis A, Pottenger W M, “A Framework for Understanding Latent Semantic Indexing (LSI) performance”, International journal of Information processing and management 42 (2006) 56-73.

[8] [Miller et al., 1990] Miller et al “Introduction to WordNet: An On-line Lexical Database”, International Journal of Lexicography 1990 3(4):235-244.

[9] [Wang et al] Y Wang, J Hodges, “Document Clustering with Semantic Analysis”, Proceedings of the 39th Hawaii international conference on system sciences.

[10] [Zamir et al, 1998] Oren Zamir and Oren Etzioni. Web Document Clustering: A Feasibility Demonstration SIGIR’98, Melbourne, Australia. 1998 ACM 1-58113-015-5 8/98



Cheating Prevention in Visual Cryptography

Gowriswara Rao G. C. Shoba Bindu Dept. of Computer Science JNTU College of Engg. JNTU College of Engg. Anantapur-515002 Anantapur-515002 [email protected] [email protected]

Abstract

Visual cryptography (VC) is a method of encrypting a secret image into shares such that stacking a sufficient number of shares reveals the secret image. Shares are usually presented in transparencies. Each participant holds a transparency. VC focuses on improving two parameters: pixel expansion and contrast. In this paper, we studied the cheating problem in VC and extended VC. We considered the attacks of malicious adversaries who may deviate from the scheme in any way. We presented three cheating methods and applied them on attacking existent VC or extended VC schemes. We improved one cheat-preventing scheme. We proposed a generic method that converts a VCS to another VCS that has the property of cheating-prevention.

1 Introduction

Visual Cryptography is a cryptographic technique which allows visual information (pictures, text, etc.) to be encrypted in such a way that the decryption can be performed by the human visual system, without the aid of computers.

Visual cryptography was pioneered by Moni Naor and Adi Shamir in 1994. They demonstrated a visual secret sharing scheme, where an image was broken up into n shares so that only someone with all n shares could decrypt the image, while any n-1 shares revealed no information about the original image. Each share was printed on a separate transparency, and decryption was performed by overlaying the shares. When all n shares were overlaid, the original image would appear.

Using a similar idea, transparencies can be used to implement a one-time pad encryption, where one transparency is a shared random pad, and another transparency acts as the cipher text.

2 Visual Cryptography Scheme

The secret image consists of a collection of black and white pixels. To construct n shares of an image for n participants, we need to prepare two collections, C0 and C1, which consist of n

x m Boolean matrices. A row in a matrix in C0 and C1 corresponds to m subpixels of a pixel, where 0 denotes the white sub pixel and 1 denotes the black sub pixel. For a white (or black) pixel in the image, we randomly choose a matrix M from C0 (or C1, respectively) and assign row i of M to the corresponding position of share Si,1 < I < n. Each pixel of the original image will be encoded into n pixels, each of which consists of m subpixels on each share.



Since a matrix in C0 and C1 constitutes only one pixel for each share. For security, the number of matrices in C0 and C1 must be huge. For succinct description and easier realization of the VC construction, we do not construct C0 and C1 directly. Instead, we construct two n x

m basis matrices S0 and S1 and then let C0 and C1 be the set of all matrices obtained by permuting columns of S0 and S1, respectively.

Let OR(B,X) be the vector of “bitwise-OR” of rows i1, i2,…,iq of B, where B is an n x m Boolean matrix and X = Pi1,Pi2,…,Piq is a set of participants. Let w(v) be the Hamming weight of row vector v. For brevity, we let w(B,X) = w(OR(B,X)). Let Pb(S) = w(v)/m, where v is a black pixel in share S and m is the dimension of v. Similarly, Pw(S) = w(v)/m, where v is a white pixel in share S. Note that all white (or black) pixels in a share have the same Hamming weight. We use “Si + Sj” to denote “the stacking of shares Si and Sj.” The “stacking” corresponds to the bitwise-OR operation “+” of subpixels in shares Si and Sj.

3 Image Secret Sharing

Unlike visual secret sharing schemes which require the halftoning process to encrypt gray-scale or color visual data, image secret sharing solutions operate directly on the bit planes of the digital input. The input image is decomposed into bit-levels which can be viewed as binary images. Using the k,n threshold concept, the image secret sharing procedure encrypts individual bit-planes into the binary shares which are used to compose the share images with the representation identical to that of the input image. Depending on the number of the bits used to represent the secret (input) image, the shares can contain binary, gray-scale or color random information. Thus, the degree of protection afforded by image secret sharing methods increases with the number of bits used to represent the secret image.

= + =

Secret Gray- Share1 Share2 Decrypted Scale Image Image

The decryption operations are performed on decomposed bit-planes of the share images. Using the contrast properties of the conventional k,n-schemes, the decryption procedure uses shares' bits to recover the original bit representation and compose the secret image. The decrypted output is readily available in a digital format, and is identical to the input image. Because of the symmetry constraint imposed on the encryption and decryption process, image secret sharing solutions hold the perfect reconstruction property. This feature in conjunction with the overall simplicity of the approach make this approach attractive for real-time secret sharing based encryption/decryption of natural images.

4 Cheating in Visual Cryptography

For cheating, a cheater presents some fake shares such that the stacking of fake and genuine shares together reveals a fake image. There are two types of cheaters in Visual Cryptography. One is a malicious participant (MP) who is also a legitimate participant and the other is a malicious outsider (MO). A cheating process against a VCS consists of the following two phases:



a. Fake share construction phase: the cheater generates the fake shares;

b. Image reconstruction phase: the fake image appears on the stacking of genuine shares and fake shares.

5 Cheating Methods

A VCS would be helpful if the shares are meaningful or identifiable to every participant. A VCS with this extended characteristic is called extended VCS (EVCS). A EVCS is like a VCS except that each share displays a meaningful image, which will be called share image.

Cheating methods are three types: first cheating method is initiated by an MP, while the second cheating method is initiated by an MO. Both of them apply to attack VC. Our third cheating method is initiated by an MP and applies to attack EVC.

A Cheating a VCS by an MP

The cheating method CA-1, depicted in Fig. 1, applies to attack any VCS. Without loss of generality, we assume that P1 is the cheater. Since the cheater is an MP, he uses his genuine share as a template to construct a set of fake shares which are indistinguishable from its genuine share. The stacking of these fake shares and S1 reveals the fake image of perfect blackness. We see that, for Y = Pi1,Pi2,……., Piq not belongs to Q, the stacking of their shares reveals no images. Thus, the stacking of their shares and the fake shares reveals the fake image due to the perfect blackness of the fake image.

Example: Fig. 2 shows how to cheat the participants in a (4,4)-VCS. There are four shares S1, S2, S3, and S4 in the (4,4)-VCS. P1 is assumed to be the MP. By CA-1, one fake share FS1 is generated. Since Y = (P1, P3, P4) (or (P1, P2)) is not belongs to Q, we see that S1 + FS1 + S3 + S4 reveals the fake image FI.



Fig. 1: Cheating method CA-1, initiated by an MP.

Fig. 2: Example of cheating a (4,4)-VCS by an MP.

B Cheating a VCS by an MO

Our second cheating method CA-2, depicted in Fig. 3, demonstrates that an can cheat even without any genuine share at hand. The idea is as follows. We use the optimal (2,2)-VCS to construct the fake shares for the fake image. Then, we tune the size of fake shares so that they can be stacked with genuine shares.

Now, the only problem is to have the right share size for the fake shares. Our solution is to try all possible share sizes. In the case that the MO gets one genuine share, there will be no such problem. It may seem difficult to have fake shares of the same size as that of the genuine shares. We give a reason to show the possibility. The shares of a VCS are usually printed in transparencies. We assume that this is done by a standard printer or copier which accepts only a few standard sizes, such as A4, A3, etc. Therefore, the size of genuine shares is a fraction, such as 1/4, of a standard size. We can simply have the fake shares of these sizes. Furthermore, it was suggested to have a solid frame to align shares in order to solve the alignment problem during the image reconstruction phase. The MO can simply choose the size of the solid frame for the fake shares. Therefore, it is possible for the MO to have the right size for the fake shares.

Example: Fig. 4 shows that an MO cheats a (4, 4)-VCS. The four genuine shares S1, S2, S3, and S4 are those in Fig. 2 and the two fake shares are FS1 and FS2. For clarity, we put S1 here to demonstrate that the fake shares are indistinguishable from the genuine shares. We see that the stacking of fewer than four genuine shares and two fake shares shows the fake image FI.

Fig. 3: Cheating method CA-2, initiated by an MO.



Fig. 4: Example of cheating a (4, 4)-VCS by an MO.

C Cheating an EVCS by an MP

In the definition of VC, it only requires the contrast be nonzero. Nevertheless, we observe that if the contrast is too small, it is hard to “see” the image. Based upon this observation, we demonstrate the third cheating method CA-3, depicted in Fig. 5, against an EVCS. The idea of CA-3 is to use the fake shares to reduce the contrast between the share images and the background. Simultaneously, the fake image in the stacking of fake shares has enough contrast against the background since the fake image is recovered in perfect blackness.

Example: Fig. 6 shows the results of cheating a (T, m)-EVCS, where P = P1, P2, P3, and Q = (P1, P2), (P2, P3), (P1, P2, P3). In this example, P1 is the cheater who constructs a fake share FS2 with share image B in substitute for P2 to cheat P3. S1 + FS2 +S3 reveals the fake image FI.

Fig. 5: Cheating method CA-3 against an EVCS.



Fig. 6: Example of cheating a (T,m)-EVCS.

6 Cheat-Preventing Methods

There are two types of cheat-preventing methods. The first type is to have a trusted authority (TA) to verify the shares of participants. The second type is to have each participant to verify the shares of other participants. In this section, we present attacks and improvement on four existent cheat-preventing methods.

Attack on Yang and Laih’s Cheat-Preventing Methods

The first cheat-preventing method of Yang and Laih needs a TA to hold the special verification share for detecting fake shares. The second cheat-preventing method of Yang and Laih is a transformation of a (T, m)-VCS (but not a (2, n)-VCS) to another cheat-preventing (T, m + n(n-1))-VCS.

Attacks on Horng et al.’s Cheat-Preventing Methods

In the first cheat-preventing method of Horng et al., each participant Pi has a verification share Vi. The share’s Sis are generated as usual. Each Vi is divided into n-1 regions Ri,j, 1< j< n, j not equal to i. Each region Ri,j of Vi is designated for verifying share Sj. The region Ri,j of Vi + Sj shall reveal the verification image for Pi verifying the share Sj of Pj. The verification image in Ri,j is constructed by a (2,2)-VCS. Although the method requires that the verification image be confidential, but it is still possible to cheat.

Horng et al.’s second cheat-preventing method uses the approach of redundancy. It uses a (2, n + l)-VCS to implement a (2, n)-VCS cheat-preventing scheme. The scheme needs no on-line TA for verifying shares. The scheme generates n + l shares by the (2, n + l)-VCS for some integer l>0, but distributes only n shares to the participants. The rest of shares are destroyed. They reason that since the cheater does not know the exact basis matrices even with all shares, the cheater cannot succeed.



7 Generic Transformation for Cheating Prevention

By the attacks and improvement in previous sections, we propose that an efficient and robust cheat-preventing method should have the following properties.

a. It does not rely on the help of an on-line TA. Since VC emphasizes on easy decryption with human eyes only, we should not have a TA to verify validity of shares.

b. The increase to pixel expansion should be as small as possible.

c. Each participant verifies the shares of other participants. This is somewhat necessary because each participant is a potential cheater.

d. The verification image of each participant is different and confidential. It spreads over the whole region of the share. We have shown that this is necessary for avoiding the described attacks.

e. The contrast of the secret image in the stacking of shares is not reduced significantly in order to keep the quality of VC.

f. A cheat-preventing method should be applicable to any VCS.

Example: Fig. 8 shows a transformed (T, m + 2)-VCS with cheating prevention, where P = P1, P2, P3 and Q = (P1, P2), (P2, P3), (P1, P2, P3). The verification images for participants P1, P2, and P3 are A, B, and C, respectively.

Fig. 7: Generic transformation for VCS with cheating prevention



Fig. 8: Example of a transformed VCS with cheating prevention.

8 Conclusion

The Proposed system explained three cheating methods against VCS and EVCS examined previous cheat-preventing schemes; found that they are not robust enough and still improvable. The system presents an improvement on one of these cheat-preventing schemes and finally proposed an efficient transformation of VCS for cheating prevention. This transformation incurs minimum overhead on contrast and pixel expansion. It only adds two subpixels for each pixel in the image and the contrast is reduced only slightly.

References

[1] Chih-Ming Hu and Wen-Guey Tzeng, “Cheating Prevention in Visual Cryptography,” IEEE Trans. Image

Processing, Vol.16, N0.1, 2007. [2] H.Yan, Z. Gan, and K. Chen, “Acheater detectable visual cryptography scheme,” (in Chinese) J. Shanghai

Jiaotong Univ., vol. 38, no. 1, 2004. [3] G.-B. Horng, T.-G. Chen, and D.-S. Tsai, “Cheating in visual cryptography,” Designs, Codes, Cryptog.,

vol. 38, no. 2, pp. 219–236, 2006.



Image Steganalysis Using LSB Based

Algorithm for Similarity Measures

Mamta Juneja Computer Science and Engineering Department

Rayat and Bahara Institute of Engineering and Technology (RBIEBT) Sahauran(Punjab), India [email protected]

Abstract

A novel technique for steganalysis of images has been presented in this paper which is subjected to Least Significant Bit (LSB) type steganographic algorithms. The seventh and eight bit planes in an image are used for the computation of several binary similarity measures. The basic idea is that, the correlation between the bit planes as well the binary texture characteristics within the bit planes will differ between a stego-image and a cover-image. These telltale marks can be used to construct a steganalyzer, that is, a multivariate regression scheme to detect the presence of a steganographic message in an image.

1 Introduction

Steganography refers to the science of “invisible” communication. Unlike cryptography, where the goal is to secure communications from an eavesdropper, steganographic techniques strive to hide the very presence of the message itself from an observer [G. J. Simmons, 1984].Given the proliferation of digital images, and given the high degree of redundancy present in a digital representation of an image (despite compression), there has been an increased interest in using digital images for the purpose of steganography. The simplest image steganography techniques essentially embed the message in a subset of the LSB (least significant bit) plane of the image, possibly after encryption [[N. F. Johnson; S. Katzenbeisser, 2000]]. Popular steganographic tools based on LSB-embedding vary in their approach for hiding information. Methods like Steganos and Stools use LSB embedding in the spatial domain, while others like Jsteg embed in the frequency domain. Non-LSB steganography techniques include the use of quantization and dithering [N. F. Johnson; S. Katzenbeisser, 2000].

Since the main goal of steganography is to communicate securely in a completely undetectable manner, an adversary should not be able to distinguish in any sense between cover-objects (objects not containing any secret message) and stego-objects (objects containing a secret message). In this context, steganalysis refers to the body of techniques that are conceived to distinguish between cover-objects and stego-objects.

Recent years have seen many different steganalysis techniques proposed in the literature.

Some of the earliest work in this regard was reported by Johnson and Jajodia [N. F. Johnson;

S. Jajodia, 1998]. They mainly look at palette tables in GIF images and anomalies caused

360 ♦ Image Steganalysis Using LSB Based Algorithm for Similarity Measures


therein by common stego-tools. A more principled approach to LSB steganalysis was

presented in [A. Westfield; A. Pfitzmann, 1999] by Westfeld and Pfitzmann. They identify

Pairs of Values (PoV’s), which consist of pixel values that get mapped to one another on LSB

flipping. Fridrich, Du and Long [J. Fridrich; R. Du, M. Long, 2000] define pixels that are

close in color intensity to be a difference of not more than one count in any of the three color

planes. They then show that the ratio of close colors to the total number of unique colors

increases significantly when a new message of a selected length is embedded in a cover

image as opposed to when the same message is embedded in a stego-image. A more

sophisticated technique that provides remarkable detection accuracy for LSB embedding,

even for short messages, was presented by Fridrich et al. in [J. Fridrich; M. Goljan ; R. Du,

2001]. Avcibas, Memon and Sankur [I. Avcibas; N. Memon; B. Sankur,2001] present a

general- technique for steganalysis of images that is applicable to a wide variety of

embedding techniques including but not limited to LSB embedding. They demonstrate that

steganographic schemes leave statistical evidence that can be exploited for detection with the

aid of image quality features and multivariate regression analysis. Chandramouli and Memon

[R. Chandramouli; N. Memon, 2001] do a theoretical analysis of LSB steganography and

derive a closed form expression of the probability of false detection in terms of the number of

bits that are hidden. This leads to the notion of steganographic capacity, that is, the number of

bits one can hide in an image using LSB techniques without causing statistically significant

modifications.

In this paper, a new steganalysis technique for detecting stego-images is presented. The technique uses binary similarity measures between successive bit planes of an image to determine the presence of a hidden message. In comparison to previous work, the technique we present differs as follows:

• [N. F. Johnson; S. Jajodia, 1998] present visual techniques and work for palette images. Our technique is based on statistical analysis and works with any image format.

• [A. Westfield; A. Pfitzmann, 1999], [J. Fridrich; R. Du, M. Long, 2000] and [J. Fridrich; M. Goljan ; R. Du, 2001] work only with LSB encoding. Our technique aims to detect messages embedded in other bit planes as well.

• [A. Westfield; A. Pfitzmann, 1999], [J. Fridrich; R. Du, M. Long, 2000] and [J. Fridrich; M. Goljan ; R. Du, 2001] detect messages embedded in the spatial domain. The proposed technique works with both spatial and transform-domain embedding.

• Our technique is more sensitive than [A. Westfield; A. Pfitzmann, 1999], [J. Fridrich; R. Du, M. Long, 2000] and [J. Fridrich; M. Goljan ; R. Du, 2001]. However, in its current form it is not as accurate as [J. Fridrich; M. Goljan ; R. Du, 2001] and cannot estimate the length of the embedded message like [J. Fridrich; M. Goljan ; R. Du, 2001].

Notice that our scheme, does not need a reference image for steganalysis. The rest of this paper is organized as follows: In Section 2 we review binary similarity measures. In Section 3 we describe our steganalysis technique. In Section 4 we give simulation results and conclude with a brief discussion in Section 5.

Image Steganalysis Using LSB Based Algorithm for Similarity Measures ♦ 361


2 Binary Similarity Measures

There are various ways to determine similarity between two binary images. Classical measures are based on the bit-by-bit matching between the corresponding pixels of the two images. Typically, such measures are obtained from the scores based on a contingency table (or matrix of agreement) summed over all the pixels in an image. In this study, where we examine lower order bit-planes of images, for the presence of hidden messages, we have found that it is more relevant to make a comparison based on binary texture statistics. Let

Kkxkii

,,1 , K==−

x and Kkykii

,,1 , K==−

y be the sequences of bits representing the 4-

neighborhood pixels, where the index i runs over all the image pixels. Let

==

==

==

==

=

114

013

102

001

sr

sr

sr

sr

r

s

xandxif

xandxif

xandxif

xandxif

χ (1)

Then we can define the agreement variable for the pixel xi as: ),(1

∑=

−

=

K

k

ki

i

j

ijχδα , 4,,1 K=j , K =

4, where

≠

=

=

nm

nmnm

,0

,1),(δ . (2)

The accumulated agreements can be defined as:

∑=

i

iMN

a11

α , ∑=

i

iMN

b21

α , ∑=

i

iMN

c31

α , ∑=

i

iMN

d41

α . (3)

These four variables a,b,c,d can be interpreted as the one-step co-occurrence values of the binary images. Normalizing the histograms of the agreement scores for the 7th bit-plane can be defined as follows:

7 / .j j j

i i

i i j

p α α=∑ ∑∑ (4)

Similarly, one can define jp

8for the 8th bit plane. In addition to these we calculate the Ojala

texture measure as follows. For each binary image we obtain a 16-bin histogram based on the

weighted neighborhood as shown in Fig. 1, where the score is given by: ∑=

=

3

0

2i

i

ixS by

weighting the four directional neighbors as in Fig. 1.

1

8 ix 2

4

Fig. 1: The Weighting of the Neighbors in the Computation of Ojala Score. S= 4+8=12 given W, S bits 1 and E, N bits 0.

The resulting Ojala measure is the mutual entropy between the two distributions, that is

∑=

−=

N

n

nnSSm

1

87

7log , (5)



Where N is the total number of bins in the histogram, 7

nS is the count of the n’th histogram

bin in the 7th bit plane and 8

nS is the corresponding one in the 8th plane.

Table 1: Binary Similarity Measures

Similarity Measure Description

Sokal & Sneath Similarity Measure 1 dc

d

db

d

ca

a

ba

am

+

+

+

+

+

+

+

=1

Sokal & Sneath Similarity Measure 2 ))()()((

2dcdbcaba

adm

++++

=

Sokal & Sneath Similarity Measure 3 cbda

dam

+++

+=

)(2

)(23

Variance Dissimilarity Measure )(4

4dcba

cbm

+++

+=

Dispersion Similarity Measure ( )

25dcba

bcadm

+++

−=

Co-occurrence Entropy

∑=

=

4

1876

logj

jjppdm

Ojala Mutual Entropy

∑=

−=

15

0

87

7log

n

nnSSdm

Using the above definitions various binary image similarity measures are defined as shown in Table 1. The measures m1 to m5 are obtained for seventh and eighth bits separately by adapting the parameters a,b,c,d (3) to the classical binary string similarity measures, such

as Sokal & Sneath. Then their differences th

i

th

iimmdm 87

−= 5,,1K=i are used as the final

measures. The measure dm6 is defined as the co-occurrence entropies using the 4-bin histograms of the 7th and 8th bit planes. Finally the measure dm7 is somewhat different in that we use the neighborhood-weighting mask proposed by Ojala [T. Ojala, M. Pietikainen, D. Harwood]. Thus we obtain a 16-bin histogram for each of the planes and then calculate their mutual entropy.

3 Steganalysis Technique Based on Binary Measures

This approach is based on the fact that embedding a message in an image has a telltale effect on the nature of correlation between contiguous bit-planes. Hence we hypothesize that binary similarity measures between bit planes will cluster differently for clean and stego-images. This is the basis of our steganalyzer that aims to classify images as marked and unmarked.

We conjecture that hiding information in any bit plane decreases the correlation between that plane and its contiguous neighbors. For example, for LSB steganography, one expects a decreased similarity between the seventh and the eighth bit planes of the image as compared to its unmarked version. Hence, similarity measures between these two LSB’s should yield higher scores in a clean image as compared to a stego-image, as the embedding process destroys the preponderance of bit pair matches.

Since the complex bit pair similarity between bit planes cannot be represented by one measure only, we decided to use several similarity measures to capture different aspects of bit plane correlation. The steganalyzer is based on the regression of the seven similarity measures listed in Table 1:



qqmmmy βββ +++= ...

2211 (6)

Where q

mmm ,...,21

are the q similarity scores and q

βββ ,...,21

are their regression coefficients.

In other words we try to predict the state y, whether the image contains a stego-message (y = 1) or not (y = -1), based on the bit plane similarity measures. Since we have n observations, we have the set of equations

111221111... εβββ ++++=

qqmmmy

nnqqnnnmmmy εβββ ++++= ...

2211 (7)

Where krm is the r’th similarity measure observed in the k’th test image. The corresponding

optimal MMSE linear predictor β can be obtained by using the matrix M of similarity

measures:

( ) ( )yβTT

MMM1ˆ −

= . (8)

Once prediction coefficients are obtained in the training phase, these coefficients can then be used in the testing phase. Given an image in the test phase, binary measures are computed and using the prediction coefficients, these scores are regressed to the output value. If the output exceeds the threshold 0 then the decision is that the image is embedded, otherwise the decision is that the image is not embedded. That is, using the prediction

qqmmmy βββ

ˆ...ˆˆˆ2211

+++= (9)

the condition 0ˆ ≥y implies that the image contains a stego-message, and the condition

0ˆ <y signifies that it does not.

The above shows how one can design a steganalyzer for the specific case of LSB embedding. The same procedure generalizes quite easily to detect messages in any other bit plane. Furthermore, our initial results indicate that we can even build steganalyzer for non-LSB embedding techniques like the recently designed algorithm F5 [Springer-Verlag Berlin, 2001]. This is because a technique like F5 (and many other robust watermarking techniques which can be used for steganography in an active warden framework [I. Avcibas; N. Memon; B. Sankur,2001]) results in the modification of the correlation between bit planes. We note that LSB techniques randomize the last bit plane. On the other hand Jsteg or F5 introduce more correlation between 7th and 8th bit planes, due to compression that filters out the natural noise in a clean image. In other words whereas spatial domain techniques decrease correlation, frequency domain techniques increase it.


The designed steganalyzer is based on a training set and had been using various image steganographic tools. The steganographic tools were Steganos [Springer-Verlag Berlin, 2001], S-Tools [Steganos II Security Suite] and Jsteg [J. Korejwa, Jsteg Shell 2.0], since these were among the most popular and cited tools in the literature. The image database for the simulations was selected containing a variety of images such as computer generated images, images with bright colors, images with reduced and dark colors, images with textures and fine details like lines and edges, and well-known images like Lena, peppers etc.

In the experiments 12 images were used for training and 10 images for testing. The embedded message size were 1/10 of the cover image size for Steganos and Stools, while the message



size were 1/100 of the cover image size for Jsteg. The 12 training and 10 test images were embedded with separate algorithms (Steganos, S-Tools and Jsteg). They were compared against their non-embedded versions in the test and training phases.

The performance of the steganalyzers is given in Table II. In this table we compare two steganalyzers: the one marked Binary is the scheme discussed in this paper. The one marked as IQM is based on the technique developed in [I. Avcibas; N. Memon; B. Sankur, 2001]. This technique likewise uses regression analysis, but it is based on several image quality measures (IQM) such as block spectral phase distance, normalized mean square error, angle mean etc. The quality attributes are calculated between the test image and its low-pass filtered version. The steganalyzer scheme denoted as IQM [I. Avcibas; N. Memon; B. Sankur, 2001] is more laborious in the computation of the quality measures and preprocessing.

Table 2: Performance of the Steganalyzer

False Alarm Rate Miss Rate Detection Rate

IQM BSM IQM BSM IQM BSM

Steganos 2/5 1/5 1/5 1/5 7/10 8/10

Stools 4/10 1/10 1/10 2/10 15/20 17/20

Jsteg 3/10 2/10 3/10 1/10 14/20 17/20

F5 2/10 2/10 16/20

Simulation results indicate that the binary measures form a multidimensional feature space whose points cluster well enough to do a classification of marked and non-marked images and in a manner comparable to the previous technique presented in [I. Avcibas; N. Memon; B. Sankur,2001].

5 Conclusion

In this paper, I have addressed the problem of steganalysis of marked images. I have developed a technique for discriminating between cover-images and stego-images that have been subjected to the LSB type steganographic marking. This approach is based on the hypothesis that steganographic schemes leave telltale evidence between 7th and 8th bit planes that can be exploited for detection. The steganalyzer has been instrumented with binary image similarity measures and multivariate regression. Simulation results with commercially available steganographic techniques indicate that the new steganalyzer is effective in classifying marked and non-marked images.

As described above, the proposed technique is not suitable for active warden steganography (unlike [I. Avcibas; N. Memon; B. Sankur, 2001]) where a message is hidden in higher bit depths. But initial results have shown that it can easily generalize for the active warden case by taking deeper bit plane correlations into account. For example, we are able to detect Digimarc when the measures are computed for 3rd and 4th bit planes.

References

[1] [G. J. Simmons, 1984] Prisoners' Problem and the Subliminal Channel (The), CRYPTO83 - Advances in Cryptology, August 22-24. 1984, pages. 51-67.

[2] [N. F. Johnson; S. Katzenbeisser, 2000] “A Survey of steganographic techniques”, in S. Katzenbeisser and F. Petitcolas (Eds.): Information Hiding, pages. 43-78. Artech House, Norwood, MA, 2000.



[3] [N. F. Johnson; S. Jajodia, 1998] “Steganalysis: The investigation of Hidden Information”, IEEE

Information Technology Conference, Syracuse, NY, USA, 1998. [4] [N. F. Johnson; S. Jajodia, 1998] “Steganalysis of Images created using current steganography software”, in

David Aucsmith (Ed.): Information Hiding, LNCS 1525, pages. 32-47. Springer-Verlag Berlin Heidelberg 1998.

[5] [A. Westfield; A. Pfitzmann, 1999] “Attacks on Steganographic Systems”, in Information Hiding, LNCS 1768, pages. 61-76, Springer-Verlag Heidelberg, 1999.

[6] [J. Fridrich; R. Du, M. Long, 2000] “Steganalysis of LSB Encoding in Color Images”, Proceedings of ICME 2000, New York City, July 31-August 2, New York, USA

[7] [J. Fridrich; M. Goljan ; R. Du, 2001] “Reliable Detection of LSB Steganography in Color and Grayscale Images”. Proc. of the ACM Workshop on Multimedia and Security, Ottawa, CA, October 5, 2001, pages. 27-30.

[8] [I. Avcibas; N. Memon; B. Sankur, 2001] “Steganalysis Using Image Quality Metrics”, Security and

Watermarking of Multimedia Contents, SPIE, San Jose, 2001. [9] [R. Chandramouli; N. Memon, 2001] “Analysis of LSB Based Image Steganography Techniques”,

Proceedings of the International Conference on Image Processing, Thessalonica, Greece, October 2001. [10] [C. Rencher, 1995] Methods of Multivariate Analysis, New York, John Wiley (1995). [11] [Springer-Verlag Berlin, 2001]F5—A Steganographic Algorithm: High Capacity Despite Better

Steganalysis. Information Hiding. Proceedings, LNCS 2137, Springer-Verlag Berlin 2001 [12] Steganos II Security Suite, http://www.steganos.com/english/steganos/download.htm[13] A. Brown, S-

Tools Version 4.0, Copyright © 1996, http://members.tripod.com/steganography/stego/s-tools4 [13] [J. Korejwa, Jsteg Shell 2.0]

http://www.tiac.net/users/korejwa/steg.htm.http://www.cl.cam.ac.uk/~fapages2/watermarking/benchmark/image_database.html

[14] [T. Ojala, M. Pietikainen, D. Harwood] A Comparative Study of Texture Measurss with Classification Based on Feature distributions, Pattern Recognition, vol. 29, pages. 51-59



Content Based Image Retrieval Using Dynamical

Neural Network (DNN)

D. Rajya Lakshmi A. Damodaram GITAM University JNTU College of Engg. Visakhapatnam, India Hydearbad, India [email protected] [email protected]

K. Ravi Kiran K. Saritha GITAM University GITAM University

Visakhapatnam, India Visakhapatnam, India


Abstract

In content-based image retrieval (CBIR), content of an image can be expressed in terms of different features such as color, texture, shape, or text annotations. Retrieval based on these features can be various by the way how to combine the feature values. Most of the existing approaches assume a linear relationship between different features, and the usefulness of such systems was limited due to the difficulty in representing high-level concepts using low-level features. We introduce Neural Network-based Image Retrieval system, a human-computer interaction approach to CBIR using Dynamical Neural Network to have approximate similarity comparison between images can be supported. The experimental results show that the proposed approach captures the user's perception subjectivity more precisely using the dynamically updated weights.

Keywords: Content Based, Neural Network-based Image Retrieval, Dynamical Neural Network, Binary signature, Region-based retrieval, Debouche compression, Image segmentation.

1 Introduction

With the rapid development of computing hardware, digital acquisition of information has become one popular method in recent years. Every day, G-bytes of images are generated by both military and civilian equipment. Consequently, how to make use of this huge amount of images effectively becomes a highly challenging problem. The traditional approach relies on image content manual annotation and Database Management System (DBMS) to accomplish the image retrieval through keywords. Although simple and straightforward, the traditional approach has two main limitations. First, the descriptive keywords of an image are inaccurate and incomplete. Second, manual annotation is time-consuming and subjective. Users with different backgrounds tend to describe the same object using different descriptive keywords resulting in the difficulties in image retrieval. To overcome the drawbacks of the traditional approach, content-based image retrieval (CBIR) was proposed to retrieve visual-similar images from an image database based on automatically-derived image features, which has been a very active research area. There have been many projects performed to develop

Content Based Image Retrieval Using Dynamical Neural Network (DNN) ♦ 367


efficient systems for content-based image retrieval. The well known CBIR system is probably IBM’s QBIC [Niblack, 93],.Other notable systems include MIT’s Photobook [Pentland, 94], Virage’s VisualSEEK [Smith, 96], etc.

Neural Networks are relatively simple systems containing general structures that can be directly applied to image analysis and visual pattern recognition problems [Peasso, 95]. They usually are viewed as nonparametric classifiers, although their trained outputs may indirectly produce maximum a posteriori (MAP) classifiers [Peasso,95].A central attraction for using neural networks is due to their computationally efficient decision based on training procedures

Due to the variations encountered during Image matching, as same image has different variations as direction, light, orientation the features of same object may not be same always, associative memory, which facilitates approximate matching instead of exact matching finds its relevance. Our DNN as associative memory with enhanced capacity permits overcoming these impediments resourcefully and provide a feasible solution.

Organization of rest of the paper is as follows. Section 2 describes about detailed System Architecture, section3 about Dynamical Neural Network (DNN) With Reuse, section 4 describes about Performance Evaluation. In section 5 Methodology and Experimentation is discussed, section 6 includes Results and Interpretation and section 7 includes discussions and conclusions.

2 Neural Network Approach

Several researchers have explored the field of Neural Network’s application [Mighell 89] in the field of content based Image retrieval.

In the present work the Dynamic Neural Network (with reuse) as associative memory has been used to handle the content based Image retrieval. It is found to be efficient with its special features of high capacity, fast learning and exact recall. The experimental results presented emphasize the efficiency.

Neural Networks are relatively simple systems containing general structures that can be directly applied to image analysis and visual pattern recognition problems [Peasso 95]. They usually are viewed as nonparametric classifiers, although their trained outputs may indirectly produce maximum a posteriori (MAP) classifiers [Peasso 95].A central attraction for using neural networks is due to their computationally efficient decision based on training procedures.

Due to the variations encountered during Image matching, as same image has different variations as direction, light, orientation the features of same object may not be same always, associative memory, which facilitates approximate matching instead of exact matching finds its relevance. Our DNN as associative memory with enhanced capacity permits overcoming these impediments resourcefully and provide a feasible solution.

Many models of neural networks are proposed to solve the problems to classification, vector unitization self-organization and associative memory. Associative memory concept is related to the association of stored information for given input pattern. High storage capacity and accurate recall are the most desired properties of an associative memory network. Many models are proposed by different researchers [Hilberg 97],] [Smith 96], [Sukhaswami 93],]

368 ♦ Content Based Image Retrieval Using Dynamical Neural Network (DNN)


[Kang 93] for improving the storage capacity. Most of the associative memory models are some sort of variations of Hopfield model [Hopfield 82]. These models are demonstrated to be very useful in a wide range of applications. The Hopfield Model has certain limitations of its own. The practical storage capacity is 0.15n as compared to the other models. The stability of the patterns stored in the memory approaches the maximum capacity. The accuracy of recall falls with the increase in the number stored pattern. These precincts have incited us to crop up with an improvised model of neural network correlated to associative memory. The model, which we proposed, Dynamic Neural Network [Rao & Pujari 99] finds its advantages in the fast learning, accurate recall and relative pruning and is being exploited in many important applications such as faster text retrieval and word sense disambiguation [Rao & Pujari 99]. Subsequent to this there are some imperative amendments to this composite structure. It is often advantageous to have higher capacity with smaller initial structure. This concept has led to come up with a method, which enables reusing of the nodes, which are being pruned facilitating increase in the network capacity. This algorithm makes possible reusability and would prove efficient in many applications. One application, which we have studied extensively and found that, DNN would be most supportive, is Content Based Image Retrieval. This is to retrieve information from a very large Image database using approximate queries. The DNN with reuse is more suitable to accomplish the problem of association related to Image Retrieval as it has the properties like associative memory with 100% perfect recall, large storage capacity, avoiding spurious states and converging only to user specified states. The image is subjected to multiresolution wavelet analysis for image compression, and to capture invariant features of image.

3 Dynamical Neural Network (DNN) with Reuse

3.1 Dynamical Neural Network (DNN)

A new architecture Dynamical Neural (DNN) is proposed in [Rao & Pujari 99]. It is called Dynamical Neural Network in the sense that its architecture gets modified dynamically over time as raining progresses. The architecture (Figure 1) of the DNN has a composite structure wherein each node of the network is a Hopfield network by itself. The Hopfield network employs the new learning technique and converges to user-specified stable states without having any spurious states. The capabilities of the new architecture are as follows. The DNN works as associative memory without spurious stable states, and it also demonstrates a novel idea of order-sensitive learning which gives preference to chronological order presentation of the exemplar patterns. The DNN prunes nodes as it progressively carries out associative memory retrieval.

3.2 Training and Relative Pruning of DNN

The underlying idea behind this network structure is the following. If each basic node Memorizes p patterns then p basic nodes are grouped together. When a test pattern is presented to the DNN, assume that it is presented to all the basic nodes simultaneously and all basic nodes are activated at the same time to reach the respective stable states. Within a group of basic nodes one of them is designed as the leader of the group. For simplicity consider the first node as the leader. After the nodes in a group reach the respective stable states these nodes transmit their stable states to the leader in that group. The corresponding



connections among the nodes are shown in Figure 1(a), for p=2. At this stage DNN adopts a relative pruning strategy and it returns only the leader of each group and ignores all other basic nodes within a group. In the next pass the DNN consists of lesser number of nodes, but the structure is retained. These leader nodes are treated as basic nodes and each of them is trained to memorize p patterns corresponding to p stable states of the member nodes of the group. These leader nodes are again grouped together talking p nodes at a time. This process is repeated till a single node remains. Thus in one cycle, the nodes carry out state-transitions in parallel, keeping the weights unchanged and in the next cycle, the nodes communicate among themselves to change the weights. At this stage half of the network is pruned and the remaining half is available for the next iteration of two cycles. In this process, the network eventually converges to the closest pattern.

It is clear that in DNN if each basic node memorizes p then each group memorizes p2 patterns. This one leader node representing a portion of the DNN memorizes p2 patterns. When p such leader nodes are grouped together then p3 patterns can be memorized. If the process runs for i iterations to arrive at a single basic node then the DNN can memorize pi patterns. On the other hand if it required that the DNN to memorize K patterns, then we must start with K / p basic nodes.

Fig. 1: The Architecture of DNN Fig. 1(a) First Level Grouping and Leader Nodes

3.3 Reuse of Pruned Nodes in DNN

One of the novel features of DNN is relative pruning, wherein the network releases half of its neurons at every step. In this process the DNN, though appears to be a composite massively connected structures, progressively simplifies by shedding a part of its structure. As a result, the DNN makes intelligent use of the resources.

As we know the network should have more capacity to tackle realistic applications efficiently, it is advantageous to have higher capacity with smaller initial structure. In the present work we show that if we do not prune the nodes and reuse these pruned nodes then it is possible to memorize more number of patters even with smaller initial structure. We emphasize here that by pruning we release the nodes to perform some other task, or, we can make use of pruned nodes for the same task at the subsequent iterations. The usage of pruned nodes for processing fresh set of exemplar patterns is depicted as an algorithm in the Figure 2.



INPUT: S1,S2,…,S2m+2l Exemplar patterns, X Test pattern, (where m is the number of basic nodes in the network; 2l is the number of leftover Exemplar patterns after first iteration to be memorized; l=1,2…)

OUTPUT: O output pattern.

PROCEDURE

DNN ( S1,S2,…,S2m+1,…,S2m+2l, X : INPUT, O : OUTPUT ) j = 1; k = 2; a = 0; For i<= 1 to m, with increment j Constantly present the input pattern X to each node Hi End for For a<= 1 to l - 1, with increment 1 For i<= 1 to m, with increment j Train-network ( Hi, S2i-1, S2i ) /* the node Hi,is trained with S2i-1 and S2i exemplar patterns.*/ End for For i<=1 to m, with increment j Hopfield ( Hi, X, Oi ) /* each node Hi stabilizes at a stable state Oi*/ End for For i< 1 to m, with increment k S2i-1 Oi S2i Oi+j S2(i+j)-1 S2(2m+a)-1 S2(i+j) S2(2m+a) End for End for Repeat Until m = 1 Begin do For i<= 1 to m, with increment j Train-network ( Hi, S2i-1, S2i ) End for For i<= 1 to m, with increment j Hopfield (Hi, X, Oi ) End for For i< 1 to m, with increment k S2i-1 Oi

S2i Oi+j

Prune the node Hi+j End for j = j + j; k = k + k; m = m / 2 End do END

Fig. 2: The Algorithm for Training and Pruning of DNN with reuse.

4 Performance Evaluation

The concept reusing the pruned nodes in each iteration has very good advantage when the numbers of exemplar patterns are much more than that of the twice the number of the basic Hopfield nodes in the DNN. Because the number of iterations required getting the final output, when the reusing of nodes is used is much less compared to that of the original DNN. The quantitative analysis is presented as follows.



The DNN without reuse of pruned nodes require “n” nodes when there are “2n” patterns and converge to final pattern in “log22n” iterations. If the pruned nodes are reused with the same “n” number of nodes we can memorize or store “(k+1-log2n) n” patterns, where “k” is the number of iterations. This “k” is log22n in the case of DNN without reuse. Now if one can tolerate an extra iteration i.e., ((log22n)+1) then, “n” extra patterns can be stored in the DNN. As a example:

NODES CAPACITY ITERATIONS

DNN(without reuse) 8 16 patterns 4

DNN with reuse 8 24 patterns 5

DNN with reuse 8 32 6

5 Methodology and Experimentation

CBIR aims at searching image libraries for specific image features like colors, shape and textures and querying is performed by comparing feature vectors (e.g. color histograms) of a search image with the feature vectors of all images in the database. One of the most important challenges when building image based retrieval systems is the choice and the representation of the visual features. Color is the most intuitive and straight forward for the user while shape and texture are also important visual ambits but there is no standard way to use them compared to color for efficient image retrieval. Many content-based image retrieval systems use color and texture features.

In the present work a method combining both color and texture features of image is proposed to improve the retrieval performance. Since the DNN operates with binary data the feature vector is transformed into a binary signature and stored into a signature file. Before to get one signature for image from the database following operations are performed.

1. Image is compressed by using Harr or Debouche compression technique.

2. A feature vector (color +Texture) for each pixel is extracted “image (c1, c2, c3, t1, t2, t3)” by applying color and texture feature extracting technique.

3. Image segmentation is a crucial step for a region-based system to increase performance and accuracy during image similarity distance computation. Image segmentation to obtain objects/regions. Images are segmented by grouping pixels with similar descriptions (color and texture) to form objects/regions. image(c1,c2,c3, t1,t2,t3) = sigma Oi (c1,c2,c3, t1,t2,t3)

4. The Final feature vector for each image is calculated by augmenting the feature vectors of each object of a image.

Feature Vector = FV O1 || FV O2 || … FVOi

5. The Feature Vector is converted binary signature and stored into a signature file SF. Thus image database is transformed into signature file storage.

Similarly the signature is extracted for a query image. The test signature through which the person claims his identity is verified with reference signature using Dynamic Neural Network. DNN based CBIR system the set of signature extracted from image database becomes training set for the DNN and the signature for the query



image becomes test signature (Input pattern). The DNN with reuse consists of fully connected basic nodes, each of which is a Hopfield node. When a user claims to be a particular individual and presents a signature (test signature), the binary representation of the test signature is used as test pattern. According to the dynamics of the Hopfield model, the DNN with reuse retrieves one of the memorized patterns that is close to the test pattern.

5.1 Image Compression and Feature Extraction

Generally in CBIR systems before the feature extraction, and generate a reference signature for each individual image it is required to compress the image. This is preprocessing phase and next is feature extraction phase where signature is computed.

5.2 Wavelet Analysis

During the preprocessing phase of images the raw file of each image is subjected multiresolution wavelet analysis. This required because the storage and manipulation of scanned images very expensive, because of large storage space. To make widespread use of digital images, practical, some form of data compression must be used. The wavelet transform has become a cutting edge technology in image compression research.

The Wavelet representation encodes the average image and the detail images as we transform the image for coarser resolutions. The detail images encode the directional features of vertical, horizontal and diagonal direction whereas the average image retains the average features. Thus the average image of a scanned signature can retain its salient structure even at coarser resolution. However not all types of basis functions are able to preserve these features in the present context. We tried a set of scaling functions and noticed that their behavior differs. Another advantage of using wavelet representation is that the preprocessing stages of contouring, thinning or edge detection are no longer required. We apply wavelet transform to the gray scale image and use the process of finalization on wavelet coefficient of the average image.

Wavelet representation gives information about the variations in the image at different scales. A high wavelet coefficient at coarse resolution corresponds to a region with high global variations. The idea is to find relevant point to represent this global variation by looking at wavelet coefficients at finer resolutions A Wavelet is an oscillating and attenuated function with zero integral. It is a basis function that has some similarity to both spines and Fourier series. It decomposes the image into different frequency components and analyzes each

component with resolution matching its scale. We study the image at the scales 2-j, j ∈ Z+. Application of wavelets to compute signatures of images is an area of active research.

The (forward) Wavelet transforms can be viewed as a form of sub band coding with a low pass filter (H) and a high pass filter (G) which split a signal’s bandwidth in half. The impulse

responses of H and G are mirror images, and are related by n

n

n hg−

−

−= 1

1)1( . A one-

dimensional signal s can be filtered by convolving the filter coefficients kc with the signal

values: kk

M

scs−

∑= 1~ where M is the number of coefficients. The one-dimensional forward

wavelet transform of a signal s is performed by convolving s with both H and G and down



sampling by 2. The image f(x,y) is first filtered along the x-dimension, resulting in a low pass image fL(x, y) and a high pass image fH(x, y). The sampling is accomplished by dropping

every other filtered value. Both fL and fH are then filtered along the y dimension, resulting in

four sub images: fLL,, fLH, fHL and fHH. Once again, we can down sample the sub images by 2, this time along the y-dimension. The 2-D filtering decomposes an image into an average signal fLL and three detail signals which are directionally sensitive: fLH emphasizes the horizontal image features, fHL the vertical features, and fHH the diagonal features. This process reduces a 130x40 image to a set of four 17x5 images of which we consider only one, namely the average image.

The properties of the basis wavelet depends on the proper choice of filter characteristic H (w). Several filters have been proposed by the researchers working in this field. The specific type of filter to be used depends on the application. The constraints for choosing a filter are good location in space and frequency domain on one hand and smoothness and differentiability on the other hand. Here we have used the Battle Lamarie filter co-efficient for wavelet approximation.

5.2.1 Battle-Lemarie Filter [Mallat, 89]

h(n): 0.30683, 0.54173, 0.30683, -0.035498, -0.077807, 0.022684, 0.0297468, -0.0121455, -0.0127154, 0.00614143, 0.0055799, -0.00307863, -0.00274529, 0.00154264, 0.00133087, -0.000780461, 0.000655628, 0.0003955934

)(ng : 0.541736,-0.30683, -0.035498, 0.077807, 0.022684, -0.0297468, -0.0121455,

0.0127154, 0.00614143, -0.0055799, -0.00307863, 0.00274529, 0.00154264, -.00133087, -0.000780461, 0.000655628, 0.0003955934, 0.000655628

5.2.2 Feature Extraction

Since the basic building block of DNN is Hopfield network and it uses only binary form of information, to learn, the training set (images) has to be converted into binary. The image retrieval system for preprocessing and feature extraction we used MATLAB image processing tools and statistical tools. For clustering we use k-means clustering. We use a general-purpose image database containing 1000 images from COREL. These images are pre-categorized into 10 groups: African people, beach, buildings, buses, dinosaurs, elephants, flowers, horses, mountains and glaciers, and food. All images have the size of 384x256 and 256x386. As explained above the color and texture feature is extracted and fed into a k-means algorithm to get the clustered objects. As explained above in step 5 the binary signature is computed. This process is repeated for all the images in the database and these signatures are used as training set to the DNN. Same process is repeated for query image and the resulted signature is used as test set. The DNN with reuse consists of fully connected basic nodes, each of which is a Hopfield node.

5.3 Signature Comparison

The DNN with reuse consists of fully connected basic nodes, each of which is a Hopfield node. The binary signatures of database images o are used as exemplar patterns. The query image binary representation is used as test pattern. According to the dynamics of the Hopfield model, the DNN with reuse retrieves one of the memorized patterns that are close to the test pattern.



6 Results and Interpretation

In this experiment, we use 1000 images from COREL database. These images are pre-categorized into 10 classes: African people, beach, buildings, buses, dinosaurs, elephants, flowers, horses, mountains and glaciers, and food (below table). The first 80 images of each class are used to as typical images to train the designed DNN. The feature vector has 200 bits. The Next 15 images of each class are used test images. Fig 3 shows the retrieved results of the DNN. In order to verify the performance of the proposed system, we also compare the proposed DDN system with other image retrieval systems. Table II is the comparison of average precision with the SIMPLicity system and the RIBIR system which shows that DNN system is more efficient to retrieve the images than the other two systems. The reason is that the trained DNN neural network can memorize some prior information about each class.

Fig. 2: Images Categorization

Fig. 3: Retrived Results of DNN (a) Flowers (b) Elephants



Table 1 Comparision of Average Precision between BPBIR and other Systems

classes BPBIR SIMPLicity DNN

African People and villages 32% 48% 56.99%

Beaches 40% 32% 54.69%

Buildings 34% 33% 53.46%

Buses 43% 37% 81.92%

Dinosaurs 54% 98% 98.46

Elephants 49% 40% 52.52

Flowers 36% 40% 76.35

Horses 54% 71% 81.21%

Mountains 34% 32% 48.65%

Food 50% 31% 66.45%

7 Conclusion

In this paper we presented a novel image retrieval system, which is based on DNN., which has the observation that the images users need are often similar to a set of images with the same conception instead of one query image and the assumption that there is a nonlinear relationship between different features. Finally, we compare the performance of the proposed system with other image retrieval system in Table 1. Experimental results show that it is more effective and efficient.

References

[1] [Pender 91] D.A. Pender ``Neural Networks and Handwritten Signature Verification”, Ph.D Thesis,

Department of Electrical Engineering, Stanford University. [2] [Smith 96] Smith, K., Palaniswami, M. Krishnamurthy, M. ``Hybrid neural approach to combinatorial

optimization”, Computers & Operations Reassert, 23, 6,597-610, 1996. [3] [Mighel 89] D. A. Mighell, T. S. Willkinson and J. W. Goodman ``Backpropagation and its application To

Handwritten Signature Verification”, Adv in Neural Inf Proc Systems 1, D. S. Touretzky (ed0, Morgon Kaufman Pub, pp 340-347.

[4] [Peasso 95] F.C. Pessao, “Multilayer Perceptron Vesus Hiddeen Markov Models: Comparison and applications to Image Analysis and Visual Pattern Recognition”, Pre PhD qualifying report,, Georgia

Institute of Technology, School of Electrical and Computer Engineering, Aug 10, 1995. [5] [Hilberg 97] Hilberg, W. ``Neural Networks in higher levels of abstractons”. Biological Cybernetics, 76,

2340, 1997. [6] [Kang 93] Kang, H. ``Multilayer Associative Neural Networks (M.A.N.N): Storage capacity Vs noise-free

recall”.International IT Conference on neural networks, (IJCNN) 93, 901-907, 1993. [7] [Rao & Pujari 99] Rao, M.S. and Pujari, A.K. ``A New Neural Networks architecture with associative

memory, pruning and order sensitive learning”, International journal of neural systems,9,4, 351-370,1999. [8] [Sukhaswami 93] Sukhaswami, M.B. ``Investigations on some applications of artificial neural networks”,

Ph.D, Thesis, Dept. of CIS, University of Hyderabad, Hyderabad, India, 1993. [9] [Hopfield 82] Hopfield, J.J. ``Neural networks and physical systems with emergent collective

computational abilities”. Proceedings of National Academy of Sciences, USA, 81 3088-3092,1994.



Development of New Artificial Neural Network

Algorithm for Prediction of Thunderstorm Activity

K. Krishna Reddy K. S. Ravi Y.V.University, Kadapa K.L.College of Engg., Vijayawada [email protected] [email protected]

V. Venu Gopalal Reddy Y. Md. Riyazuddiny JNTU College of Engg., Pulivendula V.I.T.University, Vellore, TamilNadu [email protected] [email protected]

Abstract

Thunderstorm can cause great damage to human life and property. Hence, prediction of thunderstorm and associated rainfall are essential to the agriculture, house hold purpose, industries and construction of buildings. Any weather prediction is extremely complicated. This is because associated mathematical models are complicated, involving many simultaneous non-linear hydrodynamic equations. In many occasions such models do not give accurate predictions. Artificial neural network (ANN) are known to be good at problems where there are no clear cut mathematical models and so ANNs have been tried out to make predictions in our application. ANNs are now being used in many branches of research, including the atmospheric sciences. The main contribution of this paper is the development of ANN to identify the presence of thunderstorms (and their location) based on Automatic Weather Station data collected at Semi-arid-zonal Atmospheric Research Centre (SARC) of Yogi Vemana University.

1 Introduction

Thunderstorm is a highly destructive force of nature and the timely tracking of the thundercloud direction is of paramount importance to reduce the property damages and human casualties. Annually, it is estimated that thunderstorm related phenomenon causes crores of rupees of damages world wide through forest fires, shutdown of electrical plants and industries, property damages, etc [Singye et al., 2006]. Although there are thunderstorm tracking mechanisms already in place, often such systems deploy complicated radar systems, the cost of which can only be afforded by bigger institutions. The artificial neural networks have been studied since the nineteen sixties (Rosenblatt, 1958), but their use for forecasting meteorological events appeared only in the last 15 years [Lee et al., 1990 and Marzban,2002]. The greater advantage in using ANN is their intrinsic non-linearity, which helps in describing complex meteorological events in a better way than linear methods.

2 Artificial Neural Network

An Artificial Neural Network (ANN) is a computational model that is loosely based on the manner in which the human brain processes information. Specifically, it is a ANN of highly

Development of New Artificial Neural Network Algorithm for Prediction of Thunderstorm Activity ♦ 377


interconnecting processing elements (neurons) operating in parallel (Figure 1). An ANN can be used to solve problems involving complex relationships between variables. The particular type of ANN used in this study is a supervised one, wherein an output vector (target) is specified, and the ANN is trained to minimize the error between the output and input vectors, thus resulting in an optimal solution.

Fig. 1: A 2-layer ANN with Multiple Inputs and Single Hidden and Output Neurons

Today, most ANN research and applications are accomplished by simultaneous ANNs on high performance computers. ANNs with fewer than 150 elements have been successfully used in vehicular control simulation, speech recognition and undersea mine detection. Small ANNs have also been used in airport explosive detection, expert systems, remote sensing biomedical signal processing, etc. Figure 2 demonstrates a single layer perception that classifies an analog input vector into two classes denoted A and B. This net divides the space spanned by the input into two regions separated by a hyperplane or a line in two dimensions as shown on the top right.

Fig. 2: A Single Layer Perception

378 ♦ Development of New Artificial Neural Network Algorithm for Prediction of Thunderstorm Activity


Figure.3 depicts a three-layer perception with N continuous valued inputs, M outputs and two layers of hidden units. The nonlinearity can be any of those shown in Fig.3 The decision rule is to select that class corresponding to the output node with the largest outputs in the formulas, Xj and X”k are the outputs of nodes in the first and second hidden layers. Theta j and theta k are internal offsets in those nodes. Wij is the counection strength from the input to the first hidden layer and w’ik and w”ij are the connection strengths between the first and second and between the second and the output layers, respectively.

Fig. 3: A three-layer Perception

3 Methodology

Consider figure 4. Here Y = Actual Output; D = Desired Output

Error E =1/2 (Y-D)2

Wi’= Wi -µδE/δWi where δ is the learning rate

and Wi’ is the adjusted weight

Now y=1/(1+e-x)

δy/δx = y/(1-y)

δx/δWi = δ/δWi

Therefore

-µδE/δWi = µδE/δy * δy/δWi

= µδE/δy * δy/δx * δx/δWi

= µδ(y-D)y(1-x)xi

Wjo’ = Wj

o- µδE/δWio

= Wjo - µδE/δy * δy/δWi

o

= Wjo - µ(y-D) * δy(neto) * δ(neto)/δWi

o

= Wjo - µ(y-D)y(1-y)Ij

∴ Wjo’ = Wj

o- µ(y-D)y(1-y)Ij

N

( Σ Xi Wi) = Xi i=1



Fig. 4: ANN without Hidden Layer

Fig. 5: ANN with Hidden Layer

Consider the figure 5. For the sake of simplicity, we have taken 3/2/1 ANN. Our aim is to determine the set of optimum weights between i) the input and hidden layer and ii) hidden layer and out put layer.

For the hidden nodes, we have

Wijh = Wij

h - µδE/δWijh

µδE/Wijh - µδE/δy * δy/δWij

h

=- µ(y-D) * δ[ ƒ(neto)]/δ(neto) * δ(neto) /δWijh

= - µ(y-D) * δ[ ƒ(neto)]/δ(neto) * δ(neto) /δIj * δIj/δWijh

= - µ(y-D) * δ[ ƒ(neto)]/δ(neto) *Wio * δIj/δWij

h

= - µ(y-D) * δ[ ƒ(neto)]/δ(neto) *Wio * Ij ( 1-Ij ) Xi

= - µ(y-D) y(1-y) Wio *Ij ( 1-Ij ) Xi

We have

Wjo’ = Wj

o- µ(y-D) y(1-y)Ij

Hence Wijh’ = Wij

h - µ(y-D) y(1-y) Wio *Ij ( 1-Ij ) Xi

4 Software Development

Extensive programming was done to execute the above ANN calculations. FORTRAN programmes used for training the data. In addition, a Linux version, depicting the training and testing has also been prepared. The programmes are developed and source codes are written in ‘C’ and the “exe’ files can work under GNU environment. The programme gives a graphical display of process of testing and training of ANN with depiction of results on screen in graphical mode.



5 Meteorological Data

The various Input parameters namely temperature, pressure, relative humidity, wind speed and wind direction are described.

5.1 Temperature

Temperature is the manifestation of heat energy. In meteorology, temperature is measured in free air, in shade, at a standard height of 1.2 m above the ground. Measurement is made at standard hours using thermometers. The unit of measurement is degree Celsius (0C). Ambient temperature is refereed to as dry bulb temperature. This is different from wet bulb temperature, which is obtained by keeping the maturing thermometer consistently wet. The wet bulb temperature or due point provides a measure of the moisture content of the atmosphere. We have used dry bulb temperature in our study.

5.2 Pressure

Atmospheric pressure is defined as the force excited by a vertical column of air of unit cross section at a given level. It is measured by a barometer. The unit of pressure is milli bar (mb). Atmospheric pressure varies with time of the day and latitude as also with altitude and weather conditions. Pressure decreases with height. This is due to the fact that the concentration of constituent gases and the depth of the vertical column decrease as we ascend.

5.3 Wind (Speed and Direction)

The atmosphere reaches equilibrium through winds. Wind is air in horizontal motion. Wind is denoted by the direction from which it blows and is specified by the points of a compass or by degrees from True North (0 to 360o). Wind direction is shown by the wind vane and wind speed is measured by anemometer. In our study, for computational purposes, values as in Table 1 are assigned to the various directions.

Direction Value Assigned (Degrees) North (N) 360 North North East (NNE) 22.5 North East (NE) 45 East North East (ENE) 67.5 East (E) 90 East South East (ESE) 112.5 South East (SE) 135 DIRECTION Value assigned (Degrees) South South East(SSE) 157.5 South (S) 180 South South West (SSW) 202.5 South West (SW) 225 West South West (WSW) 247.5 West (W) 270 North West (NW) 315



5.4 Relative Humidity

The measure of the moisture content in the atmosphere is humidity. Air can hold only a certain amount of water at given time. When the maximum limit is reached, air is said to be saturated. The ratio of the amount of water vapor presented in the atmosphere to the maximum it can hold at that temperature and pressure, expressed as a percentage is the relative humidity.

5.5 Data Collection

The greater advantage in using ANN is their intrinsic non-linearity, which helps in describing complex (thunderstorm) meteorological events in a better way than linear methods. But this could turn out to be also a drawback, since this intrinsic power permits the ANN to easily fit the database used to train the model. Unfortunately, it is not sure that the good performance obtained by the ANN on the training data will be confirmed also on the new data (generalization ability). To avoid this over fitting problem it is crucial to verify the ANN, i. e. to divide the original database into training and validation subsets and choose the ANN which has the best performance on the validation dataset (similarly to what done by Navone and Ceccato 1994).

The data related to thunderstorm occurrence was collected from the SARC, Department of Physics, Yogi Veamna University, Kadapa. We have collected a total of 100 data sets, of which 45 were used for training and 51 for testing. Each set consist of 5 input parameters and a corresponding output parameter. This output parameter specifies whether thunderstorm occurred or not.

6 Results

The back propagation approaches described in sections 2 and 3 were tried out to predict thunderstorm occurrences. Initially a straight forward ANN with one hidden layer was tried out. The inputs were normalized and initial weights were optimized considering the numerical limits of the compiler for data training. The sigmoid function was used as the activation function. There were 5 input nodes, 3 hidden nodes and one output node. This configuration is called 5/3/1 ANN. Target or desired output was kept 0 and 1 corresponding to “No Thunderstorm” and “Thunderstorm” conditions. If one gives a close look at the sigmoid function. It can be observed that the function saturates beyond -5 in the negative X axis and +5 in the positive X axis. Since the sigmoid function reaches 0 and 1 only at – infinity and + infinity, it was decided that for practical purposes the target out put could be corresponding to 0.0067 (X = -5 for 0 and corresponding to 0.9933 (x=5) for 1.

In order to attain convergence, error levels were fixed with respect to these values. Initially the error value was fixed at 0.0025. Training was done by taking alternate data sets for Thunderstorm and no thunderstorm conditions so that a better approximation is made by the ANN. But it was observed that even after many iterations, the ANN exhibited oscillatory behavior and finally stagnated at a constant value. Introduction and altering of the momentum factor or varying the threshold value and learning rate also did not improve the convergence. The ANN was able to reach error levels of about 0.25 only.



In order to further improve the ANN, a slightly different configuration i.e. 5/4/1 was tried out, by having 4 hidden layers/units instead of 3. However there was a marginal decline in the performance.

Table 2 Gives Details of the Number of Sweeps (Iterations) Made the Error Levels Reached

and the Efficiency of the Ann for the Different Configurations Attempted

Sl.No. ANN Type 5/1 5/3/1 5/4/1

1 Convergence reached 0.000045 0.000750 0.001

2 Iterations taken for this convergence 43 241 1606

3 Efficiency over learning data 73.469 93.012 92.763

4 Efficiency over testing data 73.009 89.1812 89.14

It can be seen that by using hidden layers lower error levels and better accuracy could be achieved. However, the convergence rate was slow. Another significant observation was made during the course of the work. It was noted that optimizing the initial weights of the ANN led to faster convergence and better efficiency in all the above cases. Results obtained were much better than those obtained with random initial weights. It has been possible to reach prediction efficiency up to 90% with good computing facilities but with limited datasets and time constraints. There is lot of scope for improvement by using more number of physical parameters and more data sets. By prolonged analysis R should be possible to get efficiency exceeding 95% and YES and NO Cases predicted quite close to their desired values.

7 Conclusion

Artificial neural network without hidden layer shows limited capability in prediction of thunderstorm. ANN with hidden layers gives good prediction results. However, ANN trained by algorithm which does not average successive errors, does not reach lower error levels. Weight initialization is an important factor. It has been found that proper weight initialization instead of random initialization results in better efficiency and faster convergence. Fairly accurate prediction of thunderstorm has been possible in spite of limited availability of physical input parameters and data sets. Prolonged analysis with more number of physical input parameters and larger volume of data sets will yield prediction efficiency greater than 95% and actual ANN outputs exactly conforming to the desired outputs

References

[1] [Lee et al., 1990] A neural network approach to cloud classify-cation, IEEE Transactions on Geoscience and Remote Sensing, 28, pages 846-855.

[2] [Marzban and Stumpf, 1996] A neural network for tornado prediction based on Doppler radar-derived

attributes, J. Appl. Meteor., 35, pages 617-626. [3] [Navone and Ceccatto, 1994] Predicting Indian Monsoon Rainfall: A Neural Network Approach. Climate

Dyn., 10, pages 305-312. [4] [Rosenblatt, F., 1958]. The Perceptron: A probabilistic model for information storage and organization in

the brain, Psychological Review, 65, pages 386-408. [5] [Singye et al., 2006], Thunderstorm tracking system using neural ANNs and measured electric fields from a

fe field mills, Journal of Electrical Engineering, 57, pages 87–92.



Visual Similarity Based Image Retrieval

for Gene Expression Studies

Ch. Ratna Jyothi Y. Ramadevi Chaitanya Bharathi Institute Chaitanya Bharathi Institute of Technology, Hyderabad of Technology, Hyderabad [email protected]

Abstract

Content Based Image Retrieval (CBIR) is becoming very popular because of the high demand for searching image databases of ever-growing size. Since speed and precision are important, we need to develop a system for retrieving images that is both efficient and effective.

Content-based image retrieval has shown to be more and more useful for several application domains, from audiovisual media to security. As content-based retrieval became mature, different scientific applications were revealed client for such methods.

More recently, botanical applications generated very large image collections then became very demanding “content-based visual similarity

computation”[2]. Our implementation describes low-level feature extraction for visual appearance comparison between genetically modified plants for gene expression studies

1 Introduction

Image database management and retrieval has been an active research area since the 1970s. With the rapid increase in computer speed and decrease in memory cost, image databases containing thousands or even millions of images are used in many application areas such as medicine, satellite imaging, and biometric databases, where it is important to maintain a high degree of precision. With the growth in the number of images, manual annotation becomes infeasible both time and cost-wise.

Content-based image retrieval (CBIR) is a powerful tool since it searches the image database by utilizing visual cues alone. CBIR systems extract features from the raw images themselves and calculate an association measure (similarity or dissimilarity) between a query image and database images based on these features. CBIR is becoming very popular because of the high demand for searching image databases of ever-growing size. Since speed and precision are important, we need to develop a system for retrieving images that is both efficient and effective.

Recent approaches to represent images require the image [3]to be segmented into a number of regions (a group of connected pixels which share some common properties). This is done with the aim of extracting the objects in the image. However, there is no unsupervised segmentation algorithm that is always capable of partitioning an image into its constituent

384 ♦ Visual Similarity Based Image Retrieval for Gene Expression Studies


objects, especially when considering a database containing a collection of heterogeneous images. Therefore, an inaccurate segmentation may result in an inaccurate representation and hence in poor retrieval performance.

We introduced contour-based CBIR technique[8]. It uses a new approach to describe the shape of a region, inspired by an idea related to the color descriptor in. This new shape descriptor, called Directional Fragment Histogram (DFH), is computed using the outline of the region. One way of improving its efficiency would be to reduce the number of image comparisons done at query time. This can be achieved by using a metric access structure or a fitering technique.

2 A Typical CBIR System

Content Based Image Retrieval is defined as the retrieval of relevant images from an image database on automatically derived imagery features. Content-based image retrieval[7], uses the visual contents of an image such as color, shape, texture, and spatial layout to represent and index the image. In typical content-based image retrieval systems (Figure 1-1), the visual contents of the images in the database are extracted and described by multi-dimensional feature vectors.

Fig. 1.1: Diagram for Content-based Image Retrieval System

The feature vectors of the images in the database form a feature database. To retrieve images, users provide the retrieval system with example images or sketched figures. The system then changes these examples into its internal representation of feature vectors. The similarities distances between the feature vectors of the query example or sketch and those of the images in the database are then calculated and retrieval is performed with the aid of an indexing scheme.

The indexing scheme provides an efficient way to search for the image database. Recent retrieval systems have incorporated users' relevance feedback to modify the retrieval process in order to generate perceptually and semantically more meaningful retrieval results. In this chapter, we introduce these fundamental techniques for content-based image retrieval.

Visual Similarity Based Image Retrieval for Gene Expression Studies ♦ 385


3 Fundamental Techniques for CBIR

3.1 Image Content Descriptors

Generally speaking, image content[6] may include both visual and semantic content. Visual content can be very general or domain specific. General visual content include color, texture, shape, spatial relationship, etc. Domain specific visual content, like human faces, is application dependent and may involve domain knowledge. Semantic content is obtained either by textual annotation or by complex inference procedures based on visual content.

3.1.1 Color

Most commonly used color descriptors include the color histogram, color coherence vector, color correlogram and so on.The color histogram serves as an effective representation of the color content of an image if the color pattern is unique compared with the rest of the data set. The color histogram[4] is easy to compute and effective in characterizing both the global and local distribution of colors in an image. In addition, it is robust to translation and rotation about the view axis and changes only slowly with the scale, occlusion and viewing angle.

In color coherence vectors (CCV) spatial information is incorporated into the color histogram. The color correlogram was proposed to characterize not only the color distributions of pixels, but also the spatial correlation of pairs of colors.

3.1.2 Texture

Texture is another important property of images. Various texture representations have been investigated in pattern recognition and computer vision. Basically, texture representation methods can be classified into two categories: structural and statistical. Structural methods, including morphological operator and adjacency graph, describe texture by identifying structural primitives and their placement rules. They tend to be most effective when applied to textures that are very regular. Statistical methods, including Fourier power spectra, co-

occurrence matrices, shift-invariant principal component analysis (SPCA), Tamura feature, World decomposition, Markov random field, fractal model, and multi-resolution filtering

techniques such as Gabor and wavelet transform, characterize texture by the statistical distribution of the image intensity.

3.1.3 Shape



Shape features of objects or regions have been used in many content-based image retrieval systems. Compared with color and texture features, shape features are usually described after images have been segmented into regions or objects. Since robust and accurate image segmentation is difficult to achieve, the use of shape features for image retrieval has been limited to special applications where objects or regions are readily available. The state-of-art methods for shape description can be categorized into either boundary-based (rectilinear shapes, polygonal approximation, finite element models, and Fourier-based shape descriptors) or region-based methods (statistical moments). A good shape representation feature for an object should be invariant to translation, rotation and scaling.

3.1.4 Spatial Information

Regions or objects with similar color and texture properties can be easily distinguished by imposing spatial constraints. For instance, regions of blue sky and ocean may have similar color histograms, but their spatial locations in images are different. Therefore, the spatial location of regions (or objects) or the spatial relationship between multiple regions (or objects) in an image is very useful for searching images. The most widely used representation of spatial relationship is the 2D strings proposed by Chang et al.. It is constructed by projecting images along the x and y directions. Two sets of symbols, V and A, are defined on the projection. Each symbol in V represents an object in the image. Each symbol in A

represents a type of spatial relationship between objects.In addition to the 2D string, spatial

quad-tree, and symbolic image are also used for spatial information representation.

4 Similarity Measures and Indexing Schemes

4.1 Similarity Measures

Different similarity/distance measures will affect retrieval performances of an image retrieval system significantly[6]. In this section, we will introduce some commonly used similarity measures. We denote D(I, J) as the distance measure between the query image I and the image J in the database; and fi(I) as the number of pixels in ith bin of Image I. In the following sections we briefly introduce some of the commonly used distance measure techniques.

1 Minkowski-Form Distance

If each dimension of image feature vector is independent of each other and is of equal importance, the Minkowski-form distance Lp is appropriate for calculating the distance between two images. This distance is defined as: when p=1, 2, a ∞, D(I, J) is the L1, L2 (also called Euclidean distance), and L∞ distance respectively. Minkowski-form distance is the most widely used metric for image retrieval. For instance, MARS system used Euclidean distance to compute the similarity between texture features; Netra used Euclidean distance for color and shape feature, and L1 distance for texture feature; Blobworld used Euclidean distance for texture and shape feature. In addition, Voorhees and Poggio used L∞ distance to compute the similarity between texture images.



2 Quadratic Form (QF) Distance

The Minkowski distance treats all bins of the feature histogram entirely independently and does not account for the fact that certain pairs of bins correspond to features which are perceptually more similar than other pairs. To solve this problem, quadratic form distance is introduced:

where A=[aij] is a similarity matrix, and aij denotes the similarity between bin i and j. Fj and Fj are vectors that list all the entries in fi(I) and fi(J). Quadratic form distance has been used in many retrieval systems for color histogram-based image retrieval. It has been shown that quadratic form distance can lead to perceptually more desirable results than Euclidean distance and histogram intersection method as it considers the cross similarity between colors.

4.2 Indexing Scheme

After dimension reduction, the multi-dimensional data are indexed. A number of approaches have been proposed for this purpose, including R-tree (particularly, R*-tree), linear quad-

trees, K-d-B tree and grid files. Most of these multi-dimensional indexing methods have reasonable performance for a small number of dimensions (up to 20), but explore exponentially with the increasing of the dimensionality and eventually reduce to sequential searching. Furthermore, these indexing schemes assume that the underlying feature comparison is based on the Euclidean distance, which is not necessarily true for many image retrieval applications.

4.3 User Interaction

For content-based image retrieval, user interaction with the retrieval system is crucial since flexible formation and modification of queries can only be obtained by involving the user in the retrieval procedure. User interfaces in image retrieval systems typically consist of a query formulation part and a result presentation part.

4.3.1 Query Specification

Specifying what kind of images a user wishes to retrieve from the database can be done in many ways. Commonly used query formations are: category browsing, query by concept, query by sketch, and query by example. Category browsing is to browse through the database according to the category of the image. For this purpose, images in the database are classified into different categories according to their semantic or visual content. Query by concept is to retrieve images according to the conceptual description associated with each image in the database. Query by sketch and query by example is to draw a sketch or provide an example image from which images with similar visual features will be extracted from the database.

4.4 Relevance Feedback

Relevance feedback is a supervised active learning technique used to improve the effectiveness of information systems. The main idea is to use positive and negative examples from the user to improve system performance. For a given query, the system first retrieves a list of ranked images according to a predefined similarity metrics. Then, the user marks the retrieved images as relevant (positive examples) to the query or not relevant (negative



examples). The system will refine the retrieval results based on the feedback and present a new list of images to the user. Hence, the key issue in relevance feedback is how to incorporate positive and negative examples to refine the query and/or to adjust the similarity measure.

4.5 Performance valuation

To evaluate the performance of retrieval system, two measurements, namely, recall and precision [8,7], are borrowed from traditional information retrieval. For a query q, the data set of images in the database that are relevant to the query q is denoted as R(q), and the retrieval result of the query q is denoted as Q(q). The precision of the retrieval is defined as the fraction of the retrieved images that are indeed relevant for the query:

The recall is the fraction of relevant images that is returned by the query:

4.6 Practical Applications of CBIR

A wide range of possible applications for CBIR technology are

1. Crime prevention 2. Military 3. Intellectual property 4. Architecture and engineering design 5. Fashion and interior design 6. Journalism and advertising 7. Medical diagnosis 8. Geographical information and remote sensing systems 9. Education and training. 10. Home entertainment

5 Our Approach

5.1 Introduction

As content-based image retrieval (CBIR) methods became mature, they are potentially useful tool in many fields, including scientific investigation in life sciences. As an example, to fully exploit the many large image collections now available in botany, scientists need automatic methods to assist them in the study of the visual content. To apply CBIR to these image databases, one must first develop description methods that are adapted both to the specific content and to the objectives of the botanists..In this work, we are interested in issues that are specific to the study of the function of genes in plants[2]. By selectively blocking individual genes, biologists can obtain rather diverse plant phenotypes. They first need a qualitative and quantitative characterization of each phenotype, reflecting the expression of a specific gene. Then, they must find which phenotypes are visually similar; indeed, visual resemblances between phenotypes reflects similarities in the roles of the genes whose expression was blocked when obtaining these phenotypes.



5.2 Overview of the System

For small databases such manipulations can be performed manually. But very large databases obtained as a result of large-scale genetic experiments require robust automatic procedures for characterizing the visual content and for identifying visual similarities. This will be our focus in the following. We use here an image database containing all classes of plants taken in several places in the world, at different periods of the year, under various conditions. All these plants had undergone genetic modifications. In order to satisfy the requirements of the application, we defined a task.

In the retrieval task, the user chooses an image as a query; this image is employed by our system to find all the plant images that are visually similar to the query plant image.

5.3 Feature Extraction

5.3.1 Plant Mask Computation

For retrieval task, we need to perform plant mask extraction with its shape and color description. In this study, the plant collection contains images with homogeneous background (synthetic) as well as heterogeneous background (earth). To eliminate the strong influence of the background on the retrieval process, we decided to separate the plant from the background and use only the salient region corresponding to the plant to perform partial queries. In order to have a single mask by plant, even if it contains leaves with color alterations, we achieve a coarse segmentation. Each pixel is represented by a local histogram of color distributions calculated around the pixel, in a quantified color space.

Fig. 1: Coarse segmentation and mask construction

5.3.2 Shape Descriptor for Plant Masks

Once every image is segmented into a set of regions, we find the various connected components, we neglect smallest ones and we use an algorithm of border detection to obtain the contours of the salient regions. The small neglected regions are represented as hatched areas and the contours of the salient regions as white curves.



Fig. 2: Detection of external plant contours

Directional Fragment Histograms

We introduce a new approach to describe the shape of a region, inspired by an idea related to the color descriptor. This new shape descriptor, called Directional Fragment Histogram

(DFH), is computed using the outline of the region. We consider that each element of the contour has a relative orientation with respect to its neighbors. We slide a segment over the contour of the shape and we identify groups of elements having the same direction (orientation) within the segment. Such groups are called directional fragments.

The DFH codes the frequency distribution and relative length of these groups of elements.The length of the segment defines the scale s of the DFH. Assume that the direction of an element of the contour can take N different values d0, d1...dN-1.

A fragment histogram at scale s is a two-dimensional array of values. Each direction correspond to a set of bins and the value of each bin DFHs (i, j) is the number of positions at which the segment contains a certain percentage of contour elements with the orientation di. Suppose that the percentage axis (0% - 100%) of the fragment histogram is partitioned into J percentiles p0, p1... pj-1, the fragment histogram contains N x J bins.

The fragment histogram is computed by visiting each position in the contour, retrieving the directions of all the elements contained in the segment S starting at this position, computing the percentage of elements having each direction and incrementing the histogram bins DFHs(i,j) corresponding to the percentage of elements with a given orientation.

5.3.3 Illustration of Extraction Procedure

Suppose that we have 8 different directions and 4 different fraction ranges as in Fig 3. The segment used is composed of 200 contour elements. Assume that, at a certain position in the contour, the segment contains

20 elements with direction d0 (20/200 = 10% [0-25]),

60 elements with direction d2 (60/200=30% [25-50]),

120 elements with direction d7 (120/200 =60% [50-75]).

Then, the first bin in the row d0, the second bin in row d2 and the third bin in the row d7 will be incremented, respectively. So, in this case, the segment is counted three times, once for each direction present in the segment, and each time it represents a group of elements of a the



different size. The fragment histogram DFHs (i, j) can be normalized by the number of all the possible segments at the end of the procedure.

Fig. 3: Extraction of the Directional Fragment Histogram

5.3.4 Quantization of Leave Color Alterations

Each pixel is represented by its color space components. The RGB color spaces were tested. This segmentation, as shown in Fig. 1, allows a quantitative study of the color alterations that are an expression of genetic modifications. For example, we distinguish several parts of the plant having undergone color alterations engendered by genetic modifications compared to the whole plant. The results of this fine segmentation are used to perform quantitative measures of the relative area over altered parts of leaves. These measures will provide automatic textual annotation.

Fig. 4: Two Examples of Area Quantization of Coloralterations, Based on a Fine Segmentation of Plant Images

References

For the purposes of designing and developing this document the following websites/technical papers have been referred to:

[1] H. Frigui and R. Krishnapuram, Clustering by competitive agglomeration. Pattern Recognition, 30(7): 1109-1119, 1997.

[2] Jie Zou and George Nagy, Evaluation of Model-Based Interactive Flower Recognition. (ICPR’04), Cambridge, United Kingdom, 2004.

[3] N. Boujemaa, On competitive unsupervised clustering, International Conference on Pattern Recognition (ICPR’00),Barcelona, Spain, 2000

[4] R.J. Qian, P.L.J. van Beek and M.I. Sezan, Image retrieval using blob histograms, in IEEE Proc, Int’l. Conf. On Multimedia and Expo, New York City, July 2000.

[5] Peter Belhumeur and al., An Electronic Field Guide: Plant Exploration and Discovery in the 1stCentury http://www.cfar.umd.edu/~gaaga/leaf/leaf.html



[6] M. L. KHERFI and D.ZIOU: Image Retrieval From the World Wide Web: Issues, Techniques and Systems. [7] Dr. Fuhui Long, Dr. Hongjiang Zhang and Prof. David Dagan Feng: Fundamentals of Content Based Image

Retrieval [8] SIA Ka Cheung: Issues on Content Based Image Retrieval [9] Image processing in JAVA –Nick Efford [10] Writing software Requirements SpecificationsDONN Le Vie, Jr. available at

http://www.raycomm.com/techwhirl/softwarerequirementspecs.html [11] A practitioner’s approach to software Engineering[pressman, 5th edition]



Review of Analysis of Watermarking Algorithms

for Images in the Presence of Lossy Compression

N. Venkatram L.S.S. Reddy KL College of Engineering KL College of Engineering Vaddeswaram Vaddeswaram [email protected] [email protected]

Abstract

In this paper, an analytical study related to the performance of important digital watermarking approaches is presented. Correlation between the embedded watermark and extracted watermark is found to identify the optimal watermarking domain that can help to maximize the data hiding in spread spectrum and Quantization watermarking.

1 Introduction

Information communication continues to attract the researchers for innovative methods on digitization, image processing, and compression techniques and data security. The problems associated with self healing of data, broadcast monitoring and signal tagging can be successfully overcome by digital watermarking. In all these applications, the robustness of watermarking is limited by compression which introduces distortion. This paper deals with the performance analysis of watermark embedding strategies to perceptual coding. Perceptual coding is lossy compression of multimedia based on human perceptual models. The basis for the perceptual coding is that minor modifications of the signal representations are not noticeable in the displayed content. Compression uses these modifications to reduce the number of bits required for storage, and water marking also uses these modifications to embed and detect the water mark. A compromise between perceptual coding and water marking need to be found to integrate so that both processes can achieve their tasks.

2 Literature survey

Wolf Gong et al [1] investigated color image compression techniques using discrete cosine transform (DCT) and discrete wavelet transform (DWT), using DCT and DWT based spread spectrum watermarking. Their assertion is that matching water marking and coding transforms improves the performance. But there is no theoretical basis for their assertion. Kundur and Hatzinakos [2] argue that the use of same transform for both compression and watermarking results in suboptimal performance in repetition – code – based quantization water marking using both analytical and simulation results. Ramkumar and Akansu [3], [4] conclude that the transforms which have poor energy compaction and not suitable for compression are useful at high capacities of spread spectrum data hiding. With these inconsistencies in the literature, a question arises regarding what is the best embedding transform for robustness against lossy compression and which one of the spread spectrum or quantization embedding superior?

394 ♦ Review of Analysis of Watermarking Algorithms for Images in the Presence of Lossy Compression


Eggers and Girod [5], [6] provided a detailed analysis of quantization effects on spread spectrum water marking scenes in DCT Domain. Wu and Yu [7], [8] presented an idea on combining two different watermark embedding strategies for embedding information 8X8 block DCT coefficients of host video. Quantization water marking [7] is used for embedding in the low frequencies and spread spectrum water marking in the high frequencies. C. Fei et al. [8], [9] proposed a model to incorporated quantization from compression from spread spectrum watermarking. Chen and Wornell [10] and Eggers and Girod [11] developed some robust schemes of water marking for lossy compression.

3 Quantization effects on Watermarks

Eggers and Girod [6] have analyzed the quantization effects on additive water marking schemes. Their analysis is based on the computation of statistical dependencies between the quantized water marked signal and the watermark, which is derived by extending the theory of dithered quantizers. They obtained expressions for calculating the correlation coefficients E eu, E ev and E e2 where u and v are independent, zero mean random variables and e is quantization error defined as

e= q-u-v (i)

Based on these expressions Kundur et al. [12] proposed a method of finding an expected correlation coefficient between quantized signal q and the signal u itself as follows.

E uq = E u2 + E eu (ii)

E q2 = Eu2 + Eu2 + Ee2 + 2 Eeu + 2 Eeu (iii)

Based on the above watermark correlation and variance of the extracted water mark of spread spectrum watermarking is found by Kunder et al. [12]

The model by Eggers and Girod [6] shows that the probability density function of the host data to be watermarked has a significant influence on the correlation values between the watermark and the quantized coefficient of spread spectrum watermarking. Their simulation also shows that Generalized Gaussian model for DCT coefficients agrees closely with the experimental results. Using these results, Chuhond Fei et al. [12 ] has calculated the theoretical correlation coefficients.

4 Discussion

For two different images, both fully dependent and independent watermark sequences, are tested using the following techniques.

4.1 Simulation Results Using Expected Average Correlation Coefficient Measure

It is found that when Joint Photographic Expert’s Group (JPEG) compressions occurs for a quality factor less than 92 and the Hadamard transform is much superior to others. Wavelet transform is a little better than KLT and DCT. And KLT and DCT are slightly better than the Slant. All these techniques consider the performance of pixel domain as constant, so that its performance exceeds that of wavelet, KLT, DCT and Slant that transform for very low quality factors of compression.

Review of Analysis of Watermarking Algorithms for Images in the Presence of Lossy Compression ♦ 395


4.2 Simulation Results Using Watermark Detection Error Probability Measure

When JPEG compression occurs for quality factor less than 92 and the Hadamard transform has the smallest error probability which is much superior to others. The Wavelet, KLT, DCT and Slant are close in behavior and the performance of pixel domain remains constant in high quality factors but is superior in low quality factor less than 60

In the case of Quantization based watermarking, again it is proved that it has high quality factors greater than 90 and the DWT is better than the Slant and Hadamard. And the DCT and KLT are at lowest in performance. The pixel domain is not bad in very high quality factors but deteriorates to be worst in low quality factors.

Although the Quantization based algorithm can extract the original watermark perfectly when water mark is transmitted without distortion the watermarks is severely damaged in case of high level of compression thus the quantization method is not very robust to JPEG compression

5 Conclusion

This paper reviewed the various analytical techniques and their appropriateness to the practical values in case of watermarking algorithms for improved resistance to compression. The findings show that the use of spread spectrum watermarking with a repetition code, and quantization – based embedding perform well when the watermarking is applied in a complimentary domain to compression. Spread spectrum watermarking using independent watermark elements work well when the same domain is employed. For improved robustness to JPEG compression a hybrid watermarking scheme that takes the predicted advantages of spread spectrum and quantization – based watermarking will give superior performance.

6 Acknowledgements

This paper has benefited from the inspiration and review by Prof. P.Thrimurthy.

References

[1] [R.B. Wolfgang et al.1998] R.B. Wolfgang C I Podilchuk, and E J Delp, “The effect of matching watermark

and compression transform in compressed color images”, in Proc. IEEE Int. Conf. Image Processing Vol 1 Oct 1998 pp 440 455.

[2] [D Kundur and D Hatzinakos,1999] D Kundur and D Hatzinakos, “Mismatching perceptual models for

effective watermarking in the presence of compression.” In Proc. SPIE, Multimedia Systems and Application II, Vol 3845 A G Tescher, Ed, Sept, 1999 pp 29-42.

[3] [M. Ramkumar and A.N. Akansu, 1998] M. Ramkumar and A.N. Akansu, “Theoretical capacity measures

for data hiding in compressed images”, in Proc SPIE voice, Video and Data Communications, Vol 3528 Nov. 1998 pp 482-492

[4] [M. Ramkumar, 1999] M. Ramkumar, A.N. Akansu, and A Alatan, “On the choice of transforms for data

hiding in compressed video”, in IEEE ICASSP, vol Vi, Phoenex AZ Mar, 1999 pp. 3049-3052. [5] [J.J. Eggers and B. Girod,1999] J.J. Eggers and B. Girod, “Watermark detection after quantization attacks”,

in Proc 3rd workshop on Information Hiding, Dresden, Germany, 1999. [6] [J.J. Eggers and B. Girod,2001] J.J. Eggers and B. Girod, “Quantization effects on digital watermarks”,

Signal Processing. Vol 81 no. 2 pp 239 -263 Feb 2001 [7] [M. Wu and H. Yu, 2000] M. Wu and H. Yu. “Video access control via multi level data hiding”, in IEEE

int. conf. Multimedia and Expo (ICME 00) New York, 2000

396 ♦ Review of Analysis of Watermarking Algorithms for Images in the Presence of Lossy Compression


[8] [C. Fei et al., 2001] C. Fei, D. Kundur, and R. Kwong. “The choice of watermark domain in the presence of

compression”, In proc IEEE Int. Conf. on information Technology coding and computing, Las Vegas, NV Apr. 2001 pp 79-84

[9] [C. Fei et al., 2001] C. Fei, D. Kundur, and R. Kwong “Transform based hybrid data hiding for improved

robustness in the presence of perceptual coding”, in proc SPIE Mathematics of Data Image Coding, Compression and encryption IV vol 4475 San Diego, CA July 2001 pp 203 212.

[10] [B. Chen and G. W. Wornell,2001] B. Chen and G. W. Wornell. “Quantization index modulation: a class of

provably good methods for digital watermarking and information embedding”, IEEE Trans. Inform Theory, Vol 47 pp. 1423-1433, May 2001

[11] [J. Eggers and B. Girod,2002] J. Eggers and B. Girod, “Informed Watermarking” Norwell, MA Kluwer, 2002

[12] [Chuhond Fei et al. 2004] Chuhond Fei Deepa Kundur, Raymond H. Kwong, “Analysis and Design of

Watermarking Algorithms for Improved Resistance to compression”, IEEE transactions on Image processing vol 13 February, 2004 pp 126-144.

Software Engineering



Evaluation Metrics for Autonomic Systems

K. Thirupathi Rao B. Thirumala Rao Koneru Lakshmaiah College of Engineering Koneru Lakshmaiah College of Engineering [email protected] [email protected]

L.S.S. Reddy V. Krishna Reddy Koneru Lakshmaiah College of Engineering Lakkireddy BaliReddy College of Engineering [email protected] [email protected]

P. Saikiran Srinidhi Institute of Science and Technology

[email protected]

Abstract

Most computer systems become increasingly large and complex, thereby compounding many reliability problems. Too often computer systems fail, become compromised, or perform poorly. To improve the system reliability, one of the most interesting methods is the Autonomic Management which offers a potential solution to these challenging research problems. It is inspired by nature and biological system, such as the autonomic nervous system that have evolved to cope with the challenges of scale, complexity, heterogeneity and unpredictability by being decentralized, context aware, adaptive and resilient. The complexity makes the autonomic systems more difficult to evaluate. So to measure the performance and to compare the autonomic systems we need to derive the metrics and benchmarks. This is highly important and interesting area. This paper gives an important direction for evaluating Autonomic Systems. Initially we also attempting to give the reader a feel for the nature of autonomic computing Systems for this review of autonomic computing systems, their properties, general architecture and importance were presented.

1 Introduction

With modern computing, consisting of new paradigms such as planetary-wide computing, pervasive, and ubiquitous computing, systems are more complex than before. Interestingly, when chip design became more complex we employed computers to design them. Today we are now at the point where humans have limited input to chip design. With systems becoming more complex it is a natural progression to have the system to not only automatically generate code but build systems, and carryout the day-to-day running and configuration of the live system. Therefore autonomic computing has become inevitable and therefore will become more prevalent. To deal with the growing complexity of computing systems requires autonomic computing. The autonomic computing, which is inspired by biological systems such as the autonomic human nervous system [1, 2] and enables the development of self-managing computing systems and applications. The systems and applications use autonomic strategies and algorithms to handle complexity and uncertainties with minimum human

400 ♦ Evaluation Metrics for Autonomic Systems


intervention. An autonomic application or system is a collection of autonomic elements, which implement intelligent control loops to monitor, analyze, plan and execute using knowledge of the environment. A fundamental principle of autonomic computing is to increase the intelligence of individual computer components so that they become “self-managing,” i.e., actively monitoring their state and taking corrective actions in accordance with overall system-management objectives. The autonomic nervous system of the human body controls bodily functions such as heart rate, breathing and blood pressure without any conscious attention on our part. The parallel notion when applied to autonomic computing is to have systems that manage themselves without active human intervention. The ultimate goal is to create Autonomic Management computer systems that will become self-managing, and more powerful; users and administrators will get more benefits from computers, because they can concentrate their works with little conscious intervention. The paper is organized as follows. Section 2 deals with the characteristics of autonomic computing system, Section 3 deals with architecture for autonomic computing, section 4 deals with the Evaluation Metrics and concluded in section 5 followed by References.

2 Characteristics of Autonomic Computing System

The new era of computing is driven by the convergence of biological and digital computing systems. To build tomorrow’s autonomic computing systems we must understand working and exploit characteristics of autonomic system. Autonomic systems and applications exhibit following characteristics. Some of these characteristics are discussed in [3, 4].

Self Awareness: An autonomic system or application “knows itself” and is aware of its state and its behaviors.

Self Configuring: An autonomic system or application should be able configure and reconfigure itself under varying and unpredictable conditions without any detailed human intervention in the form of configuration files or installation dialogs.

Self Optimizing: An autonomic system or application should be able to detect suboptimal behaviors and optimize it self to improve its execution.

Self-Healing: An autonomic system or application should be able to detect and recover from potential problems and continue to function smoothly.

Self Protecting: An autonomic system or application should be capable of detecting and protecting its resources from both internal and external attack and maintaining overall system security and integrity.

Context Aware: An autonomic system or application should be aware of its execution environment and be able to react to changes in the environment.

Open: An autonomic system or application must function in an heterogeneous world and should be portable across multiple hardware and software architectures. Consequently it must be built on standard and open protocols and interfaces.

Anticipatory: An autonomic system or application should be able to anticipate to the extent possible, its needs and behaviors and those of its context, and be able to manage it self proactively.

Evaluation Metrics for Autonomic Systems ♦ 401


Dynamic: Systems are becoming more and more dynamic in a number of aspects such as dynamics from the environment, structural dynamics, huge interaction dynamics and from a software engineering perspective the rapidly changing requirements for the system. Machine failures and upgrades force the system to adapt to these changes. In such a situation, the system needs to be very flexible and dynamic.

Distribution: systems become more and more distributed. This includes physical distribution, due to the invasion of networks in every system, and logical distribution, because there is more and more interaction between applications on a single system and between entities inside a single application.

Situated ness: systems become more and more situated: there is an explicit notion of the environment in which the system and entities of the system exist and execute, environmental characteristics affect their execution, and they often explicitly interact with that environment. Such an (execution) environment becomes a primary abstraction that can have its own dynamics, independent of the intrinsic dynamics of the system and its entities. As a consequence, we must be able to cope with uncertainty and unpredictability when building systems that interact with their environment. This situated ness often implies that only local information is available for the entities in the system or the system itself as part of a group of systems.

Locality in control: When Computing systems and components live and interact in an open world, the concept of global flow of control becomes meaningless. So Independent computing systems have their own autonomous flows of control, and their mutual interactions do not imply any join of these flows. This trend is made stronger by the fact that not only do independent systems have their own flow of control, but also different entities in a system have their own flow of control.

Locality in interaction: physical laws enforces locality of interactions automatically in a physical environment.. In a logical environment, if we want to minimize the conceptual and management complexity we must also favor modeling the system in local terms and limiting the effect of a single entity on the environment. Locality in interaction is a strong requirement when the number of entities in a system increases, or as the dimension of the distribution scale increases. Otherwise tracking and controlling concurrent and autonomously initiated interactions is much more difficult than in object-oriented and component-based applications. The reason for this is that autonomously initiated interactions imply that we can not know what kind of interaction is done and we have no clue about when a (specific) interaction is initiated. Need for global Autonomy: the characteristics described so far, make it difficult to understand and control the global behavior of the system or a group of systems. Still, there is a need for a coherent global behavior. Some functional and non functional requirements that have to be solved by computer systems are so complex that a single entity can not provide it. We need systems consisting out of multiple entities which are relatively simple and where the global behavior of that system provides the functionality for the complex task.

3 Architecture for Autonomic Computing

Autonomic systems are composed from autonomic elements and are capable to carry out administrative functions, managing their behaviors and their relationships with other systems and applications by reducing human intervention in accordance with high-level policies.



Autonomic Computing System can make decisions and manage themselves in three scopes. These scopes in detail discussed in [6].

Resource Element Scope: In resource element scope, individual components such as servers and databases manage themselves.

Group of Resource Elements Scope: In group of resource elements scope, pools of grouped resources that work together perform self-management. For example, a pool of servers can adjust work load to achieve high performance. Business Scope: overall business context can be self-managing. It is clear that increasing the maturity levels of Autonomic Computing will affect on level of making decision.

3.1 Autonomic Element

Autonomic Elements (AEs) are the basic building blocks of autonomic systems and their interactions produce self managing behavior. Each AE has two parts: Managed Element (ME) and Autonomic Manager (AM) as shown in figure. Sensors retrieve information about the current state of the environment of ME and then compare it with expectations that are held in knowledge base by the AE. The required action is executed by effectors. Therefore, sensors and effectors are linked together and create a control loop.

Fig-1 The figure1 description is as follows

Managed Element: It is a component from system. It can be hardware, application software, or an entire system.

Autonomic Manager: These execute according to the administrator policies and implement self-management. An AM uses a manageability interface to monitor and control the ME. It has four parts: monitor, analyze, plan, and execute.

Monitor: Monitoring Module provides different mechanisms to collect, aggregate, filter, monitor and manage information collected by its sensors from the environment of a ME.

Analyze: The Analyze Module performs the diagnosis of the monitoring results and detects any disruptions in the network or system resources. This information is then transformed into events. It helps the AM to predict future states.

Plan: The Planning Module defines the set of elementary actions to perform accordingly to these events. Plan uses policy information and what is analyzed to achieve goals. Policies can

Knowledge

Analyze Plan

Monitor Execute

Sensors Effectors

Managed Element



be a set of administrator ideas and are stored as knowledge to guide AM. Plan assigns tasks and a resource based on the policies, adds, modifies, and deletes the policies. AMs can change resource allocation to optimize performance according to the policies.

Execute: It controls the execution of a plan and dispatches recommended actions into ME. These four parts provide control loop functionality.

3.2 AC Toolkit

IBM assigns autonomic computing maturity levels to its solutions. There are five levels total and they progressively work toward full automation [5].

Basic Level: At this level, each system element is managed by IT professionals. Configuring, optimizing, healing, and protecting IT components are performed manually.

Managed Level: At this level, system management technologies can be used to collect information from different systems. It helps administrators to collect and analyze information. Most analysis is done by IT professionals. This is the starting point of automation of IT tasks.

Predictive Level: At this level, individual components monitor themselves, analyze changes, and offer advices. Therefore, dependency on persons is reduced and decision making is improved.

Adaptive Level: At this level, IT components can individually or group wise monitor, analyze operations, and offer advices with minimal human intervention. Autonomic Level: At this level, system operations are managed by business policies established by the administrator. In fact, business policy drives overall IT management, while at adaptive level; there is an interaction between human and system.

4 Evaluation Metrics

Advances in computing, communication, and software technologies and evolution of Internet have resulted in explosive growth of information services and their underlying infrastructures. Operational environment become more and more complex and unmanageable. Hence As with increase in complexity their evaluation is increasingly important. The evaluation metrics can be classified into two categories, one is at Component Level in which each unit ability to meet its goal is measured and the other is Global Level in which are for measuring the overall autonomic system performance [7]. This section lists sets of metrics and means by which we can compare such systems. All these metrics will fall in any one of two categories and some times in both.

a Scalability: Providing rich services to the users requires the computing systems to be scalable.

b Heterogeneity: A computing system should have the ability to run in heterogeneous operating systems.

c Survivability: computing system should be aware of operating environment. The future operating environment is unpredictable, and the computing systems should be able to survive even under the extreme conditions.



d Reliability: New services are always built on top of existed components, thus the reliability of system components becomes more important.

e Adaptability: computing systems, services and applications requires the systems and software architectures to be adaptive in all their attributes and functionalities. We separate out the act of adaptation form the monitoring and intelligence that causes the system to adapt. Some systems are designed to continue execution whilst reconfiguring, while others cannot. Furthermore the location of such components again impacts the performance of the adaptively process. That is, a component object, which is currently local to the system verses a component (such as a printer driver for example), having to be retrieved over the Internet, will have significantly differing performance. Perhaps more future systems will have the equivalent of a pre-fetch of components that are likely to be of use and are preloaded to speed up the re-configuration process.

f Quality of Service (QoS): It is a highly important metric in autonomic systems as they are typically designed to improve some aspect of a service such as speed and efficiency or performance. QoS reflect the degree to which the system is reaching its primary goal. Some systems wish to improve the user’s experience with the system in self-adaptive or personalized GUI design for disabled people. This metric is tightly coupled to the application area or service that is expected of the system. It can be measured as a global or component level goal metric.

g Cost: Due to dynamic computing environment the numbers of connected computing devices are growing in the network every year. As a result, manually managing and controlling of these complex computing systems become difficult and human labor cost is exceeding equipment cost manually by human operators [8].

Autonomicity costs, the degree of this cost and its measurement is not clear-cut. For many commercial systems the aim is to improve the cost of running an infrastructure, which includes primarily people costs in terms of systems administrators and maintenance. This means that the reduction in cost for such systems cannot be measured immediately but over time and as the system becomes more and more self-managing. Cost comparison is further complicated by the fact that adding Autonomicity means adding intelligence, monitors and adaptation mechanisms–and this cost. A class of application very fitting to autonomic computing is that of Ubiquitous computing which typically consists of networks of sensors working together to create intelligent homes monitor the environment. This sort of application relies on self reliance, distributed self-configuration intelligence and monitoring. However many of the nodes in such a system is limited in resources and can be wireless, which means that the cost of autonomous computing involves resource consumption such as battery power.

h Abstraction: Computing Systems hide their complexity from end users, leveraging the resources to achieve business or personal goals, without involving the user in any implementation details.

i Granularity: The granularity of autonomicity is an important issue when comparing autonomic systems. Fine grained components with specific adaptation rules will be highly flexible and perhaps adapt to situations better, however this may cause more



overhead in terms of the global system. That is, if we assume that each finer grained component requires environmental data and is providing some form of feedback on its performance then potentially there is more monitoring data or at least environmental information flowing around the global system. Of course may not be the case in systems where the intelligence is more centralized.Many current commercial autonomic endeavors are at the thicker grained service level. Granularity is important where unbinding, loading and rebinding a component took a few seconds. These few seconds are tolerable in a thick-grained component based architecture where the overheads can be hidden in the system’s overall operation and potentially change is not that regular. However in finer-grained architectures, such as an Operating System or Ubiquitous computing where change is either more regular or the components smaller, the hot swap time is potentially too much.

j Robustness: Typically many autonomic systems are designed to avoid failure at some level. Many are designed to cope with hardware failure such as a node in a cluster system or a component that is no longer responding. Some avoid failure by retrieving a missing component. Either way the predictability of failure is an aspect in comparing such systems. To measure this, the nature of the failure and how predictable that failure is, needs to be varied and the systems’ ability to cope measured.

k Degree of Autonomy: Related to failure avoidance, we can compare how autonomous a system is. This would relate to AI and agent-based autonomic systems primarily as their autonomic process is usually to provide an autonomous activity. For example the NASA pathfinder must cope with unpredicted problems and learn to overcome them without external help. Decreasing the degree of predictability in the environment and seeing how the system copes could measure this. Lower predictability could even reach it having to cope with things it was not designed to. A degree of proactivity could also compare these features.

l Reaction Time: Related to cost and sensitivity, these are measurements concerned with the system reconfiguration and adaptation. The time to adapt is the measurement of the time a system takes to adapt to a change in the environment. That is, the time taken between the identification that a change is required until the change has been effected safely and the system moves to a continue state. Reaction time can be seen to partly envelop the adaptation time. This is the time between when an environmental element has changed and the system recognizes that change, decides on what reconfiguration is necessary to react to the environmental change and get the system ready to adapt. Further the reaction time affects the sensitivity of the autonomic system to its environment.

m Sensitivity: This is a measurement of how well the self adaptive system fits with the environment it is sitting in. At one extreme a highly sensitive system will notice a subtle change as it happens and adapt to improve itself based on that change. However in reality, depending on the nature of the activity, there is usually some form of delay in the feedback that some part of the environment has changed effecting a change in the autonomic system. Further the changeover takes time. Therefore if a system is highly sensitive to its environment potentially it can cause the system to be constantly changing configuration etc and not getting on with the job itself.



n Stabilization: Another metric related to sensitivity is stabilization. That is the time taken for the system to learn its environment and stabilize its operation. This is particularly interesting for open adaptive systems that learn how to best reconfigure the system. For closed autonomic systems the sensitivity would be a product of the static rule/constraint base and the stability of the underlying environment the system must adapt to.

5 Conclusion

In this paper, we have presented the essence of the autonomic computing and development of such systems. It gives the reader a feel for the nature of these types of systems and also we presented some typical examples to illustrate the complexities in trying to measure the performance of such systems and to compare them.

This paper lists set of metrics and means for measuring the overall autonomic system performance at global level as well as at component level.

Finally, these metrics together form some sort of benchmarking tool to derive new autonomic systems or we can augment existing autonomic system by incorporating these metrics, which measures various autonomic characteristics.

References

[1] S. Hariri and M. Parashar. Autonomic Computing: An overview. In Springer-Verlag Berlin Heidelberg, pages 247–259, July 2005.

[2] Kephart J.O., Chess D. M.. The Vision of Autonomic Computing. Computer, IEEE, Volume 36, Issue 1, January 2003, Pages 41-50.

[3] Sterritt R., Bustard D. Towards an Autonomic Computing Environment. University of Ulster, Northern Ireland.

[4] Bantz D.F. et al. Autonomic personal computing. IBM Systems Journal, Vol 42, No 1, 2003. [5] Bigus J.P. et al. ABLE: A toolkit for building multiagent autonomic systems. IBM Systems Journal, Vol.

41, No. 3, 2002. [6] IBM An architectural blueprint for autonomic computing, April 2003. [7] [Bletsas E.N., McCann, J.A AEOLUS: An Extensible Webserver Benchmarking Tool submitted to 13th

IW3C2 and ACM World Wide Web Conference (WWW04), New York City, 17-22 May 2004. [8] McCann J.A,. Crane J.S.,'Kendra: Internet Distribution & Delivery System an introductory paper', Proc.

SCS EuroMedia Conference, Leicester, UK, Ed. Verbraeck A., Al-Akaidi M., Society for Computer Simulation International, January 1998. pp 134-140.



Feature Selection for High Dimensional Data:

Empirical Study on the Usability of Correlation

& Coefficient of Dispersion Measures

Babu Reddy M. Thrimurthy P. Chandrasekharam R. LBR College of Engineering Acharya Nagarjuna University LBR College of Engineering Mylavaram–521 230 Nagarjuna Nagar–522 510 Mylavaram–521 230 [email protected] [email protected] [email protected]

Abstract

Databases in general are modified to suit to the new requirements on serving the users. In that process, the dimensionality of data also gets increased. While dimensionality gets increased every time, there is a severe damage causes to database in terms of redundancy. This paper addresses the usefulness of eliminating highly correlated and redundant attributes in increasing the classification performance. An attempt has been made to prove the usefulness of dimensionality reduction by applying LVQ(Learning Vector Quantization) method on two Benchmark datasets of ‘Lung cancer Patients’ and ‘Diabetic Patients’. We adopt the Feature Selection methods which are used for machine learning tasks for facilitating in reducing dimensionality, removing inappropriate data, increasing learning accuracy, and improving comprehensibility.

1 Introduction

Feature selection is one of the prominent preprocessing steps to machine learning. It is a process of choosing a subset of original features so that the feature space is condensed according to a certain evaluation criterion. Feature selection has been a significant field of research and development since 1970’s and proved very useful in removing irrelevant and redundant features, increasing the learning efficiency, improving learning performance like predictive accuracy, and enhancing comprehensibility of learned results [John & Kohavi, 1997; Liu & Dash,1997; Blum & Langley, 1997]. In present day applications such as genome projects [Xing et al., 2001], image retrieval [Rui et al., 1998-99], and customer relationship management [Liu & Liu, 2000], text categorization [Pederson & Yang, 1997], the size of database has become exponentially large. This immensity may cause serious problems to many machine learning algorithms in terms of efficiency and learning performance. For example, high dimensional data can contain high degree of redundant and irrelevant information which may greatly influence the performance of learning algorithms. Therefore, while dealing with high dimensional data, feature selection becomes highly necessary. Some of the recent research efforts in feature selection have been focused on these challenges [Liu et al., 2002 & Das, 2001; Xing et al., 2001]. In the following, basic models of feature selection have been reviewed and supporting justification was given for choosing filter solution as a suitable method for high dimensional data.

Feature selection algorithms can be divided into two broader categories, namely, the filter

model and the wrapper model [Das, 2001; John & Kohavi, 1997]. The filter model relies on

408 ♦ Feature Selection for High Dimensional Data: Empirical Study on the Usability of Correlation


general characteristics of the training data to select some features without involving any

learning algorithm. The wrapper model relies on a predetermined learning algorithm and uses

its performance to evaluate and select the features. For each new subset of features, the

wrapper model needs to learn a classifier. It tends to give superior performance as it finds

tailor-made features which are better suited to the predetermined learning algorithm, but it

also tends to be more computationally expensive [Langley, 1994]. For the increased number

of features, the filter model is usually a choice due to its computational efficiency.

Different feature selection algorithms under filter model can be further classified into two

groups, namely subset search algorithms and feature weighting algorithms. Feature weighting

algorithms allocate weights to features individually and grade them based on their relevance

to the objective. A feature will be selected subject to a threshold value. If its weight of

relevance is greater than a threshold value, the corresponding feature will be selected. Relief

[Kira & Rendell, 1992] is a well known algorithm that relies on relevance evaluation. The

key idea of Relief is to estimate the relevance of features according to their classification

capability, i.e. how well their values differentiate between the instances of the same and

different classes. Relief randomly samples a number ‘p’ of instances from the training set and

updates the relevance estimation of each feature based on the difference between the selected

instance and the two nearest instances of the same and opposite classes. Time complexity of

Relief for a data set with M instances and N features is O(pMN). By assuming ‘p’ as a

constant, the time complexity becomes O(MN), which makes it very scalable to high

dimensional data sets. But, Relief does not help in removing redundant features. As long as

features are relevant to the class concept, they will all be selected even though many of them

are highly correlated to each other [Kira & Rendell, 1992]. Experiential evidence from

feature selection literature shows that, both the irrelevant and redundant features will affect

the efficiency of learning algorithms and thus should be eliminated as well [Hall 2000; John

& Kohavi, 1997].

“Subset search” algorithms recognize subsets directed by an evaluation measure/goodness

measure [Liu & Motoda, 1998] which captures the goodness of each subset.

Some other evaluation measures in removing both redundant and irrelevant features include the correlation measure [Hall, 1999; Hall, 2000], consistency measure [Dash et al., 2000]. In [Hall 2000], a correlation measure is applied to evaluate the goodness of feature subsets based on the hypothesis that a good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other. Consistency measure attempts to find an optimum number of features that can separate classes as consistently as the complete feature set can. In [Dash et al, 2000], different search strategies, like heuristic, exhaustive and random search, are combined with this evaluation measure to form hybrid algorithms. The time complexity is exponential in terms of data dimensionality for exhaustive search and quadratic for heuristic search. The complexity can be linear to the number of iterations in a random search, but experiments show that in order to find an optimum feature subset, the number of iterations required is mostly at least quadratic to the number of features [Dash et al., 2000]. Section 2 discusses the required mathematical preliminaries. Section 3 describes the procedure that has been adopted. Section 4 presents the simulation results of an empirical study. Section 5 concludes this work with key findings and future directions.

Feature Selection for High Dimensional Data: Empirical Study on the Usability of Correlation ♦ 409


2 Preliminaries

We adopt the following from the literature:

2.1 Correlation-Based Measures

In general, a feature is good if it is highly correlated with the class but not with any of the other features. To measure the correlation between two random variables, broadly two approaches can be followed. One is based on classical linear correlation and the other is based on information theory. Under the first approach, the most familiar measure is linear correlation coefficient. For a pair of variables (X, Y), the linear correlation coefficient r is given by the formula

r =

∑∑

∑

−−

−−

22 )y(y)x(x

)y)(yx(x

iiii

iiii

(1)

where xi is the mean of X, and yi is the mean of Y. The value of the correlation coefficient ‘r’ lies between -1 and 1, inclusive. If X and Y are completely correlated, r takes the value of 1 or -1; if X and Y are totally independent, r is zero. Correlation measure is a symmetrical measure for two variables. There are several benefits of choosing linear correlation as a feature goodness measure for classification. First, it helps to identify and remove features with near zero linear correlation to the class. Second, it helps to reduce redundancy among selected features. It is known that if data is linearly separable in the original representation, it is still linearly separable if all but one of a group of linearly dependent features are removed [Das, 1971]. But, it is not safe to always assume linear correlation between features in the real world. Linear correlation measures may not be able to capture correlations that are not linear in nature. Another limitation is that the calculation requires all features contain numerical values.

To overcome these problems, correlation measure can be chosen based on Entropy, a measure of uncertainty of a random variable. The entropy of a variable X is defined as

E(X) = − (2)

and the entropy of X after observing values of another variable Y is defined as

E(X|Y ) = − ∑ ∑j i

jijij yxPyxPyP ))/((log)/()( 2

(3)

where P(xi) is the aforementioned probabilities for all values of X, and P(xi|yi) is the posterior probabilities of X given the values of Y. The amount by which the entropy of X decreases reflects additional information about X provided by Y and is called information gain (Quinlan, 1993), given by

IG (X|Y ) = E(X) − E(X|Y) (4)

∑i

ii xPxP ))((log)( 2



According to this measure, a feature Y is more correlated to feature X than to feature Z, if IG (X|Y ) > IG(Z|Y).

Since the information gain is symmetrical for two random variables X and Y, Symmetry is

desired property for a measure of correlations between features. The problem with

Information Gain measure is that it is biased in favor of greater valued features. A measure of

Symmetrical Uncertainty can compensate the Information Gain’s bias and normalizes it’s

values to [0,1]. Value ‘1’ indicates that the value of feature X can completely predicts the

value of feature ‘Y’. And ‘0’ indicates that the two features X and Y are independent.

SU(X, Y) = 2[IG (X|Y) / (E(X)+E(Y))] (5)

2.2 CFS: Correlation-Based Feature Selection

The key idea of CFS algorithm is a heuristic evaluation of the merit of a subset of features.

This heuristic takes into account the usefulness of individual features for predicting the class

label along with the level of inter-correlation among themselves.

If there are ‘n’ possible features, then there are 2n possible subsets. To find the optimal subset,

all the possible 2n subsets should be tried. This process may not be feasible.

Various heuristic search strategies like hill climbing and Best First Search [Ritch and Knight,

1991] are often used. CFS starts with an empty set of features and uses a best first forward

search (BFFS) with terminating criteria of getting consecutive non-improving subsets.

3 Process Description

a) In this paper, the usefulness of correlation measure and variance measures in

identifying and removing the irrelevant and redundant attributes has been studied by

applying Learning Vector Quantization (LVQ) method on two benchmark micro array

datasets of Lung Cancer patients and Pima-Indian Diabetic patients. The considered

benchmark data sets have class labels also as one of the attribute. The performance of

LVQ method in supervised classification has been studied with the original data set

and with a reduced dataset in which few irrelevant and redundant attributes have been

eliminated.

b) on Lung Cancer Data set, some features whose coefficient of dispersion is very low

have been discarded from the further processing and results are compared.

Let F = F11 F21 F31 ……..FN1

F12 F22 F32 ……..FN2

F13 F22 F33 ……..FN3

- - - - - - - - -- - - - --

- - - - - - - - - - - - - -

F1M F2M F3M ……..FNM

Let the feature set contains ‘N’ features (attributes) and M instances (records).



Coefficient of Dispersion (CDFi) = σFi / iF where iF is the arithmetic average of a particular

feature ‘i’. σFi = M

FF jiij )( −∑ ∀ j=1 to N

If ((CDFi) < δ ), feature Fi can be eliminated from further processing. It requires only Linear time complexity (O(M), where as the other methods like FCBF or CBF with modified pairwise selection requires quadratic time complexity(i.e. O(MN)).


LVQ has great significance in Feature Selection and Classification tasks. The LVQ method has been applied on the benchmark datasets of Diabetic patients [8] and Lung cancer Patients [4]. And an attempt has been made to identify some of the insignificant/redundant attributes, by means of Class correlation (C-Correlation), inter-Feature correlation (F-Correlation) and Coeffecient of Dispersion among all the instances of a fixed attribute. This may help us towards better performance in terms of classification efficiency in a supervised learning environment. Classification efficiency has been compared by considering the original dataset and the corresponding reduced dataset with less number of attributes. Better performance has been noticed by eliminating the unnecessary/insignificant attributes.

a) Pima-Indian Diabetic Data Set–Correlation Coefficient as a measure for Dimensionality reduction

Database Size: 768 Learning rate: 0.1 Learning Rate: 0.1 No. of classes: 2 (0 & 1)

No. of attributes: 8 No. of iterations performed: 30

No. of recognized

data items Efficiency

Ex. Time

(in Secs)

S. No Training

Inputs (%)

Testing

Inputs (%) Original

Data Set

Reduced

Data

Set(corr)

Original

Data Set

Reduced

Data Set

(corr)

Original

Data Set

Reduced

Data Set

(corr)

1 10 90 458 688 66.2808 99.5658 6.516 6.609

2 20 80 459 689 74.7557 112.215 7.219 6.328

3 30 70 460 689 85.5019 128.0669 7.172 6.812

4 40 60 461 689 100.00 149.4577 7.891 7.812

5 50 50 472 689 122.9167 179.4271 5.859 6.968

6 60 40 460 689 149.8371 224.43 7.672 7.219

7 70 30 470 689 204.3478 299.5652 6.14 5.719

8 80 20 470 689 305.1948 447.4026 6.687 7.218

9 90 10 473 689 614.2857 894.8052 5.25 5.843

b) Lung cancer Data set - Correlation Coefficient as a measure for Dimensionality reduction

Database Size: 73 Instances Learning rate: 0.1

Learning Rate: 0.1 No. of classes: 3

No. of attributes: 326(Class attribute is included) No. of iterations performed: 30



No. of recognized


Ex. Time

(in Secs)

S. No

Training

Inputs

(%)

Testing

Inputs

(%)

Original

Data Set

Reduced Data

Set(Corr. with

Class label)

Original

Data Set

Reduced

Data Set

(Corr. with

Class label)

Original

Data Set

Reduced

Data Set

(Corr. with

Class label)

1 10 90 14 22 1.2121 33.3333 9.109 14.58

2 20 80 14 19 24.1379 32.7586 8.109 26.03

3 30 70 14 25 27.451 49.0196 6.812 46.08

4 40 60 14 23 31.8182 52.2727 6.328 46.55

5 50 50 14 26 37.8378 70.273 5.171 13.71

6 60 40 13 32 44.8276 110.3448 6.328 23.23

7 70 30 14 27 63.6364 122.7273 8.906 32.14

8 80 20 14 27 93.333 180.00 212.672 53.90

9 90 10 14 32 200.00 457.1429 6.0 39.25

c) Lung cancer Data set–Coefficient of Dispersion as a measure for Dimensionality reduction

No. of recognized


Ex. Time

(in Secs)

S. No

Training

Inputs

(%)

Testing

Inputs

(%)

Original

Data

Set

Reduced

Data Set

(Variance)

Original

Data Set

Reduced

Data Set

(Variance)

Original

Data Set

Reduced

Data Set

(Variance)

1 10 90 14 22 21.2121 33.333 9.109 5.359

2 20 80 14 19 24.1379 32.7586 8.109 7.89

3 30 70 14 25 27.451 49.0196 6.812 8.016

4 40 60 14 23 31.8182 52.2727 6.328 7.937

5 50 50 14 26 37.8378 70.2703 5.171 5.203

6 60 40 13 19 44.8276 65.5172 6.328 7.281

7 70 30 14 24 63.6364 109.0909 8.906 8.953

8 80 20 14 26 93.333 173.333 212.672 8.313

9 90 10 14 33 200.00 471.4286 6.0 6.515

The following graphs show the advantage of the dimensionality reduction method used on the two benchmark sets in terms of Efficiency of Classification and Execution Time.

Diabetic Dataset

0

200

400

600

800

1000

90 80 70 60 50 40 30 20 10

% of Training Inputs

Eff

ec

ien

cy

of

Cla

ss

ific

ati

on

Original Data Set

Reduced Data Set

Fig: A1 (Correlation Measure) Fig: A2 (Correlation Measure)

Diabetic Dataset

0

1

2

3

4

5

6

7

8

9

90 80 70 60 50 40 30 20 10

% of Training Inputs

Ex

ec

uti

on

Tim

e

Original Data Set

Reduced Data Set



Fig. B1: (Correlation Measure) Fig. B2: (Correlation Measure)

Fig. C1: (Coeff of Dispersion) Fig. C2: (Coeff. Of Dispersion)

It has been clearly observed that the Efficiency of classification is encourageable after reducing the dimensionality of data sets. Because of the dynamic load on the processor at the time of running the program, few peaks have been observed in the execution time. This can be eliminated by running the program in an ideal standalone environment.

5 Conclusion and Future Directions

Improvement in the efficiency of classification has been observed by using correlation and variance as measures to reduce the dimensionality. The existing correlation based feature selection methods are working around the features with acceptable level of correlation among themselves. But so far very less emphasis has been given on the independent feature integration and its effect on C-Correlation. Useful models can be identified to study the goodness of the combined feature weight of statistically independent attributes and pair-wise correlations can also be considered for further complexity reduction. And the impact of learning rate and threshold value on the classification performance can also be studied.

References

[1] Langley P & Sage S(1997): Scaling to domains with many irrelevant features- In R. Greiner(Ed), Computational Learning Theory and Natural Learning systems(Vol:4), Cambridge, MA:MIT Press.

[2] Blake C and Merz C(2006): UCI repository of Machine Learning Databases – Available at: http://ics.uci.edu/~mlearn/MLRepository.html

[3] M. Dash and H Liu; Feature Selection for classification and Intelligent data analysis: An International Journal, 1(3), pages: 131-157, 1997.



[4] Huan Liu and Lei yu, Feature Selection for High Dimensional Data: A Fast Correlation-Based Filter

Solutions, Proceedings of the 20th International Conference on Machine Learning (ICML-2003), Washington DC 2003.

[5] Vincent Sigillito: UCI Machine Learning Repository: Pima Indian Diabetes Data – Available at: http://archives.uci.edu/ml/datasets/Pima+Indians+Diabetes.

[6] Huan Liu and Lei yu, Redundancy based feature Selection for micro-array data- In proceedings of KDD’04, pages: 737-742, Seattle, WA, USA, 2004.

[7] Imola K. Fodor – A survey of dimension reduction techniques. Centre for Applied Scientific Computing, Lawrence Livermore National Laboratories.

[8] Jan O. Pederson & Yang. Y – A comparative study on feature selection in text categorization: Morgan Kaufmann Publishers, pages: 412-420, 1997.

[9] Ron Kohavi & George H John – AIJ Special issue on relevance wrappers for feature subset selection: 2001 [10] M. Dash & H. Liu – Feature selection for classification: Journal of Intelligent Data Analysis: Pages: 131-

156(1997). [11] Langley. P – Selection of relevant features in machine learning – Proceedings of the AAAI Fall

Symposium on relevance – 1994; pages: 140-144. [12] SN Sivanandam, S. Sumathi and SN Deepa – Introduction to Neural Networks using Matlab-6.0; TMH-

2006. [13] L.A. Rendell and K. Kira - A practical approach to feature selection- International conference on machine

learning; pages: 249-256(1992). [14] Fundamentals of Mathematical Statistics-S.C.Gupta & V.K. Kapoor (Sulthan Chand & Sons) [15] Mark A. Hall – Correlation based feature selection for discrete & numeric class machine learning; pages:

359-366(2000); Publisher: Morgan Kaufman.



Extreme Programming: A Rapidly Used Method

in Agile Software Process Model

V. Phani Krishna K. Rajasekhara Rao S.V. Engg. College for Women K. L. College of Engineering Bhimavaram – 534204 Vijayawada [email protected] [email protected]

Abstract

Extreme Programming is a discipline of software development based on values of simplicity, communication, feedback, and courage. It works by bringing the whole team together in the presence of simple practices, with enough feedback to enable the team to see where they are and to tune the practices to their unique situation. This paper discusses how tight coupling, redundancy and interconnectedness became a strong foe to the software development process.

1 Introduction

Pragmatic Dave Thomas stated that if a piece of software that was as tightly coupled as Extreme Programming, the user would be fired.” Inspite of having several advantages of Extreme Programming the striking characteristics of XP, redundancy and interconnectedness became decry in software design.

In the present paper we will discuss how that tight coupling could have a strong negative impact for a software process. For any process, Extreme or not, to be really useful and successful in a variety of situations for different teams, we have to understand how to tailor it.

Every project team inevitably takes the permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The problem is that most of the time we do tailoring blindly. We may have an idea what problem we’re trying to solve by adding some new practice, or some reason that we don’t need a particular artifact. But process elements don’t exist in isolation from one another. Typically, each provides input, support, or validation for one or more other process elements, and may in turn depend on other elements for similar reasons.

Is this internal coupling as bad for software processes as it is for software? This is an important question not just for Extreme Programming, but for all software processes. Until we understand how process elements depend upon and reinforce one another, process design and tailoring will continue to be the hit-or-miss black art.

Extreme Programming is an excellent subject for studying internal process dependencies. One reason is that it acknowledges those dependencies and tries to enumerate them [Beck,

416 ♦ Extreme Programming: A Rapidly Used Method in Agile Software Process Model


99]. Additionally, XP is unusual in covering not just the management of the project, but day-to-day coding practices as well. It provides an unusually broad picture of the software development process.

2 Tightly Coupled

The published literature about Extreme Programming is incomplete in several ways. If we follow discussions of how successful teams actually apply XP, we’ll see that there are many implicit practices, including the physical layout of the team workspace and fixed-length iterations. Likewise, since relationships between practices are more difficult to see than the practices themselves, it’s probable that there are unidentified relationships between the practices—perhaps even strong, primary dependencies.

However, just diagramming the twelve explicit XP practices and the relationships documented in Extreme Programming Explained shows the high degree of interconnectedness, as seen in Figure 1.

Rather than add additional complications to the problem right from the start, it will be better to focus on the relationships Beck described. The change we made from the beginning was to split the “testing” practice into “unit testing” and “acceptance testing.” They are different activities, and the XP literature emphasizes the differences in their purpose, timing, and practice, so it seemed appropriate to treat them as distinct practices. Therefore, instead of the original twelve practices of Extreme Programming, this analysis deals with the thirteen shown in Figure 2.

Fig. 1: The Original 12 Practices And Their Dependencies.

Once the complex web of dependencies is shown so clearly, it’s easy to understand Dave Thomas’ point and the challenge implicit in it. Does a chosen software process can be customized in an XP context? If one of the XP practices has to be modified or omitted, how can we understand what we’re really losing? If we notice a problem on our project that XP

Extreme Programming: A Rapidly Used Method in Agile Software Process Model ♦ 417


isn’t adequately addressing, how can we fit a new practice into this web? That would be our goal—understanding these dependencies well enough to permit informed adjustment. The point is not to “decouple” Extreme Programming.

Many processes try to deal with the problem of redundancy by strengthening the practices. But such measures are costly in terms of time and effort, and they probably also harm team moral and cohesion. Strength of the XP approach is that the practices play multiple roles. In most cases when an XP practice serves to compensate for the flaws of another practice, the redundant compensation is merely a secondary role of the practice. This helps keep the number of practices to a minimum, and has the added benefit of using core team members in enforcement roles without making them seem like “enforcers.”

Fig. 2: The Thirteen Practices.

Without some coupling, even in software designs, nothing will ever get done. The trick is to build relationships between components when they are appropriate and helpful, and avoid them otherwise. The coupling within XP is only harmful if it makes the process difficult to change.

3 Teasing Out the Tangles

Are there strongly connected subcomponents that have weaker connections between them? For deriving answer for the above question, we have to surf the dependency graphs and move the nodes around in order to find some hint. This process is similar to the metallurgical process of annealing, where a metal is heated and then slowly cooled to strengthen it and reduce brittleness.



The process allows the molecules of the metal, as it cools, to assume a tighter, more nearly regular structure. Some automated graph-drawing algorithms employ a process of simulated annealing, jostling the nodes of the graph randomly and adjusting position to reach an equilibrium state that minimizes the total length of the arcs in the graph [Kirkpatrick, 83].

Attempting through figure2 didn’t give any hint. So this is of no use. Hence we can try to visualize clusters of dependencies by arranging the practices in a circle and changing the order to bring closely related practices together. What did practices that were close to each other on the circle have in common? What distinguished practices on opposite sides of the circle?

Fig. 3: Before and after a "circular topological sort."

The low-level programming practices depend on each other more than they depend on the

product-scale practices like the planning game and short releases.. There are nine practices

that seem to operate at particular scales, as illustrated in Figure 4.

Each of these practices seems to provide feedback about particular kinds of decisions, from

very small to the large, sweeping decisions. Of course, that leaves four other practices out, which is a problem when we’re trying to understand all of the practices and how they relate. All of the dependencies within XP are not of the same kind.

For example, consider the bidirectional dependency between pair programming and unit testing. How does pair programming help unit testing? It strengthens unit testing by suggesting good tests, and by encouraging the unit-testing discipline. It also helps to ensure

that the unit-testing process is dealing with well-designed code, making the testing process

itself more efficient and productive.

Now turn it around. How does unit testing support pair programming? It guides the

programmers by helping them structure their work, setting short-term goals on which to focus. It guides their design work as well; unit testing has well known benefits as a design technique. It also defends against shortcomings of pair programming (even two minds don’t

write perfect code) by catching errors.



Fig. 4: Scale-defined practices

Do the relationships at larger scales look similar? Another bidirectional dependency on a larger scale exists between on-site customer and acceptance testing. The relationship between the two is clearly different in details from the one we just explored between pair programming and unit testing, but it seems to me to be similar in terms of the respective roles of the two practices.

Having an on-site customer strengthens acceptance testing by guiding the development of tests, and by helping maintain correspondence between stories and tests. In the opposite direction, acceptance testing guides feature development (again by providing goals) and defends against the weaknesses of on-site customer, providing a concrete, executable record of key decisions the customer made that might otherwise be undocumented. It also provides a test bed for the consistency of customer decisions.

Smaller-scale practices strengthen larger-scale practices by providing high-quality input. In other words, smaller-scale practices take care of most of the small details so that the larger-scale practices can effectively deal with appropriately scaled issues. In the reverse direction, larger-scale practices guide smaller-scale activities, and also defend against the mistakes that might slip through.

Re-factoring, forty-hour weeks, simple design, and coding standards seem to all have a strengthening role. One way of looking at the strengthening dependencies is to see them as noise filters. The “noise” refers to the accidental complexity: the extra complexity in our systems over and above the essential complexity that is inherent in the problem being solved. In a software system, that noise can take many forms: unused methods, duplicate code, misplaced responsibility, inappropriate coupling, overly complex algorithms, and so on. Such noise obscures the essential aspects of the system, making it more difficult to understand, test, and change.

The four practices that operate independent of scale seem to be aimed at reducing noise, improving the overall quality of the system in ways that allow the other practices to be more effective. Refactoring is an active practice that seeks to filter chaotic code from the system whenever it is found. Simple design and coding standards are yardsticks against which the system’s quality can be measured, and help guide the other practices to produce a high



quality system. Finally, forty-hour week helps eliminate mistakes by reducing physical and mental fatigue in the team members. The four noise-filtering practices, along with their interdependencies, are shown in Figure 5.

Fig. 5: Noise Filters.

Those four noise-filtering practices help many of the other practices to operate more effectively by maximizing clarity and reducing complexity in the code. They help minimize the accidental complexity in the system in favor of the essential complexity..

4 A Feedback Engine

The nine practices are characterized not only by the scale of entity they work with; additionally, they function primarily within a certain span of time. Not surprisingly, the practices that operate on small-scale things also operate very quickly. The correspondence between practices and time scales is shown in Figure 6.

The nesting of XP’s feedback loops is the fundamental structural characteristic of Extreme Programming. All of the explicit dependencies between individual practices that have been identified by Beck and others are natural consequences of this overall structure.

Fig. 6: Practices and time scales.



5 Cost of Feedback

[Bohem,81] observations of projects led him to conclude that, as projects advance through their lifecycles, the cost of making necessary changes to the software increases exponentially[Bohem,81]. This observation led to a generation of processes that were designed to make all changes—all decisions—as early in the process as possible, when changes are cheaper.

Many in the agile community have observed that [Bohem, 81] study dealt primarily with projects using a waterfall-style process, where decisions were made very early in the project. Those decisions were often carefully scrutinized to identify mistakes, but the only true test of software is to run it. In classic waterfall projects, such empirical verification typically didn’t happen until near the end of the project, when everything was integrated and tested. Agile, iterative processes seem to enjoy a shallower cost-of-change Curve, suggesting that perhaps Boehm’s study was actually showing how the cost of change increased as a function of the length of the feedback loop, rather than merely the point in the project lifecycle.

The analysis of the cost of change curve is not new to the agile process community. Understanding XP’s structure sheds new light on how the process manages that curve. With its time-and scale-sensitive practices and dependencies, XP is an efficient feedback engine. They do in very cost-effective way. In case of smaller decisions, XP projects get that feedback continuously, minute by-minute, through interactions within programming pairs and through unit testing.

Larger decisions, such as the selection of features to help solve a business problem and the best way to spend project budget, are quite costly to validate. Therefore XP projects validate those decisions somewhat more slowly, through day-to-day interaction with customers, giving the customer control over each iteration’s feature choice, and by providing a release every few weeks at most. At every scale, XP’s practices provide feedback in a way that balances timeliness and economy [cockburn, 02].

6 Defense in Depth

Another traditional view of the purpose and function of a software process—closely related to managing the cost of change—is that it is defensive, guarding against the introduction of defects into the product.

Our model of XP’s inner structure also makes sense when measured against this view. In fact, it resembles the timeworn security strategy of defense in depth. Extreme Programming can be seen as a gauntlet of checks through which every line of code must pass before it is ultimately accepted for inclusion in the final product.

At each stage, it is likely that most defects will be eliminated but for those that slip through, the next stage is waiting. Furthermore, the iterative nature of XP means that in most cases code will be revisited, run through the gauntlet again, during later iterations.

7 Conclusion

Extreme Programming has some tight coupling between its practices. But the redundant, “organic” interconnectedness of XP is the source of a lot of its robustness and speed. All



those dependencies between practices have a structure that is actually fairly simple. That structure, once identified, provides crucial guidance for those who need to tailor and adjust the software process.

The feedback engine, with its nested feedback loops, is an excellent model for a process designed to manage the cost of change and respond efficiently to changing requirements. This is the essence of agility: letting go of the slow, deliberate decision-making process in favor of quick decisions, quickly and repeatedly tested. The feedback loops are optimized to validate decisions as soon as possible while still keeping cost to a minimum.

References

[1] [Beck,99] Beck, K. Extreme Programming Explained: Embrace Change. Addison-Wesley, Reading, MA, 1999.ssss

[2] [Bohem,81] Boehm, B. Software Engineering Economics. Prentice Hall, Englewood Cliffs, NJ, 1981. [3] [cockburn,02] Cockburn, A. Agile Software Development. Addison-Wesley, Boston, 2002. [4] [Kirkpatrick,83] Kirkpatrick, S., Gelatt Jr., C.D., and Vecchi, M.P. Optimization by Simulated Annealing.

Science, 4598 (13 May 1983), 671–680.



Data Discovery in Data Grid Using Graph Based

Semantic Indexing Technique R. Renuga Sudha Sadasivam Coimbatore institute of Technology PSG College of technology Coimbatore Coimbatore [email protected] [email protected]

S. Anitha N.U. Harinee Coimbatore institute of technology Coimbatore institute of technology Coimbatore Coimbatore

R. Sowmya B. Sriranjani Coimbatore institute of technology Coimbatore institute of technology Coimbatore Coimbatore

Abstract

A data grid is a grid computing system that deals with data – the controlled sharing and management of large amounts of distributed data. The process of data discovery aids in retrieval of requested and relevant data from the data source. The quality of the search is improved when the semantically related data is retrieved from a grid. The proposed model of data discovery in the data grid using the graph based semantic indexing technique in providing an efficient discovery of data based on the time and other retrieval parameters. Since there often exists some semantic correlation among the specified keywords, this paper proposes a model for more effective discovery of data from a data grid by utilizing the semantic correlation to narrow the scope of the search. The indexing phase makes use of two data structures. One is a hash-based index that maps concepts to their associated operations, allowing efficient evaluation of query concepts. The other is a graph-based index that represents the structural summary of the semantic network of concepts, and is used to answer the queries. The grid environment is established using gridsim simulator.

Keywords: Semantic search, context classes, graph based indexer, gridsim.

1 Introduction

Grid computing [Kesselman and Kauffmann, 1999] is applying the resources of many computers in a network to a single problem at the same time - usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data. The usage of grid has been predominant in the science and engineering research arena for the past decade. It concentrates on providing collaborative problem-solving and resource sharing strategies. There are two basic types of grids: Computational grid, Data grid. This paper focuses on the data grid and discovery of data from this grid environment. The data discovery is a process by which a data that is requested is retrieved from the grid.

424 ♦ Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique


The traditional keyword based search does not advocate a complete result generation on a requested query whereas the semantic search aims at providing exhaustive search results. The keyword search employs an index based mechanism which has a number of disadvantages. Keyword indices suffer because they associate the semantic meaning of web pages with actual lexical or syntactic content. Hence there is an inclination towards the semantic search methodology in the recent years. [Makela, 2005]; [Guha at.el, 2003] Semantic Search attempts to augment and improve traditional search results by making an efficient use of the ontology. [Ontology, 1996]

The search for a particular data in a data pool is both one of the most crucial applications on the grid and also an application where significant need for melioration is always necessary.

The addition of explicit semantics can improve the search. A semantic approach proposed in this paper exploits and augments the semantic correlation among the query key words and employs an approximate data retrieval technique to discover the requested data. The Search relies on graph based indexing (Data concept network) in which semantic relations can be approximately composed, while the graph distance represents the relevance. The data ranking is done based on the certainty of matching a query.

There are a number of approaches for semantic retrieval of data. The tree based approach for instance takes into account the only hierarchical relationships between the data and calculates the semantic correspondence there by introducing a number of disadvantages. But the graph based approach overcomes the disadvantages because it considers both hierarchical and non hierarchical relationships. [Maguitman at.el, 2005]

The contributions of this proposed model are

• The model allows the user to query a data using a simple and extensible query language in an interactive way.

• The model provides approximate compositions by deriving semantic relations between the data concepts based on ontology.

• The model can also be extended to use clusters in order to supply a compact representation of the index.

2 System Overview

2.1 Introduction

The search is mainly grounded on the textual matter given by the user. It ascertains the query to extract its entailing and the information in the documents are explored to find the user’s needs. The documents are ranked according to their relevance on par with the query.

The design criteria are based upon the

• Usability

• Robustness

• Predictability

• Scalability

• Transparency

Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique ♦ 425


The user query is converted into a formal query by matching the keywords with the concepts whereas the concepts are the nodes in a network which is constructed by referring to the ontology.

The mapping technique is done by adding connectors and annotating the terms in the user query. The abstract query is constructed which is the mathematical representation of the formal query. The semantic relationship between the concepts can be correlated by the notion context classes.

C-mapping is a hash based indexing which is used for efficient evaluation of queries. C-Network is constructed using C-mapping. Using this network the related documents are retrieved from the grid and ranked. The knowledge inference is used, which makes the ontology to get modernized dynamically.

Thus whenever a query is given, the documents are ranked according to their relevance and returned to the user. The architecture of the data discovery is shown in the figure 1.

Fig.1: The Data Discovery Architecture

3 Data Discovery Methodology

Semantic search methodology using graph based technique is designed in this paper for data discovery. The user query forms the rudiments of the search. This proposed technique processes the query and index the documents based on their relevance. The basis for searching is the context classes, C-mapping, C-network.

3.1 Query Interface

Users communicate with the search engine through a query interface. The keywords are extracted from the query and the concepts are built by referring to the ontology [Tran at.el, 2007]. Now, the user queries are transformed into formal queries by automatically mapping keywords to concepts by the query interface. In order to formalize a simple query into a query expression, each of the keywords is mapped to a concept term by using content matching techniques. If more than one concept is matched with the keyword, that keyword with the highest matching score is used in the query evaluation. Queries entered via the interface undergo two additional processing steps.



Step: 1 Query terms are connected automatically using a conjunctive connector.

Step: 2 Concept terms are annotated with a property category (input, output or operation) defining which

property will be matched with the term.

The result is a virtual data concept which is associated with a certainty, which is determined according to the semantic correspondence between the node’s and operation’s concepts.

3.2 Context Classes

The proposed method for analyzing relations between concepts depends on the notion of context classes, which form groups of concepts that allow the investigation of relations between them. For any given concept, define a set of context classes, each of which defines a subset of the concepts in ontology according to their relation to the concept. Given a query keyword associated with a concept c, we define a set of concepts Exact(c) as c itself and concepts which are equivalent to c. The Exact class may contain concepts that have monovular semantic meaning. The other context classes contain concepts with related meaning. For each concept c in ontology O the following sets of classes are defined.

For example, consider an example, “A diamond ring”

Now the keywords in the above query are diamond and ring. These keywords are referred to the ontology and the concepts like jewelry, occasions, wedding, gift, gold, Kohinoor etc are abstracted. Now the keywords and concepts are matched and the formal query “A ring made up of diamond” is formed as in figure.2.

Fig. 2: Formal Query

The semantic relationships between two concepts are based on the semantic distance between them. Given an anchor concept c, and some arbitrary concept c’, the semantic correspondence function d(c,c’) [Toch at.el,2007] is defined as,

D (c,c’)=1, where c’ belongs to Exact(c)

D (c,c’)= 1/ 2logαn·logβ (1+δ), where c’ belongs to General(c) or ClassesOf(c) or Properties(c)

D (c,c’)= 1/2n·logβ 1+δ, where c’ belongs to Specific(c) or Instances(c) or InvertProperties (c)

D(c,c’)= 1/2logα(n1+n2)·logβ (1+δ), where c’ belongs to Siblings(c)

D(c,c’) = 0, where c’ belongs to Unrelated(c)

Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique ♦ 427


Where n is the shortest path between c and c’, and δ is the difference between the average depth of the ontology and depth of the upper concept.The log bases α and β are used as parameters in order to set the magnitude of the descent function. Thus the similarity between two concepts is set to 1 when the concepts are having highest similarity and if the concepts are unrelated than the similarity is set to 0.

3.3 Indexing and Query Evaluation

The objective of the index is to enable efficient evaluation of queries with respect to processing time and storage space. The index is composed of two data structures:

a C-Mapping

b C-Network.

3.3.1 C-Mapping

C-Mapping is a hash-based index that maps concepts to their associated operations, allowing efficient evaluation of query concepts. Each mapping is associated with a certainty function in [0, 1] reflecting the semantic affinity between the concept and the concepts of the operation. Context classes are used in order to construct the key set of C-Mapping, and to assign the operations associated with each concept.

3.3.2 C-Network

C-Network is a graph-based index that represents the structural summary of the data concept network, and is used to answer queries that require several atomic operations. C-Mapping is expanded with additional concepts whose mapping certainty is higher than a given threshold, in order to retrieve approximate data. C-Network represents the structural summary of the data concept network using a directed graph. Given two operations, the objective of C-Network is to efficiently answer whether a composite concept, starting with the first operation and ending with the second, can be constructed, and to calculate the certainty of the composition. The design of C-Network is based on principles taken from semantic routing in peer-to-peer networks.

Algorithm:

Get the query from the user.

Extract the keywords from the query and build the concepts by referring to the ontology.

Convert the general query into formal query by matching the keywords in the query with the

concepts.

Considering each keyword in the query

Formal query=Match (keyword, concepts);

Using content matching technique

Construct context classes and calculate the semantic correlation between the concepts

For each concept c in formal query

Associate a certainty value by referring to the C-Mapping table;

If c’ is linked with c in the C-Network then add c’ to final set of related concepts;

Rank the final set of related concepts;



4 Conclusion

This paper has presented a method of discovery of data in a grid using Semantic search method. The search technique is implemented by doing the following steps.

• Splits the given query into keywords and extract the concepts

• Forms a formal query

• Construct the context classes, C-mapping and C-network

• Ranks the documents

The above proposed model is under implementation.

5 Acknowledgement

The authors would like to thank Dr. R. Prabhakar, Principal, Coimbatore Institute of Technology, Dr. Rudra Moorthy, Principal, PSG College of Technology and Mr. Chidambaram Kollengode, YAHOO Software Development (India) Ltd, Bangalore for providing us the required facilities to do the project. This project is carried out as a consequence of the YAHOO’s University Relation Programme.

References

[1] [Kesselman and Kauffmann,1999] Foster, I., Kesselman, C. Morgan Kaufmann “The Grid: Blueprint for a New Computing Infrastructure”, San Francisco, 1999.

[2] [Makela,2005] Eetu Makela, Semantic Computing Research Group,Helsinki Institute for Information Technology (HIIT). “Survey of Semantic Search Research”, 2005.

[3] [Guha at.el,2003] R. Guha, Rob McCool, Eric Miller. “Semantic Search” WWW2003, May 20-24, 2003.Budapest, Hungary.ACM 1-58113-680-3/03/0005.

[4] [Maguitman at.el,2005] Ana G. Maguitman, Filippo Menczer, Heather Roinestad and Alessandro Vespignani. “Algorithmic Detection of Semantic Similarity”, 2005

[5] [Tran at.el,2007] Thanh Tran, Philipp Cimiano, Sebastian Rudolph and Rudi Studer. “Ontology-based Interpretation of Keywords for Semantic Search”, 2007

[6] [Toch at.el,2007] Eran toch and Avigdor gal, Iris Reinhartz-Berger, Dov Dori. “A Semantic Approach to Approximate Service Retrieval”, ACM Trans. Intern. Tech. 8, 1, Article 2 November 2007, pages 1-31.

[7] [Ontology,1996] “Ontology-Based Knowledge Discovery on the World-Wide Web” To Appear in:

Proceedings of the Workshop on Internet-based Information Systems, AAAI-96 (Portland, Oregon), 199



Design of Devnagari Spell Checker for Printed

Document: A Hybrid Approach

Shaikh Phiroj Chhaware Latesh G. Mallik G.H. Raisoni College of Engineering G.H. Raisoni College of Engineering Nagpur, India Nagpur, India [email protected] [email protected]

Abstract

Natural language processing plays an important role in perfect analysis of language related issues. Nowadays with the advent in Information Technology, in India where the majority of peoples are Hindi language speaking, a perfect Devnagari Spell Checker is required for word processing a document in Hindi language. The one of the challenging field is how to implement a perfect spell checker for the Hindi language for doing spell checking in the printed document as we generally do for English like language in Microsoft word. This paper is aimed to develop a system for spelling check for the Devnagari text. The proposed approach consist of a development of encrypted database for the font specific word database and a spell check engine which will match the printed word from the available database of words and then for non-word, it presents a list of most appropriate suggestions based on n-gram distance calculation methods. Application of the spell checker can be treated as a stand-alone capable of operating on a block of text, or as part of a larger application, such as a word processor, email client, electronic dictionary, or search engine.

Keywords: Devnagari script, Devnagari font, word database, N-Gram distance calculation method, spell checker.

1 Introduction

The most common mode of interaction with computer is through keyboard. Spell checker system has a variety of commercial and practical applications in correctly writing the documents, reading forms, manuscripts and their archival. Standard Hindi text is known by the name of Devngari script. Also the text which is either printed or handwritten needs to

430 ♦ Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach


have some sort proof reading to avoid any erroneous matter. There are widely commercially available OCR systems present in the market which can recognize the document printed or handwritten for English language. The need is arises for the local languages and the languages which are specific to a particular based on religion, area of locality, society and other issues. An example which looks similarly as appears for English Spell checker is presented for the Devnagari spell checker.

2 Spell Checking Issues

The earliest writing style programs checked for wordy, trite, clichéd or misused phrases in a text. This process was based on simple pattern matching. The heart of the program was a list of many hundreds or thousands for phrases that are considered poor writing by many experts. The list of suspect phrases included alternate wording for each phrase. The checking program would simply break text into sentences, check for any matches in the phrase dictionary, and flag suspect phrases and show an alternative.

These programs could also perform some mechanical checks. For example, they would typically flag doubled words, doubled punctuation, some capitalization errors, and other simple mechanical mistakes.

True grammar checking is a much more difficult problem. While a computer programming language has a very specific syntax and grammar, this is not so for natural languages. Though it is possible to write a somewhat complete formal grammar for a natural language, there are usually so many exceptions in real usage that a formal grammar is of minimal help in writing a grammar checker. One of the most important parts of a natural language grammar checker is a dictionary of all words in the language.

A grammar checker will find each sentence in a text, look up each word in the dictionary, and then attempt to parse the sentence into a form that matches a grammar. Using various rules, the program can then detect various errors, such as agreement in tense, number, word order, and so on.

3 How Does The Spell Checker Work?

Initially the Spell Checker reads extracted words from the document, one at a time. Dictionary examines the extracted words. If the word is present in the Dictionary, it is interpreted as a valid word and it seeks the next word.

If a word is not present in dictionary, it is forwarded to the Error correcting process. The spell checker comprises three phases namely text parsing, spelling verification and correction, and generation of suggestion list. To aid in these phases, the spell checker makes use of the following.

i Morphological analyzer for analyzing the given word

ii Morphological generator for generating the suggestions.

In this context, the spell checker for Hindi needs to tackle the rich morphological structure of Hindi. After tokenizing the document into a list of words, each word is passed to the morphological analyzer. The morphological analyzer first tries to split the suffix. It is designed in such a way that it can analyze only the correct words. When it unable to split the

Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach ♦ 431


suffix due to mistake, it passes the word to spelling verification and correction phase to correct the mistake.

4 Spelling Verification and Correction

a. Correcting Similar Sounding Letters

Similar sounding letter can cause incorrect spelling of words. For example consider the word ‘Thaalam’. Here the letter ‘La’ may be misspelled as ‘la’. Suggestions are generated by examining the entire possible similar sounding letters for the erroneous word.

b. Checking the Noun

Tasks in noun correction include Case marker correction, plural marker checking, postposition checking, adjective checking and root word correction.

c. Checking the Verb

Verb checking tasks include Person, Number & Tense marker checking and root word checking.

d. Correcting the Adjacent Key Errors

User can mistype one letter instead of one letter. So we have to consider all the possible adjacent keys of that particular letter. If any adjacent key of the mistyped letter matches with the original letter then that letter is replaced instead of mistyped one and the dictionary is checked.

5 Proposed Plan of Work

The errors in the input are made either due to human mistakes or limitations of the software systems. Many spelling checking programs are available for detecting these errors. There are two approaches for judging the correctness of the spelling of a word

1. Estimates the likelihood of a spelling by its frequency of occurrence which is derived from the transition probabilities between characters. This requires a priori statistical knowledge of the language.

2. The correctness is judged by consulting the dictionary.

The spelling correction programs also offer suggestions for correct words which are based on the similarity with the input word using the word dictionary. Here a mechanism is required to limit the search space. A number of strategies have been suggested for partitioning the dictionary based on length of the word, envelop and selected characters. Here the following work is considered:

1. Word based approach for correction of spelling.

2. Correctness judged by consulting the dictionary

3. For substitution of correct word, a list of alternative words should be displayed spell check program.



4. Then the user will select the word from the available list or he may update the dictionary with a new word.

5. The correctness of the newly inserted word will be judged by the user himself.

6 Research Methodology to be Employed

The correction method presented here uses a partitioned Hindi word dictionary. The partitioning scheme has been designed keeping special problems in mind which Devanagari script poses. An input word is searched in the selected partitions of the dictionary. An exact match stops further search. However while looking for an exact match, the best choices are gathered. The ranking of the words is based on their distances from the input word. If the best match is within a preset threshold distance, further search is terminated. However, for short words, no search terminating threshold is used. Instead, we try various aliases which are formed from the output of classification process. The output of the character classification process is of three kinds:

• A character is classified to the true class - correct recognition.

• A character is classified such that the true class is not the top choice substitution error.

• The character is not classified to a known class - reject error.

A general system design for the Devnagari Spell Checker is as depicted as below:

Operation

Simple spell checkers operate on individual words by comparing each of them against the contents of a dictionary, possibly performing stemming on the word. If the word is not found it is considered to be a error, and an attempt may be made to suggest a word that was likely to have been intended. One such suggestion algorithm is to list those words in the dictionary having a small Levenshtein distance from the original word.

When a word which is not within the dictionary is encountered most spell checkers provide an option to add that word to a list of known exceptions that should not be flagged.

Design

A spell checker customarily consists of two parts:

1. A set of routines for scanning text and extracting words, and

2. An algorithm for comparing the extracted words against a known list of correctly spelled words (ie., the dictionary).

Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach ♦ 433


The scanning routines sometimes include language-dependent algorithms for handling morphology. Even for a lightly inflected language like English, word extraction routines will need to handle such phenomena as contractions and possessives. It is unclear whether morphological analysis provides a significant benefit.

The word list might contain just a list of words, or it might also contain additional information, such as hyphenation points or lexical and grammatical attributes. As an adjunct to these two components, the program's user interface will allow users to approve replacements and modify the program's operation. One exception to the above paradigm are spell checkers which use based solely statistical information, for instance using n-grams. This approach usually requires a lot of effort to obtain sufficient statistical information and may require a lot more runtime storage. These methods are not currently in general use. In some cases spell checkers use a fixed list of misspellings and suggestions for those misspellings; this less flexible approach is often used in paper-based correction methods, such as the see also entries of encyclopedias.

References

[1] R.C. Angell, G.E. Freund and P. Willet, (1983) "Automatic spelling corection using a trigram similarity measure", Information Processing and Management. 19: 255-261.

[2] V. Cherkassky and N. Vassilas (1989) "Back-propagation networks for spelling correction". Neural

Network. 1(3): 166-173. [3] K.W. Church and W.A. Gale (1991) "Probability scoring for spelling correction". Statistical Computing.

1(1): 93-103. [4] F.J. Damerau (1964) "A technique for computer detection and correction of spelling errors". Commun.

ACM. 7(3): 171-176. [5] R.E. Gorin (1971) "SPELL: A spelling checking and correction program", Online documentation for the

DEC-10 computer. [6] S. Kahan, T. Pavlidis and H.S. Baird (1987) "On the recognition of characters of any font size", IEEE

Trans. Patt. Anal. Machine Intell. PAMI-9. 9: 174-287. [7] K. Kukich (1992) "Techniques for automatically correcting words in text". ACM Computing Surveys. 24(4):

377-439. [8] V.I. Levenshtein (1966) "Binary codes capable of correcting deletions, insertions and reversals". Sov. Phys.

Dokl., 10: 707-710. [9] U. Pal and B.B. Chaudhuri (1995) "Computer recognition of printed Bangla script" Int. J. of System

Science. 26(11): 2107-2123. [10] J.J. Pollock and A. Zamora (1984) "Automatic spelling correction in scientific and scholarly text".

Commun. ACM-27. 4: 358-368. [11] P. Sengupta and B.B. Chaudhuri (1993) "A morpho-syntactic analysis based lexical subsystem". Int. J. of

Pattern Recog. and Artificial Intell. 7(3): 595-619. [12] P. Sengupta and B.B. Chaudhuri (1995) "Projection of multi-worded lexical entities in an inflectional

language". Int. J. of Pattern Recog. and Artificial Intell. 9(6): 1015-1028. [13] R. Singhal and G.T. Toussaint (1979) "Experiments in text recognition with the modified Viterbi

algorithm". IEEE Trans. Pattern Analysis Machine Intelligence. PAMI-1 4: 184-193. [14] E.J. Yannakoudakis and D. Fawthrop (1983) "An Intelligent spelling corrector". Information Processing

and Management. 19(12): 101-108. [15] P. Kundu and B.B. Chaudhuri (1999) "Error Pattern in Bangla Text". International Journal of Dravidian

Linguistics. 28(2): 49-88. [16] Naushad UzZaman and Mumit Khan, A Bangla Phonetic Encoding for Better Spelling Suggestions, Proc.

7th International Conference on Computer and Information Technology (ICCIT 2004), Dhaka, Bangladesh, December 2004.

[17] Naushad UzZaman and Mumit Khan, A Double Metaphone Encoding for Bangla and its Application in Spelling Checker, Proc. 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 705-710, Wuhan, China, October 30 - November 1, 2005.



[18] Naushad UzZaman and Mumit Khan, A Comprehensive Bangla Spelling Checker, Proc. International Conference on Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006.

[19] Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer Science), BRAC University, May 2005.

[20] Munshi Asadullah, Md. Zahurul Islam, and Mumit Khan, Error-tolerant Finite-state Recognizer and String Pattern Similarity Based Spell-Checker for Bengali, to appear in the Proc. of International Conference on Natural Language Processing, ICON 2007, January 2007.



Remote Administrative Suite for Unix-Based Servers

G. Rama Koteswara Rao G. Siva Nageswara Rao K. Ram Chand Dept. of CS P.G. Dept., P.B. P.G. Centre P.G. Centre, P.B.S. College Siddhartha College, Vijayawada ASN College, Tenali koti_g @yahoo.com [email protected] ramkolasani @yahoo.com

Abstract

This paper deals with the methodologies that help in enhancing the capabilities of the server. An attempt is made to develop software that eases the burden of routine administrative functions. This results in increasing the overall throughput of the server.

1 Introduction

In this paper, we deal with client-server technology. We develop methods to enhance the capabilities of a client in accessing a server on static and dynamic administrative services. Generally, a server administrator has the privilege of capturing everything that is happening on the server side.

This paper discusses two processes, running one at server and another at selected client. The client side process sends an IP packet, with a request for desired service. The process running on the server side acts like a gateway and examines the incoming packet. This “gateway process” processes the request.

2 Client Side Software

Features that incorporated in developing client side software include the following among serveral others.

• User and Group Management

• Remote Script Execution with Feedback

• File System Monitoring

• Monitoring Paging and Swap Space

• Monitoring System Load

• Process Management

• File Locking

• Device Drivers

• Database Administration

3 Roles of Clients

A main feature of the client is to give a convenient User interface, hiding the details of how the server 'talks' to the user. The client needs to first establish a connection with the server,

436 ♦ Remote Administrative Suite for Unix-Based Servers


given its address. After the connection is established, the client needs to be able to do two things

Receive commands from the user, translate them to the server's language (protocol) and send them to the server. Receive messages from the server, translate them into human-readable form, and show them to the user. Some of the messages will be dealt-with by the client automatically, and hidden from the user. This is based on the Client designer's choice.

4 Algorithm Developed for Client side Software Functions

1.1 get the server's address from a working address that can be used to talk over the Internet.

1.2 connect to the server

1.3 while (not finished) do:

1.3.1 wait until there is information either from the server, or from the user.

1.3.2 If (information from server) do

1.3.2.1 parse information, show to user, update local state information, etc.

1.3.3 else we've got a user command

1.3.3.1 parse command, send to server, or deal with locally.

1.4 done

5 Roles of Servers

A server’s main feature is to accept requests from clients, handle them, and send the results back to the clients. The Server side process checks the 8-bit unused field of IP packet to confirm that the request is from a valid client. We discuss two kinds of servers: a single-client server, and a multi-client server.

5.1 Single Client Servers

Single client server responds only to one client at a given time. It acts as follows :

1 Accept connection requests from a Client.

2 Receive requests from the Client and return results.

3 Close the connection when done, or clear it if it's broken from some reason.

Following is the basic algorithm a Single-Client Server performs:

1.1 bind a port on the computer, so Clients will be able to connect

1.2 forever do:

1.2.1 listen on the port for connection requests.

1.2.2 accept an incoming connection request

1.2.3 if (this is an authorized Client)

1.2.3.1 while (connection still alive) do:

1.2.3.2 receive request from client

Remote Administrative Suite for Unix-Based Servers ♦ 437


1.2.3.3 handle request

1.2.3.4 send results of request, or error messages.

1.2.3.5 done

1.2.4 else

1.2.4.1 abort the connection

1.2.5 done

5.2 Multi Client Servers

Multi-Client server responds to several clients at a given time. It acts as follows:

1. Accept new connection requests from Clients.

2. Receive requests from any Client and return results.

3. Close any connection that the client wants to end.

Following is the basic algorithm a Multi-Client Server performs:

1.1 bind a port on the computer, so Clients will be able to connect

1.2 listen on the port for connection requests.

1.3 forever do:

1.3.1 wait for either new connection requests, or requests from existing Clients.

1.3.2 if (this is a new connection request)

1.3.2.1 accept connection

1.3.2.2 if (this is an un-authorized Client)

1.3.2.2.1 close the connection

1.3.2.3 else if (this is a connection close request)

1.3.2.3.1 close the connection

1.3.2.4 end if

1.3.3 end if

1.3.4 else this is a request from an existing Client connection

1.3.4.1 receive request from client

1.3.4.2 handle request

1.3.4.3 send results of request, or error messages

1.3.5 end if

1.4 done

6 File System Monitoring

Monitoring complete file systems is the most common monitoring task. On different flavors of Unix the monitoring techniques are the same, but the commands and fields in the output vary slightly. This difference is due to the fact that command syntax and the output columns vary depending on the flavour of the Unix system being used.

We have developed software script for monitoring the file system usage.

The outcome of our software that is developed using serveral methods are as follows:



6.1 Percentage of used space method.

Example:

/dev/hda2 mounted on /boot is 11%

6.2 Megabytes of Free Space Method

Example:

Full FileSystem on pbscpg55046.pbscpg

/dev/hda3 mounted on / only as 9295 MB Free Space

/dev/hda2 mounted on /boot only as 79 MB Free Space

6.3 Combining Percentage Used 6.1 and Megabytes of Free Space 6.2.

6.4 Enabling the Combined Script to Execute on AIX, HP_UX, Linux and Solaris.

7 Monitoring Paging and Swap Space

Every Systems Administrator attaches more importance to paging and swap space because they are supposed to be the key parameters to fix a system that does not have enough memory. This misconception is thought to be true by many people, at various levels, in a lot of organizations. The fact is that if the system does not have enough real memory to run the applications, adding more paging and swap space is not going to help. Depending on the applications running on the system, swap space should start at least 1.5 times physical memory. Many high-performance applications require 4 to 6 times real memory so the actual amount of paging and swap space is variable, but 1.5 times is a good place to start.

A page fault happens when a memory segment, or page, is needed in memory but is not currently resident in memory. When a page fault occurs, the system attempts to load the needed data into memory, this is called paging or swapping, depending on the Unix system being used. When the system is doing a lot of paging in and out of memory, this activity needs monitoring. If the system runs out of paging space or is in a state of continuous swapping, such that as soon as a segment is paged out of memory it is immediately needed again, the system is thrashing. If this thrashing condition continues for very long, there is a possible risk of the system crashing. One of the goals of the developed software is to minimise the page faults.

Each of four Unix flavors, AIX, HP-UX, Linux, and Solaris, use different commands to list the swap space usage, the output for each command and OS varies also. The goal of this paper is to create all-in-one shell script that will run on any of our four Unix flavors. A sample output of the script is presented below.

Paging Space Report for GRKRAO Thu Oct 25 14:48:16 EDT 2007 Total MB of Paging Space : 33MB Total MB of Paging Space Used : 33MB Total MB of Paging Space Free : 303MB Percent of Paging Space Used : 10% Percent of Paging Space Free : 90%



8 Monitoring System Load

1. First is to look at the load statistics produced.

2. Second one is to look at the percentages of CPU usage for system/kernel, user/applications, I/O wait state and idle time.

3. The final step in monitoring the CPU load to find hogs.

Most systems have a top like monitoring tool that shows the CPUs, processes, users in descending order of CPU usage.

9 File Locking

File locking allows multiple programs to cooperate in their access to data. This paper looks at the following two schemes of file locking.

1. A simple binary semaphore scheme

2. A more complex file locking scheme of locking different parts of a file for either shared or exclusive access

10 Device Drivers

Device Drivers are needed to control any peripherals connected to a server. This paper focuses on the following aspects of device drivers where an authorized client can control devices connected to the server.

1. Registering the device

2. Reading from a device and Writing to a device

3. Getting memory in device driver

11 Database Administration

“C” Language is used to access MySQL. In this paper, the following databaseadministrative features are implemented to be run at an authorized client

1. Create a new database

2. Delete a database

3. Change a password

4. Reload the grant tables that control permissions

5. Provide the status of the database server

6. Repair any data tables

7. Create users with permissions

There are three basic things to look at when monitoring the load on the system.

Most systems have a top like monitoring tool that shows the CPUs, processes, users in descending order of CPU usage.



12 Using the Algorithm that is described in serial number 4 above, we developed the following C program code:

Sample Client Program

#include <sys/socket.h> #include<netinet/in.h> #include<arpa/inet.h> #include<stdio.h> int main(int argc,char **argv) int sockfd,n,len; char buf[10240]; struct sockaddr_in servaddr; if(argc!=2) perror("invalid IP"); if((sockfd=socket(AF_INET,SOCK_STREAM,0))<0) perror("socket error"); bzero(&servaddr,sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_port = htons(13); if(inet_pton(AF_INET,argv[1],&servaddr.sin_addr) <= 0) perror("SERVER ADDR "); if(connect(sockfd,(struct sockaddr*)&servaddr,sizeof(servaddr)) < 0) perror("connect error"); buf[0] = '\0'; printf("Enter the Directory name \n"); scanf("%s",buf); if(write(sockfd,buf,100) < 0) printf("write error "); exit(1); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf(" Inode Number = %s\n", buf); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf(" No of links = %s\n", buf); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf("Size of file in bytes = %s\n", buf); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf("UID = %s\n", buf); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf("GID = %s\n", buf); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf("Type and Permissions = %s\n", buf); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf("Last Modification Time = %s\n", buf); if((len = read(sockfd,buf,100)) < 0) printf("read error \n"); exit(1); else printf("Last Access Time = %s\n", buf); exit(1);



13 Using the Algorithm that is described in serial number 5 above, we developed the following C program code :

Sample Server Program

#include<sys/socket.h> #include<arpa/inet.h> #include<stdio.h> #define MAXLINE 10024 #define LISTENQ 10 int main(int argc,char **argv) int listenfd,connfd,len,i; struct sockaddr_in servaddr; struct stat statbuf; char buff[MAXLINE],buff1[MAXLINE]; DIR *dir; struct dirent *direntry; listenfd = socket(AF_INET,SOCK_STREAM,0); bzero(&servaddr,sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htonl(INADDR_ANY); servaddr.sin_port = htons(13); bind(listenfd,(struct sockaddr*)&servaddr,sizeof(servaddr)); listen(listenfd,LISTENQ); connfd = accept(listenfd,(struct sockaddr*)NULL,NULL); if((len = read(connfd,buff,100)) < 0) printf("read error \n"); exit(1);

else printf("%s\n",buff); lstat(buff,&statbuf); sprintf(buff, "%d", statbuf.st_ino); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); sprintf(buff, "%d", statbuf.st_nlink); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); sprintf(buff, "%d", statbuf.st_size); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); sprintf(buff,"%d", statbuf.st_uid); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); sprintf(buff,"%d", statbuf.st_gid); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); sprintf(buff,"%o", statbuf.st_mode); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); sprintf(buff,"%s", ctime(&statbuf.st_mtime)); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); sprintf(buff,"%s", ctime(&statbuf.st_atime)); if(write(connfd,buff,100) < 0) printf("write error "); exit(1); close(connfd); exit(1);



Sample Outputs

[root@grkrao 01Oct]#./a.out Message From Client : /etc/passwd [grkrao@grkraoclient 01Oct]#./a.out 123.0.57.44 Enter the File name : /etc/passwd Message From Server : Inode Number = 1798355 No of links = 1 Size of file in bytes = 3263

UID = 0

GID = 0

Type and Permissions = 100644

Last Modification Time = Fri Apr 20 16:12:06 2007 Last Access Time = Tue Apr 24 11:46:33 2007

14 Extensions

• Adding authentication to individual client requests

• Restricting clients to make specific requests

• Making a selected client work as a proxy server for administration

• Embedding both the server and client side software

References

[1] W. Richard Stevens, Advanced Programming in Unix Environment, Pearson Education, pp 91–136 [2] W. Richard Stevens, Unix N/W Programming – Vol-1: Networking APIs: Socket and XTI, Pearson

Education, pp 3-140 [3] Uresh Vahalia, Unix Internals: New Frontiers, Pearson Education, pp 43–50 [4] W. Richard Stevens, Unix Network Programming, PHI, pp 258–3



Development of Gui Based Software Tool

for Propagation Impairment Predictions

in Ku and Ka Band-Traps

Sarat Kumar K. Vijaya Bhaskara Rao S. Advanced Centre for Atmospheric Sciences Advanced Centre for Atmospheric Sciences (ISRO Project) (ISRO Project) Sri Venkateswara University Sri Venkateswara University Tirupati-522502, India Tirupati-522502, India

D. Narayana Rao H. HYARC, Nagoya University, Japan

Abstract

The presence of the atmosphere and weather conditions may have a significant detrimental effect on the transmission/ reception performance of earth-satellite links operating in the millimeter wavelength range. To attain a fine frequency planning of operating or future satellite communication systems, the predictive analysis of propagation effects should be performed with payload in the suitable orbit slot and, for the earth-segment, with the antenna characteristics and meteorological parameters being specific to the station site. In line with this methodology, the Tropical Rain Attenuation Predictions and Simulations (TRAPS) a MATLAB based GUI software was developed for the real time processing of input propagation parameters supplied during the execution from a comprehensive suite of prediction models. It also consists of database for assessing atmospheric propagation impairments for different locations in India which are stored for viewing offline, which helps the engineer in developing reliable communication systems operating in higher frequency bands. The concentration on this paper is more on the software tool rather than propagation impairments. For information on propagation impairments refer [Crane R K, 1996; W L Stutzman, 1993; R L Olsen, 1978].

Keywords: Ku and Ka band, Rain Attenuation, Propagation Impairments GUI Based Software Tool, TRAPS.

1 Introduction

The prediction of radio propagation factors for radio systems planning is invariably undertaken with the aid of models representing a mathematical simplification of physical reality. In some cases the models may be simply formulae derived empirically as a result of experimental link measurements, but in many cases the models are derived from analysis of the propagation phenomena in media with time-varying physical structure. In all cases the models require input data (usually meteorological data) so as to be able to make a prediction. Usually there are important constraints on situations in which various models may be applied and also the input data will have to meet certain conditions in order to sustain a certain accuracy of prediction. [ITU-R models P 618-7, 837-3, 839-2].

444 ♦ Development of Gui Based Software Tool for Propagation Impairment Predictions in Ku and Ka


The paper describes an initial design study for an intelligent (computer-based) model management system for radio propagation prediction on earth-space links. The system includes "deep knowledge" about the relationships between the fundamental concepts used by radio propagation experts. It manages application of the models in particular regions of the world and is able to provide intelligent database support. It is intended for use both by the propagation expert as an aid to developing and testing models and also by the radio system planner as means of obtaining routine propagation predictions using state-of-the-art knowledge on appropriate models and data.

In predicting propagation factors for radio system design we are concerned with the following actions:

• Definition of link geometry

• Evaluation of antenna performance on link budget, including footprint calculation and polarisation properties

• Selection of propagation models appropriate to particular frequencies and locations

• Evaluation of propagation factors for specific system requirements (outage probabilities)

Conventionally these activities are carried out with the aid of a computer, usually using a series of separate programs linked together or written by the user. In the case of propagation models we may be concerned with the application of straightforward empirical or semi-empirical formulae. For frequencies above 10 GHz, empirical or semi-empirical formulae for propagation prediction, covering the following factors: bore sight error, attenuation and fading (including scintillation), cross-polarisation, delay spread, antenna noise, interference.

The formulae for these factors generally pose a simple task to program and evaluate, but the knowledge associated with the conditions under which each Step may be applied and the input data requirements are often complex, the more so because many of the formulae are empirical and only apply within strict limits, for example on frequency, elevation or type of climate.

What is required as a tool for the system designer and propagation expert is an intelligent kind of data management system able to store and retrieve the formulae and associated conditions for propagation prediction, to select those models appropriate to the particular system requirements and local conditions, to select the best available data for use with the models and then to calculate the specific propagation factors.

Conventionally a propagation expert or system designer may retain computer source code or compiled code on his chosen machine, for a range of propagation problems, including the antenna and link geometry calculations mentioned earlier. As this formula becomes larger, the task of maintenance of the code, of linking elements together, of remembering the required data formats and output display possibilities grows multiplicatively. If this type of system provides an in house consultancy type of service, with many experts contributing to the pool of formulae and data over a period of time, then the situation can get out of hand. What we are proposing is a system of managing these mundane tasks, enhanced with vital knowledge on the conditions for application of formulae and selection of data.

What we require is an intelligent system capable of linking together concepts (e.g. models and data requirements) and applying rule based reasoning. These requirements lead us to

Development of Gui Based Software Tool for Propagation Impairment Predictions in Ku and Ka ♦ 445


consider the latest generation of intelligent knowledge based system tools, based on an object-oriented approach with a reasoning toolkit.

A software system which allows us to define the key concepts in a particular field in terms of equations, conditions or rules, plus descriptions or explanations, allows us to define various types of relationships between these concepts, to associate properties via a particular type of relationship and to perform goal oriented or data driven reasoning, should prove to be a powerful tool in addressing our specialised and well contained propagation factor prediction problem. [Pressman RS, 1987]

Fig. 1: Schematic one to achieve the principal objects and their dependencies

Fig. 2: Proposed System Architecture using all the models information available

The principle objects in a propagation prediction system and their dependencies are illustrated in figure 1. Each block represents an object (or class of objects) and consists of three parts; objects name, operations or methods and data items. We note that objects are defined for the (satellite) system model, for the propagation factors, for antenna footprints and for three databases (for radio, site and meteorological data). The object-oriented environment forms an inner shell, interfacing to the computer operating system and other utilities as shown in figure 2

2 Implementation Using Models

Processing a propagation prediction task is a lot like assembling models mutually into a rather complex tree graph where each model may rely on one or many other lower-level models, then executing the models by successive layers of node-points till end results are produced as outputs of the main calculation schema. A model embedded in this structure,



although being called in different radio-propagation contexts, may frequently receive unchanged values for some parameters of its argument list. Thus, it seemed beneficial to develop a model implementation that applies to different parameter types for the same general action.

In adopting MATLAB as programming language for implementing propagation models, the inherent MATLAB's ability to perform matrix computations has been exploited at all the calculation levels by overloading model functions through interfaces, each interface defining a particular combination of input parameters types. Every in-house subroutine quoted inside a model algorithm is adequately defined in terms of the type of input parameters it supports. The model in turn is built in such a way that the function output can be processed in a consistent and error-free manner, and each variable returned by the model has the expected MATLAB object format. The intended outcome is that model functions are safely executed with different combinations of scalars, vectors, matrices or multi-dimensional arrays as arguments. [P. Marchand et al., 2002]

Furthermore, the interface mechanism allows the same function to operate on a common data abstraction but with multiple operands types, and relieves the programmer of the complexity of assigning specific names to functions that perform the same model for different legitimate use cases of parameter sets.

3 Functional Description

TRAPS consisting of an interface supporting the collection of input parameters to a

prediction task, execution of this task and visualization of its results. Numerous models have

been implemented from scratch in MATLAB in such a way that they can be integrated into

so-called TRAPS, using the integrated potential features of MATLAB, which supports the

full range of prediction tasks for geo stationary satellites. The application then provides a

task-oriented abstraction on top of this integrated MATLAB schema, supporting the selection

of a specific usage scenario (interface) of the TRAPS schema and the collection of its

expected input parameters in the proper format. [C. Bachiller et al., 2003] The user interacts

with the system through a sequence of screens supporting the selection of satellite(s), site(s)

or region, the input of statistical parameters, the choice of effects and models to be calculated,

the triggering of the calculation and the visualization of results.

From the end user’s viewpoint, TRAPS is an application that is accessible from MATLAB

platform dependent and supports three modes of operation: Satellite mode, Location mode,

Contour mode. The user may select a location mode to consider the influence of link

geometry and radio transmission parameters on signal degradations with a tight control on the

statistical parameters relating to the activity of atmospheric factors in the surroundings of the

ground station. The single-site mode interface offers a set of handy link parameter tables

through which any conceivable pattern of slant-path connections from a common earth-

station site can be defined. The link-parameters interface page includes a user-held list of

parameters, defined during previous sessions and stored there for subsequent reuse. [P Daniel

et al., 2004]

In the satellite calculation mode, the TRAPS interface asks for selection of required satellite for analysis of propagation impairments. Alternatively, the selection of a single satellite can



be combined with an arbitrary number of earth stations. Site data and a topographic database are available for support. For broadcasting applications specifically, the propagation capability of the downlink satellite channel can be examined over any geographical region within the overall satellite coverage area. After the calculations have been performed, the user is able to store the input parameters of the calculation and the numerical and graphical results into the system’s database, where they can be accessed at any later point in time.

The windows structure is illustrated in figure 3. The location mode, satellites mode, and contour mode calculations are the three main courses of action in the interface, and they make use of a set of common pages whose contents is dynamically adapted to the context of the calculation in progress.

After opening the TRAPS, the user is presented with a welcome page that proposes a choice between the satellites mode, location mode, and contour mode.

In general, performing each step of a calculation will enable the next icon in the sequence of steps that should be followed. It is possible at any point in time to go back to a previous step by clicking on the icon associated with this step. Modifying the choices made at a certain step may necessitate that the user goes through the following steps again. Figures 4-10 shows the windows seen during execution for obtaining the attenuation parameters.

Fig. 3: Start Window of TRAPS

Fig. 4: Satellite Mode Window of TRAPS

SATELLITE MODE



Fig. 5: Satellite Mode Window for choosing the location: TRAPS

Fig. 6: Window for choosing the parameter for calculation: TRAPS

Fig. 7: Window for viewing the required parameter plots: TRAPS

Fig. 8: Window showing the plot after choosing the parameter: TRAPS



Fig. 9: Location Mode Main Window TRAPS

Fig. 10: Contour Mode Results Window TRAPS

4 Technical Overview - Traps

TRAPS supports a fully GUI based architecture. The TRAPS was developed using the features of MATLAB 2007b. The calculation models are run using the built-in functions, not on external to the application. If the user has a local copy of MATLAB, he may download results in MAT file format too. However, this is unnecessary if the user is solely interested in viewing results in graphical form, since all the most relevant graphs are generated automatically without extra user intervention upon completion of the calculation.

Whenever a user initiates a calculation request, the application generates parameters and invokes the execution of the compiled schema through the MATLAB complier. When it is scheduled for execution, the compiled schema reads its parameters MAT file, analyzes its contents in order to determine its course of action (which effects should be calculated, with which models and data sets), performs a set a calculations and then stores the numerical results in another MAT file. It also generally produces a set of graph files for results visualization.

The development of TRAPS was based entirely on the software package MATLAB and specifically, on the GUI (Graphical User Interface) environment offered. No knowledge of this software is required from the user. The inputs of the tool are also briefly described. This graphic tool can be used either for the design of Satellite Communications or for research purposes.



5 Summary and Conclusions

The performance of complex wave propagation prediction tasks requires propagation engineers to use a significant number of mathematical models and data sets for estimating the various effects of interest to them, such as attenuation by rain, by clouds, scintillation, etc. Although the models are formally described in the literature and standardized e.g. by the ITU-R, there is no easy-to-use, integrated and fully tested implementation of all the relevant models, on a common platform. On the contrary, the models have usually been implemented using different tools and languages such as MATLAB, IDL, PV-Wave, C/C++ and FORTRAN. It is often necessary to have a good understanding of a model’s implementation in order to use it correctly, otherwise it may produce errors or, worse, wrong results when supplied with parameters outside of an expected validity range. Some models only support scalar values as inputs while others accept vectors or matrices of values, e.g. for performing a calculation over a whole region as opposed to a discrete point. These issues worsen whenever the engineer wishes to combine several models, which is necessary for most prediction tasks. In addition, assessing the validity of results produced by the combination of multiple models is also a complex issue, especially when their implementation originates from various parties. Finally, as no common user interface is supported, the combination of models requires tedious manipulations and transformations of model inputs and outputs.

The paper has outlined the design features and specifications of TRAPS, Tropical Rain Attenuation Predictions and Simulations, used for propagation prediction on slant paths. The software tool achieves the integration of propagation models and radio meteorological datasets and provides support for the analysis of tropospheric effects on a large variety of earth-space link scenarios. The actual value of the TRAPS software application can be reckoned by the functionality of the GUI interface, its efficiency in performing advanced model calculations and the content of results being returned. Additional capabilities of TRAPS include offline viewing of results already stored in data base and run time processing for the requirement output by giving the inputs. The web based software tool is under development, which will enable propagation engineers to predict the attenuation by executing online with specified input parameters of model.

6 Acknowledgement

The authors would like to thank the Advanced Centre for Atmospheric Sciences project supported by Indian Space Research Organisation (ISRO), and Department of Physics, Sri Venkateswara University, Tirupati.

References

[1] [C. Bachiller, H. Estehan, S. Cogollos, A. San Blas, and V. E. Boria] “Teaching of Wave Propagation Phenomena using MATLAB GUIs at the Universidad Politecnica of Valencia,” IEEE Antennas and

Propagation Magazine, 45, 1, February 2003, pp. 140–143. [2] [ITU-R] “Characteristics of Precipitation for Propagation Modeling,” Propagation in Non-Ionized Media,

Rec. P.837-3, Geneva, 2001. [3] [ITU-R] “Propagation Data and prediction Methods Required for the Design of Earth-Space

Telecommunication Systems,’’ Propagation in Non-Ionized Media, Rec. P.618–7, Geneva, 2001. [4] [ITU-R] “Rain Height Model for Prediction Methods,” Propagation in Non-Ionized Media, Rec. P.839-3,

Geneva, 2001.



[5] [P. Marchand and 0. T. Holland] Graphics and GUIs with MATLAB, Third Edition, Boca Ratan, CRC Press, 2002.

[6] [Pantelis-Daniel. M. Arapoglou', Athanasios D. Panagopoulosi3', George E. Chatzaraki', John D. Kanellopoulos', and Panayotis G. Cottis'] Diversity Tech iques for Satellite Communications: An Educational Graphical Tool, lEEE Antenna. and Propagation Magazine, Vol. 46, No. 3. June 2004

[7] [Pressman R S] Software Engineering, TMH, 2nd Edition 1987. [8] [R. K. Crane] Electromagnetic Wave Propagation through Rain, New York, Wiley, 1996. [9] [R. L. Olsen, D. V. Rogers., and D. B. Hodge] “The αRb relation in calculation of rain attenuation”, IEEE

Trans. On Antennas and Propagation, Vol. 26, No. 2, pp. 318–329, 1978. [10] [W. L. Stutzman] “The special section on propagation effects on satellite communication links”,

Proceedings of the IEEE, Vol. 81, No. 6, 1993, pp. 850–855.



Semantic Explanation of Biomedical

Text Using Google

B.V. Subba Rao K.V. Sambasiva Rao

Department of IT M.V.R. College of Engineering, P.V.P Siddhartha Institute of Technology Vijayawada Vijayawada-520007 Krishna Dt., A.P [email protected] [email protected]

Abstract

With the rapid increasing quantity of biomedical text, there is a need for automatic extraction of information to support biomedical researchers. So there is a need for effective Natural Language Processing tools to assist in organizing, retrieving this information. Due to incomplete biomedical information databases, the extraction is not straightforward using dictionaries, and several approaches using contextual rules and machine learning have previously been proposed. Our work is inspired by the previous approaches, but is novel in the sense that it is using Google for semantic explanation of the biomedical words. The semantic explanation or annotation accuracy obtained 52% on words not found in the Brown Corpus, Swiss-Prot or LocusLink (accessed using Gsearch.org) is justifying further work in this direction.

Keywords: Biomedical text, Google, Data Mining, Semantic explanation.

1 Introduction

With the increasing importance of accurate and up-to-date databases for biomedical research, there is a need to extract information from biomedical research literature, e.g. those indexed in MEDLINE [8]. Examples of information databases are LocusLink, UniGene and Swiss-Prot [3]. Due to the rapidly growing amounts of biomedical literature, the information extraction process needs to be automated. So far, the extraction approaches have provided promising results, but they are not sufficiently accurate and scalable.

Methodologically all the suggested approaches belong to the information extraction field, and in the biomedical domain they range from simple auto- automatic methods to more sophisticated, but manual, methods. Good examples are: Learning relationships between proteins/genes based on co-occurrences in MEDLINE abstracts [10] (e.g. manually developed information extraction rules (e.g. information extraction [2] (e.g. protein names) classifiers trained on manually annotated training corpora (e.g. [4]), and our previous work on classifiers trained on automatically annotated training corpora).

Examples of Biological name entities in a textual context are i) “duodenum, a peptone meal in the” ii) “subtilisin plus leucine amino-peptidase plus prolidase followed” Semantic Annotation is an important part of information extraction is to know what the information is, e.g. knowing that the term “gastrin” is a protein or that “Tylenol” is a medication. Obtaining and adding this knowledge to given terms and phrases is called semantic tagging or semantic annotation annotation.

Semantic Explanation of Biomedical Text Using Google ♦ 453


1.1 Research Hypothesis

Fig. 1: Google is among the biggest known as Information haystacks.

Google is probably the world’s largest available source of heterogeneous electronically represented information. Can it be used for semantic tagging of textual entities in biomedical

literature? and if so, how? The rest of this paper is organized as follows. Section 2 describes the materials used, section 3 presents our method, section 4 presents empirical results, section 5 describes related work, and the section 6 presents conclusion and future work.

2 Materials

The materials used included biomedical (sample of MEDLINE abstract) and general English (Brown) textual corpora, as well as protein databases. See below for a detailed overview.

2.1 Medline Abstracts-Gastrin-Selection

The US National Institutes of Health grants a free academic license for PubMed/MEDLINE [9, 10]. It includes a local copy of 6.7 million abstracts, out of the 12.6 million entries that are available on their web interface. As subject for the expert validation experiments we used the collection of 12.238 gastrin-related MEDLINE abstracts that were available in October 2005.

2.2 Biomedical Information Databases

As a source for finding already known protein names we used a web search system called Gsearch, developed at Department of Cancer Research and Molecular Medicine at NTNU. It integrates common online protein databases, e.g. Swiss-Prot, LocusLink and UniGene.

454 ♦ Semantic Explanation of Biomedical Text Using Google


2.3 The Brown Corpus

The Brown repository (corpus) is an excellent resource for training a Part Of Speech (POS)

tagger. It consists of 1,014,312 words of running text of edited English prose printed in the

United States during the calendar year 1961. All the tokens are manually tagged using an

extended Brown Corpus Tagset, containing 135 tags. The Brown corpus is included in the

Python NLTK data-package, found at Sourceforge.

3 Our Method

We have taken a modular approach where every sub module can easily be replaced by other

similar modules in order to improve the general performance of the system. There are five

modules connected to the data gathering phase, namely data selection, tokenization, POS-

tagging, Stemming and Gsearch. Then the sixth and last module does a Google search for

each extracted term. See.Figure 2.

3.1 Data Selection

The data selection module uses PubMed Entrez online system to return a set of PubMed IDs

(PMIDs) for a given protein, in our case “gastrin” (symbol GAS). The PMIDs are matched

against our local copy of MEDLINE, to extract the specific abstracts.

3.2 Tokenization

The text is tokenized to split it into meaningful tokens, or “words”. We use the White Space

Tokenizer from NLTK with some extra processing to adapt to the Brown Corpus, where

every special character (like ( )” ’ -, and.) is treated as a separate token. Words in parentheses

are clustered together and tagged as a single token with the special tag Paren.

3.3 POS Tagging

Next, the text is tagged with Part-of-Speech (POS) tags using a Brill tagger trained on the

Brown Corpus. This module acts as an advanced stop-word-list, excluding all the everyday

common American English words from our protein search. Later, the actually given POS tags

are used also as context features for the neighboring words.

3.4 Porter-Stemming

We use the Porter Stemming Algorithm to remove even more everyday words from the

“possibly biological term” candidate list. If the stem of a word can be tagged by the Brill

tagger, then the word itself is given the special tag “STEM”, and thereby transferred to the

common word list.



Fig. 2: Overview of Our Methodology (named Biogoogle)

3.5 Gsearch

Identifies and removes already known entities from the search, but after the lookup in Gsearch, there are still some unknown words that are not yet stored in our dictionaries or databases, so in order to do any reasoning about these words it is important to know which class they belong to. Therefore, in the next phase they are subjected to some advanced Google searching, in order to determine this.

3.6 Google Class Selections

We have a network of 275 nouns, arranged in a semantic network on the form “X is a kind of Y”. These nouns represent the classes that we want to annotate each word with. The input to this phase is a list of hitherto unknown words. From each Word a query on the form in the example below is formed (query syntax: Word is (an an|a)” a)”). ). Then these queries are fed to the PyGoogle module which allows 1000 queries to be run against the Google search engine every day with a personal password key. In order to maximize the use of this quota, the results of every query are cached locally, so that each given query will be executed only once.

456 ♦ Semantic Explanation of Biomedical Text Using Google


If a solution to the classification problem is not present among the first 10 results returned, the result set can be expanded by 10 at a time, at the cost of one of the thousand quota-queries every time.

Each returned hit from Google contains a “snippet” with the given query phrase and approximately 10 words on each side of it. We use some simple regular grammars to match the phrase and the words following it. If the next word is a noun it is returned. Otherwise, adjectives are skipped until a noun is encountered, or a “miss” is returned.

4 Empirical Results

Table1: Semantic classification of untagged words

Classifier TP/TN FP/FN Precision/Recall F-Score CA

Biogoogle 24/80 31/65 43.6/27.0 33.3 52.0

5 Related Work

Our specific approach was on using Google for direct semantic annotation (search- searching

for is-a relations) of tokens (words) in biomedical corpora. We haven’t been able to find other

work that does this, but Dingare et al. is on using the number of Google hits as input features

for a maximum entropy classifier used to detect protein and gene names[1]. Our work differs

since we use Google to directly determine the semantic class of a word (searching for is-a

relationships and parsing text (filtering adjectives) after ships (a a/an) in “Word is (a a|an)

an), as, opposed to Dingare et al.’s indirect use of Google search as a feature for the

information extraction classifier. A second difference between the approaches is that we

search for explicit semantic annotation (e.g. “word is a protein”) as opposed to their search

for hints (e.g. “word protein”). The third important difference is that our approach does

automatic annotation of corpuses, whereas they require pre-tagged (manually created)

corpuses in their approach. Other related works include extracting protein names from

biomedical literature and some on semantic tagging using the web. Under, a brief overview of

related work is given.

5.1 Semantic Annotation of Biomedical Literature

Other approaches for (semantic) annotation (mainly for protein and gene names) of

biomedical literature include: a) Rule-based discovery of names (e.g. of proteins and genes).

b) Methods for discovering relationships of proteins and genes [2].

c)Classifier approaches (machine learning) with textual context as features, [4, 5] d)Other

approaches include generating probabilistic rules for detecting variants of biomedical terms

The paper by Cimiano and Staab [6] shows that a system similar to ours works, and can be

taken as a proof that automatic extraction using Google is a useful approach. Our systems

differ in that we have 275 different semantic tags, while they only use 59 concepts in their

ontology. They also have a table explaining how the number of concepts in a system

influences the recall and precision in several other semantic annotation systems.




This paper presents a novel approach - Biogoogle - using Google for semantic annotation of entities (words) in biomedical literature.

We got empirically promising results - 52% semantic annotation accuracy ((TP+TN)/N, TP=24,TN=80,N=200) in the answers provided by Biogoogle compared to expert classification performed by a molecular biologist. This en- encourages further work possibly in combination with other approaches (e.g. rule and classification based information extraction methods), in order to improve and the overall accuracy (both with respect to precision and recall). Disambiguation is another issue that needs to be further investigated. Other opportunities for future work include:

Improve tokenization. Just splitting on whitespace and punctuation characters is not good enough. In biomedical texts non-alphabetic characters such as brackets and dashes need to be handled better. Improve stemming. The Porter algorithm for English language gives mediocre results on biomedical terms (e.g. protein names).Do spell-checking before a query is sent to Google, e.g. allowing minor variations of words (using the Levenshtein Distance).Search for other semantic tags using Google, e.g. “is a kind of” and “resembles”, as well as negations (“is not a”).”, Investigate whether the Google ranking is correlated with the accuracy of the proposed semantic tag. Are highly ranked pages better sources than lower ranked ones. Test our approach on larger datasets, e.g. all available MEDLINE abstracts. Combine this approach with more advanced natural language parsing techniques in order to improve the accuracy.

In order to find multiword tokens, one could extend the search query (“X is (an an|a)”) to also include neighboring words of X, and then see how this affects the number of hits returned by Google. If there is no reduction in the number of hits, this means that the words are “always” printed together and are likely constituents in a multiword token. If you have only one actual hit to begin with, the certainty of the previous statement is of course very weak, but with increasing number of hits, the confidence is also growing.

References

[1] [Steffen Bickel, Ulf Brefeld, 2004] A Support Vector Machine classifier for gene name recognition. In Proceedings of the EMBO Works Workshop: A Critical Assessment of Text hop: Mining Methods in Molecular Biology.

[2] [C. Blaschke, C. Ouzounis, 1999] Automatic Extraction of biological information from scientific text: Protein-protein interactions. In Proceedings of International Conference on Intelligent Systems for Molecular Biology, pages 60–67. AAAI.

[3] [B. Boeckmann, Estreicher, Gasteiger, MJ Martin, K Michoud, I. Phan, S. Pilbout, and M. Schneider, 2003] The SWISS-PROT protein knowledgebase and its supplement. Nucleic Acids Research, pages 365–370, January 2003.

Embedded Systems



Smart Image Viewer Using Nios II Soft-Core

Embedded Processor Based on FPGA Platform

Swapnili A. Dumbre Pravin Y. Karmore R.W. Jasutkar GH Raisoni College Alagappa University GH Raisoni College Nagpur Karaikudi Nagpur [email protected] [email protected] [email protected]

Abstract

This paper proposes the working of an image viewer which uses the very much advanced technologies and automation methods for hardware and software designing.

There are two goals for the project. The basic goal is to read the contents of SD card, using the SD Card reader on the DE2[1] board, decode and display all the JPEG images in it on the screen one after the other as a slideshow using the onboard VGA DAC.

The next and more aggressive goal once this is achieved is to have effects in the slide show like fading, bouncing etc.

The main function of the software is to initialize and control the peripherals and also to decode the JPEG image once it is read from the SD card[9]. The top level idea is to have two memory locations. One is where the program sits (Mostly SRAM) and the other (Mostly SDRAM) is where the image buffer is kept so that the video peripheral can read from it.At the top level, the c-program reads the JPEG image from the SD Card, decompresses it and asks the video peripheral not to read from the SDRAM[2] any more and it starts writing to the SDRAM the new decoded image. After it is done, it informs the Video peripheral to go ahead and read from SDRAM again and it starts to fetch the next image from the SD Card and begins to uncompress it[6][7].

Keywords: We are using SD Card reader, with the help of which memory can be read using software protocol that is File transfer protocol proposed hardware can be used with the Quartus II tool and SOPC builder is used to read the IP core. Simultaneously on same system software is developed by Nios II[8] soft core embedded processor and C, C++ embedded language.

1 Proposed Plan of Work

The basic idea is to have two peripherals, 1) to control the onboard SD card reader and 2) to control the VGA DAC.

There are two goals for the project. The basic goal is to read the contents of SD card, using the SD Card reader on the DE2 board, decode and display all the JPEG images in it on the screen one after the other as a slideshow using the onboard VGA DAC. The next and more aggressive goal once this is achieved is to have effects in the slide show like fading, bouncing etc.

462 ♦Smart Image Viewer Using Nios II Soft-Core Embedded Processor Based on FPGA Platform


Main objective is development of fast accurate, time, cost, effort and efficient project using advanced technology.

Altera’s powerful development tools Facilities

Let you create custom systems on a programmable chip, making FPGAs the platform of choice.

Increase productivity

Whether you are a hardware designer or software developer, we have tools to provide you with unprecedented time and cost savings.

Protect your software investment from processor obsolescence

Altera's embedded solutions protect the most expensive and time consuming part of your embedded design—the software.

Scale system performance

Increase your performance at any phase of the design cycle by adding processors, custom instructions, hardware accelerators, and leverage the inherent parallelism of FPGAs.

Reduce cost

Reduce your system costs through system-level integration, design productivity, and a migration path to high-volume Hard Copy ASICs.

Establish a competitive advantage with flexible hardware

Choose the exact processor and peripherals for your application. Deploy your products quickly, and feature-fill

2 Proposed Hardware for System

Smart Image Viewer Using Nios II Soft-Core Embedded Processor Based on FPGA Platform ♦ 463


3 Research Methodology to be Employed

For this project we are using FPGA and soft-core embedded processor that is NIOS II[8] from Altera for development of system. We are using DE-2 development platform for physical verification of the proposed application.

4 Conclusion

The traditional snapshot viewer have so many disadvantages like poor picture quality, poor performance, platform dependencies and they are unable to stand against changing technologies This advanced technologies and methodologies leads to Protection of software investment from processor obsolescence It Increases productivity. Performance and efficiency of Software. It reduces cost of Project and removes all the problems occur with traditional image viewer it have an additional facility of adding effects in the slide show like fading, bouncing etc.

References

[1] Using the SDRAM Memory on Alteras DE2 Board. [2] 256K x 16 High Speed Asynchronous CMOS Static RAM With 3.3V Supply: Reference Manual. [3] Avalon Memory-Mapped Interface Specification. [4] FreeDOS-32: FAT file system driver project page from Source Forge. [5] J. Jones, JPEG Decoder Design, Sr. Design Document EE175WS00-11, Electrical Engineering Dept.,

University of California, Riverside, CA, 2000. [6] Jun Li, Interfacing a MultiMediaCard to the LH79520 System-On-Chip. [7] Engineer-to-Engineer Note Interfacing MultiMediaCard with ADSP-2126x SHARC Processors. [8] www.altera.com/literature/hb/qts/ qts_qii54007.pdf [9] www.radioshack.com/sm-digital-concepts-sd-card-reader [10] http://focus.ti.com/lit/ds/symlink/ pci7620.pdf. [11] ieeexplore.ieee.org/iel5/30/31480/ 01467967.pdf [12] www.ams-tech.com.cn/Memory-card



SMS Based Remote Monitoring and

Controlling of Electronic Devices

Mahendra A. Sheti N.G. Bawane G.H. Raisoni College of Engg G.H. Raisoni College of Engg. Nagpur (M.S.) 440016 India Nagpur (M.S.) 440016 India [email protected] [email protected]

Abstract

In today’s world mobile phone has become the most popular communication device, as it offers effective methods of communication to its users. The most common service provided by all network service providers is Short Message Service (SMS). As Short Message Service is cost effective way of conveying data, researchers are trying to apply this technology in the areas that are not explored by network service providers. One of such areas, that Short Message Service can be used, as a remote monitoring and controlling technology. By sending a specific SMS messages one can not only monitor and control different electrical/electronic devices from any place in the world, but also able to get alerts regarding catastrophic events.

A stand alone Embedded System (here we call it as Embedded Controller) can be developed to monitor and control the electrical/electronic devices through specific SMS. The same Embedded Controller can detect the catastrophic events like fire, earthquake, and burglary events. Implementation of such system is possible by using a programmed microcontroller, relays, and sensors like PIR sensor, Vibration sensor, Fire sensor and GSM modem which can be used to send and receive the SMS. The Programmed Embedded Controller acts as a mediator between mobile phone and electrical/electronic devices and performs the monitoring and controlling functions as well as catastrophic event detection and notification. In this paper we are presenting a design of an embedded system that will monitor and control the electric/electronic devices and will notify the catastrophic events (like Fire, burglary) by the means of SMS and will provide light when user arrives home at night.

1 Introduction

In this paper we are trying to explore the context of embedded systems and mobile communication which will make human life much easier. Imagine that you are driving from your office to your home, while driving you realized that you forgot to switch of the Air Conditioner. Now in this case you have to either go back to office or if somebody is there in office then you have to call him and tell him to switch of the Air Conditioner. If both the options are not possible for you then what? Is there any option exists that enables you to get the status of Air Conditioner, which is installed in your office and control it from the location where you are? Yes! At this situation remote monitoring and controlling comes in mind [1, 2].

SMS Based Remote Monitoring and Controlling of Electronic Devices ♦ 465


The purpose behind developing Embedded Controller is to remotely monitor and control the electrical/electronic devices through SMS, which is proven to be cost effective method of data communication in recent days [3]. Such system can be helpful in not only remotely switching ON/OFF the devices but also in security or safety in industries to detect the catastrophic conditions and alerting the user through SMS message [5, 8, 4].

Before discussing the actual system let us brief out existing trends for remote monitoring and controlling.

2 Recent Trends for Remote Monitoring & Controlling

Most of the electrical/electronic devices are provided with their own remote controllers by the manufacturer but the limitation is the distance, as we are interested in controlling the devices from a long distance hence we will discuss only the technology which enables the long distance monitoring and controlling [1], [9].

After the introduction of Internet in 1990’s it became a popular medium for remote communication. When researchers realized that it could become an effective medium for remote monitoring and controlling then the concept of Embedded Web Server came [6]. Embedded Web Server enabled its users to remotely monitor and control the devices over Internet [6].

But this technology is costlier as it requires an always-on connection for Embedded Web Server. Again user has to pay for the Internet access. Though Embedded Web Server can be useful for complex operations but it is really costly for simple controlling and monitoring applications [6]. Apart from the cost the factor that imposes limits on use of this technology is accessibility of Internet. Unless and until Internet accessibility is not there user cannot able to access the system.

Alternative to Internet based monitoring is making use of GSM technology, which is almost available in all the part of world even in remote locations like hill areas. GSM service is available in four frequency bands 450 MHz, 900 MHz, 1800 MHz, and 1900 MHz throughout the world. One of the unique benefits of GSM service is its capability for international roaming because of the roaming agreements established between the various GSM operators worldwide [7, 12]. Short Message Service is one of the unique features of GSM technology and can be effectively used to transmit the data from one mobile phone to another mobile phone and it was first defined as a part of GSM standard in 1985 [10].

3 Design of an Embedded Controller

As SMS technology is cheap, convenient and flexible way of conveying data as compare to Internet technology, it could be used as a cost effective and more flexible way of remote monitoring and controlling [7,10]. Hence a stand alone embedded system- “Embedded Controller” can be designed and developed to monitor and control the electrical/electronic devices through SMS with following features.

1. Remotely switch ON and OFF any electrical/electronic device by sending a specific SMS message to Embedded Controller.

2. Monitor the status of the device whether ON or OFF. For this purpose the Embedded Controller will generate a reply message to user’s request including the status of the device and will send it to requested mobile phone.

466 ♦ SMS Based Remote Monitoring and Controlling of Electronic Devices


3. The Embedded Controller will send alerts to the user regarding the status of the device. It is essential in cases where the device needs to be switched ON/OFF after a certain time.

4. The Embedded Controller will notify the power cut and power on conditions to the user.

5. Providing Security Lightning to deter burglars, or providing lights when user comes at home in late night for this purpose PIR sensor is used.

6. Providing Security alerts in case of catastrophic events like fire and burglary.

7. For Security purpose the already stored mobile phone numbers are allowed to use the system.

8. Any Mobile Phone with SMS feature can be used with the system.

9. Status of Device is displayed at Embedded Controller through LED’s for local monitoring.

10. Atmel’s AT 89C52 microcontroller is used to filter the information and perform the required functions.

3.1 System Architecture

Fig. 1: System Architecture

Figure 1 shows the system architecture of the remote monitoring and controlling system. The Embedded Controller is the heart of the system; it will perform all the system functions. As shown in figure 1 the user who is authorized to use the system is allowed to send the specific SMS message to the GSM modem, the SMS travels through GSM network to the GSM modem as in [11], the Embedded Controller periodically reads the first location of SIM (Subscriber Identity Module) which is present inside the GSM modem as in [11], as soon as



Embedded Controller finds the SMS message, it start to process the SMS message and then takes necessary action and gives reply back to the user as per the software program incorporated in the ROM of microcontroller.

3.2 Hardware Design

Figure below shows the block diagram of an Embedded Controller which consists of a micro controller, GSM modem, Relays, Devices that are to be controlled, PIR sensors, ADC, Temperature sensor, Buzzer, LCD, LED’s etc.

Fig. 2: Hardware Design of Embedded Controller

Here in our design we are using the following main hardware components:

• The AT 89C52 micro-controller:

• Used for processing the commands and controlling the different external devices connected as per the SMS received.

• ANALOGIC 900/1800 GSM modem:

• This GSM/GPRS terminal equipment is a powerful, compact and self-contained unit with standard connector interfaces and has an integral SIM card reader. It is used for receiving the SMS from the mobile device and then to transmit to the AT 89C52. It is also used to send SMS Reply back to user.

• A MAX232 chip:

• This converter chip is needed to convert TTL logic from a Microcontroller (TxD and RxD pins) to standard serial interfacing for GSM modem (RS232)

• ULN 2003A:

• The IC ULN 2003A is used for inductive load driving.

• RELAY:

• Used to achieve ON/OFF switching.



• PIR Sensor: A PIR Sensor is a motion detector, which detects the heat emitted naturally by humans and animals.

• ADC (Analog to Digital Converter):

• The ADC0808, data acquisition component is a monolithic CMOS device with an 8-bit analog-to-digital converter, 8-channel multiplexer and microprocessor compatible control logic.

• Temperature Sensor (LM 35):

• The LM35 series are precision integrated-circuit temperature sensors, whose output voltage is linearly proportional to the Celsius (Centigrade) temperature.

• Fire Sensor, Vibration Sensors etc can also be used.

• BUZZER:

• A buzzer is connected to one of the I/O ports of the Microcontroller. As soon as the signal about successful search is received, a logical level from the Microcontroller instructs the buzzer to go high, according to the programming, alerting the operator.

• LCD (Liquid Crystal Display):

• Use to display the various responses for crosschecking purpose.

• Power supply:

• Use to provide the power to various hardware components as per the requirement.

• LEDs:

• Used as status indicators

• Mobile Phone:

• Any mobile phone with SMS feature can be used for sending the commands (SMS).

• Control Equipment/device:

• Control Equipment is the device that can be controlled and monitored, e.g. Tube light.

3.3 Software Design

The two software modules excluding hardware module are as follows.

3.3.1 Communication Module

This module will be responsible for the communication between GSM modem and the AT 89C52 microcontroller. Major functionalities that are implemented in this module includes: Detecting connection between GSM modem and the microcontroller, Receiving data from modem, Sending data to modem etc. Additionally the status of modem, receiving/sending process, status of controlled device can be displayed on LCD.

3.3.2 Controlling Module

This module will take care of all the controlling functions. For example after extracting the particular command like “SWITCH ON FAN”, the module will activate the corresponding



port of the microcontroller so that the desired output can be achieved. It will also be responsible for providing the feedback to user. If catastrophic conditions are detected by the sensors then it will alert the user and will turn off the devices to avoid future danger.

4 Internal Operation of the System

The program written to achieve the desired functionality is incorporated in ROM of AT 89C52 microcontroller. For communication with GSM modem and reading, deleting, and sending the SMS messages we are using GSM AT commands [4], [11]. AT Stands for Attention. GSM AT commands are the instructions that are used to control the modem functions. AT commands are of two types Basic AT Commands & Extended AT Commands. Standard GSM AT commands used with this system are as follows:

AT+CMGR Read Message.

AT+CMGS Send Message

AT+CMGD Delete Message

AT+CMGF Select PDU mode or Text Mode

At the startup or on reset of the system, the microcontroller will first detect whether connection with GSM modem is established or not by sending the command ATE0V0 to the GSM Modem. Then after successful connection with GSM modem the AT 89C52 microcontroller reads the first location in GSM modem’s SIM (subscriber identity module) card by sending AT+CMGR=1 command for an incoming SMS in each 2 seconds. The SIM is present inside the GSM modem.

The controlling of electrical/electronic devices can be accomplished by decoding the SMS received, comparing it with already stored strings in the microcontroller and accordingly providing an output on the ports of a microcontroller. This output is used to switch on/off a given electrical/electronic device.

For getting status of the device the associated pin of microcontroller is checked for active high or low signal, and as per the signal status of the device is provided to the user of the system by sending an SMS.

The program or software is loaded into AT 89C52, and then the circuit is connected to the modem. Initially the SMS received at GSM modem is transferred to the AT 89C52 with the help of a MAX 232 chip The microcontroller periodically reads the 1st memory location of the GSM modem to check either SMS has been received or not (programmed for every two second). Before implementing the control action, the microcontroller extracts the sender’s number from the SMS and verifies if this number has the access to control the device or not. If the message comes from invalid number then it deletes the message and doesn’t take any action. If the message comes from an authorized number then it takes the necessary action.

Generally sending and receiving of SMS occurs in two modes, Text mode and Protocol Data Unit (PDU) mode [4], [11]. Here we are using the Text Mode. In text mode the message is displayed as it is in Text. In PDU mode the entire message is given as the string of hexa-decimal numbers. The AT 89C52 microcontroller performs all the functions of system. It reads, and extracts control commands from the SMS message and processes it according to the request.



The main functions implemented using GSM AT commands are as shown below:

1.Connection to GSM Modem

To detect whether the GSM modem is connected or not, the command ATE0V0 is sent to the GSM modem. In response to this GSM modems sends the reply as 0 or 1, which indicates the connection establishment if 1 or fail to connect if 0.

2.Deleting SMS from SIM Memory

To delete the SMS from SIM memory the command AT+CMGD=1,0 is used which deletes the SMS messages which are present in SIM inbox. The same command is used to delete the received SMS messages after processing the message.

3.Setting SMS Mode

To select the PDU/Text mode command

AT+CMGF=<0/1> is used

0 – indicates PDU Mode

1 – Indicates Text Mode

AT+CMGF=1 // selects text mode

4.Reading the SMS

To read the SMS message from SIM we are sending the read command as shown bellow to GSM modem periodically (after every 2 seconds)

AT+CMGR=1

Response from modem is as follows:

+CMGR: “RECUNREAD”,”9989028959”,”98/10/01, 18:22:11+00”, This is the message

5.Processing The SMS Message

After Reading the SMS message, the message is processed by microcontroller to extract the user’s number and command as per the program written. The microcontroller does not takes any action if the received is not from a valid user & if it does not contains any command, each command is predefined in program. The matching command is only accepted and microcontroller takes the action as per the defined command.

6.Sending SMS

Command syntax in text mode:

AT+CMGS= <da> <CR> Message is typed here <ctrl-Z / ESC >

For example to send a SMS to the mobile number 09989028959 we have to send the command in this format AT+CMGS= “09989028959”<CR> Please call me soon <ctrl-Z> the fields <CR> and <ctrl-Z> are programmed as string and used while generating the SMS message. On successful transmission GSM Modem responds with “ok” otherwise with “Error” if not transmitted

Detection of Catastrophic Events

With Embedded Controller one can use various sensors to detect the catastrophic events or some special events as shown in figure 2. Here we are using the PIR sensors, which enable to



provide the security feature. Passive Infra Red (PIR) is an electronic device which is commonly used to provide light and to detect motion. Whenever a suitably large (and therefore probably human) warm body moves in the field of view of the sensor, a floodlight is switched on automatically and left on for a fixed period of time - typically 30-90 seconds[13]. This can be used to deter burglars as well as providing lighting when you arrive home at night [13].

The first PIR Sensor is located at outside the door which provides light to user when he comes home in late night & which in turn helps in deterring the burglar indicating somebody’s presence. Even if burglar manage to get inside the home/office another PIR sensor detects the burglar and sends alert to user as well as to nearest police personnel for taking further actions.

The temperature sensor is set to detect a certain level of temperatures. If the temperature sensor detects the temperature greater than set point then the Embedded Controller sends an alert message to user as an indication of fire and to the fire station so that further needed action can be taken.

We can call it as an intelligent embedded system because the microcontroller is programmed in such a way that whenever the sensor detects any catastrophic event or burglary possibility then in that case it sends the notification not only to the user but also to the preventive service providers like to fire station in case of fire detection, or police station in case of burglary possibility along with the location where the event is happening, again the PIR sensor provides the light to user when he comes home late in night and stands in front of the door.

5 Conclusion

An Embedded Controller can be designed and developed to control and monitor electrical/ electronic devices through SMS using AT 89C52 microcontroller remotely using specific SMS messages. Also Embedded Controller will notifies the catastrophic events like fire, burglary by the means of SMS and provides light when user arrives home at night.

Such an Embedded Controller can be used in variety of applications [8], [14]. The Embedded Controller will be useful for the following users or systems.

A. General Public (Home users)

B. Agriculturists (Agricultural users)

C. Industries (Industrial users)

D. Electricity Board – Administrators

E. Electricity Board – Officers

F. Municipalities/ Municipal Corporations/ panchayats.

References

[1] Dr. Milkael Sjodin “Remote Monitoring and Control Using Mobile Phones” Newline Informations White Paper www.newlineinfo.se

[2] S.R.D. Kalingamudali, J.C. Harambearachchi, L.S.R. Kumara, J.H.S.R. De Silva,R. M.C.R.K. Rathnayaka, G. Piyasiri, W.A.N. Indika, M.M.A.S. Gunarathne, H.A.D.P.S.S. Kumara, M.R.D.B. Fernando “Remote Controlling and Monitoring System to Control Electric Circuitry through SMS using a icrocontroller”,Department of Physics, University of Kelaniya”, Kelaniya 11600, Sri Lanka Sri Lanka



Telecom, 5th Floor, Headquarters Building, Lotus Road, Colombo 1, Sri Lanka [email protected], [email protected]

[3] Guillaume Peersman and Srba Cvetkovic “The Global System for Mobile Communications Short Message Service”, The University Of Hefeiled Paul Griffthsa And Hugh Spear, Dialogue Communicaions Ltd.

[4] G. Peersman, P. Griffiths, H. Spear, S. Cvetkovic and C. Smythe “A tutorial overview of the short message service within GSM”, COMPUTING & CONIXOL ENGINEERING JOURNAL AI’KIL 2000.

[5] Rahul Pandhi, Mayank Kapur, Sweta Bansal, Asok Bhatacharya, “A Novel Approach to Remote Sensing and Control”, Delhi University Delhi, Proceedings of the 6th WSEAS Int. Conf. on Electronics, Hardware, Wireless and Optical Communications, Corfu Island, Greece, February 16–19, 2007.

[6] Eka Suwartadi, Candra Gunawan, “First Step Towards Internet Based Embedded Control System”, Laboratory for control and computer systems, Department of Electrical Engineering Bandung Institute of Technology, Indonesia.

[7] Cisco Mobile Exchange Solution Guide. [8] http://www.cisco.com/univercd/cc/td/doc/product/wireless/moblwrls/cmx/mmg_sg/cmxsolgd.pdf [9] Daniel J.S. Lim1, Vishy Karri2 “Remote Monitoring and Control for Hydrogen Safety via SMS” School of

Engineering, University of Tasmania Hobart, Australia [email protected] & [email protected] [10] http://en.wikipedia.org/wiki/Remote_control [11] http://en.wikipedia.org/wiki/Short_message_service [12] http://www.developershome.com/sms/atCommandsIntro.asp [13] http://en.wikipedia.org/wiki/Global_System_for_Mobile_ Communications [14] http://www.reuk.co.uk/PIR-Sensor-Circuits.htm [15] Dr. Nizar Zarka, Jyad Al-Houshi, Mohanad Akhkobek, “Temperature Control Via SMS”, Communication

department, Higher Institute for Applied Sciences and Technology (HIAST) P.O. Box. 31983, Damascus-Syria, Phone. +963 94954925, Fax. +963 11 2237710 e-mail. [email protected]



An Embedded System Design for Wireless

Data Acquisition and Control

K.S. Ravi S. Balaji Y. Rama Krishna K.L. College of Engg. K.L. College of Engg. KITE Women’s College of Vijayawada Vijayawada Professional Engineering Sciences [email protected] Sahbad, Ranga Reddy

Abstract

One of the major problems in industrial automation is monitoring and controlling of parameters in remote and hard-to-reach areas, as it is difficult for an operator to go there or even to implement and maintain wired systems. In this scenario the Wireless Data Acquisition and Control (DAQC) systems are very much useful, because the monitoring and controlling will be done remotely through a PC. A small scale Embedded system is designed for wireless data acquisition and control. It acquires temperature data from a sensor and send the data to a desktop PC in wireless format continuously at an interval of one minute and user can start and control the speed of a dc motor whenever required from this PC using wireless RF communication. Hardware for the Data Acquisition and Control (DAQC) is designed using integrated programming and development board and software is developed in embedded C using CCS_PICC IDE.

1 Introduction

Small scale embedded systems are designed with single 8-bit or 16-bit microcontrollers. They have little hardware and software complexities and involve board-level design. The embedded software is usually developed in embedded C [Rajkamal, 2007]. Data acquisition is a term that encompasses a wide range of measurement applications, which requires some form of characterization, monitoring or control. All data acquisition systems either measure a physical parameter (Temperature, pressure, flow etc) or take a specific action (sounding an alarm, turning ON a light, controlling actuators etc) based on the data received. [Heintz, 2002]. A small scale embedded system is designed for wireless data acquisition and control using PIC microcontroller, which is a high performance RISC processor.

Implementation of proper communication protocol is very important in a DAQC System, for transferring data to / from DAQC hardware, Microcontroller and PC. Data communication is generally classified as parallel communication, serial communication and wireless communication. However most of the microcontroller based DAQC uses serial communication using wired or wireless technologies. The popular wired protocols are RS-232C, I2C, SPI, CAN, Fire wire and USB. But wireless communication eliminates the need for devices to be physically connected in order to communicate. The physical layer used in wireless communication is typically either an infrared channel or a radio frequency channel and typical wireless protocols are RFID, IrDA, Bluetooth and the IEEE802.11. The present

474 ♦ An Embedded System Design for Wireless Data Acquisition and Control


DAQC system uses RF wireless communication because of its wide spread technology and has the advantage of immune to electrical noise interface, no need for line of sight propagation [Dimitrov, 2006 and Vemishetty, 2005].

2 Hardware Design and System Components

2.1 Block Diagram of DAQC system

Hardware design is the most important part in the development of the data acquisition systems. The present DAQC system transmits data from an external device or sensor to PC and from PC to DAQC in wireless format using RF modem. Figure 1 shows the block diagram connecting various components of the DAQC system and a remote PC placed in two different labs. The sensor measures physical activity, converts it to analog electrical signal. The microcontroller acquires analog data from the sensor and converts it into digital format. The digital data is transferred to the wireless RF modem, which modulates digital data to wireless signal and transmits it. At the receiving end the receiver receives the wireless data, demodulates the wireless data into the digital data and transfers it to the PC through the serial port. The starting and speed control of the DC motor connected to the microcontroller port is controlled from PC remotely.

Fig. 1: Block diagram of the DAQC System.

Experimental arrangements of the DAQC system in Lab – 1 and Lab – 2 are shown in figure 2 and 3.

Fig. 2: Experimental Arrangement at Lab1. Fig. 3: Experimental Arrangement at Lab 2.

An Embedded System Design for Wireless Data Acquisition and Control ♦ 475


2.2 LM35 - Precision Centigrade Temperature Sensor

The DAQC system monitors the temperature using LM35, a precision integrated-circuit temperature sensor, whose output voltage is linearly proportional to the Celsius (Centigrade) temperature. It does not require any external calibration or trimming to provide typical accuracies of ±1⁄4°C at room temperature and ±3⁄4°C over a full −55 to +150°C temperature range. Low cost is assured by trimming and calibration at the wafer level. The LM35’s low output impedance, linear output, and precise inherent calibration make interfacing to readout or control circuitry especially easy. It can be used with single power supply, or with dual power supply. It draws only 60 µA from its supply, it has very low self-heating, less than 0.1°C in still air [www.national.com/pf/LM/LM35.html]

2.3 PIC 16F877A Flash Microcontroller

The present DAQC system uses PIC 16F877A flash microcontroller because of various on-chip peripherals and RISC architecture. Apart from the flash memory there is a data EEPROM. Power consumption is very low, typically less than 2mA at 5v and 4MHz, 20 µA at 3v and 32 MHz. There are three timers. Timer 0 is an 8-bit timer/count with a presale. Timer 1 is 16-bit wide and with presale. 10-bit ADC is an interesting feature. A PWM output along with a RC low pass filter allows an analog output with a maximum resolution of 10 bits, which is sufficient in many applications. There is a synchronous serial port (SSP) with SPI master mode and I2C master/slave mode. Further universal synchronous receiver transmitter USART/SCI is supported. A parallel 8-bit slave port (PSP) is supported with control signals RD, WR and CS signals in case of 40/44-pin version only [www.pages.drexel.edu/~cy56/PIC.htm].

2.4 Low Power Radio Modem

The Low Power Radio Modem is an ultra low power transceiver, mainly intended for 315, 433, 868 and 915 MHz frequency bands. In the present work 915MHz low power radio modem is used. It has been specifically designed to comply with the most stringent requirements of the low power short distance control and data communication applications. The UHF Transceiver is designed for very low power consumption and low voltage operated energy meter reading applications. The product is unique with features like compact, versatile, low cost, short range, intelligent data communication, etc. The product also has 2 / 3 isolated digital inputs and outputs. Necessary command sequences will be supplied to operate these tele commands from the user host. The modem supports maximum data rates up to 19.2 kbps [www.analogicgroup.com]

2.5 Target System Design

The present DAQC system is designed using integrated programmer and development board as target system which supports PIC 16F877A (40pin DIP). The target system can be run in two modes, Program mode and Run mode, by selecting the slide switch. In the program mode, the flash memory can be programmed with the developed code in hex format from the PC to the target device. In the run mode the code that was just downloaded can be executed from a position RESET as a stand alone embedded system. The I / O port pins of the target device are accessible at connectors for interface to external devices. Figure 4 shows the PIC 16F877A development board



Fig. 4: PIC 16F877A Development Board.

The target system includes two RS232c ports. One controlled by 89C2051 and used for programming the flash program memory of PIC 16F877A, and second is controlled by a on-chip USART of PIC 16F877A used for interfacing RF transceiver for the wireless data transmission with PC. The on-board PIC 16F877A includes, 256x8 bytes EEPROM memory, 8Kx14 words of In-System reprogrammable downloadable flash memory, 368x8 bit RAM, an external crystal of 6MHz is used for providing system clock for PIC16F877A, 10-bit ADC, 3timers and programmable watch dog timer, 2capture/compare/PWM modules, 5 I/O controlled LEDs, on – board power regulation with an LED power indication, termination provided for 5V DC output at 250mA.

3 Functioning of the DAQC System

The DAQC System continuously monitors the temperature by the senor LM35. The sensor output is digitized by using on – chip ADC of 16F877A. The control registers ADCON0, ADCON1 of the A/D converter determines which bits of the A/D port are analog inputs, which are digital, which bits are used for Vref- and Vref+. PCFG3 – PCFG0 are the configuration bits in ADCON1, these bits determine which of the A/D port analog inputs are. The most common configuration is 0x00, which sets all 8 analog inputs as analog inputs and uses Vdd and Vss as the reference voltages. In the present system, the clock for A/D module is derived from internal RC oscillator and channel 0 is selected, by writing the control word OXC1 in ADCON0. The digitized value is transferred to remote desktop for display, by using USART of the PIC Microcontroller and through the RF module connected to one of the serial port for transmission.

The output of timer based functions is used to provide the base for the PWM function and the baud rate for the serial port. Each time the timer 2 count matches the count in PR2, timer 2 is automatically reset. The reset or match signals are used to provide a baud rate clock for the serial module. The frequency for the PWM signal is generated from Timer 2 output by setting the PR2 with the value 127. The timer 2 counts up to the value that matches PR2, resets the timer 2, sets PWM output bit (port C, bit2) and the process restarts. Thus PR2 value control the OFF period of the PWM signal. As timer 2 counts up again, it will eventually match the value placed into CCPR1L (Low 8 bits of CCPR1) at which time the PWM output will be cleared to zero. Thus CCPR1L value controls the ON period of the PWM signal.

During the process if the user wants to start the DC motor, then user has to hit a key on the keyboard which start, the motor. To increase the speed of the motor, the user has to press ‘U’ in the keyboard, whereas to decrease the speed of the motor user has to press ‘D’.



Accordingly appropriate control signal is sent to the Microcontroller through RF, which control the PWM signal generated by Microcontroller and there by controlling the speed of DC motor. When the user wants to quit the motor control and revert back to monitor temperature then the user has to press the key “Q’.

4 Software Development of the DAQC System

Microcontrollers can be programmed using the assembly language (or) using the high – level language such as C, BASIC. In the present system, the software is developed in Embedded ‘C’ using CCS – PICC IDE. The IDE allows the user to built projects, add source code files to the projects, set compiler options for the projects, compile projects into executable program files. The executable files are then loaded into the target microcontroller. The CCS-PICC compiler uses preprocessor commands and a data base of microcontroller information to guide the generation of the software associated with on–chip peripherals [Barnett and Thomson].The source code for the DAQC system is given in Figure 5.

#include<16F877A.H> #use delay(clock=6000000) #use rs232(baud=9600, xmit=PIN_C6,rcv=PIN_C7) unsigned char c,adcval,serval,pwmval=0; main() setup_adc_ports( RA0_RA1_RA3_ANALOG ); setup_adc(ADC_CLOCK_INTERNAL ); setup_ccp1(CCP_PWM); setup_timer_2(T2_DIV_BY_1, 127, 1); set_adc_channel(0); printf("\r\n welcome to sfs \r\n"); delay_ms(500); while(1) adcval= read_adc(); printf("\r\n TEMPERATURE VALUE IS %d ",(adcval*2)); delay_ms( 1000 ); if(kbhit()) do serval=getc(); if(serval=='U') pwmval++; set_pwm1_duty (pwmval); else if(serval=='D') pwmval--;



set_pwm1_duty (pwmval); else printf("\r\n PRESS 'U' / 'D' TO INCREASE OR DECREASE SPEED OR 'Q' TO QUIT

\r\n"); printf("\r\n CURRENT PWM VALUE IS %f ",(pwmval*.03));while(serval!='Q');

Fig. 5: Source Code for the DAQC System

5 Results and Conclusion

Once the application code is downloaded into the flash program memory of the PIC 16F877A using device programmer of CCS – PIC C IDE, then the system works independent of PC. It monitors the temperature continuously and sends the temperature data to the PC’s Hyper Terminal using wireless RF Communication. The temperature data is displayed on PC as shown in Figure 6. DC Motor control such as starting of DC Motor, increase of speed and decrease of speed can be achieved from PC in Interrupt mode. Appropriate control signals for these operations are sent to the target system from PC through wireless RF Communication. The PWM signal is changed as per the control signal and is used to either increase or decrease the speed of the DC Motor. The value of the PWM signal is also displayed on the PC’s Hyper Terminal.

Fig. 6: PC’s Hyper Terminal (Result Window)

The project “Microcontroller Based Wireless Data Acquisition system” is designed to monitor and control the devices through wireless technology, which are located at remote areas where it is difficult for the user to go and take the data. Also the added advantage in using wireless technology is, it reduces a lot of connections, also eliminates chance of electrical noise. Apart from various general microcontrollers available in the market, this project is implemented with PIC 16F877A because it has various on – chip peripherals. Because of the on – chip peripherals, additional circuitry required will be reduced to a greater extent. The present project acquires data only from 2 physical devices; it can be extended for



various other devices also. This project is implemented with RF technique; the same can be implemented with IR, Bluetooth, etc. Wireless technology is somewhat limited in bandwidth and range which sometimes offsets its inherent benefits.

References

[1] [Barnett Cox & O’Cull, Thomson Delmar Learning] Embedded C Programming and the Microchip PIC. [2] [David Hentz, 2002] Essential Components of Data Acquisition Systems, Application note 1386, Agilent

Technologies. [3] [Smilen Dimitrov, 2006] A Simple Practical Approach to wireless Distributed Data acquisition”. [4] [Kalyanramu Vemishetty, 2005] Embedded Wireless Data Acquisition System, Ph.D., Thesis. [5] [Raj Kamal, 2007] Embedded Systems-Architecture, Programming and Design, McGraw – Hill Education,

2nd edition. [6] LM35 data sheet, www.national.com/pf/LM/LM35.html [7] PIC 16F877A Microcontroller Tutorial revA’ www.pages.drexel.edu/~cy56/PIC.htm [8] RF module Spec, www.analogicgroup.com



Bluetooth Security

M. Suman P. Sai Anusha Dept of ECM Dept of IST K.L.C.E K.L.C.E [email protected] [email protected]

M. Pujitha R. Lakshmi Bhargavi Dept of ECM Dept of ECM K.L.C.E K.L.C.E. [email protected] [email protected]

Abstract

Bluetooth is emerging as a pervasive technology that can support wireless communication in various contexts in everyday life. By installing a Bluetooth network in your office you can do away with the complex and tedious task of networking between the computing devices, yet have the power of connected devices. Enhancement in Bluetooth security has become the stunning necessity. To improve the security for the pin or to simply curb the possibility for the generation of hypothesis of the initialization key (K_init) just in less than a second, this paper proposes the inclusion of two algorithms one in safer + algorithm and other in random number generation.

1 Introduction

Bluetooth is a new technology for wireless communication. The target of the design is to connect different devices together wirelessly in a small environment like in an office or at home. The BT range restricts the environment, which at the moment is about 10 meters. Before accepting the technology a close look at the security function has to be taken. Especially in office the information broadcasted over the Bluetooth Pico net can be sensitive and requires a good security. Bluetooth employs several layers of data encryption and user authentication measures. Bluetooth devices use a combination of the Personal Identification Number (PIN) and a Bluetooth address to identify other Bluetooth devices. Data encryption can be used to further enhance the degree of Bluetooth security. Establishment of channel between two blue tooth devices occurs through many stages. It includes the usage of E21, E22, E1 algorithms which produce keys like authentication keys are based on safer algorithm. The new logics included will be explained here in part-I and part-II

2 Bluetooth Security

2.1 Part-I

First consider the random number generation algorithm and the enhancement of that logic is proposed here Random number is the only plain text that is sent by the master device to the slave device. The cracker first get holds of it by sniffing it and there by makes his assumption of the PIN and generates a hypothesis for K_INIT. The attacker can now use a brute force

Bluetooth Security ♦ 481


algorithm to find the PIN used. The attacker enumerates all possible values of the PIN. Knowing IN RAND and the BD ADDR, the attacker runs E22 with those inputs and the guessed PIN, and finds a hypothesis for Kinit. The attacker can now use this hypothesis of the initialization key, to decode messages 2 and 3. Messages 2 and 3 described in figure -1 that contain enough information to perform the calculation of the link key Kab, giving the attacker a hypothesis of Kab. The attacker now uses the data in the last 4 messages to test the hypothesis: Using Kab and the transmitted AU RANDA (message 4), the attacker calculates SRES and compares it to the data of message 5. If necessary, the attacker can use the value of messages 6 and 7 to re-verify the hypothesis Kab until the correct PIN is found.

Fig. 1

The basic method of K_init generation can be figured out as follows:-

During the initial step master generates a 128 bit random number i.e., in_rand and this will be broadcasted to the slave. Both the master and the slave make use of it and with the help of Bluetooth device address, pin, in_rand, they will generate k_init (initialization key) using E22 algorithm. Whenever a cracker eaves drop and listens to the in_rand he can make a hypothesis of k_init by assuming pin numbers of all possibilities. This k_init will be given as input to E21 algorithm, along with lk_rand and this will result in the generation of Kab. This Kab along with random number and Bluetooth address will be given as input to E1 algorithm and consequently SRES will be produced both in master and slave devices. These will be checked for equality, if equal the particular pin guessed by the pin cracker will be taken for correct. All this process of cracking blue pin can be completed easily with the help of algebraic optimizations. So, a slight change in broadcasting the random number is proposed in this paper.

Now this can be modified for better convenience as: - n – Number of random numbers will be generated. ‘n’ must be necessarily selected by the master and the slave as per their choice but in common. Then ‘n’ number of random numbers will be generated with an interval of 100 milli seconds. All these will be stored in a database with two fields one specifying the index and the other specifying its corresponding random number. The 100 milli second interval is just chosen in order to minimize the transfer traffic and also the corresponding data is stored on the slave device. These set of random numbers enter into a logic known as logic-1 which will be described in the paper. Finally FIN_RAND will be the output of the logic-1 which in-turn is given as input to E22 algorithm and the entire after math process is as it is.

482 ♦ Bluetooth Security


2.2 Security Modes

Bluetooth has three different security modes build in it and they are as follows:

Security Mode 1 a device will not initiate any security. A non-secure mode[12].

Security Mode 2 A device does not initiate security procedures before channel establishment on L2CAP level This mode allows different and flexible access policies for applications, especially running applications with different security requirements in parallel. A service level enforced security mode[12].

Security Mode 3 a device initiates security procedures before the link set-up on LPM level is completed. A link level enforced security mode[12].

2.3 Part-II

A small modification in safer + algorithm is also proposed such that its security level will be tightened. Safer+ algorithm is the basic algorithm for the algorithms such as E22,E21,E1 which are used in initialization key, authentication key, encryption key generation techniques.

Outline of safer + (K64) algorithm:-

The enciphering algorithm consists of r rounds of identical transformations that are applied in sequence to the plaintext, followed by an output transformation, to produce the final cipher text. Our recommendation is to use r = 6 for most applications, but up to 10 rounds can be used if desired. Each round is controlled by two 8-byte sub keys and the output



transformation is controlled by one 8-byte sub key. These 2r +1 sub keys are all derived from the 8-byte user-selected sub key K1 in a manner. The output transformation of SAFER K-64 consists of the bit-by-bit XOR ("exclusive or" or modulo-2 sum) of bytes 1, 4, 5 and 8 of the last sub key, K2r+1, with the corresponding bytes of the output from the r-th round together with the byte-by-byte byte addition (modulo-256 addition) of bytes 2, 3, 6 and 7 of the last sub key, K2r+1, to the corresponding bytes of the output from the r-th round. After this Armenian shuffle and substitution boxes will be included whose output will be given to the layer which performs bit-by-bit addition/XOR operation. But the problem with this is a simple algebraic matrix is helpful to trace the encryption faster with the help of look up tables and some of the optimization techniques. The small change can be given by including a logic-2 layer and thereby the layer which performs solitaire encryption and sequentially the output will be given as it is to the pseudo Hadmard transformation layers. Logic-2 will be explained in the algorithm2 and for better understanding it is illustrated in figure-3. By encrypting the key itself, working backward to decrypt the key and there by cracking the blue pin will be stupidly difficult for a blue pin cracker.

Fig. 3:

The PHT-(pseudo-Hadamard transform) used above is a reversible transformation of a bit string that provides cryptographic diffusion. The bit string must be of even length, so it can be split into two bit strings a and b of equal lengths, each of n bits. To compute the transform, a' and b', from these we use the equations:

484 ♦ Bluetooth Security


3 Algorithms

Algorithms For logic-1 and logic-2 can be given below

3.1 Algorithm-1 for Logic-1

• Algorithm for random key generation

• Input: n random numbers each generated by RAND function (each 128 bits) [ai] (i=1 to n)

• Output :encrypted RAND

// the algorithm is started from here

For i=1 to 128

(bi)1=0

For i=1 to n

Store ai in rand tables

For i=1 to n ; For r=1 to 128

If ( (br)I !=0) ; (br)i+1 = (br)i + (-1)n (ar)i

Else (br)i+1= (br)i + (ar)I r++ ; i++

3.2 Algorithm-2 for Logic-2

• Algorithm: modified safer + algorithm

• Input: 8bytes after bit addition XOR operation

• Output: encrypted 8bytes number // the algorithm is started from here

Get 64 bits (8bytes)[p64 p63 p62 ………p1]

Take 65th bit p65=0 and make it MSB

Group 65 bits into 5bits per each group

Name the grps:-a13 a12 a11….a1 Consider a14=00000 a15=00000

for i=1 to i=15

do

bi=a1 mod 26 ; ci=convert bi to corresponding alphabet

done ; for i=1 to 15

do

maski= solatire key generation(ci); addni =maski + bi

di=(addni mod 26) or (addni mod 36)



done ; obtain 75 bits from d15 d14 ……d1 as k75 k74 k73…..k1

perform 11 levels of XORing and

reduce them to 64 bits ; return(encrypted 8bytes)

4 Conclusion

This work acts as a firewall to the blue-pin crackers who use advanced algebraic optimizations for the attack. Re-Pairing attacks will also be minimized as there will be no need to broadcast all the random numbers again. Bluetooth security is not complete, but is seems like it wasn’t meant to be that way. More security can be accomplished easily with additional software that is all ready available. Further work will be done in the other papers on the Bluetooth security.

References

[1] Specification of the Bluetooth System, volume 1B, December 1st 1999. [2] Knowledge Base for Bluetooth information http://www.infotooth.com/ [3] General information on Bluetooth http://www.mobileinfo.com/bluetooth/ [4] Thomas Muller, Bluetooth WHITE PAPER: Bluetooth Security Architecture, Version 1.0, 15 July 1999. [5] Annikka Aalto, Bluetooth http://www.tml.hut.fi/Studies/Tik110.300/1999/Essays/ [6] Bluetooth information, http://www.bluetoothcentral.com/ [7] Oraskari, Jyrki, Bluetooth 2000 http://www.hut.fi/~joraskur/bluetooth.html [8] How Stuff Works, information on BT http://www.howstuffworks.com/bluetooth3.htm [9] Information on Bluetooth (Official Homepage) http://www.bluetooth.com/ [10] Bluetooth Baseband http://www.infotooth.com/tutorial/BASEBAND.htm [11] Bluetooth Glossary http://www.infotooth.com/glossary.htm#authentication. [12] Frederik Armknecht. A linearization attack on the Bluetooth key stream generator. Cryptology ePrint

MobiSys ’05: The Third International Conference on Mobile Systems, Applications, and Services USENIX Association 47 Archive, report 2002/191, available from http:// eprint.iacr.org/2002/191/, 02.

[13] Yaniv Shaked and Avishai Wool “Cracking the Bluetooth PIN” MobiSys ’05: The Third International Conference on Mobile Systems, Applications, and Services, USENIX Association.



Managing Next Generation Challenges and Services

through Web Mining Techniques

Rajesh K. Shukla P.K. Chande G.P. Basal CIST, Bhopal IIM Indore CSE SATI, Vidisha [email protected]

Abstract

The Web mining technology is different to the pure mining based on database due to Web Data’s semi-structure and heterogeneous (mixed media) character. With the large size and the dynamic nature of the Web, rapid growing number of WWW users, hidden information becomes ever increasingly valuable. As a consequence of this phenomenon, the need for continuous support and updating of Web based information retrieval systems, mining Web data and analyzing on-line users’ behavior and their on-line traversal par tem have emerged as a new area of research. Web mining is a cross point of database, information retrieval and artificial intelligence. The research of web mining is also related to many different research studies, such as database, information retrieval, artificial intelligence, machine learning, natural language processing and many others. Data mining for Web intelligence is an important research thrust in Web technology— one that makes it possible to fully use the immense information available on the Web.

This paper presents a complete frame work for web mining. In this paper we present a broad overview rather than an in-depth analysis about Web mining, the taxonomy and the function of Web mining research issues, techniques and development efforts as well as emerging work in Semantic Web mining.

Keywords: WWW, Knowledge Discovery, Web mining, WCM, WSM and WUM and Semantic web mining.

1 Introduction

The Data mining is used to identify valid, novel, potentially useful and ultimately understandable pattern from data collection in database community. Data mining is emerging research fields based on various kinds of researches, such as machine learning, inductive learning, knowledge representation, statistics and information visualization, with considering characteristic features of databases. The World Wide Web is also an extensive source of information which is rapidly growing. At the same time it is extremely distributed. Web search is one of the most universal and an influential application on the internet but searching it exhaustively is inefficient in terms of time complexity. A particular type of data such as author’s lists may be scattered across thousands of independent information sources in many different formats. Determining the size of the World Wide Web is extremely difficult. In 1999 it was estimated to contain over 350 million pages with growth at the rate of about 1 million pages a day. So The Web can be viewed as the largest data source available and

Managing Next Generation Challenges and Services Through Web Mining Techniques ♦ 487


presents a challenging task for effective design and access. One of the main challenges for large corporations adopting World Wide Web sites is to discover and rediscover useful information from very rich but also diversified sources in the Web environment.

In order to help the people utilize those resources, the researchers have developed many search engines, which brought people great convenience, But at the same time, the question that the search results can not satisfy the demands of the users perfectly because the Web is structure less, dynamic and more complexity of the Web pages than the text documents. It is one way to solve those problems to use Web mining through connecting the traditional mining techniques and the Web.

The Data Mining technology normally adopts data integration method to generate Data warehouse, on which to dig the Relation Rules, Cluster Characters and get the useful Module Prediction and knowledge evaluation. Web mining can be viewed as the use of data mining techniques to automatically retrieve, extract and evaluate information for knowledge discovery from web documents and services. So application of data mining techniques to the World Wide Web, referred to as Web mining. With the prompt increasing of information in the WWW, the Web Mining has gradually become more and more important in Data Mining.

Web mining is a new research issue which draws great interest from many communities. It has been the focus of several recent research projects and papers because People always hope to gain some knowledge pattern through searching, mining on web because These useful knowledge patterns can help us in many ways e.g. to built the efficient web site that can serve people better. So Web mining is a technique that seeks to extract knowledge from Web data and combines two of the prominent research areas comprising of data mining and the World Wide Web (WWW). Web mining can be divided into three classes: web content mining, web structure mining and web usage mining.

2 Process of Web Mining

Web mining is the process of studying and discovering web user behavior from web log data. Usually the web log data collection is done over a long period of time (one day, one month, one year, etc). Later, three steps, namely, preprocessing, pattern discovery and pattern analysis as shown in figure 1 are carried out. Pre-processing is the process of transforming the raw data into a usable data model. The pattern discovery step uses several data mining algorithms to extract the user patterns. Finally, pattern analysis reveals useful and interesting user patterns and trends. Pattern Analysis is a final stage of the whole web usage mining. The goal of this process is to eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. The output of web mining algorithms is often not clustering, association and sequence analysis. These steps are normally executed after the web log data is collected.

Fig. 1: General Process of Web Mining

488 ♦ Managing Next Generation Challenges and Services Through Web Mining Techniques


The objects of Web mining include: sever logs, Web pages, Web hyperlink structures, on-line market data, and other information. When people browse Web server, sever will produce three kinds of log documents: sever logs, error logs, and cookie logs. Through analyzing these log documents we can mine accessing information.

3 Taxonomy of Web Mining

Web mining can use data mining techniques to automatically discover and extract information from Web documents/services which can help people extract knowledge, improve Web sites design, and develop e-commerce better so Web mining is the application of data mining or other information process techniques to WWW, to find useful patterns. People can take advantage of these patterns to access WWW more efficiently. Like other data mining applications, it can port from given structure on data (as in database tables), but it can also be applied to semi-structured or unstructured data like free-form text.

Most of existing web mining methods are used in Web pages of according with HTML and the Web pages are all connected by hyperlinks, in which there is very important mining information. So Web hyperlinks are very authoritative resources and user registrations can also help to mine better. Since the Document of Web contains semi structured web data including wave, image and text, thus making the Web data become multi-dimension, heterogeneous. Web mining research can be classified in to three major categories according to kind of mined information and goals that particular categories set: Web content mining (WCM), Web structure mining (WSM), and Web usage mining (WUM) as shown in figure 2.

Fig. 2: Taxonomy of Web Mining

3.1 Web Content Mining

It is the process of information discovery from sources across the World Wide Web. A well-known problem, related to web content mining, is experienced by any web user trying to find all and only web pages that interests him from the huge amount of available pages. Therefore Web content mining refers to the discovery of useful information from web contents, including text, image, multimedia etc and above all, in these data-types, texts and hyper-links are quite useful and information rich attributes.. It focuses on the discovery of knowledge from the content of web pages and therefore Research in web content mining encompasses resource discovery from the web, document categorization and clustering, and information extraction from web pages. Agents search the web for relevant information using domain characteristics and user profiles to Organize and interpret the discovery information. Agents



may be used for intelligent search, for Classification of web pages, and for personalized search by learning user preferences and discovering web sources meeting these preferences.

Web content mining can take advantage of the semi- structured nature of Web page text; can be used to detect co-occurrences of terms in texts. For example, trends over time may also be discovered, indicating a surge or decline in interest in certain topics such as the programming languages "Java". Another application area is event detection: the identification of stories in continuous news streams that correspond to new or previously unidentified events. But there are some problems with the web content mining

1. Current search tools suffer from low precision due to irrelevant results. 2. Search engines aren’t able to index all pages resulting in imprecise and incomplete

searches due to information overload. The overload problem is very difficult to cope as information on the web is immensely and grows dynamically raising scalability issues.

3. Moreover, myriad of text and multimedia data are available on the web prompting the need for intelligence agents for automatic mining.

3.2 Web Usage Mining

An important area in Web mining is Web usage mining, the discovery of patterns in the browsing and navigation data of Web users. Web Usage Mining is the application of data mining techniques to large Web data repositories in order to produce results that can be used in the design tasks. It is the process of mining for user browsing and access patterns. Web usage mining has been an important technology for understanding user’s behaviors on the Web. There are three objects of web mining include: sever logs, Web pages, Web hyperlink structures, on-line market data, and other information. Currently, most Web usage mining research has been focusing on the Web server side.

The main purpose of research in web usage mining is to improve a Web site’s service and the server’s performance. Some of the data mining algorithms that are commonly used in Web Usage Mining are association rule generation, sequential pattern generation, and clustering. Association Rule mining techniques discover unordered correlations between items found in a database of transactions. In the context of Web Usage Mining a transaction is a group of Web page accesses, with an item being a single page access.

A Web Usage Mining system can determine temporal relationships among data items. Web usage mining focus on analyzing visiting information from logged data in order to extract usage pattern, which can be classified into three categories: similar user group, relevant page group and frequency accessing path. These usage patterns can be used to improve Web server system performance and enhance the quality of service to the end users.

A Web server usually registers a Web log entry for every access of a Web page. There are many types of Web logs due to different server and different setting parameters. But all the Web files have the same basic information. Web log is usually saved as text (.txt) file. Due to large amount of irrelevant information in the Web log, the original log can’t be directly used in the Web log mining procedure. By data cleaning, user identification, session identification and path complement the information in the Web log can be used as transaction database for mining procedure.



The Web site’s topological structure is also used in session identification and path complement. Web usage mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining collects the data from Web log records to discover user access patterns of Web pages.

There are several available research projects and commercial products that analyze those patterns for different purposes. The applications generated from this analysis can be classified as personalization, system improvement, site modification, business intelligence and usage characterization. Web usage mining has several applications in e-business, including personalization, traffic analysis, and targeted advertising. The development of graphical analysis tools such as Webviz popularized Web usage mining of Web transactions. The main areas of research in this domain are Web log data preprocessing and identification of useful patterns from this preprocessed data using mining techniques.

3.3 Web Structure Mining

Using data mining methods we automatically class them into usable web page classification system organized by hyperlink structure. Web structure mining deals with the connectivity of websites and the extraction of knowledge from hyperlinks of the web sites therefore Web structure mining studies the web’s hyperlink structure. It usually involves analysis of the in-links and out-links of a web page, and it has been used for search engine result ranking. Automatic classification in documents using searching engine Search engine can index a mass of disordered data on Web. In the beginning the web mining was classified in to web content mining and web usage mining by Coley and later Kosala and Bloc keel added web structure mining.

Web structure mining is an approach based on directory structures and web graph structures of hyperlinks. Web structure mining is closely related to analyzing hyperlinks and link structure on the web for information retrieval and knowledge discovery. Web structure mining can be used by search engines to rank the relevancy between websites classifying them according to their similarity and relationship between them. Personalization and recommendation systems based on hyperlinks are also studied in web structure mining. Web structure mining is used for identifying “authorities”, which are web pages that are pointed to by a large set of other web pages that make them candidates of good sources of information. Web structure mining is also used for discovering community networks by extracting knowledge from similarity links.

Web structure mining is a research field focused on using the analysis of the link structure of the web, and one of its purposes is to identify more preferable documents. Web structure mining exploits the additional information that is (often implicitly) contained in the structure of hypertext. Therefore, an important application area is the identification of the relative relevance of different pages that appear equally pertinent when analyzed with respect to their content in isolation.

Domain applications related to web structure mining of social interest are: criminal investigations and security on the web, digital libraries where authoring, citations and cross-references form the community of academics and their publications etc. With the growing interest in Web mining, the research of structure analysis had increased and these efforts had resulted in a newly emerging research area called Link Mining, which is located at the



intersection of the work in link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining. Recently Getoor and Diehl introduced this term “link mining” to put special emphasis on the links as the main data for analysis and provide an extended survey on the work that is related to link mining. The method is based on building a graph out of a set of related data and to apply social network theory to discover similarities. There are many ways to use the link structure of the Web to create notions of authority. The main goal in developing applications for link mining is to made good use of the understanding of these intrinsic social organization of the Web

3.4 Semantic Web Mining

Related to web content mining is the effort for organizing the semi-structured web data into structured collection of resources leading to more efficient querying mechanisms and more efficient information collection or extraction. This effort is the main characteristic of the “Semantic Web”, which is considered as the next web generation. This is the method for semantic analysis of web pages. Semantic Web is based on “ontologies”, which are meta-data related to the web page content that make the site meaningful to search engines.. Analysis of web pages is performed with regard to unwritten and empirically proven agreement between users and web designers using web patterns. This method is based on extraction of patterns which are characteristics for concrete domain. Patterns provide formalization of the agreement and allow assignment of semantics to parts of web pages. In the Semantic Web, adding semantics to a Web resource is accomplished through explicit annotation (based on ontology).

Semantic annotation is the process of adding formal semantics (metadata, knowledge) to the web content for the purpose of more efficient access and management. Currently, the researchers are working on the development of fully automatic methods for semantic annotation. The first one is to simplify the querying, and the second is to improve relevance of answers. we consider important semantic annotation and tracing user behavior when querying in search engines Currently, there are two trends in the field of semantic analysis. One of them provides mechanism to semi-automatic page annotation and creation of semantic web documents. The second approach prefers an automatic annotation of real internet pages. Web content-mining techniques can accomplish the annotation process through ontology learning, mapping, merging, and instance learning. With the Semantic Web, page ranking is decided not just by the approximated semantics of the link structure, but also by explicitly defined link semantics expressed in OWL. Thus, page ranking will vary depending on the content domain. Data modeling of a complete Web site with an explicit ontology can enhance usage-mining analysis through enhanced queries and more meaningful visualizations.

4 Different Approaches for Information Extraction on the Web

Word-based search in which keyword indices are used to find documents with specified keywords or topics;

Querying deep Web sources where information hides behind searchable database query forms and that cannot be accessed through static URL links;

Web linkage pointers are very useful in recent page ranking algorithms used in search engines.



5 Important Operation on the Web

This is the key component of the web mining. Pattern discovery covers the algorithms and techniques from several research areas, such as data mining, machine learning, statistics, and pattern recognition. It has separate subsections as follows.

5.1 Classification

Classification is a method of assigning the data items to one of the predefined classes. There are several algorithm which can be used to classify the data item or the pages.Some of them are decision tree classifiers, naïve Bayesian classifiers, k-nearest neighbor classifier, Support Vector Machines etc.

5.2 Clustering

Clustering is a grouping of similar data items or the pages. Clustering of user information or pages can facilitate the development and execution of future marketing strategies

5.3 Association Rules

Association rule mining techniques can be used to discover unordered correlation between items found in a database of transactions.

5.4 Statistical Analysis

Statistical analysts may perform different kinds of descriptive statistical analyses based on different variables when analyzing the session file. By analyzing the statistical information contained in the periodic web system report, the extracted report can be potentially useful for improving the system performance, enhancing the security of the system, facilitation the site modification task, and providing support for marketing decisions.

5.5 Sequential Pattern

This technique intends to find the inter-session pattern, such that a set of the items follows the presence of another in a time-ordered set of sessions or episodes. Sequential patterns also include some other types of temporal analysis such as trend analysis, change point detection, or similarity analysis.

6 Problems with Web Mining

1. Due to lack a uniform structure of web page, web page complexity far exceeds of the complexity any traditional text document collection. Moreover, the tremendous numbers of documents in the web site have not been indexed, which makes searching the data it contains, extremely difficult.

2. The Web constitutes a highly dynamic information source. Not only does the Web continue to grow rapidly, the information it holds also receives constant updates. Linkage information and access records also undergo frequent updates.

3. The Web serves a broad spectrum of user communities. The Internet’s rapidly expanding user community connects millions of workstations. These users have



markedly different backgrounds, interests, and usage purposes. Many lack good knowledge of the information network’s structure, are unaware of a particular search’s heavy cost, frequently get lost within the Web’s ocean of information, and can chafe at the many access hops and lengthy waits required to retrieve search results

Only a small portion of the Web’s pages contain truly relevant or useful information. A given user generally focuses on only a tiny portion of the Web, dismissing the rest as uninteresting data that serves only to swamp the desired search results. How can a search identify that portion of the Web that is truly relevant to one user’s interests? How can a search find high-quality Web pages on a specified topic?

7 Conclusion and Future Directions

In fact, Web mining can be considered as the applications of the general data mining techniques to the Web. With the information overload, Web mining is a new and promising research issue to help users in gaining insight into overwhelming information on the Web. We have discussed the key component i.e. mining process itself of web mining. In this paper, we presented a preliminary discussion about Web mining, including the definition, the process, the taxonomy, and introduced a semantic web mining and the link mining. Web Mining is a new research field that has a great prospect and its technology has wide application in the world. Such as text data mining on the Web, time and spatial sequence data mining on the Web, Web mining for the e-commerce system, hyperlink structure mining of Web site and so on. A lot of work still remains to be done in adapting known mining techniques as well as developing new ones. Firstly, even though Web contains huge volume of data, it is distributed on the internet. Before mining, we need to gather the Web document together. Secondly, Web pages are semi-structured, in order for easy processing; documents should be extracted and represented into some format. Thirdly, Web information tends to be of diversity in meaning, training or testing data set should be large enough. Even though the difficulties above, the Web also provides other ways to support mining, for example, the links among Web pages are important resource to be used. Besides the challenge to find relevant information, users could also find other difficulties when interacting with the Web such as the degree of quality of the information found, the creation of new knowledge out of the information available on the Web, personalization of the information found and learning about other users. The increasing demand of Web service can not be matched with the increase in the server capability and network speed. Therefore, many alternative solutions, such as cluster-based Web servers, P2P technologies and Grid computing have been developed to reduce the response time observed by Web users. Accordingly, mining distributed Web data is becoming recognized as a fundamental scientific challenge. Web mining technology has still been faced with many challenges. The following issues must be addressed

1. There is a continual need to figure out new kinds of knowledge about user behavior that needs to be mined for.

2. There will always be a need to improve the performance of mining algorithms along both these dimensions.

3. There is a need to develop mining algorithms and develop a new model in an efficient manner.



4. There is a need of integrated logs where all the relevant information from various diversified sources in the Web environment can be kept to mine the knowledge more comprehensively.

References

[1] http://acsr.anu.edu.au/staff/ackland/papers/political_web_graphs.pdf(15th, January, 2007). [2] Faca, F.M., and Lanzi, P.L. (2005). Mining Interesting Knowledge from Weblogs: A Survey, Data

Knowledge. [3] Engineering, 53(3):225-241Badia, A., and Kantardzik, M. (2005). Graph Building as a Mining Activity:

Finding Links in the Small. Proceedings of the 3rd International Workshop on Link Discovery, ACM Press, pages 17–24.

[4] Chen, H., and Chau, M. (2004). Web Mining: Machine Learning for Web Applications, Annual Review of

Information Science and Technology (ARIST), 38:289–329. [5] Baldi, P., Frasconi, P., and Smyth, P. (2003). Modeling the Internet and the Web: Probabilistic Methods. [6] L. Getoor, Link Mining: A New Data Mining Challenge. SIGKDD Explorations, vol. 4, issue 2, 2003. [7] B. Berendt, A. Hotho, and G. Stumme, “Towards Semantic Web Mining,” Proc. US Nat’l Science

Foundation Workshop Next-Generation Data Mining (NGDM), Nat’l Science Foundation, 2002. [8] J. Srivastava, P. Desikan, and V. Kumar, “Web Mining: Accomplishments and Future Directions,” Proc.

US Nat’l Science Foundation Workshop on Next-Generation Data Mining (NGDM), Nat’l Science Foundation, 2002.

[9] Chakrabarti, S., (2000). Data Mining for Hypertext: A Tutorial Survey. ACM SIGDDD Explorations, 1(2):1–11.

[10] Wang xiao yan, “Web Usage Mining“, PH.D thesis 2000. [11] Cooley, R., Mobasher, B. and Srivastava, J., (1997). Web Mining: Information and Pattern Discovery on

the World Wide Web, 9th International Conference on Tools with Artificial Intelligence(ICTAI ’97), New Port Beach, CA, USA, IEEE Computer Society, pages 558–567.

.Proceedings of the International Conference on Web Sciences

ICWS-2009 January 10th and 11th, 2009 Koneru Lakshmaiah College of Engineering, Vaddeswaram, AP, INDIA

Internet Based Production and Marketing

Decision Support System of Vegetable

Crops in Central India

Gigi A. Abraham B. Dass KVK, JNKVV, Jabalpur IDSC, JNKVV, JBP [email protected] [email protected]

A.K. Rai A. Khare IDSC, JNKVV, Jabalpur IDSC, JNKVV, Jabalpur

[email protected]

Abstract

The ever growing demand of vegetables for domestic consumption and enormous scope of exports, per hectare yield of vegetables can be increased by using advanced technology. Research system in horticulture is to provide technological support in ever expanding vegetable production. To do this agriculturist must be well informed about the availability of different techniques such as high yielding good varieties, soil fertility evolution, fertilizer applications, importance of organic manure, pest management, harvest and post harvest technologies and marketing. In spite of successful research on new agricultural practices concerning crop cultivation, the majority of farmers are not getting upper-bound yield due to several reasons. One of the reasons is that expert advice regarding crop cultivation is not reaching farming community in a timely manner. Indian farmers need timely expert advice to make them more productive and competitive. By exploiting recent Internet revolution, we are aiming to build a web based information system, which is an IT based effort to improve the vegetable production and its marketing. This will provide production and marketing decision support to vegetable growing farmers in Central India through internet. This system tries to give the answer that aims to provide the dynamic and functional web based vegetable production and marketing system support for farmers, agricultural extension agencies, State agricultural departments, agricultural Universities, etc. Information system can be developed using PHP and MySQL.

Keywords: Internet, vegetable production, marketing, Decision support

1 Introduction

India has surfaced as a country with a sound foothold in the field of Information technology. Use of Internet has given the globe a shrinking effect. Every kind of information is only a few clicks away. The graphical user interface and multimedia has simplified one of the most complex issues in the world. The time has come to exploit this medium to the best-suited interests in the other fields of life such as agriculture.

Today one can observe that progress in information technology (IT) is affecting all spheres of our life. In any sector information is the key for its development. Agriculture is not exception to it. By giving the relevant and right information in right time to farmers can help agriculture

496 ♦ Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India


a lot. It helps to take timely action, prepare strategies for next season or year, speculate the market changes, and avoid unfavorable circumstances. So the development of agriculture may depend on how fast and relevant information is provided to the end users. There are other traditional methods to provide the information to the end users. Mostly they are inoculated, un-timed and also communication is one way only. It will take long time to provide the information and get feedback from the end users. Now its time to look at the new technologies and methodologies.

The application of information technology to agriculture and rural development has to be strengthened. It can help in optimal farm production, pest management, agro-environmental resource management by way of effective information and knowledge transfer. Vegetables form the most important component of a balanced diet. The recent emphasis on horticulture in our country consequent to the recognition of the need for attaining nutrition security provide ample scope for enhancing quality production of vegetable crops. In India there is a wide gap in available technology and their dissipation to its end users. Due to this, majority of the farmers are still using traditional agricultural practices. The skills of the farmers have to be improved in the field of vegetable production and acquiring marketing intelligence. This is possible only when complete and up to date knowledge of available technologies in all the aspects of vegetable crop production, is made available to them with an easy and user-friendly access from knowledge based resource. It is expected that if communications through the Internet and the World Wide Web are looked into seriously, the efficiency of delivering information could be increased.

The agriculture extension workers and farmers will be able to use the system for finding answers to their problems and also for growing better varieties of vegetable crops having very good marketing potential. Rural people can use the two-way communication through online service for crop information, purchases of agri-inputs, consumer durable and sale of rural produce online at reasonable price. The system will be developed in such a way that the user can interact with the software for obtaining information for a set of vegetable crops.

2 Material and Methods

Data are collected from the farmers by on site inspection of fields and from secondary sources like expert interviews, internet and literatures and manuals. Data regarding agro-climatic reasons, economic and field information, crop information, recommended varieties of each zone, nursery management details, fertilizer management, irrigation, intercropping, weed management, insect management and disease management. Photographs, video clips and audio clips are collected. Marketing details for the last 10 years are also collected.

The information collected is stored using MySQL database. The decision support system is bilingual (English, Hindi), so that the end users can make use it effectively. The system is developed by PHP (Hypertext Preprocessor) using Apache server.

Information collected from farmers, experts are compiled and stored in a relational database (MySQL). The decision support system is developed by this database and is available to end users through internet.

Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India ♦ 497


Fig. 1: Information Flow of the System

3 System Description

This user friendly decision making system is developed in both English and Hindi.

498 ♦ Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India


This system provides production technologies and guidelines to the farmers according to the agroclimatic zone and crop they cultivated. This also provides marketing information like present crop demand, price, etc. This also provides information regarding pest, disease, weed, nutrient management.

Nursery management in vegetable crop is very important aspect and the farmer generally lack this information to a great extent, the system will cover this feature in details by means of incorporation of visual clips supported by audio and text. Fertilizer management of the vegetable crop will also be covered in the system. This will display the rate of fertilizer application including the organic manure and compost application at the different stages of plant growth, along with its application methods. Irrigation requirement of vegetable crops will be incorporated in the software with emphasis on timing and method of application of irrigation as per the requirement of vegetable crop.

The diagnostic module deals with solving the problem of vegetable growers with the data base intelligence. In the database of each vegetable crop the different problems, which occur in particular vegetable crop, will be stored along with their best possible solutions and remedies. All this will be utilized to give answers to the vegetable growers/farmers.

Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India ♦ 499


4 Conclusion

In this paper, we make an effort to improve the utilization and performance of agriculture technology by exploiting recent progress in Information technology (Internet) by developing a decision support system on production technologies and marketing of vegetables. This system may work as an assist to farmers and experts in furnishing knowledge on various aspects of vegetable production; however it will not replace the experts. The vegetable sector suffers from lack of availability of good quality planting materials, low use of hybrid seeds and poor farm management. Hence the production technology needs improvement qualitatively and quantitatively, so that the standards of the vegetable produce can be further improved to cater the acceptability of international market.

In this system, the recent technologies available in the country will be collected and incorporated in different modules so that user can easily access all the modules or module of his/her interest. The decision support system for the following vegetables are available; Tomato, Brinjal, Chilli Cauliflower, Cabbage, Bottle gourd, Cucumber, Sponge gourd, Bitter gourd, Onion, Garlic, Okra (Bhindi) Garden pea, Cow pea, French bean. Information technology will play a pivotal role in agriculture extension activities and this will definitely help the extension workers as most of the village panchyat may have personal computer in near future. The ICAR is going to connect all the Krishi Vighyan Kendras (KVK) through Network (VSAT) very soon.

References

[1] S. Chaudhuri, Umeshwar Dayal, V. Ganti (2001), Database Technology for decision support systems, IEEE Computer, pp.48–55, December 2001.

[2] S.P. Ghosh, Research preparedness for accelerated growth of horticulture in India, Reproduced form J. Appl. Hort., 1(1); 64–69, 1999.

[3] P. Krishna Reddy, A novel framework for information technology based agricultural information dissemination system to improve crop productivity, in proceedings of 27th Convention of Indian Agricultural Universities Association, December 9-11, 2002, Hyderabad, India, pp. 437-459, published by Acharya N.G. Ranga Agricultural University Press.

[4] P.K. Agarwal (1999), Building India's national Internet backbone, Communications of the ACM, Vol. 42, No. 6, June 1999.

[5] Indian Council of Agricultural Research web site, ``http://www.nic.in/icar/''

.Proceedings of the International Conference on Web Sciences

ICWS-2009 January 10th and 11th, 2009 Koneru Lakshmaiah College of Engineering, Vaddeswaram, AP, INDIA

Fault Tolerant AODV Routing Protocol

in Wireless Mesh Networks

V. Srikanth T. Sai Kiran Department of IST, KLCE Department of IST, KLCE Vaddeswaram Vaddeswaram [email protected] [email protected]

A. Chenchu Jeevan S. Suresh Babu

Department of IST, KLCE Department of IST, KLCE Vaddeswaram Vaddeswaram [email protected] [email protected]

Abstract

Wireless Mesh Networks (WMNs) are believed to be a highly promising technology and will play an increasingly important role in future generation wireless mobile networks. In any Network finding the destination node is the fundamental task. This can be achieved by various Routing Protocols. AODV (Ah-hoc On-demand Distance Vector) is one of the widely used routing protocol that is currently undergoing extensive research and development.

AODV Routing Protocol is efficient in establishing the path to the destination node. But when a link in the path crashes or breaks this protocol utilizes more time for finding another efficient path to destination.

We extend the AODV to resolve the above problem by making use of sequence numbers and hop counts, which results in increasing the efficiency of protocol. This paper also explains how the use of sequence numbers will have additional advantages in finding the shortest path to destination node.

1 Introduction

Wireless mesh networks (WMNs) have emerged as a key technology for next-generation wireless networking [1]. WMN is characterized by dynamic self-organization, self-configuration and self-healing to enable quick deployment, easy maintenance, low cost, high scalability and reliable services, as well as enhancing network capacity, connectivity and resilience.

Routing plays an important role in any type of network. The main task of routing protocols is the path selection between the source node and the destination node. This has to be done reliably, fast, and with minimal overhead. In general, routing protocols can be classified into topology-based and position-based routing protocol [2]. Topology-based routing protocols select paths based on topological information, such as links between nodes. Position-based routing protocols select paths based on geographical information with geometrical algorithms. There are routing protocols that combine those two concepts. Topology-based routing protocols are further distinguished among reactive, proactive, and hybrid routing protocols. Reactive protocols compute a route only when it is needed. Ex: AODV, DSR

Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks ♦ 501


protocols. This reduces the control overhead but introduces latency for the first packet to be sent due to the time needed for the on-demand route setup. In proactive routing protocols, every node knows a route to every other node all the time. There is no latency, but permanent maintenance of unused routes increases the control overhead. Ex: OLSR protocol. Hybrid routing protocols try to combine the advantages of both the philosophies: proactive is used for near nodes or often used paths, while reactive routing is used for more distant nodes or less often used paths.

2 AODV Protocol

AODV is a very popular routing protocol. It is a reactive routing protocol. Routes are set up on demand, and only active routes are maintained [3]. This reduces the routing overhead, but introduces some initial latency due to the on-demand route setup.

AODV uses a simple request–reply mechanism for the discovery of routes. It can use hello messages for connectivity information and signals link breaks on active routes with error messages. Every routing information has a timeout associated with it as well as a sequence number. The use of sequence numbers allows detecting outdated data, so that only the most current, available routing information is used. This ensures freedom of routing loops and avoids problems known from classical distance vector protocols, such as ‘‘counting to infinity.’’ When a source node wants to send data packets to a destination node but does not have a route to Destination in its routing table, then a route discovery has to be done by Source node. The data packets are buffered during the route discovery.

2.1 Working of AODV Protocol

2.1.1 Broadcasting RREQ Packet

The source node broadcasts a route request (RREQ) throughout the network. In addition to several flags, a RREQ packet contains the hopcount, a RREQ identifier, the destination IP address, destination sequence number, originator IP address and originator sequence number. Hop count gives the number of hops that the RREQ has traveled so far. The RREQ ID combined with the originator IP address uniquely identifies a route request. This is used to ensure that a node rebroadcasts a route request only once in order to avoid broadcast storms, even if a node receives the RREQ several times from its neighbors.

The destination Sequence Number field in the RREQ message is the last known destination sequence number for this destination and is copied from the Destination Sequence Number field in the routing table of source node[4]. If no sequence number is known, the unknown sequence number (U) flag MUST be set. The Originator Sequence Number in the RREQ message is the node's own sequence number, which is incremented prior to insertion in a RREQ. The RREQ ID field is incremented by one from the last RREQ ID used by the current node. Each node maintains only one RREQ ID. The Hop Count field is set to zero. Before broadcasting the RREQ, the originating node buffers the RREQ ID and the Originator IP address (its own address) of the RREQ for PATH_DISCOVERY_TIME. In this way, when the node receives the packet again from its neighbors, it will not reprocess and re-forward the packet. The originating node often expects the destination node for bidirectional communication. In this case two way path must be discovered between destination and source node. So the gratuitous flag (G) in the RREQ packet must be set. This indicates that destination must send a RREP packet to source after discovering a route.

502 ♦ Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks


When RREQ packet is received by an intermediate node then it checks the source sequence number in the RREQ packet with the sequence number in its own routing table. If the sequence number present in RREQ packet is greater than the sequence number in routing table then the intermediate node recognizes that a fresh route is required by the source. Otherwise the packet is discarded. If this condition satisfies then the intermediate node checks in its routing table whether there is a valid path from it to destination. If there exists a valid path then it checks for the gratuitous flag in RREQ. If G flag is set the RREP packet is sent by the intermediate node to the source node and destination. When there exists no path to destination from intermediate node or when a link in the active route breaks then a RERR packet is sent to source node indicating that it cannot reach destination. The source node can restart the discovery process if its still needs a route. The source node or an intermediate node can rebuilt the route by sending out a route request.

2.1.2 Handling RERR Packet

When a RERR is received by the source then the source again retries by rebroadcasting RREQ message under some conditions.

• Source node should not broadcast more that RREQ_RATELIMIT RREQ messages per second.

• Source node that generates RREQ message waits for NET_TRAVERSAL_TIME milliseconds to receive any control messages regarding route.

• If no control message is received regarding the path source nodes again broadcasts RREQ message again (2nd try). It tries maximum RREQ_RETRIES.

• At each new attempt RREQ ID must be incremented.

• After sending RREQ packet source node buffers data packets in “first in, first out (FIFO)”.

• To reduce congestion in a network, repeated attempts by a source node uses the concept of binary exponential back off. The first time a source node broadcasts a RREQ, it waits for NET_TRAVERSAL_TIME milliseconds for the reception of a RREP. If a RREP is not received within that time, the source node sends a new RREQ. When calculating the time to wait for the RREP after sending the second RREQ, the source node must wait for 2* NET_TRAVERSAL_TIME milliseconds and so on.

When a Node detects that it cannot communicate with one of its Neighbors. When this happens it looks at the route table for Route that uses the Neighbor for a next hop and marks them as invalid. Then it sends out a RERR with the Neighbor and the invalid routes. When these types of route errors and link breakages don’t occur, the RREQ packets reach to the destination. The same procedure is followed until RREQ packet identifies the destination.

When a new route is discovered to the destination then the destination sequence number is updated as the maximum of the current sequence number and the destination sequence number in the RREQ packet. The destination sequence number is incremented by one immediately before the RREP packet is sent. When the destination increments its sequence number, it must do so by treating the sequence number value as if it were an unsigned number. To accomplish sequence number rollover, if the sequence number has already been

Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks ♦ 503


assigned to be the largest possible number represented as a 32-bit unsigned integer (i.e., 4294967295) [7], then when it is incremented it will then have a value of zero (0). On the other hand, if the sequence number currently has the value 2147483647, which is the largest possible positive integer if 2's complement arithmetic is in use with 32-bit integers, the next value will be 2147483648, which is the most negative possible integer in the same numbering system. The representation of negative numbers is not relevant to the increment of AODV sequence numbers. This is in contrast to the manner in which the result of comparing two AODV sequence numbers is to be treated. After setting the destination sequence number the destination sends RREP packet to the source through the intermediate nodes. In the intermediate nodes routing table the destination sequence number in the RREP packet is compared with that of the destination sequence number in the routing table. The intermediate node updates the destination sequence number in its routing table with sequence number in RREP packet. Finally, a valid path is set from source to destination.

2.2 Drawback

AODV Routing Protocol is efficient in establishing the path to the destination node. But when a link in the path crashes or breaks according to this protocol intermediate node sends RERR packet to the source. The source again broadcasts RREQ packet for finding another new route to the destination. This process takes more time for finding another efficient path to destination.

Another drawback in AODV protocol is that whenever there is more traffic in shortest path then RREQ packet chooses another path for destination. So the resultant path will be the path other than the shortest path.

3 Proposed Protocol

In the above explained AODV protocol, destination accepts only the first RREQ packet that comes to it and ignores all the other packets that come to it later. This causes above two drawbacks. To eliminate these drawbacks and make the above AODV protocol more efficient in case of link failure we extend the above protocol with some changes.

• The idea is introducing a new field 2nd route hop in every node’s routing table.

• Like the above protocol the RREQ packets are broadcasted by the source node in all directions[7].

• All RREQ packets are received by the destination and the destination will generate RREP packets for all RREQ’s by incrementing the destination sequence number only for the first RREQ packet unlike the above protocol.

• All the RREP packets generated by the destination is received by the intermediate nodes. Intermediate node that receives more than one RREP packets will update their routing table fields (next hop, 2nd

next hop) with the paths having least hop count and next least hop count values respectively.

• So when a link breaks at intermediate node without sending a route error to the source node we can find the alternative path to the destination from that intermediate node itself. This resolves above two drawbacks.

504 ♦ Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks


• The following example illustrates the proposed idea.

Actual Routing table for node 2:

Node Next hop Seq # Hop count

1 1 120 1

3 3 136 1

4 3 140 2

5 5 115 1

6 5 141 2

If link from 2 to 5 crashes then RERR is sent to source node and again RREQ is generated. This can be eliminated by using the following routing table.

Proposed Routing table for node 2:

Node Next hop Seq # Hop count 2nd

route hop

1 1 120 1 --

3 3 136 1 --

4 3 140 2 --

5 5 115 1 --

6 5 141 2 3

Whenever link from 2 to 5 crashes we choose the alternative path i.e. from 2 to 3, 3 to 4, 4 to 6 to reach the destination without again broadcasting the RREQ packets from the source.

4 Conclusion

The fundamental goal of any routing algorithm is to find the destination in very less span of time. So the above proposed protocol effectively finds the destination in very less quantum of time even though when a link breaks along the active path. We try to store more than two paths by constructing dynamic routing table.

References

[1] Wireless mesh networking by Yan Zhang, Jijun Luo and Honglin Hu – Auerbach publications [2] A Quick guide to AODV routing by Luke Klein-Berndt, NSIT, US Dept of Commerce [3] Kullberg “Performance of the Ad hoc On demand Distance Vector Routing Protocol” [4] Manel Zapata, Secure Ad hoc On-Demand Distance Vector (SAODV) Routing, INTERNET DRAFT

(September 2006) draft-guerrero-manet-saodv-06.txt [5] Ad hoc On-Demand Distance Vector (AODV) Routing- C. Perkins, Nokia Research Center and S. Das,

University of Cincinnati [6] IETF Manet Working Group AODV Draft http://www.ietf.org/internet-drafts/draft-ietf-manet-aodv-08.txt [7] Perkins, C.E., ‘‘Ad-Hoc Networking,’’ Addison Wesley Professional, Reading, MA, 2001

Author Index

A

Abraham, Gigi A., 495 Ahmad, Nesar, 344 Ahmad, Rehan, 209 Ahmad, S. Kashif, 260 Ahmad, Tauseef, 209 Akhtar, Nadeem, 344 Anitha, S., 423 Anuradha, T., 76 Anusha, P. Sai, 480 Arunakumari, D., 76

B Babu, D. Ramesh, 283 Babu, G. Anjan, 308 Babu, S. Suresh, 500

Balaji, S., 313, 473 Balasubramanian, V., 8 Baliarsingh, R., 239 Basal, G.P., 486 Bawane, N.G., 273, 464 Bhanu, J. Sasi, 13 Bhargavi, R. Lakshmi, 480 Bhattacharya, M., 338 Bindu, C. Shoba, 195, 351 Bisht, Kumar Saurabh, 46

C Chakraborty, Partha Sarathi, 295 Chand, K. Ram, 435 Chandrasekharam, R., 407 Chaudhary, Sanjay, 46 Chhaware, Shaikh Phiroj, 429 Choudhury, Shubham Roy, 203

D Damodaram, A., 366 Das, Pradip K., 338 Dasari, Praveen, 117 Dass, B., 495 David, K., 8 Deiv, D. Shakina, 338 Dumbre, Swapnili A., 461

G Gadiraju, N.V.G. Sirisha, 39 Garg, Nipur, 57 Gawate, Sachin P., 273 Govardhan, A., 67 Gupta, Surendra, 186

H

Hande, K.N., 109 Hari, V.M.K., 29 Harinee, N.U., 423 Hong, Yan, 301

I Iqbal, Arshad, 251

J Jasutkar, R.W., 461 Jeevan, A. Chenchu, 500 Jena, Gunamani, 239 Jiwani, Moiaz, 203 Joglekar, Nilesh, 273 Jonna, Shanti Priyadarshini, 148 Juneja, Mamta, 359 Jyothi, Ch. Ratna, 383

K

Karmore, Pravin Y., 461 Kasana, Robin, 266 Khaliq, Mohammed Abdul, 117 Khan, M. Siddique, 209 Khare, A., 495

Kiran, K. Ravi, 366 Kiran, K.V.D., 283 Kiran, P. Sai, 226 Kiran, T. Sai, 500 Kota, Rukmini Ravali, 29 Kothari, Amit D., 71 Krishna, T.V. Sai, 139 Krishna, V. Phani, 415 Krishna, Y. Rama, 473 Krishnan, R., 216 Kumar, K. Sarat, 443 Kumar, K. Hemantha, 333 Kumar, V. Kiran, 3 Kumaravelan, G., 8

L

Lakshmi, D. Rajya, 366

M Mallik, Latesh G., 429 Mangla, Monika, 52 N

Niyaz, Quamar, 260

506 ♦ Author Index


O Ong, J.T., 301

P

P.K., Chande, 486 Padaganur, K., 234 Pagoti, A.R. Ebhendra, 117 Patel, Dharmendra T., 71 Patra, Manas Ranjan, 203 Prakash, V. Chandra, 13 Pramod, Dhanya, 216 Prasad, E.V., 101, 161 Prasad, G.M.V., 239 Prasad, Lalji, 180 Prasad, V. Kamakshi, 156 Praveena, N., 313 Pujitha, M., 480

Q Qadeer, Mohammad A., 209, 251, 260, 266

R

Radhika, P., 21 Rai, A.K., 495 Rajeshwari, 234 Raju, G.V. Padma, 39 Ramadevi, Y., 383 Ramesh, N.V.K., 301 Ramesh, R., 327 Ramu, Y., 83 Rani, T. Sudha, 139 Rao, G. Gowriswara, 351 Rao, H. D. Narayana, 443 Rao, S. Vijaya Bhaskara, 443 Rao, B. Mouleswara, 313 Rao, B. Thirumala, 399 Rao, B.V. Subba, 452 Rao, D.N. Mallikarjuna, 156 Rao, G. Rama Koteswara, 435 Rao, G. Sambasiva, 101 Rao, G. Siva Nageswara, 435 Rao, K. Rajasekhara, 3, 13, 415 Rao, K. Thirupathi, 283, 399 Rao, K.V. Sambasiva, 452 Rao, K.V.S.N. Rama, 203 Rao, P. Srinivas, 90 Rao, S. Srinivasa, 283 Rao, S. Vijaya Bhaskara, 301

Rao, S.N. Tirumala, 101 Ravi, K.S., 301, 376, 473 Reddy, C.H. Pradeep, 327 Reddy, K. Shyam Sunder, 195 Reddy, M. Babu, 407 Reddy, K. Krishna, 376 Reddy, K. Sudheer, 125 Reddy, L.S.S., 393, 399 Reddy, V. Krishna, 399 Reddy, P. Ashok, 125 Reddy, V. Venu Gopalal, 376 Renuga, R., 423 Riyazuddiny, Y. Md., 376

S Sadasivam, Sudha, 423 Saikiran, P., 399 Samand, Vidhya, 180 Santhaiah, 308 Saritha, K., 366 Sarje, A.K., 174 Sastry, J.K.R., 13 Satya, Sridevi P., 322 Satyanarayan, S. Patil, 234 Sayeed, Sarvat, 266 Shaikh, Sadeque Imam, 168 Shanmugam, G., 301 Sheti, Mahendra A., 464 Shuklam, Rajesh K., 486 Singh, Inderjeet, 57 Soma, Ganesh, 148 Somavanshi, Manisha, 216 Soni, Preeti, 57 Sowmya, R., 423 Srikanth, V., 500 Srinivas, G., 29 Srinivas, M., 161 Srinivasulu, D., 327 Sriranjani, B., 423 Suman, M., 76, 480 Supreethi, K.P., 161 Surendrababu, K., 186 Suresh, K., 67, 90 Surywanshi, J.R., 109

T

Tewari, Vandan, 57 Thrimurthy, P., 407

Author Index ♦ 507


V Varma, G.P.S., 125 Varma, T. Siddartha, 29 Vasumathi, D., 67, 90 Venkateswarlu, N.B., 101 Venkatram, N., 393 Vishnuvardhan, M., 283

Y Yadav, Rashmi, 180 Yarramalle, Srinivas, 322 Yerajana, Rambabu, 174

Z Zahid, Mohammad, 251