advances in c omp uter and information sciences and ... · editor dr. tarek sobh university of...

30
Advances in C omp uter and Inf ormation Sciences and Engineering

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Advances in C omp uter and Inf ormation Sciences and Engineering

Page 2: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Edited by

Tarek Sob h University of B rid g ep o rt, C T , USA

A dvances in C omp uterand Inf ormation S ciences and Engineering

Page 3: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

EditorDr. Tarek SobhUniversity of BridgeportSchool of Engineering221 University AvenueBridgeport CT 06604USA

ISBN: 978-1-4020-8740-0 e-ISBN: 978-1-4020-8741-7

Library of Congress Control Number: 2008932465

No part of this work may be reproduced, stored in a retrieval system, or transmittedin any form or by any means, electronic, mechanical, photocopying, microfilming, recordingor otherwise, without written permission from the Publisher, with the exceptionof any material supplied specifically for the purpose of being enteredand executed on a computer system, for exclusive use by the purchaser of the work.

Printed on acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

c© 2008 Springer Science+Business Media B.V.

[email protected]

Page 4: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

To Nihal, Omar, Haya, Sami and Adam

Page 5: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Contents

1. A New Technique for Unequal-Spaced Channel-Allocation Problem in WDM Transmission System .....................................................................................................................................1 A.B.M.Mozzammel Hossain and Md. Saifuddin Faruk

2. An Algorithm to Remove Noise from Audio Signal by Noise Subtraction ............................................5 Abdelshakour Abuzneid et al.

3. Support Vector Machines Based Arabic Language Text Classification System: Feature Selection Comparative Study................................................................................................................11 Abdelwadood. Moh’d. Mesleh

4. Visual Attention in Foveated Images ....................................................................................................17 Abulfazl Yavari, H.R. Pourreza

5. Frequency Insensitive Digital Sampler and Its Application to the Electronic Reactive Power Meter..........................................................................................................................................21 Adalet N. Abiyev

6. Non-Linear Control Applied to an Electrochemical Process to Remove Cr(VI) from Plating Wastewater............................................................................................................................................27 Regalado-Méndez, A. et al.

7. Semantics for the Specification of Asynchronous Communicating Systems (SACS) ..........................33 A.V.S. Rajan et al.

8. Application Multicriteria Decision Analysis on TV Digital .................................................................39 Ana Karoline Araújo de Castro et al.

9. A Framework for the Development and Testing of Cryptographic Software .......................................45 Andrew Burnett, Tom Dowling

10. Transferable Lessons from Biological and Supply Chain Networks to Autonomic Computing...........51 Ani Calinescu

11. Experiences from an Empirical Study of Programs Code Coverage.....................................................57 Anna Derezińska

12. A Secure and Efficient Micropayment System .....................................................................................63 Anne Nguyen and Xiang Shao

13. An Empirical Investigation of Defect Management in Free/Open Source Software Projects...............68 Anu Gupta, Ravinder Kumar Singla

14. A Parallel Algorithm that Enumerates all the Cliques in an Undirected Graph ....................................74 A. S. Bavan

Preface

Acknowledgements

xv

xvii

vii

Page 6: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

15. Agent Based Framework for Worm Detection......................................................................................79 El-Menshawy et al.

16. Available Bandwidth Based Congestion Avoidance Scheme for TCP: Modeling and Simulation ......................................................................................................................................85 A. O. Oluwatope et al.

17. On the Modeling and Control of the Cartesian Parallel Manipulator....................................................90 Ayssam Y. Elkady et al.

18. Resource Allocation in Market-Based Grids Using a History-Based Pricing Mechanism ...................97 Behnaz Pourebrahimi et al.

19. Epistemic Structured Representation for Legal Transcript Analysis .................................................101 Tracey Hughes et al.

20. A Dynamic Approach to Software Bug Estimation ............................................................................108 Chuanlei Zhang et al.

21. Soft Biometrical Students Identification Method for e-Learning........................................................114 Deniss Kumlander

22. Innovation in Telemedicine: an Expert Medical Information System Based on SOA, Expert Systems and Mobile Computing..............................................................................................119 Denivaldo Lopes et al.

23. Service-Enabled Business Processes: Constructing Enterprise Applications – An Evaluation Framework ..........................................................................................................................................125 Christos K. Georgiadis, Elias Pimenidis

24. Image Enhancement Using Frame Extraction Through Time.............................................................131 Elliott Coleshill et al.

25. A Comparison of Software for Architectural Simulation of Natural Light.........................................136 Evangelos Christakou and Neander Silva

26. Vehicle Recognition Using Curvelet Transform and Thresholding....................................................142 Farhad Mohamad Kazemi et al.

27. Vehicle Detection Using a Multi-Agent Vision-Based System ..........................................................147 Saeed Samadi et al.

28. Using Attacks Ontology in Distributed Intrusion Detection System...................................................153 F. Abdoli, M. Kahani

29. Predicting Effectively the Pronunciation of Chinese Polyphones by Extracting the Lexical Information.......................................................................................................................159 Feng-Long Huang et al.

30. MiniDMAIC: An Approach for Causal Analysis and Resolution in Software Development Projects ................................................................................................................................166 Márcia G. S. Gonçalves et al.

CONTENTSv i i i

Page 7: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

31. Light Vehicle Event Data Recorder Forensics ....................................................................................172 Jeremy S. Daily et al.

32. Research of Network Control Systems with Competing Access to the Transfer Channel..................178 G.V.Abramov et al.

33. Service-Oriented Context-Awareness and Context-Aware Services ..................................................184 H. Gümüşkaya, M. V. Nural

34. Autonomous Classification via Self-Formation of Collections in AuInSys........................................190 Hanh H. Pham

35. Grid Computing Implementation in Ad Hoc Networks ......................................................................196 Aksenti Grnarov et al.

36. One-Channel Audio Source Separation of Convolutive Mixture........................................................202 Jalal Taghia, Jalil Taghia

37. Extension of Aho-Corasick Algorithm to Detect Injection Attacks....................................................207 Jalel Rejeb, and Mahalakshmi Srinivasan

38. Use of Computer Vision During the Process of Quality Control in the Classification of Grain .........213 Rosas Salazar Juan Manuel et al.

39. Theoretical Perspectives for E-Services Acceptance Model...............................................................218 Kamaljeet Sandhu

40. E-Services Acceptance Model (E-SAM) ............................................................................................224 Kamaljeet Sandhu

41. Factors for E-Services System Acceptance: A Multivariate Analysis ................................................230 Kamaljeet Sandhu

42. A Qualitative Approach to E-Services System Development .............................................................236 Kamaljeet Sandhu

43. Construction of Group Rules for VLSI Application ...........................................................................242 Byung-Heon Kang et al.

44. Implementation of an Automated Single Camera Object Tracking System Using Frame Differencing and Dynamic Template Matching..................................................................................245 Karan Gupta, Anjali V. Kulkarni

45. A Model for Prevention of Software Piracy Through Secure Distribution .........................................251 Vaddadi P. Chandu et al.

46. Performance Enhancement of CAST-128 Algorithm by Modifying Its Function ..............................256 Krishnamurthy G.N et al.

47. A Concatenative Synthesis Based Speech Synthesiser for Hindi........................................................261 Kshitij Gupta

48. Legibility on a Podcast: Color and Typefaces.....................................................................................265 Lennart Strand

CONTENTS i x

Page 8: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

49. The Sensing Mechanism and the Response Simulation of the MIS Hydrogen Sensor .......................268 Linfeng Zhang et al.

50. Visual Extrapolation of Linear and Nonlinear Trends: Does the Knowledge of Underlying Trend Type Affect Accuracy and Response Bias?..............................................................................273 Lisa A. Best

51. Resource Discovery and Selection for Large Scale Query Optimization in a Grid Environment.......279 Mahmoud El Samad et al.

52. Protecting Medical Images with Biometric Information.....................................................................284 Marcelo Fornazin et al.

53. Power Efficiency Profile Evaluation for Wireless Communication Applications ..............................290 Marius Marcu et al.

54. Closing the Gap Between Enterprise Models and Service-Oriented Architectures ............................295 Martin Juhrisch, Werner Esswein

55. Object Normalization as Contribution to the Area of Formal Methods of Object-Oriented Database Design..................................................................................................................................300 Vojtěch Merunka, Martin Molhanec

56. A Novel Security Schema for Distributed File Systems .....................................................................305 Bager Zarei et al.

57. A Fingerprint Method for Scientific Data Verification.......................................................................311 Micah Altman

58. Mobile Technologies in Requirements Engineering ...........................................................................317 Gunnar Kurtz et al.

59. Unsupervised Color Textured Image Segmentation Using Cluster Ensembles and MRF Model.........................................................................................................................................323 Mofakharul Islam et al.

60. An Efficient Storage and Retrieval Technique for Documents Using Symantec Document Segmentation (SDS) Approach ...........................................................................................................329 Mohammad A. ALGhalayini and ELQasem ALNemah

61. A New Persian/Arabic Text Steganography Using “La” Word ..........................................................339 Mohammad Shirali-Shahreza

62. GTRSSN: Gaussian Trust and Reputation System for Sensor Networks............................................343 Mohammad Momani, Subhash Challa

63. Fuzzy Round Robin CPU Scheduling (FRRCS) Algorithm ...............................................................348 M.H. Zahedi et al.

64. Fuzzy Expert System In Determining Hadith Validity .......................................................................354 M. Ghazizadeh et al.

65. An Investigation into the Performance of General Sorting on Graphics Processing Units .................360 Nick Pilkington, Barry Irwin

CONTENTSx

Page 9: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

66. An Analysis of Effort Variance in Software Maintenance Projects ....................................................366 Nita Sarang, Mukund A Sanglikar

67. Design of Adaptive Neural Network Frequency Controller for Performance Improvement of an Isolated Thermal Power System.................................................................................................372 Ognjen Kuljaca et al.

68. Comparing PMBOK and Agile Project Management Software Development Processes ...................378 P. Fitsilis

69. An Expert System for Diagnosing Heavy-Duty Diesel Engine Faults................................................384 Peter Nabende and Tom Wanyama

70. Interactive Visualization of Data-Oriented XML Documents ............................................................390 Petr Chmelar et al.

71. Issues in Simulation for Valuing Long-Term Forwards......................................................................394 Phillip G. Bradford, Alina Olteanu

72. A Model for Mobile Television Applications Based on Verbal Decision Analysis............................399 Isabelle Tamanini et al.

73. Gene Selection for Predicting Survival Outcomes of Cancer Patients in Microarray Studies ............405 Tan Q et al.

74. Securing XML Web Services by using a Proxy Web Service Model .................................................410 Quratul-ain Mahesar, Asadullah Shah

75. O-Chord: A Method for Locating Relational Data Sources in a P2P Environment ............................416 Raddad Al King et al.

76. Intuitive Interface for the Exploration of Volumetric Datasets ...........................................................422 Rahul Sarkar et al.

77. New Trends in Cryptography by Quantum Concepts .........................................................................428 SGK MURTHY et al.

78. On Use of Operation Semantics for Parallel iSCSI Protocol ..............................................................433 Ranjana Singh, Rekha Singhal

79. BlueCard: Mobile Device-Based Authentication and Profile Exchange.............................................441 Riddhiman Ghosh, Mohamed Dekhil

80. Component Based Face Recognition System......................................................................................447 Pavan Kandepet and Roman W. Swiniarski

81. An MDA-Based Generic Framework to Address Various Aspects of Enterprise Architecture..........455 S. Shervin Ostadzadeh et al.

82. Simulating VHDL in PSpice Software................................................................................................461 Saeid Moslehpour et al.

83. VLSI Implementation of Discrete Wavelet Transform using Systolic Array Architecture.................467 S. Sankar Sumanth and K.A. Narayanan Kutty

CONTENTS x i

Page 10: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

84. Introducing MARF: A Modular Audio Recognition Framework and its Applications for Scientific and Software Engineering Research ...................................................................................473 Serguei A. Mokhov

85. TCP/IP Over Bluetooth .......................................................................................................................479 Umar F. Khan et al.

86. Measurement-Based Admission Control for Non-Real-Time Services in Wireless Data Networks.....................................................................................................................................485 Show-Shiow Tzeng and Hsin-Yi Lu

87. A Cooperation Mechanism in Agent Organization .............................................................................491 W. Alshabi et al.

88. A Test Taxonomy Applied to the Mechanics of Java Refactorings ....................................................497 Steve Counsell et al.

89. Classification Techniques with Cooperative Routing for Industrial Wireless Sensor Networks .............................................................................................................................................503 Sudhir G. Akojwar, Rajendra M. Patrikar

90. Biometric Approaches of 2D-3D Ear and Face: A Survey .................................................................509 S. M. S. Islam et al.

91. Performance Model for a Reconfigurable Coprocessor ......................................................................515 Syed S. Rizvi et al.

92. RFID: A New Software Based Solution to Avoid Interference ..........................................................521 Syed S. Rizvi et al.

93. A Software Component Architecture for Adaptive and Predictive Rate Control of Video Streaming....................................................................................................................................................526 Taner Arsan, Tuncay Saydam

94. Routing Table Instability in Real-World Ad-Hoc Network Testbed...................................................532 Tirthankar Ghosh, Benjamin Pratt

95. Quality Attributes for Embedded Systems..........................................................................................536 Trudy Sherman

96. A Mesoscale Simulation of the Morphology of the PEDT/PSS Complex in the Water Dispersion and Thin Film: the Use of the MesoDyn Simulation Code...............................................540 T. Kaevand et al.

97. Developing Ontology-Based Framework Using Semantic Grid .........................................................547 Venkata Krishna. P, Ratika Khetrapal

98. A Tree Based Buyer-Seller Watermarking Protocol ...........................................................................553 Vinu V Das

99. A Spatiotemporal Parallel Image Processing on FPGA for Augmented Vision System.....................558 W. Atabany, and P. Degenaar

CONTENTSx i i

Page 11: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

100. Biometrics of Cut Tree Faces..............................................................................................................562 W. A. Barrett

101. A Survey of Hands-on Assignments and Projects in Undergraduate Computer Architecture Courses..................................................................................................................................566 Xuejun Liang

102. Predicting the Demand for Spectrum Allocation Through Auctions ..................................................571 Y. B. Reddy

103. Component-Based Project Estimation Issues for Recursive Development .........................................577 Yusuf Altunel, Mehmet R. Tolun

CONTENTS x i i i

Author Index................................................................................................................................................583

...............................................................................................................................................5Subject Index 87

Page 12: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Preface This book includes Volume I of the proceedings of the 2007 International Conference on Systems, Computing Sciences and Software Engineering (SCSS). SCSS is part of the International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE 07). The proceedings are a set of rigorously reviewed world-class manuscripts presenting the state of international practice in Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. SCSS 07 was a high-caliber research conference that was conducted online. CISSE 07 received 750 paper submissions and the final program included 406 accepted papers from more than 80 countries, representing the six continents. Each paper received at least two reviews, and authors were required to address review comments prior to presentation and publication. Conducting SCSS 07 online presented a number of unique advantages, as follows:

• All communications between the authors, reviewers, and conference organizing committee were done on line, which permitted a short six week period from the paper submission deadline to the beginning of the conference.

• PowerPoint presentations, final paper manuscripts were available to registrants for three weeks prior to the start of the conference.

• The conference platform allowed live presentations by several presenters from different locations, with the audio and PowerPoint transmitted to attendees throughout the internet, even on dial up connections. Attendees were able to ask both audio and written questions in a chat room format, and presenters could mark up their slides as they deem fit.

• The live audio presentations were also recorded and distributed to participants along with the power points presentations and paper manuscripts within the conference DVD.

The conference organizers and I are confident that you will find the papers included in this volume interesting and useful. We believe that technology will continue to infuse education thus enriching the educational experience of both students and teachers. Tarek M. Sobh, Ph.D., PE Bridgeport, Connecticut June 2008

xv

Page 13: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Acknowledgements The 2007 International Conference on Systems, Computing Sciences and Software Engineering (SCSS) and the resulting proceedings could not have been organized without the assistance of a large number of individuals. SCSS is part of the International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE). CISSE was founded by Professor Khaled Elleithy and myself in 2005, and we set up mechanisms that put it into action. Andrew Rosca wrote the software that allowed conference management, and interaction between the authors and reviewers online. Mr. Tudor Rosca managed the online conference presentation system and was instrumental in ensuring that the event met the highest professional standards. I also want to acknowledge the roles played by Sarosh Patel and Ms. Susan Kristie, our technical and administrative support team. The technical co-sponsorship provided by the Institute of Electrical and Electronics Engineers (IEEE) and the University of Bridgeport is gratefully appreciated. I would like to express my thanks to Prof. Toshio Fukuda, Chair of the International Advisory Committee and the members of the SCSS Technical Program Committee including: Abdelaziz AlMulhem, Alex A. Aravind, Ana M. Madureira, Mostafa Aref, Mohamed Dekhil, Julius Dichter, Hamid Mcheick, Hani Hagras, Marian P. Kazmierkowski, Low K.S., Michael Lemmon, Rafa Al-Qutaish, Rodney G. Roberts, Sanjiv Rai, Samir Shah, Shivakumar Sastry, Natalia Romalis, Mohammed Younis, Tommaso Mazza, and Srini Ramaswamy. The excellent contributions of the authors made this world-class document possible. Each paper received two to four reviews. The reviewers worked tirelessly under a tight schedule and their important work is gratefully appreciated. In particular, I want to acknowledge the contributions of the following individuals: Ahmad Almunayyes, Alexander Alegre, Ali Abu-El Humos, Amitabha Basuary, Andrew Burnett, Ani Calinescu, Antonio José Balloni, Baba Ahmed Isman Chemch Eddine, Charbel Saber, Chirakkal Easwaran, Craig Caulfield, Cristian Craciun, Emil Vassev, Evangelos Christakou, Francisca Márcia Gonçalves, Geneflides Silva, Hanh Pham, Harish CL, Imran Ahmed, Jose Maria Pangilinan, Khaled Elleithy, Leticia Flores, Ligia Chira Cremene, Madiha Hussain, Michael Horie, Miguel Barron-Meza, Mohammad ALGhalayini, Mohammed Abuhelalh, Phillip Bradford, Rafa Al-Qutaish, Rodney G. Roberts, Seppo Sirkemaa, Srinivasa Kumar Devireddy, Stephanie Chua, Steve Counsell, Uma Balaji, Vibhore Jain, Xiaoquan Gao, and Ying-ju Chen Tarek M. Sobh, Ph.D., PE Bridgeport, Connecticut June 2008

xvii

Page 14: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Abstract - For long-haul fiber-optic transmission systems to support multiple high speed channels wavelength-division multiplexing (WDM), is currently being deployed to achieve high-capacity. It allows information to be transmitted at different channels with different wavelength. But FWM is one of the major problems needed to be taken into account when one designs high-capacity long-haul WDM transmission system. Recently, unequal-spaced channel-allocation technique have been studied and analyzed to reduce four wave mixing (FWM) crosstalk. Finding a solution by this proposed channel allocation technique need two parameters such as minimum channel spacing and number of channel used in WDM system. To get better result minimum channel spacing has to be selected perfectly.

INTRODUCTION

WDM system, which allows information at various channels to be transmitted in different wavelengths, fully exploits the vast bandwidth provided by optical fiber. If the frequency separation of any two channels of a WDM system is different from that of any other pair of channels, no FWM waves will be generated at any of the channel frequencies, thereby suppressing FWM crosstalk. When three carrier frequencies f1, f2 & f3 co-propagate in a fiber they will produced a fourth wave having frequency f = f1+f2-f3. Also in FWM two signals (f1, f2) mix to produce two new frequencies: f3 = 2f1 - f2, f4= 2f2-f1. Two techniques that can determine (and mathematically prove) the total numbers of FWM signals falling onto the operating band and each channel for the unequal-spaced WDM systems are: (1) Frequency Difference Triangle (FDT), (2) Frequency Difference Square (FDS). These techniques are also applicable to equal-spaced systems. By knowing the two numbers, one can adjust the system parameters, such as minimum channel spacing, in order to reduce the adverse effects of FWM crosstalk and interchannel interference, or avoid the assignment of channels at locations with the most severe crosstalk. A design methodology of channel spacing is presented to satisfy the above requirement. The method is a generalization of what had been proposed in the 1950’s to reduce the effect of 3rd-order intermodulation interference in radio systems. The use of proper unequal channel spacing keeps FWM waves from coherently interfering with the signals. Nevertheless, the FWM waves are still generated at the expense of the transmitted power, giving rise to pattern-dependent depletion of the channels. In this paper, a “simplified algebraic” framework for finding the solutions to unequal-spaced channel-allocation problem is reported. Proposed algorithms are introduced to provide a fast and simple alternative to solve the problem.

PROPOSED CHANNEL ALLOCATION TECHNIQUE:

Proposed channel allocation technique base on analytical process of FDT and FDS. Finding a solution by this channel allocation technique need two parameters such as minimum channel spacing and number channel used in WDM system. To get better result minimum channel spacing has to be selected perfectly. Although the unequal-spaced channel-allocation techniques greatly reduce FWM crosstalk, the number of unequal-spaced WDM channels supported is always less than that of conventional equal-spaced WDM systems when the operating optical bandwidth and the minimum channel spacing of both kinds of systems are the same [1]. In other words, the minimum channel spacing in an unequal-spaced system has to be reduced and is thus less than the channel spacing in an equal-spaced system when the same number of channels is accommodated in a fixed operating bandwidth. The effect of interchannel interference gets worse as the minimum channel spacing decreases[2]-[5]. However, to reduce the impact of the interchannel interference, the minimum channel spacing must be large enough. They tend to counter-act against each other and we need to have a good balance between them while designing an unequal-spaced WDM system with a fixed operating bandwidth and a pre-determined number of channels. As a result, FWM crosstalk may sometime be unavoidable in order to keep the interchannel interference within an acceptable level [3]. While the effects of interchannel interference and FWM crosstalk in WDM systems have been studied and understood [6], [4], [7], [8], [9] it will be helpful to system designers if there exists a fast tool to measure the strength of FWM crosstalk in the operating band as well as in each channel, instead of going through complex analyzes. At present work it is found that to get the optimum solution/maintain minimum FWM crosstalk falling onto each channel, by the proposed algorithm the following should be kept in mind:

01. Minimum channel spacing for N= 2P channel is as follows:

Minimum channel spacing qpn p ±+= 2 …….. (1)

where .........3,2,1,0,1,2,2 −−=≥ qp

02. Minimum channel spacing n = (3P – 1)/2, Number of unequal space WDM channel N = P+1 and the total number of slots occupied by these channelS = P (2P-1) are constructed algebraically for a given prime number (P) [1].

03. Many other solution can be achieved by adjusting minimum channel spacing (n) and number channel (N) used in WDM system

A.B.M.Mozzammel Hossain and Md. Saifuddin Faruk

Dept. of Electrical & Electronic Engineering, Dhaka University of Engineering & Technology, Gazipur, Bangladesh

T. Sobh (ed.), Advances in Computer and Information Sciences and Engineering, 1–4. © Springer Science+Business Media B.V. 2008

A New Technique for Unequal-Spaced Channel-Allocation Problem in WDM

Transmission System

Page 15: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

ALLOCATION ALGORITHMS

First choose the number of channels (N) that you want to use in

your WDM system. Then choose any arbitrary minimum channel

spacing (n). The maximum channel spacing M= N + n - 2. Now

starting with 0 unequal channel spacing sequences are 0, n, n+1,

n+2, n+3 .… …. …. n+N-3, n+N-2. If N>6 and n>6, make two

equal set of resultant unequal channel spacing sequence. The

second set includes m sequential unequal channel spacing

sequence except 0, where m = N/2. First set includes 0 and rest of

the sequential unequal channel spacing sequence. Otherwise, if

N ≤ 6 and n ≤ 6, make one set of resultant unequal channel

spacing sequence.

EXAMPLE:

In first case Let N=8, n=10 then M=8+10-2=16 and m=8/2=4. The unequal channel spacing sequence are 0, 10,11,12,13,14,15,16. Elements for first set [0,14,15,16] and elements for second set [10,11,12,13].

The resultant unequal channels for WDM are as follows:

n1= 0+0 = 0, n2= n1+14 = 14, n3 = n2 +15 = 29, n4 = n3 +16 = 45,

n5 = n4 +10 = 55, n6 = n5 +11=66, n7 = n6 +12=78, n8 = n7 +13=91.

In second case Let N=4, n= 4 then M= 4+ 4 – 2 =6. The unequal

channel spacing sequences are 0, 4, 5, 6.

The resultant unequal channels for WDM are as follows:

n1= 0+0 = 0, n2= n1 + 4 = 4, n3 = n2 + 5 = 9, n4 = n3 + 6 = 15.

Table-1

Computation of slot vector n by proposed algorithm

Exa

mple

N n M S an example of sot vector n

01 3 1 2 3 [0,1,3]

02 3 3 4 7 [0,3,7]

03 3 4 5 9 [0,4,5]

04 4 2 4 9 [0,2,5,9]

05 4 3 5 12 [0,3,7,12]

06 4 4 6 15 [0,4,9,15]

07 4 5 7 18 [0,5,11,18]

08 6 5 9 35 [0,5,11,18,26,35]

09 6 6 10 40 [0,6,13,21,30,40]

10 6 7 11 45 [0,10,21,28,36,45]

11 6 8 12 50 [0,11,23,31,40,50]

12 8 1 7 28 [0,5,11,18,19,21,24,28]

13 8 4 10 49 [0,8,17,27,31,36,42,49]

14 8 10 16 91 [0,14,29,45,55,66,78,91]

15 8 12 18 105 [0,16,33,51,63,76,90,105]

16 10 13 21 153 [0,18,37,57,78,91,105,120,136,

153]

17 12 7 17 132 [0,13,27,42,58,75,82,90,99,109

,120,132]

18 16 20 34 405 0,28,57,87,150,183,217,236,25

8,280,303,327,352,378,405

HOSSAIN AND FARUK 2

Page 16: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Table -1.1

The number of FWM product falling onto each channel of the unequal

equal WDM system with N=3 shown in above Table-1

Unequal Spaced

chann

el

Equal

spaced

[0,1,2] Example1 Example2 Example3

N1 1 0 0 0

N2 1 0 0 0

N3 1 0 0 0

Total 3 0 0 0

Table -1.2

The number of FWM product falling onto each channel of the unequal

equal WDM system with N=4 shown in above Table-1

Unequal Spaced

channel

Equal

spaced

[0,2,4,6] Exampl

e 4

Examp

le 5

Examp

le 6

Example

7

N1 2 0 0 0 0

N2 3 0 0 0 0

N3 3 0 0 0 0

N4 2 0 0 0 0

Total 10 0 0 0 0

Table -1.3

The number of FWM product falling onto each channel of the unequal

equal WDM system with N=6 shown in above Table-1

Unequal Spaced channel Equal

spaced

[0,5,10,15

,20,25,30]

Exa

mple

8

Example

9

Example

10

N1 6 0 0 0

N2 8 0 0 0

N3 9 0 0 0

N4 9 0 0 0

N5 8 0 0 0

N6 6 0 0 0

Total 46 0 0 0

Table -1.4

The number of FWM product falling onto each channel of the unequal

equal WDM system with N=8 shown in above Table-1

Unequal Spaced channel Equal spaced

[0,5,10,15,20,

25,30,35] Example 13 Example14 Example15

N1 12 0 0 0

N2 15 1 0 0

N3 17 1 0 0

N4 18 1 0 0

N5 18 0 0 0

N6 17 1 0 0

N7 15 0 0 0

N8 12 0 0 0

Total 124 4 0 0

Table -1.5

The number of FWM product falling onto each channel of the unequal

equal WDM system with N=10 and N= 12 shown in above Table-1

Unequal Spaced channel Equal spaced

[0,5,10,15,20,25,30,

35,40,45] Example 16 Example17

N1 20 0 2

N2 24 0 1

N3 27 0 2

N4 29 1 3

N5 30 0 2

N6 30 0 4

N7 29 1 2

N8 27 0 2

N9 24 0 2

N10 20 1 1

N11 × × 1

N12 × × 2

Total 260 3 24

UNEQUAL-SPACED CHANNEL-ALLOCATION PROBLEM 3

Page 17: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Table -2

Some results obtained from the proposed constructions, where P is a

prime number, N is the number of unequal-spaced WDM channels, S is

the total number of slots occupied by these channel, n is the minimum

channel separation, M is the maximum channel separation and n is the

slot vector.

P N S n M an example of sot

vector n

FWM

product

3 4 15 4 6 [0,4,9,15] 0

5 6 45 7 11 [0,10,21,28,36,45] 0

7 8 91 10 16 [0,14,29,45,55,66,7

8,91]

0

11 12 231 16 26 [0,22,45,69,94,120,

136,153,171,190,2

10,231]

3

13 14 325 19 31 [0,26,53,81,110,14

0,171,190,210,231,

253,276,300,325]

16

17 18 561 25 41 [0,34,69,105,142,1

80,219,259,300,32

5,351,378,406,435,

465, 496,528,561]

25

19 20 703 28 46 [0,38,77,117,158,2

00,243,287,332,37

8,406,435,465,496,

528,561,595,630,6

66,703]

12

CONCLUSION

To reduce Four-Wave Mixing Crosstalk in high-capacity, long-haul, repeaterless, WDMTransmission system, proposed technique is used for finding the solution to be Unequal-Spaced Channel–Allocation Problem. The problem has been formulated algebraically and provided a programming code for very fast solution. Numerical example has been given to illustrate the constructions. The construction provides a fast and simple alternative to solve the problem, besides the recently exists methods. In the proposed technique there is no bound to find unequal spaced channel allocation problem. But results are only valid with respect to the limitation.

Although proposed algorithm is the simplest than other algorithm exists at present, but it has some limitations. For higher order channels it is not able to give optimum solution for WDM system. But the algorithm describe above gives solution for maintaining minimum FWM crosstalk. At present work it is not possible to overcome all of the algorithm’s limitations. Besides in this paper, algorithm is investigating only software simulation by FDT and FDS. To overcome it’s limitations need practical implementation facilities instate of algebraic analytical process. At present research that type of facilities is not available. The author hope so, by Allah willing, in future research the author will be able to overcome all of algorithm limitations and make it unique channel allocation algorithm for WDM system.

REFERENCES [1] Wing C. Kwong, and Guu-Chang Yang “An Algebraic Approach to the Unequal-Spaced Channel-Allocation Problem in WDM Lightwave Systems.” IEEE transactions on communications, vol. 45, no. 3, march 1997 [2] B. Hwang and O. K. Tonguz, “A generalized suboptimum unequallyspaced channel allocation technique—Part I: In IM/DDWDMsystems,”IEEE Trans. Commun., vol. 46, pp. 1027–1037, Aug. 1998. [3] O. K. Tonguz and B. Hwang, “A generalized suboptimum unequallyspaced channel allocation technique—Part II: In coherent WDM systems,”IEEE Trans. Commun., vol. 46, pp. 1186–1193, Sept. 1998. [4] L.G.Kazovsky,”Multichannel coherent optical communications systems,”J. Lightwave Technol., vol. LT-5, pp. 1095–1102, Aug. 1987. [5] M. O. Tanrikulu and O. K. Tonguz, “The impact of crosstalk and phasenoise on multichannel coherent lightwave systems,” IEICE Trans.Commun., vol. E78- B, no. 9, pp. 1278–1286, Sept. 1995. [6] F. Forghieri, R. W. Tkach, A. R. Chraplyvy, and D. Marcuse, “Reductionof four-wave mixing crosstalk in WDM systems using unequally spaced channels,” IEEE Photon. Technol. Lett., vol. 6, no. 6, pp. 754- [7] F. Forghieri, R. W. Tkach, and A. R. Chraplyvy, “WDM systems withunequally spaced channel,” J. Lightwave Technol., vol. 13, pp. 889–897,May 1995. [8] K. Inoue, H. Toba, and K. Oda, “Influence of fiber four-wave mixingon multichannel direct detection transmission systems,” J. LightwaveTechnol., vol. 10, pp. 350–360, Mar. 1992. [9] K. Inoue and H. Toba, “Theoretical evaluation of error rate degradation due to fiber four-wave mixing in multichannel FSK heterodyne envelope detection transmissions,” J. Lightwave Technol., vol. 10, pp. 361–366,Mar. 1992.

HOSSAIN AND FARUK 4

Page 18: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Abdelshakour Abuzneid, Moeen Uddin, Shaid Ali Naz, Omar Abuzaghleh

Department of Computer Science University of Bridgeport

[email protected], [email protected], [email protected], [email protected]

Abstract- This paper proposes an algorithm for removing the noise from the audio signal. Filtering is achieved through recording the pattern of noise signal. In order to validate our algorithm, we have implementation in MATLAB 7.0. We investigated the effect of proposed algorithm on human voice and compared the results with the existing related work, most of which employ simple algorithm and effect the voice signal. The results show that the proposed algorithm makes efficient use of Voice over the IP communication with less noise in comparison to similar existing algorithm.

I. INTRODUCTION Whenever we speak in microphone, it does not catch only

the waves which come out from our mouth, but it also catches the waves coming from other sources like fan, vacuum cleaner, phone ringing, or other sources, and combines the noise with the real voice signal. The input signal to the microphone becomes noisy.

To remove this noise we have to know the characteristics of the noise and the needed voice signal, so that we can separate noise from the original voice signal. The three main characteristics of signals are,

A. Amplitude

This is the strength of the signal. It can be expressed in a number of different ways (as volts, decibels). The higher the amplitude, the stronger (louder) the signal. The decibel is a popular measure of signal strength [8].

40db normal speech 90db lawn mowers 110db shotgun blast 120db jet engine taking off 120db rock concerts

It has been discovered that exposure to sounds greater than

90db for a period of time exceeding 15 minutes causes permanent damage to hearing. Our ability to hear high notes is affected. As young babies, we have the ability to hear quite high frequencies. This ability reduces as we age. It can also be affected by too much noise over sustained periods. Ringing in the ears after being exposed to loud noise is an indication that hearing loss may be occurring [8].

B. Frequency

This is the rate of change the signal undergoes every second, expressed in Hertz (Hz), or cycles per second. A 30Hz signal changes thirty times a second. In speech, we also refer to it as the number of vibrations per second. As we speak, the air is forced out of our mouths, being vibrated by our voice box. Men, on average, tend to vibrate the air at a lower rate than women, thus tend to have deeper voices. A cycle is one complete movement of the wave, from its original start position and back to the same point again. The number of cycles (or waves) within one second time interval is called cycles-per-second, or Hertz [8].

T. Sobh (ed.), Advances in Computer and Information Sciences and Engineering, 5–10. © Springer Science+Business Media B.V. 2008

An Algorithm to Remove Noise from Audio Signal by Noise Subtraction

Page 19: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

C. Phase This is the rate at which the signal changes its relationship

to time, expressed as degrees. One complete cycle of a wave begins at a certain point, and continues till the same point is reached again. Phase shift occurs when the cycle does not complete, and a new cycle begins before the previous one has fully completed [8].

Generating Noise in the Background

We need to generate background noise. Background noises come in many different shapes and sizes (figuratively speaking). In layman’s terms, Background noise is often described as “office ventilation noise”, “car noise”, “street noise”, “cocktail noise”, “background music”, etc. Although this classification is practical for human understanding, the algorithms that model and produce comfort noise see things in more mathematical terms. The most basic and intuitive property of Background noise is its loudness. This is referred to as the signal’s energy level. Another less obvious property is the frequency distribution of the signal. For example, the hum of a running car and that of a vacuum cleaner can have the same energy level, yet they do not sound the same: these two signals have distinctly different spectrums. The third property of BGN is the variability over time of the first two properties. When a Background noise’s energy level and spectrum are constant in time, it is said to be stationary. Some environments are prone to contain non-stationary BGN. The best example is street noise, in which cars come and go [10].

Good algorithms must cope well with all types of background noises. The regenerated noise must match the original signal as closely as possible.

II. SUGGESTED ALGORITHM Step 1:

We recorded two voice signals in two distinct environments and plotting the signal on graph, we examined that mostly the continuous noise comes in the range between -1.5 to +1.5 as you can see in the following snapshots.

Figure 1: Signal 1 recorded at home

Figure 2: Signal 2 recorded at computer lab

By magnifying a portion of the signal 1 we can see the

noise strip more clearly.

Figure 3: Magnified portion of the Signal in figure 1

ABUZNEID ET AL. 6

Page 20: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Recording this noise strip to a different variable and drawing on graph as shown in the snapshot.

Figure 4: Recorded noise strip

This noise appears when there is no voice signal. If we

subtract this signal from the noise signal we get the output signal shown in figure4.

Figure 5: Output signal after subtraction of noise from Signal 1.

Figure 6: Magnified portion of above Signal

Figure 7: Output signal after subtraction of noise from Signal 2.

We played these signals and examined that there was no noise when there was no recorded voice but there was noise during the voice signal. It is because of the recorded noise empty spaces recorded during the voice where the range of signal was not between -1.5 and +1.5. Step 2:

We filled these empty spaces by copying the pattern of

noise, on those places where the noise is 0 or straight line. And the recorded noise strip becomes:

Figure 8: Noise strip with filled spaces

This whole strip in the above figure is mostly similar to

actual noise. By subtracting this noise strip from the first recorded signals we can get the noise free signal as shown in the following snapshot.

AN ALGORITHM TO REMOVE NOISE FROM AUDIO SIGNAL BY NOISE SUBTRACTION 7

Page 21: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Figure 9: Output signal 1 after subtracting noise strip

Figure 10: Output signal 2 after subtracting noise strip

We played these signals and found that there was no noise

during the voice and without voice. But unfortunately the voice signal is little bit effected it seems like the voice signal lost some of its characteristics. Step 3:

We solved this problem by putting some specific ranges for noise subtraction. We subtracted noise from the signal range that appears between -1.5 and +1.5, the signal above the +8 and below the -10.

Figure 11: Output signal 1

Figure 12: Output signal 2

We played these signals and examined that the final output

signals were mostly noise free signal. We have chosen the range +8 to -10. This is a critical point; since the clarity of voice depends on the range we may choose.

ABUZNEID ET AL. 8

Page 22: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

III. FLOW CHART

Input Voice

Add noise to Voice

Separate noise from recorded

signal

Noise Signal

Subtract noise from Noisy signal

Noise free signal

IV. MATLAB CODE The algorithm has three modules which are coded in Matlab

7.0. We used a microphone to record 3 seconds of 22-kHz, 8-bit, mono data.

A. Module 1 code % Recording Audio Signal micrecorder = audiorecorder(22050,8,1); record(micrecorder,3);

a = getaudiodata(micrecorder, ‘int8’); [m,n] = size(a); j=1; k=1; % Recording Noise for i = 1 : m if (a(i,1) <= 0) && ( a(i,1) >= -1.5 ) b(i,1) = a(i,1); else b(i,1) = 0; end end % Subtracting Noise for i = 1 : m if ( a(i,1) < 0 ) r(i,1)=a(i,1)-b(i,1); elseif ( a(i,1) > 0 ) r(i,1)=a(i,1)+b(i,1); end end % Plotting Signals figure, plot(a); figure, plot(b); figure, plot(r); x=audioplayer(a,22050,8); y=audioplayer(b,22050,8); z=audioplayer(r,22050,8); % Playing Signals play(x); play(y); play(z);

B. Module 2 code % Recording Audio Signal micrecorder = audiorecorder(22050,8,1); record(micrecorder,3); a = getaudiodata(micrecorder, ‘int8’); [m,n] = size(a); j=1; k=1; % Recording Noise for i = 1 : m if (a(i,1) <= 0) && ( a(i,1) >= -1.5 ) b(i,1) = a(i,1); c(j,1) = a(i,1); j = j+1; else

AN ALGORITHM TO REMOVE NOISE FROM AUDIO SIGNAL BY NOISE SUBTRACTION 9

Page 23: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

b(i,1) = c(k,1); k = k+1; end end % Subtracting Noise for i = 1 : m if ( a(i,1) < 0 ) r(i,1)=a(i,1)-b(i,1); elseif ( a(i,1) > 0 ) r(i,1)=a(i,1)+b(i,1); end end % Plotting Signals figure, plot(a); figure, plot(b); figure, plot(c); figure, plot(r); x=audioplayer(a,22050,8); y=audioplayer(b,22050,8); z=audioplayer(r,22050,8); % Playing Signals play(x); play(y); play(z);

C. Module 3 code % Recording Audio Signal micrecorder = audiorecorder(22050,8,1); record(micrecorder,3); a = getaudiodata(micrecorder, ‘int8’); [m,n] = size(a); j=1; k=1; % Recording Noise for i = 1 : m if (a(i,1) <= 0) && ( a(i,1) >= -1.5 ) b(i,1) = a(i,1); c(j,1) = a(i,1); j = j+1; else b(i,1) = c(k,1); k = k+1; end end % Subtracting Noise for i = 1 : m if ( (a(i,1) < 0 && a(i,1)>-1.5) || a(i,1)<-10) r(i,1)=a(i,1)-b(i,1);

elseif ( (a(i,1) > 0 && a(i,1)<1.5) || a(i,1)>8 ) r(i,1)=a(i,1)+b(i,1); else r(i,1)=a(i,1); end end % Plotting Signals figure, plot(a); figure, plot(b); figure, plot(c); figure, plot(r); x=audioplayer(a,22050,8); y=audioplayer(b,22050,8); z=audioplayer(r,22050,8); % Playing Signals play(x); play(y); play(z);

V. CONCLUSION We presented an algorithm for noise removal from a voice

signal through stepwise study methodology. We notice significant improvement in voice signal when it is filtered out through this algorithm. The algorithm works great on uniform noise. We are currently working on other techniques for removing the noise from signal so that it does not affect the characteristics of the original voice. We are extending our work so that we can implement our algorithm in voice over IP, messengers or cellular phones.

REFERENCES [1] Jaroslav Koton , Kamil Vrba, Pure Current-Mode Frequency Filter for

Signal Processing in High-Speed Data Communication, Issue Date: April 2007 pp. 4

[2] Yu-Jen Chen , Chin- Chang Wang , Gwo-Jia Jong , Boi-Wei Wang, The Separation System of the Speech Signals Using Kalman Filter with Fuzzy Algorithm, Issue Date: August 2006 pp. 603-606

[3] Jafar Ramadhan Mohammed, “A New Simple Adaptive Noise Cancellation Scheme Based On ALE and NLMS Filter”, Issue Date: May 2007 pp. 245-254

[4] Jian Zhang , Qicong Peng , Huaizong Shao , Tiange Shao, “Nonlinear Noise Filtering with Support Vector Regression”, Issue Date: October 2006 pp. 172-176

[5] Xingquan Zhu , Xindong Wu, “Class Noise Handling for Effective Cost-Sensitive Learning by Cost-Guided Iterative Classification Filtering,” Issue Date: October 2006 pp. 1435-1440.

[6] Carlos Sanchez-Lopez , Esteban Tlelo-Cuautle, “Symbolic Noise Analysis in Gm-C Filters,” Issue Date: September 2006 pp. 49-53.

[7] DANIELE RIZZETTO AND CLAUDIO CATANIA, A VOICE OVER IP SERVICE ARCHITECTURE for Integrated Communications, MAY - JUNE 1999

[8] http://www.cs.ntu.edu.au/sit/resources/dc100www/dc1009.htm [9] http://www.bcae1.com/sig2nois.htm [10] http://www.commsdesign.com/design_corner/OEG20030303S0036

ABUZNEID ET AL. 10

Page 24: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Abdelwadood. Moh’d. Mesleh

Computer Engineering Department, Faculty of Engineering Technology, Balqa’ Applied University, Amman, Jordan.

Abstract- feature selection (FS) is essential for effective and more accurate text classification (TC) systems. This paper investigates the effectiveness of five commonly used FS methods for our Arabic language TC System. Evaluation used an in-house collected Arabic TC corpus. The experimental results are presented in terms of macro-averaging precision, macro-averaging recall and macro-averaging F1 measure.

I. INTRODUCTION

It is known that the volume of Arabic information available on Internet is increasing. This growth motivates researchers to find some tools that may help people to better managing, filtering and classifying these huge Arabic information resources. TC [1] is the task to classify texts to one of a pre-specified set of categories or classes based on their contents. It is also referred as text categorization, document categorization, document classification or topic spotting.

TC is among the many important research problems in information retrieval (IR), data mining, and natural language processing. It has many applications [2] such as document indexing, document organization, text filtering, word sense disambiguation and web pages hierarchical categorization.

TC has been studied as a binary classification approach (a binary classifier is designed for each category of interest), a lot of TC training algorithms have been reported in binary classification e.g. Naïve Bayesian method [3,4], k-nearest neighbors (kNN) [4,5,6], support vector machines (SVM) [7], decision tree [8], etc. On the other hand, it has been studied as a multi classification approach e.g. boosting [9], and multi-class SVM [10,11].

In TC tasks, supervised learning is a very popular approach that is commonly used to train TC systems (algorithms). TC algorithms learn classification patterns from a set of labeled examples, given an enough number of labeled examples (training set), and the task is to build a TC model. Then we can use the TC system to predict the category (class) of new (unseen) examples (testing set). In many cases, the set of input variables (features) of those examples contains redundant features and do not reveal significant input-output (document-category) characteristics. This is why FS techniques are essential to improve classification effectiveness.

The rest of this paper is organized as follows. Section 2 summarizes the Arabic TC and FS related work. Section 3

describes the TC design procedure. Experimental results are shown in section 4. Section 5 draws some conclusions and outlines future work

II. ARABIC TC AND FS RELATED WORK

Most of the TC research is designed and tested for English language articles. However, some TC approaches were carried out for other European languages such as German, Italian and Spanish [12], and some others were carried out for Chinese and Japanese [13,14].

There is a little TC work [15] that is carried out for Arabic articles. To our best knowledge, there is only one commercial automatic Arabic text categorizer referred as “Sakhr Categorizer” [16].

Compared to other languages (English), Arabic language has an extremely rich morphology and a complex orthography; this is one of the main reasons [15,17] behind the lack of research in the field of Arabic TC. However, many machine learning approaches have been proposed to classify Arabic documents: SVM with CHI square feature extraction method [18,19], Naïve Bayesian method [20], k-nearest neighbors (kNN) [21,22,23], maximum entropy [17,24], distance based classifier [25,26,27], Rocchio algorithm [23] and WordNet knowledge based approach [28].

It is quit hard to fairly compare the effectiveness of these approaches because of the following reasons: • Their authors have used different corpora (because there is

no publicly available Arabic TC corpus). • Even those who have used the same corpus, it is not obvious

whether they have used the same documents for training/testing their classifiers or not.

• Authors have used different evaluation measures: accuracy, recall, precision and F1 measure. For English language TC tasks, the valuable studies [29,30]

have presented extensive empirical studies of many FS methods with kNN and SVM, it has been reported that 2X square statistics (CHI) and information gain (IG) [29] FS methods performed most effective with kNN classifier. On the other hand, it has been shown that mutual information (MI) and term strength (TS) [29] performed terribly. However, IG [30] is the best choice to improve SVM classifier performance in term of precision.

T. Sobh (ed.), Advances in Computer and Information Sciences and Engineering, 11–16. © Springer Science+Business Media B.V. 2008

Support Vector Machines Based Arabic Language Text Classification System: Feature

Selection Comparative Study

Page 25: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

To our best knowledge, the only work that investigated the usage of some FS methods for Arabic language TC tasks is [23]. IG, CHI, document frequency (DF), odd ratio (OR), GSS and NGL FS methods have been evaluated using a hybrid approach of light and trigram stemming. It has been shown that the usage of any of those FS methods separately gave near results, NGL performed better than DF, CHI and GSS with Rocchio classifier in term of F1 measure (it was noticed that when using IG and OR, the majority of documents contain non of the selected terms). It has been concluded that a hybrid approach of DF and IG is a preferable FS method with Rocchio classifier. It is clear that authors of [23] have not reported the comparison results of the mentioned FS methods in terms of recall, precision and F1 measure, and they have not considered SVM which was already known to be superior to their classifiers.

In this paper, we have restricted our study of TC on binary classification methods and in particular to SVM and only for Arabic language articles. On the other hand, through fair comparison experiments, we have investigated the performance of the well-known FS methods with SVM for Arabic language TC tasks.

III. TC DESIGN PROCEDURE

TC system design usually compromises the following three main phases [7]: • Data pre-processing and FS phase is to make the text

documents compact and applicable to train the text classifier.

• The text classifier, the core TC learning algorithm, shall be constructed, learned and tuned using the compact form of the Arabic dataset.

• Text classifier evaluation: the text classifier shall be evaluated (using some performance measures). At the end of the above procedure, the TC system can

implement the function of document classification. The following subsections are devoted to Arabic dataset

preprocessing, FS methods, text classifier and TC evaluation measures. A. Arabic Dataset Preprocessing

Since there is no publicly available Arabic TC corpus to test our classifier, we have used an in-house collected corpus from online Arabic newspaper archives, including Al-Jazeera, Al-Nahar, Al-hayat, Al-Ahram, and Ad-Dustour as well as a few other specialized websites.

The collected corpus contains 1445 documents that vary in length. These documents fall into Nine classification categories that vary in the number of documents. In this Arabic dataset, each document was saved in a separate file within the corresponding category’s directory, i.e. the documents of this dataset are single-labeled.

Table 1 shows the number of documents for each category. One third of the articles is randomly specified and used for testing and the remaining articles are used for training the Arabic text classifier

TABLE 1 ARABIC DATA SET

Category Training Articles Testing Articles Total Number

Computer 47 23 70 Economics 147 73 220 Education 45 23 68

Engineering 77 38 115 Law 65 32 97

Medicine 155 77 232 Politics 123 61 184 Religion 152 75 227

Sports 155 77 232

Total 966 479 1445

As mentioned before, Arabic dataset preprocessing aims at

transforming the Arabic text documents to a form that is suitable for the classification algorithm. In this phase, the Arabic documents are processed according to the following steps [5,11,28]: • Each article in the Arabic dataset is processed to remove

digits and punctuation marks. • We have followed [15] in the normalization of some Arabic

letters: we have normalized letters “ء” (hamza), “ آ (aleph mad), “أ (aleph with hamza on top), “ؤ (hamza on w), “إ (alef with hamza on the bottom), and “ئ” (hamza on ya) to The reason for this normalization is that all forms .(alef) ”ا“of hamza are represented in dictionaries as one form and people often misspell different forms of aleph. We have normalized the letter “ى” to “ي” and the letter “ة” to “ه”. The reason behind this normalization is that there is not a single convention for spelling “ى” or “ي”and “ة” or “ه” when they appear at the end of a word.

• Arabic stop words (such as “أحد“ ,”أبدا“ ,”آخر” etc.) were removed. The Arabic function words (stop words) are the words that are not useful in IR systems e.g. pronouns and prepositions.

• All non Arabic texts were removed. • The vector space model (VSM) representation [31] is used

to represent the Arabic documents. In VSM, term frequency (TF) concerns with the number of occurrences a term i occurs in document j while inverse document frequency (IDF) concerns with the term occurrence in a collection of texts and it is calculated by IDF(i) = log(N/DF(i)) , where N is the total number of training documents and DF is the number of documents that term i occurs in. In IR, it is known that TF makes the frequent terms more important. As a result, TF improves recall (see TC evaluation measures subsection for recall definition). On the other hand, the IDF makes the terms that are rarely occurring in a collection of text more important. As a result, IDF improves precision (see TC evaluation measures subsection for precision definition). Using VSM in [32] shown that combining TF and IDF to weight terms ( IDF.TF ) gives better performance. In our Arabic dataset, each document feature vector is normalized to unit length and the IDF.TF is calculated.

MESLEH 12

”” ””

Page 26: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

We have not done stemming, because it is not always beneficial [33] for TC tasks. And because it has been empirically proved that it is not beneficial [18,19] for Arabic TC tasks too.

In fact, the same Arabic root, depending on the context, may be derived from multiple Arabic words. On the other hand, the same word may be derived from several different roots.

Table 2 shows some words that are derived from the same root ‘ktb’ (آتب). On the other hand, Table 3 shows some roots that are derived from same word ‘ayman’ (ايمان). B. Feature Selection Methods

FS is a process that chooses a subset from the original feature set, such that the selected subset is sufficient to perform classification tasks. It is one of the important research problems in data mining [34], pattern recognition [35], and machine learning [36]. FS methods have been widely applied to TC tasks [2,37,38,39,40,41].

FS basic steps are [36,42,43]: • Feature generation: in this step, a candidate subset of

features is generated by some search process. • Feature evaluation: using an evaluation criterion, the

candidate feature subset is evaluated. (This step measures the goodness of the produced features).

• Stopping: using some stopping criterion, decide whether to stop or not, i.e. whether a predefined number of features are selected or whether a predefined number of iterations is reached.

• Feature Validation: using a validation procedure, a decision is made whether a feature subset is valid or not. (As a matter of fact, this step is not a part of FS process itself, but in practice, we need to verify the validity of the FS outcome). Generally, FS algorithms are accomplished by the following

common approaches [43,44]: • A filter-based method [34]: it selects a subset of features by

filtering based on the scores which were assigned by a specific weighting method.

• A Wrapper approach [36], where the subset of features is chosen based on the accuracy of a given classifier.

• A Hybrid method [45]: takes advantage of the filter and wrapper methods. The major disadvantage of wrapper method [34,46] is its

computational cost, this makes wrapper methods impractical for large classification problem. Instead filter methods are often used. In TC task, because the number of features is huge, an important consideration shall be made to select the right FS method to make the learning task efficient and accurate [42]: • FS improves the performance of the TC task in terms of

learning speed and effectiveness. (Building the classifier is usually simpler and faster when less features are used). And it reduces data dimension and removes irrelevant, redundant, or noisy data.

• On the other hand, FS may decrease accuracy (over-fitting problem [1,44], which may arise when the number of features is large and the number of training samples is relatively small).

TABLE 2 DIFFERENT ARABIC WORDS FROM THE ROOT ‘KTB’

Arabic Word Arabic pronunciation English Meaning katib writer آاتب kitaba the act of writing آتابه Kitab some writing, book آتاب

In addition to classical FS methods [29] (Document

frequency thresholding (DF), the 2X statistics (CHI), Term strength (TS), Information gain (IG) and Mutual information (MI), Other FS methods have been reported in literatures such as Odds Ratio [47], NGL [48], GSS [49], etc.

Table 4 contains the functions for TC commonly used FS methods [2], where kt denotes a term, ic denotes a category. DF for a term kt is the number of documents in which

kt occurs, probabilities are interpreted on events of training document space, for example k iP(t ,c ) denotes the probability that a term kt occurs in a document x that does not belong to class ic , iP(c ) is estimated as the number of documents that do not belong to class ic divided by the total number of training documents. Functions are specified “locally” to a specific category ic ; to assess the value of a term kt in a global category sense, either the sum

|C |k ii =1

f(t ,c ) ,∑ or the weighted sum |C |

i k ii =1P(c )f(t ,c ),∑ or the maximum |C |

i =1 k imax f(t ,c ) of their category-specific values k if(t ,c ) are usually computed.

In this paper, we have restricted our study on only five FS methods and in particular to CHI, NGL, GSS, OR and MI. C. Text Classifier

SVM based classifiers are binary classifiers, which are originally proposed by [50]. Based on the structural risk minimization principle, SVM seeks a decision hyper-plane to separate the training data points into two classes and makes decisions based on the support vectors that are carefully selected as the only effective elements in the training dataset. SVM classifier is formulated in two different cases: the separable case and the non-separable case.

In the separable case, where the training data is linearly separable, the optimization of SVM is to minimize:

i i

,1 2m in | w |2

s . t . i , y ( x w b ) 1 0∀ + − ≥

(1)

In the non-separable case, where the training data are not

linearly separable, the optimization of SVM is to minimize:

TABLE 3 DIFFERENT ARABIC ROOTS FROM THE WORD ‘ayman’

Arabic Root Arabic pronunciation English Meaning

eman peace أمن Ayyiman two poor people أيم Ayama’nu will he give support مان

SUPPORT VECTOR MACHINES BASED ARABIC LANGUAGE TEXT CLASSIFICATION SYSTEM 13

Page 27: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

TABLE 4 COMMONLY USED FS METHODS

CHI k i k i k i k i

k k i i

2N .[ P ( t ,c ).P ( t ,c ) P ( t ,c )P ( t ,c )]P ( t )P ( t )P ( c )P ( c )

NGL k i k i k i k i

k k i i

N .[ P ( t ,c ).P ( t ,c ) P ( t ,c )P ( t ,c )]P ( t )P ( t ) P ( c )P ( c )

GSS k i k i k i k iP ( t ,c ).P ( t ,c ) P ( t ,c )P ( t ,c )−

OR k i k i

k i k i

P ( t | c ).( 1 P ( t | c ))( 1 P ( t | c )).P ( t | c )

−−

MI k i

k i

P ( t ,c )logP ( t ).P ( c )

2ii

i

i

1m in . | w | +C ξ ,2

s.t. , ( ) 1 ξ 0i, ξ 0.

∀ + − + ≥∀ ≥

∑i ii y x w b (2)

D. TC Evaluation Measures

TC performance is always evaluated in terms of computational efficiency and categorization effectiveness. When categorizing a large number of documents into many categories, the computational efficiency of the TC system shall be considered, this includes: FS method and the classifier learning algorithm. TC effectiveness [51] is measured in terms of precision, recall and 1F measure. Denote the precision, recall and 1F measure for a class iC by iP , iR and iF , respectively. We have: iP = /( ),+i iTP TP FP (3) iR = /( ),+i i iTP TP FN (4) i i i i iF =2P R /(R +P ) 2 /( 2 ).i i i iTP FP FN TP= + + (5)

Where:

• TPi: true positives; the set of documents that both the classifier and the previous judgment (as recorded in the test set) classify under iC .

• FPi: false positives; the set of documents that the classifier classifies under iC , but the test set indicates that they do not belong to iC .

• TNi: true negatives; both the classifier and the test set agree that the documents in TNi do not belong to iC .

• FNi: false negatives; the classifier does not classify the documents in FNi under iC , but the test set indicates that they should be classified under iC . To evaluate the classification performance for each category,

precision, recall, and 1F measure are used. To evaluate the average performance over many categories, the macro-averaging 1F ( M

1F ), micro-averaging 1F ( 1Fμ ), macro-averaging precision (macroP), micro-averaging precision

(microP), macro-averaging recall (macroR) and micro-averaging recall (microR) are used and defined as follows:

1

Mi i i i1 1 1 1

F =2[ R P ]/ [ R P ],|C | |C | |C | |C |

i i i iN

= = = =+∑ ∑ ∑ ∑ (6)

1 1 1 1 1F =2 /[ 2 ]|C | |C | |C | |C |

i i i ii i i iTP FP FN TPμ

= = = =+ +∑ ∑ ∑ ∑ (7)

1= / ,|C |

iimacroP P | C |

=∑ (8)

1

/|C |ii

macroR R | C |=

=∑ (9)

1 1= / ( )|C | |C |

i i ii imicroP TP TP FP

= =+∑ ∑ (10)

1 1= / ( )|C | |C |

i i ii imicroR TP TP FN

= =+∑ ∑ (11)

Macro-averaging 1F treats every category equally, and calculates the global measure as the mean of all categories’ local measures.

On the other hand, micro-averaging 1F computes overall global measures by giving category’s local performance measures different weights based on their numbers of positive documents.

To compute macroP and macroR, the precisions and recalls for each individual category are locally computed then averaged over the results of the different categories. On the other hand, to compute microP and microR, precisions and recalls are computed globally over all the testing documents for all categories.

In this paper, we will focus on macro-averaging 1F , macroP and macroR.

IV. EXPERIMENTAL RESULTS

In our experiments, we have used the mentioned Arabic dataset for training and testing our Arabic text classifier.

In addition to the mentioned preprocessing steps, we have filtered all terms with term frequency TF less than some threshold (threshold is set to Three for positive features and set to Six for negative features in training documents).

We have used an SVM package, TinySVM (downloaded from http://chasen.org/~taku/), the soft-margin parameter C is set to 1.0 (other values of C shown no significant changes in results).

To study the effect of FS, we have run a classification experiment without conducting any FS method, i.e. all the features are used (the result of this experiment is referred as original classifier).

And in order to fairly compare the Five FS methods (CHI, NGL, GSS, OR and MI), we have conducted three groups of experiments. For each group and for each text category, we have randomly specified one third of the articles and used them for testing while the remaining articles used for training the Arabic classifier. And for each FS method, we have conducted three experiments: the first experiment selects the best 180

MESLEH 14

Page 28: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

features, the second experiment selects the best 160 features and finally the third experiment selects the best 140 features.

Fig. 1 shows the macroP values for SVM classifier with the five different FS methods at different sizes of feature set. Compared with the original classifier (without feature selection i.e. all the 78699 features are used for training the SVM classifier), only CHI and NGL perform better. However, CHI is more stable than NGL (CHI outperforms original classifier at the three different feature sizes). However, the best SVM classification macroP result is obtained with NGL FS method (93.14 when selecting the best 180 features).

In Fig. 2, we show the macroR results. It is observed that all the FS methods outperform the original classifier. CHI, NGL and GSS performed much better than OR and MI. However, the best classification macroR result is obtained with CHI (84.00 when selecting the best 140 features).

Fig. 3 shows the macro-averaging 1F results. It is clear that all the FS methods outperformed the original classifier. CHI,

NGL and GSS performed much better than OR and MI. However, CHI outperformed NGL and GSS, and achieved its best macro-averaging 1F result when selecting the best 160 features.

V. CONCLUSIONS

We have investigated the performance of Five FS methods with SVM evaluated on an Arabic dataset. CHI, NGL and GSS performed most effective. On the other hand, OR and MI performed less effective. CHI performance is best.

In future, we like to study more FS methods for our SVM based Arabic TC system. And we like to deeply investigate the effect of the FS methods on small categories (such as Computer).

ACKNOWLEDGMENT

Many thanks to Dr. Raed Abu Zitar for kind advice.

REFERENCES [1] C. Manning, and H. Schütze, “Foundations of Statistical Natural

Language Processing”, MIT Press (1999). [2] F. Sebastiani, “Machine Learning in Automated Text Categorization”,

ACM Computing Surveys, Vol. 34, No. 1, 2002, pp.1-47. [3] A. McCallum, and K. Nigam, “A comparison of event models for naïve

Bayes text classification”, AAAI-98 Workshop on Learning for Text Categorization, 1998, pp.41-48.

[4] Y. Yang, and X. Liu, “A re-examination of text categorization methods”, 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), 1999, pp. 42-49.

[5] G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “An kNN Model-based Approach and its Application in Text Categorization”, Proceeding of 5th International Conference on Intelligent Text Processing and Computational Linguistic, CICLing-2004, LNCS 2945, Springer-Verlag, pages, 2004, pp. 559-570.

[6] Y.M. Yang, “Expert network: Effective and efficient learning from human decisions in text categorization and retrieval”, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), 1994, pp. 13-22.

[7] T. Joachims, “Text categorization with Support Vector Machines: learning with many relevant features”, Proceedings of the European Conference on Machine Learning (ECML’98), Berlin, 1998, pp.137-142, Springer.

60

65

70

75

80

85

90

140 160 180 AverageNumber of Features

Mac

roR

CHI NGL GSS OR MI All features

Fig. 2. Macro-averaging recall values for SVM classifier with the five FSmethods at different sizes of features.

80

82

84

86

88

90

92

94

140 160 180 AverageNumber of Features

Mac

roP

CHI NGL GSS OR MI All Features

Fig. 1. Macro-averaging precision values for SVM classifier with thefive FS methods at different sizes of features.

70

75

80

85

90

140 160 180 Average

Number of Features

Mac

ro-a

vera

ging

F1

CHI NGL GSS OR MI All Features

Fig. 3. Macro-averaging F1 values for SVM classifier with the five FS methods at different sizes of features.

SUPPORT VECTOR MACHINES BASED ARABIC LANGUAGE TEXT CLASSIFICATION SYSTEM 15

Page 29: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

[8] D. Lewis, and M. Ringuette, “A comparison of two learning algorithms for text categorization”, The Third Annual Symposium on Document Analysis and Information Retrieval, 1994, pp.81-93.

[9] R. Schapire, and Y. Singer, “BoosTexter: A boosting-based system for text categorization”, Machine Learning, Vol. 39, No.2-3, 2000, pp.135-168.

[10] S. Gao, W. Wu, C-H. Lee, and T-S. Chua, “A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization”, ACM Transactions on Information Systems, Vol. 24, No. 2, 2006, pp. 190-218.

[11] J. Zhang, R. Jin, Y.M. Yang, and A. Hauptmann, “A modified logistic regression: an approximation to SVM and its applications in large-scale Text Categorization”, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pp. 888-895.

[12] F. Ciravegna, et.al., “Flexible Text Classification for Financial Applications: the FACILE System”, Proceedings of PAIS-2000, Prestigious Applications of Intelligent Systems sub-conference of ECAI2000, 2000, pp.696-700.

[13] F. Peng, X. Huang, D. Schuurmans, and S. Wang, “Text Classification in Asian Languages without Word Segmentation”, Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages (IRAL 2003), Association for Computational Linguistics, July 7, Sapporo, Japan, 2003, pp. 41-48.

[14] J. He, A-H. TAN, and C-L. TAN, “On Machine Learning Methods for Chinese document Categorization”, Applied Intelligence, 2003, pp. 311-322.

[15] A.M. Samir, W. Ata, and N. Darwish, “A New Technique for Automatic Text Categorization for Arabic Documents”, Proceedings of the 5th Conference of the Internet and Information Technology in Modern Organizations, December, Cairo, Egypt, 2005, pp. 13-15.

[16] Sakhr company website: http://www.sakhr.com. [17] A.M. El-Halees, “Arabic Text Classification Using Maximum Entropy”,

The Islamic University Journal, Vol. 15, No. 1, 2007, pp 157-167. [18] A.M. Mesleh, “CHI Square Feature Extraction Based SVMs Arabic

Language Text Categorization System”, Proceedings of the 2nd international Conference on Software and Data Technologies, (Knowledge Engineering), Volume 1, Barcelona, Spain, July, 22—25, 2007, pp. 235-240.

[19] A.M. Mesleh, “CHI Square Feature Extraction Based SVMs Arabic Language Text Categorization System”, Journal of Computer Science, Vol. 3, No. 6, 2007, pp. 430-435.

[20] M. Elkourdi, A. Bensaid, and T. Rachidi, “Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm”, Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-based Languages, Geneva, August 23rd-27th , 2004, pp. 51-58.

[21] G. Kanaan, R. Al-Shalabi, and A. AL-Akhras, “KNN Arabic Text Categorization Using IG Feature Selection”, Proceedings of The 4th International Multiconference on Computer Science and Information Technology (CSIT 2006), Vol. 4, Amman, Jordan, April 5-7, 2006, Retrieved March 20, 2007, from http://csit2006.asu.edu.jo/proceedings.

[22] R. Al-Shalabi, G. Kanaan, M. Gharaibeh, “Arabic Text Categorization Using kNN Algorithm”, Proceedings of The 4th International Multiconference on Computer Science and Information Technology (CSIT 2006), Vol. 4, Amman, Jordan, April 5-7, 2006, Retrieved March 20, 2007, from http://csit2006.asu.edu.jo/proceedings.

[23] M. Syiam, Z. Fayed, and M. Habib, “An Intelligent System for Arabic Text Categorization”, International Journal of Intelligent Computing and Information Ssciences, Vol.6, No.1, 2006, pp. 1-19.

[24] H. Sawaf, J. Zaplo, and H. Ney, “Statistical Classification Methods for Arabic News Articles”, Paper presented at the Arabic Natural Language Processing Workshop (ACL2001), Toulouse, France. ( Retrieved from Arabic NLP Workshop at ACL/EACL 2001 website: http://www.elsnet.org/acl2001-arabic.html).

[25] R.M. Duwairi, “A Distance-based Classifier for Arabic Text Categorization”, Proceedings of the 2005 International Conference on Data Mining (DMIN2005), Las Vegas, USA, 2005, pp.187-192.

[26] L. Khreisat, “Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study”, Proceedings of the 2006 International Conference on Data Mining (DMIN2006). Las Vegas, USA, 2006, pp.78-82.

[27] R.M. Duwairi, “Machine Learning for Arabic Text Categorization”, Journal of American society for Information Science and Technology, Vol. 57, No. 8, 2006, pp.1005-1010.

[28] M. Benkhalifa, A. Mouradi, and H. Bouyakhf, “Integrating WordNet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization”, International Journal of Intelligent Systems, Vol. 16, No. 8, 2001, pp. 929-947..

[29] Y.M. Yang, and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization”, In J. D. H. Fisher, editor, The 14th International Conference on Machine Learning (ICML’97), Morgan Kaufmann, 1997, pp.412-420.

[30] G. Forman, “An Extensive Empirical Study of Feature Selection Metrics for Text Classification”, Journal of Machine Learning Research, Vol. 3, 2003, pp. 1289-1305.

[31] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model for Automatic Indexing”, Communications of the ACM, Vol. 18, No. 11, 1975, pp. 613-620.

[32] G. Salton, and C. Buckley, “Term weighting approaches in automatic text retrieval”, Information Processing and Management, Vol. 24, No. 5, 1988, pp. 513-523.

[33] T. Hofmann, “Introduction to Machine Learning”, Draft Version 1.1.5, November 10, 2003.

[34] M. Dash, M., K. Choi, P. Scheuermann, and H. Liu, “Feature Selection for Clustering-a Filter Solution”, Proceedings of the second International Conference of Data Mining, 2002, pp. 115-122.

[35] P. Mitra, C.-A. Murthy, and S.-K. Pal, “Unsupervised Feature Selection Using Feature Similarity”. IEEE Transaction of Pattern Analysis and Machine Intelligence, Vol. 24, No. 3, 2002, pp. 301-312.

[36] R. Kohavi, G.H. John, “Wrappers for feature subset selection”, Artificial Intelligence, Vol. 97, No. 1-2, 1997, pp.273-324.

[37] E. Leopold, and J. Kindermann, “Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?”, Machine Learning, Vol. 46, 2002, pp. 423-444.

[38] K. Nigam, A.K. Mccallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM”, Machine Learning, Vol. 39, 2000, pp. 103—134.

[39] D. Mladenic, “Feature subset selection in text learning”, Proceedings of European Conference on Machine Learning (ECML), 1998, pp. 95-100.

[40] H. Taira, and M. Haruno, “Feature selection in SVM text categorization”, Proceedings of AAAI-99, 16th Conference of the American Association for Artificial Intelligence (Orlando, US, 1999), 1999, pp. 480-486.

[41] D. Lewis, “Feature Selection and Feature Extraction for Text Categorization”, Proceedings of a workshop on speech and natural language, San Mateo, CA: Morgan Kaufmann, 1992, pp. 212-217.

[42] M. Dash, and H. Liu, “Feature selection for classification”, Intelligent Data Analysis, Vol. 1, No. 3, 1997, pp.131-156.

[43] H. Liu, and L. Yu, “Toward integrating feature selection algorithms for classification and clustering”, IEEE Transaction on Knowledge and Data Engineering, Vol. 17, No. 4, 2005, pp. 491-502.

[44] H. Liu, “Evolving feature selection”, IEEE Intelligent Systems, 2005, pp.64-76.

[45] S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection”, Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pp. 74-81.

[46] A.L Blum, and P. Langley, “Selection of Relevant Features and Examples in machine Learning”, Artificial Intelligence, Vol. 97, No. 1-2, 1997, pp. 245-271.

[47] D. Mladenic, and M. Grobelnik, “Feature Selection for Unbalanced Class Distribution and Naïve Bayes”, Proceedings of the Sixteenth International Conference on Machine Learning (ICML), 1999, pp. 258-267.

[48] H.T. Ng, W.B. Goh, and K.L. Low, “Feature Selection, Perceptron Learning, and A usability Case Study for Text Categorization”, Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, Philadelphia, PA, 1997, pp. 67-73.

[49] L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization”, Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, Portugal, 2000, pp. 59-68.

[50] V. Vapnik, “The Nature of Statistical Learning Theory”, Springer-Verlag, New York, 1995.

[51] R. Baeza-Yates, and B. Rieiro-Neto, “Modern Information Retrieval”, Addison-Wesley & ACM Press, 1999.

MESLEH 16

Page 30: Advances in C omp uter and Information Sciences and ... · Editor Dr. Tarek Sobh University of Bridgeport School of Engineering 221 University Avenue Bridgeport CT 06604 USA ISBN:

Abulfazl Yavari1, H.R. Pourreza2 1Jahad Daneshgahi Institute of Higher Education of Kashmar, Kashmar, Khorasan Razavi, IRAN

2Ferdowsi University of Mashad, Mashad, Khorasan Razavi, IRANE-mails: Abulfazl_yavari @yahoo.com, [email protected]

Abstract – In this paper, we present a new visual attention sys-tem which is able to detect attentive areas in the images with non-uniform resolution. Since, one of the goals of the visual attention systems is simulating of human perception, and in human visual system the foveated images processed, therefore, visual attention systems should be able to identify the saliency region to these im-ages. We test the system by two types of the images: real world and artificial images. Real world images include the cloth images with some defect, and the system should detect these defects. In artificial images, one element is different from the others only in one feature and the system should detect it.

Index Terms – Visual Attention, foveated Images

I. INTRODUCTION

A. Visual Attention

The visual attention is a selective process that enables a per-son to act effectively in his complicated environment [1]. We all frequently use the word “attention” in our everyday conver-sations. A definition for attention is given in [2]: “attention defines the mental ability to select stimuli, responses, and memories”. These stimuli can be anything, for instance, think of conditions in which you are studying and concentrated on the text you are reading. Suddenly you are interrupted with a loud sound or you smell something burning and attract your attention. Similarly, there are stimuli in the environment that affect our vision, for example, a moving object, picture on the wall, or a bright area in a dark room. These are examples of cases where without any predetermined goal automatically attract our attention. This type of attention is called bottom-up attention [3](fig. 1).

Fig.1 a picture on the wall attract our attention (bottom-up attention)

There are other cases in which we are looking for a special object in the environment, and all the things that have similar features to that object, will attract our attention. Assume, for instance, that we are looking for a red pen on a table. Anything with red color or with a shape like a pen will attract us, and so we may find the desired object in the first or the next focuses. This type of attention is called top-down attention [3] (Fig. 2). B. Foveated Images

Human vision system (HVS) is a space variant system [4]. It means that by receding from the gazing point, the resolution gradually decreases and only the totality of the scene will sur-vive. Images that have this feature are called foveated images. The area with the highest resolution is called the fovea [5] (fig.3).

We can find the source of this behaviour, by studying the eye’s structure. There are two kinds of vision cells on the ret-ina, cone cells and rod cells. Cone cells are too much less than rod cells in number, but they are sensitive to color, and each cell is individually connected to a nerve. Cone cells are gath-ered in the fovea area in the retina. Rod cells are too much more than cone cells and they are in the area around the fovea. Multiple rod cells are connected to a single shared nerve and they are sensitive to light [6].

Because humans have non-uniform vision and in fact their brains perform special processing on foveated images, in this paper we concentrate on visual attention on foveated images. Up to now, visual attention is studied only on normal images (uniform resolution).

Fig.2 when we looking for a red pen on the table, regions that have red

color, attract our attention (top-down attention)

T. Sobh (ed.), Advances in Computer and Information Sciences and Engineering, 17–20. © Springer Science+Business Media B.V. 2008

Visual Attention in Foveated Images