machine learning approach for efficient keyword prediction

ABSTRACT

Internet services and applications have become an inextricable part of daily life, enabling communication and the management ofpersonal information from anywhere. To accommodate this increase in application and data complexity, web services have moved to a multitiered design wherein the webserver runs the application front-end logic and data are outsourced to a database or file server. In this paper, we present Doubleguard, an IDS system that models the network behavior of user sessions across both the front-end webserver and the back-end database. By monitoring both web and subsequent database requests, we are able to ferret out attacks that an independent IDS would not be able to identify. Furthermore, we quantity the limitations of any multitier IDS in terms of training sessions and functionality coverage. We implemented Doubleguard using an Apache webserver with MySQL and lightweight virtualization. We then collected and processed real-world traffic over a 15-day period of system deployement in both dynamic and static web applications. Finally, using Doubleguard, we were able to expose a wide range of attacks with 100 percent accuracy while maintaining 0 percent false positives for static web services and 0.6 percent false positives for dynamic web services.INTRODUCTION

The immense popularity of Internet and P2P networks has produced a significant stimulus to P2P file sharing systems, where a file requesters query is forwarded to a file provider in a distributed manner. The median file size of these P2P systems is 4 MB, which represents a 1,000-fold increase over the 4 KB median size of typical Web objects. The study also shows that the access to these files is highly repetitive and skewed toward the most popular ones. In such circumstances, if a server receives many requests at a time, it could become overloaded and consequently cannot respond to the requests quickly. Therefore, highly popular files (i.e., hot files) could exhaust the bandwidth capacity of the servers, leading to low efficiency in file sharing.

File replication is an effective method to deal with the problem of server overload by distributing load over replica nodes. It helps to achieve high query efficiency by reducing server response latency and lookup path length (i.e., the number of hops in a lookup path). A higher effective file replication method produces higher replica hit rate. A replica hit occurs when a file request is resolved by a replica node rather than the file owner. Replica hit rate denotes the percentage of the number of file queries that are resolved by replica nodes among total queries. Recently, numerous file replication methods have been proposed. The methods can be generally classified into three categories denoted by Server Side, Client Side, and Path. Server Side replicates a file close to the file owner; Client Side replicates a file close to or at a file requester and Path replicates on the nodes along the query path from a requester to a file owner. However, most of these methods either have low effectiveness on improving query efficiency or come at a cost of high overhead. By replicating files on the nodes near the file owners, Server Side enhances replica hit rate and query efficiency. However, it cannot significantly reduce path length because replicas are close to the file owners. It may overload the replica nodes since a node has limited number of neighbours. On the other hand, Client Side could dramatically improve

Query efficiency when a replica node queries for its replica files, but such a case is not guaranteed to occur as node interest varies over time. Moreover, these replicas have low chance to serve other requesters. Thus, Client Side cannot ensure high hit rate and replica utilization. Path avoids the problems of Server Side and Client Side.

It provides high hit rate and greatly reduces lookup path length. However, its effectiveness is outweighed by its high cost of overhead for replicating and maintaining much more replicas. Furthermore, it may produce underutilized replicas. Since more replicas lead to higher query efficiency but more maintenance overhead, a challenge for a replication algorithm is how to minimize replicas while still achieving high query efficiency. From a technical perspective, the key for the success of a corporate network is choosing the right data sharing platform, a system which enables the shared data (stored and maintained by different companies) network-wide visible and supports efcient analytical queries over those data. Traditionally, data sharing is achieved by building a centralized data warehouse, which periodically extracts data from the internal production systems (e.g., ERP) of each company for subsequent querying. Unfortunately, such a warehousing solution has some deciencies in real deployment.

Fig.1.1 A Cloud Peer-Peer System

Finally, to maximize the revenues, companies often dynamically adjust their business process and may change their business partners. Therefore, the participants may join and leave the corporate networks at will. The data warehouse solution has not been designed to handle such dynamicity. To address the aforementioned problems, this paper presents Doubleguard, a cloud enabled data sharing platform designed for corporate network applications.

By integrating cloud computing, database, and peer-to-peer (P2P) technologies, Doubleguard achieves its query processing efciency and is a promising approach for corporate network applications, with the following distinguished features. Doubleguard is deployed as a service in the cloud. To form a corporate network, companies simply register their sites with the Doubleguard service provider, launch Doubleguard instances in the cloud and nally export data to those instances for sharing. Doubleguard adopts the pay-as-you-go business model popularized by cloud computing.

Fig 1.2 Doubleguard Cloud Cluster

The total cost of ownership is therefore substantially reduced since companies do not have to buy any hardware/software in advance. Instead, they pay for what they use in terms of Doubleguard instances hours and storage capacity.

LITERATURE REVIEW

1. Spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging

Author: Michael D. Fox* and Marcus E. Raichle

This approach is a paradigm that requires subjects to open and close their eyes at fixed intervals. Modulation of the functional magnetic resonance imaging (fMRI) bloodoxygenleveldependent(BOLD)signal attributable to the experimental paradigm can be observed in distinct brain regions, such as the visual cortex, allowing one to relate brain topography to function. However, spontaneous modulation of the BOLD signal which cannot be attributed to the experimental paradigm or any other explicit input or output is also present. Because it has been viewed as noise in task-response studies, this spontaneous component of the BOLD signal is usually minimized through averaging.

Methods for analyzing spontaneous BOLD data spontaneous neuronal activity refers to activity that is not attributable to specific inputs or outputs; it represents neuronal activity that is intrinsically generated by the brain. As such, fMRI studies of spontaneous activity attempt to minimize changes in sensory input and refrain from requiring subjects to make responses or perform specific cognitive tasks. Most studies are con- ducted during continuous resting-state conditions such as fixation on a cross-hair or eyes-closed rest. subjects are usually instructed simply to lie still in the scanner and refrain from falling asleep. After data acquisition, two important data analysis issues must be considered: how to account for non-neuronal noise and how to identify spatial patterns of spontaneous activity.

Advantage

Measure of changes in neuronal activity.

Disadvantage It doesnt consider different models for different regions of the time series

More time and Computational Cost.

2. A Wavelet-Based Anytime Algorithm for K-Means Clustering Author: Michail Vlachos, Jessica Lin, Eamonn Keogh, Dimitrios Gunopulos

The emergence of the field of data mining in the last decade has sparked an increase of interest in clustering of time series. Such clustering is useful in its own right as a method to summarize and visualize massive datasets . Clustering is often used as a subroutine in other data mining algorithms such as similarity search classification and the discovery of association rules. Applications of these algorithms cover a wide range of activities found in finance, meteorology, industry, medicine etc.

Although there has been much research on clustering in general , the unique structure of time series means that most classic machine learning and data mining algorithms do not work well for time series. In particular, the high dimensionality, very high feature correlation, and the (typically) large amount of noise that characterize time series data present a difficult challenge.

Anytime algorithms are algorithms that trade execution time for quality of results . In particular, an anytime algorithm always has a best- so-far answer available, and the quality of the answer improves with execution time. The user may examine this answer at any time, and choose to terminate the algorithm, temporarily suspend the algorithm, or allow the algorithm to run to completion. It would be highly desirable to implement the algorithm as an anytime algorithm. This would allow a user to examine the best current answer after an hour or so as a sanity check of all assumptions and parameters. As a simple example, suppose the user had accidentally set the value of K to 50 instead of the desired value of 5. Using a batch algorithm the mistake would not be noted for a week, whereas using an anytime algorithm the mistake could be noted early on and the algorithm restarted with little cost.

Advantage Very coarse resolution representation of the data.

Quickly to dirty noise.

Improves the execution time.

Disadvantage FMRI brain activity information collection not able to Wavelet analysis .

3. Model-based Classification of Data with Time Series-valued Attributes

Author: Claudia Plant, AndrewZherdin, Leonhard Laer

Time series data are collected in many applications including finance, science, natural language processing, medicine and multimedia. The content of large time series databases cannot be analyzed manually. To support the knowledge discovery process from time series databases, effective and efficient data mining methods are required. Often, the primary goal of the knowledge discovery process is classification, which is the task to automatically assign class labels to data objects. To learn rules, strategies or patterns for automatic classification, the classifier needs to be trained on a set of data objects for which the class labels are known. Typically, this so-called training data set has been labeled by human experts. Based on the patterns learned form the training data set, the classifier automatically assigns labels to new, unseen objects.

The classification of the test objects is now simple. To classify an object, we sum up the mean square error for all relevant models for all classes. We assign the object to the class with the smallest mean square error.

Advantage Increasing amounts of captured motion stream.

Motion classification is the task to automatically assign movements.

Time series obtained from the different sensors.

Disadvantage Multiple sensors to capture human movements.

No Interaction Region consideration.

4. CoRE: A Context-Aware Relation Extraction Method for Relation Completion

Author: Zhixu Li, Mohamed A. Sharaf, Laurianne Sitbon, Xiaoyong Du, and Xiaofang Zhou, Senior Member, IEEE

ABSTRACT

Relation completion (RC) as one recurring problem that is central to the success of novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation R, RC attempts at linking entity pairs between two entity lists under the relation R. To accomplish the RC goals, we propose to formulate search queries for each query entity a based on some auxiliary information, so that to detect its target entity b from the set of retrieved documents. For instance, a pattern-based method (PaRE) uses extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns may decrease the probability of nding suitable target entities. As an alternative, we propose CoRE method that uses context terms learned surrounding the expression of a relation as the auxiliary information in formulating queries. The experimental results based on several real-world web data collections demonstrate that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.DISADVANTAGES

This data is typically unstructured and naturally lacks any binding information (i.e., foreign keys). Linking this data clearly goes beyond the capabilities of current data integration systems.

This motivated novel frameworks that incorporate information extraction (IE) tasks such as named entity recognition

(NER) and relation extraction (RE).

Those frameworks have been used to enable some of the emerging data linking applications such as entity reconstruction and data enrichment.EXISTING SYSTEM

The corporate network needs to scale up to support thousands of participants, while the installation of a large-scale centralized data warehouse system entails nontrivial costs including huge hardware/software investments (a.k.a total cost of ownership) and high maintenance cost (a.k.a total cost of operations). In the real world, most companies are not keen to invest heavily on additional information systems until they can clearly see the potential return on investment (ROI). Second, companies want to fully customize the access control policy to determine which business partners can see which part of their shared data. Unfortunately, most of the data warehouse solutions fail to offer such flexibilities. Finally, to maximize the revenues, companies often dynamically adjust their business process and may change their business partners. Therefore, the participants may join and leave the corporate networks at will. The data warehouse solution has not been designed to handle such dynamicity.

Second, companies want to fully customize the access control policy to determine which business partners can see which part of their shared data.

Existing P2P search techniques are based on either unstructured hint-based routing or structured Distributed Hash Table (DHT)-based routing neither of these two paradigms can provide satisfactory solution to the DPM problem.

Unstructured techniques are not efficient in terms of the generated volume of search messages; moreover, no guarantee on search completeness is provided.

Structured techniques, on the other hand, strive to build an additional layer on top of a cloud protocol for supporting partial-prefix matching.

Cloud Peer to Peer mechanisms cluster keys based on numeric distance. But, for efficient subset matching, keys should be clustered based on Hamming distance.

DISADVANTAGES OF EXISTING SYSTEM:

The corporate network needs to scale up to support thousands of participants, while the installation of a large-scale centralized data warehouse system entails nontrivial costs including huge hardware / software investments and high maintenance cost.

Its most of the data warehouse solutions fail to offer flexibilities.

Its warehousing solution has some deficiencies in real deployment.

It is expensive.

Its most of the data warehouse solutions fail to offer flexibilities.

Its warehousing solution has some deficiencies in real deployment.

It is expensive.

PROPOSED SYSTEM

Doubleguard achieves its query processing efficiency and is a promising approach for corporate network applications, with the following distinguished features. Doubleguard is deployed as service in the cloud. To form a corporate network, companies simply register their sites with the Doubleguard service provider, Doubleguard instances in the cloud and finally export data to those instances for sharing. Doubleguard adopts the pay-as-you-go business model popularized by cloud computing. The total cost of ownership is therefore substantially reduced since companies do not have to buy any hardware/software in advance.

Instead, they pay for what they use in terms of Doubleguard instances hours and storage capacity. Doubleguard extends the role-based access control for the inherent distributed environment of corporate networks. Through a web console interface, companies can easily configure their access control policies and prevent undesired business partners to access their shared data. Doubleguard employs P2P technology to retrieve data between business partners. Doubleguard instances are organized as a structured P2P overlay network named BATON. The data are indexed by the table name, column name and data range for efficient retrieval. Doubleguard employs a hybrid design for achieving high performance query processing. The major workload of a corporate network is simple, low overhead queries.

Such queries typically only involve querying a very small number of business partners and can be processed in short time. Best- Peer++ is mainly optimized for these queries. For infrequent time-consuming analytical tasks, we provide an interface for exporting the data from Best- Peer++ to Hadoop and allow users to analyze those data using MapReduce.

The main contribution of this paper is the design of Doubleguard system that provides economical, flexible and scalable solutions for corporate network applications. We demonstrate the efficiency of Doubleguard by benchmarking Doubleguard against HadoopDB, a recently proposed large-scale data processing system, over a set of queries designed for data sharing applications. The results show that for simple, low-overhead queries, the performance of Doubleguard is significantly better than HadoopDB.

The unique challenges posed by sharing and processing data in an inter-businesses environment and proposed Doubleguard, a system which delivers elastic data sharing services, by integrating cloud computing, database, and peer-to-peer technologies.

ADVANTAGES OF PROPOSED SYSTEM

Our system can efficiently handle typical workloads in a corporate network and can deliver near linear query throughput as the number of normal peers grows.

Doubleguard adopts the pay-as-you-go business model popularized by cloud computing. The total cost of ownership is therefore substantially reduced since companies do not have to buy any hardware/software in advance. Instead, they pay for what they use in terms of Doubleguard instances hours and storage capacity.

Doubleguard extends the role-based access control for the inherent distributed environment of corporate networks.

Doubleguard employs P2P technology to retrieve data between business partners.

Doubleguard is a promising solution for efficient data sharing within corporate networks. It provides economical, flexible and scalable solutions for corporate network applications.

It is more efficient.

It prevent undesired business partners to access their shared data.

SYSTEM SPECIFICATIONHARDWARE REQUIREMENTS

Hard disk

40 GB

RAM

512mb

Processor

Pentium IV

Monitor

17 Color monitor

Key board, Mouse Multi media.SOFTWARE REQUIREMENTS

Front End

VISUAL STUDIO.NET 2010

Platform

ASP.NET

Code Behind

C#.NET

Back End

SQL SERVER 2008

Operating System Windows XP SP3, VISTA,7.

4.3 SOFTWARE DESCRIPTION

4.3.1 NET Framework OverviewThe .NET technology provides a new approach to software development. This is the first development platform designed from the ground up with the Internet in mind. Previously, Internet functionality has been simply bolted on to pre-Internet operating systems like Unix and Windows. This has required Internet software developers to understand a host of technologies and integration issues. .NET is designed and intended for highly distributed software, making Internet functionality and interoperability easier and more transparent to include in systems than ever before. .NET was first introduced in the year 2002 as .NET 1.0 and was intended to compete with Sun's Java. And .NET is very easy but the basics of the C language is required and if you know them then by step you can know and do it well. Unlike Java, .Net is not Free Software, yet source for the Base Class Library is available under the Microsoft Reference License. .NET is designed for ease of creation of Windows programs.4.3.2 ABOUT DOT NET

Microsoft has invested millions in marketing, Advertising and development to produce what it feels is the foundation of the future Internet. Its a corporate initiative, the strategy of which was deemed so important, that Bill Gates himself, Microsoft Chairman and CEO, decided to oversee personally its development. It is a technology that Microsoft claims will reinvent the way companies carry out business globally for years to come. In his opening speech at the Professional Developers Conference (PDC) held in Orlando Florida in July of 2000, Gates stated that a transition of this magnitude only comes around once every five to six years.

4.3.3 OVERVIEW OF THE .NET FRAMEWORK:

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet. The .NET Framework is designed to fulfill the following objectives:

To provide a code-execution environment that minimizes software deployment and versioning conflicts.

To provide a code-execution environment that guarantees safe execution of code, including code created by an unknown or semi-trusted third party.

To provide a code-execution environment that eliminates the performance problems of scripted or interpreted environments.

To make the developer experience consistent across widely varying types of applications, such as Windows-based applications and Web-based applications.

To build all communication on industry standards to ensure that code based on the .NET Framework can integrate with any other code.

The .NET Framework has two main components: the common language runtime and the .NET Framework class library. The common language runtime is the foundation of the .NET Framework. You can think of the runtime as an agent that manages code at execution time, providing core services such as memory management, thread management, and remoting, while also enforcing strict type safety and other forms of code accuracy that ensure security and robustness. In fact, the concept of code management is a fundamental principle of the runtime. Code that targets the runtime is known as managed code, while code that does not target the runtime is known as unmanaged code. The class library, the other main component of the .NET Framework, is a comprehensive, object-oriented collection of reusable types that you can use to develop applications ranging from traditional command-line or graphical user interface (GUI) applications to applications based on the latest innovations provided by ASP.NET, such as Web Forms and XML Web services.

The .NET Framework can be hosted by unmanaged components that load the common language runtime into their processes and initiate the execution of managed code, thereby creating a software environment that can exploit both managed and unmanaged features. The .NET Framework not only provides several runtime hosts, but also supports the development of third-party runtime hosts.

Sql Server

The RDBMS concept is gaining momentum all over the world. Microsoft SQL Server is a RDBMS for Windows, released in USA by the Microsoft Corporation.

Since processing calls for extensive data input and processing, retrieval of required information must be quick and efficient. SQL Server supports the event-driven nature of the windows environment and has many event trapping features like on click, on open, on Dbl click, Before Update, After Update etc.

Event procedures are coded and tagged to those events according to the necessity of the application. These procedures are run at those particular events and thus the whole coding is based on event-driven methodology. The forms of SQL Server help; to create Tables, Screen Queries aid in creation-complicated queries and generation informative reports is made an easy task.

SQL Server stores records in organized lists called tables. One or more tables in SQL Server make up a whole database. A table is just a collection lf records with the same structure. All of the records in the table contain the same type of information. SQL Server allows setting up tables and like them to other tables. Microsoft SQL Server is relational database. This means that the data in several tables is linked through one or more fields present in the tables. Its this business of linked tables that separates database programs like SQL Server from the other types of database, a flat file database which allows only single table in which to store all information. Microsoft SQL Server extends the performance, reliability, quality, and ease-of-use of Microsoft SQL Server version 7.0. Microsoft SQL Server includes several new features that make it an excellent database platform for large-scale online transactional processing (OLTP), data warehousing, and e-commerce applications. The OLAP Services feature available in SQL Server version 7.0 is now called SQL Server Analysis Services. The term OLAP Services has been replaced with the term Analysis Services. Analysis Services also includes a new data mining component.

ABOUT C# .NET

C# (pronounced "see sharp" or "C Sharp") is one of many .NET programming languages. It is object-oriented and allows you to build reusable components for a wide variety of application types. Microsoft introduced C# on June 26th, 2000 and it became a v1.0 product on Feb 13th 2002.

C# is an evolution of the C and C++ family of languages. However, it borrows features from other programming languages, such as Delphi and Java. If you look at the most basic syntax of both C# and Java, the code looks very similar, but then again, the code looks a lot like C++ too, which is intentional. Developers often ask questions about why C# supports certain features or works in a certain way. The answer is often rooted in it's C++ heritage.

4.4 SOFTWARE TESTING

System testing provides the file assurance that software once validated must be combined with all other system elements. System testing verifies whether all elements have been combined properly and that overall system function and performance is achieved.

Characteristics of a Good Test

Tests are likely to catch bugs

No redundancy

Not too simple or too complexTYPES OF TESTING

Unit Testing

Antithesis of the big bang approve Unit testing begins at the vertex of the spiral and concentrates on each unit of the software as implemented in source code. Initially test focus on each module individually, assuring that it function properly as a unit. Hence the name unit testing. Unit testing makes heavy use of white box testing techniques, exercising specific paths in a modules control structure to ensure complete coverage and maximum error detection.

Unit testing focuses verification effort on the smallest unit of software design the module. Using the procedural design description as a guide important control paths are tested to uncover the errors within the boundary of the module. The relative complexity of the tests and uncovered errors is limited by the constrained scope established for unit testing.

Unit Test Procedure

Unit testing is considered as an adjacent to the coding step. After source level code has been developed, reviewed and verify for correct syntax, unit test case design begins. A module is not a standalone program; hence a driver or stub software must be developed for each unit test. Stubs serve to replace modules that are subordinate to the module to be tested. Drivers and stubs represent overhead. Unit testing is simplified when a module with high cohesion is designed. When only one function is addressed by a module. The number of test cases is reduced and error can be more easily predicted and uncovered.

Integration Testing

Integration testing is a systematic technique for constructing a program structure while conducting tests to uncover errors associated with interfacing. The objective is to take unit tested modules and build a program structure that has been detected by design. There is a often a tendency to attempt non-incremental integration, that is, to construct the program using big bang approach. All modules are combined in advance. The entire program is tested as a whole. A set of errors is encountered. Correction is difficult because isolation of causes is complicated by the vast expanse of the entire program. Incremental integration is the ach. The program is constructed and tested in a small segments, where errors are easier to isolate and correct; interfaces are more likely to be tested completely; and a systematic approach may be applied.

Different Incremental Integration Strategies

1. Top-Down Integration.

2. Bottom-up Integration.

3. Regression Testing

Validation Testing

The application is tested to check how it responds to various kinds of input given. The user should be intimated of any kind of exceptions in a more understandable manner so that debugging becomes easier. At the culmination of the black box testing, software is completely assembled as a package; interfacing errors have been uncovered and corrected. Next stage is the validation testing and it can be defined in many ways, but a simple definition is that the validation succeeds when the software functions in the manner that can be reasonably expected by the user. When an user enters incorrect inputs it should not display error messages, instead it should display helpful messages enabling user to use the tool properly.

SYSTEM ARCHITECTURE

MODULES

1. NETWORK FORMATION2. CLIENT AUTHENTICATION3. SCAN: A Structural Clustering Algorithm for P-P Networks

4. BESTPEER ++ MODULE5. Carry & Forward Approach6. PERFORMANCE EVALUATION CLIENT AUTHENTICATION User should register and login to the searching.

User has to be in the specified workgroup.

Admin can login directly to view the reports of all data.

The authenticated page only followed to the user.

NETWORK FORMATION Retrieve the connected systems in the specified workgroup.

The performance of the system to be evaluated.

The evaluation result will be long lived systems, and short lived systems to make efficient search.

SCAN: A STRUCTURAL CLUSTERING ALGORITHM FOR P-P NETWORKS

List of p2p systems are structured in tree format.

The long lived systems are taken for the process.

Network performance classifies by this module. From the best performance system the data be searched and resulted to the client.

BP++ search make the system very effective.

The result indexed to the client system for the future proposal.

We are also using the bloom filters to get more results from the p2p network. Specifically, we develop the notion of path and source level redundancy.

When given QoS requirements of a query, we identify optimal path and source redundancy such that not only QoS requirements are satisfied, but also the lifetime of the system is maximized.

BESTPEER ++ MODULE The replication of file should be reduce the performance of the system.

So we are make the structure as decentralized in the search.

The file name, file size and the file content to be searched. If it matches more than systems the replication of files to be deleted automatically. And the data history is maintained by client side. So the memory efficiency should get higher than the existing approaches.

Carry & Forward Approach Carry and forwardis atechnique in whichinformationis sent to an intermediate station where it is kept and sent at a later time to the final destination or to another intermediate station.

The intermediate station, ornodein anetworkingcontext, verifies theintegrityof the message before forwarding it.

In general, this technique is used in networks with intermittent connectivity, especially in the wilderness or environments requiring high mobility. It may also be preferable in situations when there are long delays in transmission and variable and high error rates, or if a direct, end-to-end connection is not available.

REPORTS The reports all maintained by the administrator.

The user information and the network information and the data processing are controlled by the administrator.

The indexing is initiated by the administrator.

And the total log files are maintained by the reports module.

From this module we can able to check and monitor the network for the future data processings.

ALGORITHM

DOUBLEGUARD ALGORITHM:

Algorithm developed in this paper takes two forms of redundancy.

The first form is path redundancy. That is, instead of using a single path to connect a source cluster to the processing centre, mp disjoint paths may be used. The second is source redundancy. That is, instead of having one sensor node in a source cluster return requested sensor data, ms sensor nodes may be used to return readings to cope with data transmission and/or sensor faults. The above architecture illustrates a scenario in which mp = 2 (two paths going from the CH to the processing centre) and ms = 5 (five SNs returning sensor readings to the CH). Doubleguard extends the role-based access control for the inherent distributed environment of corporate networks. Through a web console interface, companies can easily congure their access control policies and prevent undesired business partners to access their shared data. Doubleguard employs P2P technology to retrieve data between business partners. Doubleguard instances are organized as a structured P2P overlay network named BATON. The data are indexed by the table name, column name and data range for efcient retrieval.

Doubleguard employs a hybrid design for achieving high performance query processing. The major workload of a corporate network is simple, low overhead queries. Such queries typically only involve querying a very small number of business partners and can be processed in short time. Doubleguard is mainly optimized for these queries. For infrequent time consuming analytical tasks, we provide an interface for exporting the data from Doubleguard to Hadoop and allow users to analyse those data using MapReduce. The analysis performed thus far assumes that a source CH does not aggregate data. The CH may receive up to ms redundant sensor readings due to source redundancy but will just forward the first one received to the PC. Thus, the data packet size is the same. For more sophisticated scenarios, conceivably the CH could also aggregate data for query processing and the size of the aggregate packet may be larger than the average data packet size. We extend the analysis to deal with data aggregation in two ways. The first is to set a larger size for the aggregated packet that would be transmitted from a source CH to the PC.

CLUSTERING ALGORITHM:

A clustering algorithm that aims to fairly rotate SNs to take the role of CHs has been used to organize sensors into clusters for energy conservation purposes. The function of a CH is to manage the network within the cluster, gather sensor reading data from the SNs within the cluster, and relay data in response to a query. clustering algorithm is executed during the system lifetime.

Aggregation of readings

Each cluster has a CH

Users issue queries through any CH.

CH that receives the query is called the Processing Center (PC)

Each non-CH node selects the CH candidate with the highest residual energy, sends it a cluster join message (includes the non-CH nodes location). The CH will acknowledge this message.

Randomly rotates role of CH among nodes -> nodes consume their energy evenly.

DATA FLOW DIAGRAM

Level 0

Cloud Users/Application

Level 1

TESTING

SYSTEM TESTING

Testing is done for each module. After testing all the modules, the modules are integrated and testing of the final system is done with the test data, specially designed to show that the system will operate successfully in all its aspects conditions. The procedure level testing is made first. By giving improper inputs, the errors occurred are noted and eliminated. Thus the system testing is a confirmation that all is correct and an opportunity to show the user that the system works. The final step involves Validation testing, which determines whether the software function as the user expected. The end-user rather than the system developer conduct this test most software developers as a process called Alpha and Beta test to uncover that only the end user seems able to find.

This is the final step in system life cycle. Here we implement the tested error-free system into real-life environment and make necessary changes, which runs in an online fashion. Here system maintenance is done every months or year based on company policies, and is checked for errors like runtime errors, long run errors and other maintenances like table verification and reports.

UNIT TESTING

Unit testing verification efforts on the smallest unit of software design, module. This is known as Module Testing. The modules are tested separately. This testing is carried out during programming stage itself. In these testing steps, each module is found to be working satisfactorily as regard to the expected output from the module.

INTEGRATION TESTING Integration testing is a systematic technique for constructing tests to uncover error associated within the interface. In the project, all the modules are combined and then the entire programmer is tested as a whole. In the integration-testing step, all the error uncovered is corrected for the next testing steps.VALIDATION TESTING

To uncover functional errors, that is, to check whether functional characteristics confirm to specification or not specified.CONCLUSION

The unique challenges posed by sharing and processing data in an inter-businesses environment was efficiently outdone by the proposed Doubleguard method, which is a system which delivers elastic data sharing services, by integrating cloud computing, database, and peer-to-peer technologies. Therefore, Doubleguard is a promising solution for efcient data sharing within corporate networks. Traditional file replication methods for P2P file sharing systems replicate files close to file owners, file requesters, or query path to release the owners load, and meanwhile, improve the file query efficiency. However, replicating files close to the file owner may overload the nodes in the close proximity of the owner, and cannot significantly improve query efficiency since replica nodes are close to the owners. Replicating files close to or in the file requesters only brings benefits when the requester or its nearby nodes always queries for the file. In addition, due to non-uniform and time-varying file popularity and node interest variation, the replicas cannot be fully utilized and the query efficiency cannot be improved significantly. Replicating files along the query path improves the efficiency of file query, but it incurs significant overhead.

The Doubleguard file replication algorithm proposed that chooses query traffic hubs and frequent requesters as replica nodes to guarantee high utilization of replicas and high query efficiency. Unlike current methods in which file servers keep track of replicas, it creates and deletes file replicas by dynamically adapting to non-uniform and time varying file popularity and node interest in a decentralized manner based on experienced query traffic. It leads to higher scalability and ensures high replica utilization.

REFERENCES

[1] S. Saroiu, P. Gummadi, and S. Gribble, A Measurement Study of Peer-to-Peer File Sharing Systems, Proc. Conf. Multimedia Computing and Networking (MMCN), 2002.

[2] A. Rowstron and P. Druschel, Storage Management and Caching in PAST, a Large-Scale, Persistent Peer-to-Peer Storage Utility, Proc. Symp. Operating Systems Principles (SOSP), 2001.

[3] F. Dabek et al., Wide Area Cooperative Storage with CFS, Proc. Symp. Operating Systems Principles (SOSP), 2001.

[4] T. Stading et al., Peer-to-Peer Caching Schemes to Address Flash Crowds, Proc. Intl Workshop Peer-to-Peer Systems (IPTPS), 2002.

[5] M. Theimer and M. Jones, Overlook: Scalable Name Service on an Overlay Network, Proc. Intl Conf. Distributed Computing Systems (ICDCS), 2002.

[6] V. Gopalakrishnan et al., Adaptive Replication in Peer-to-Peer Systems, Proc. Intl Conf. Distributed Computing Systems (ICDCS), 2004.

[7] Gnutella, http://www.gnutella.com, 2008.

[8] M. Roussopoulos and M. Baker, CUP: Controlled Update Propagation in Peer to Peer Networks, Proc. USENIX, 2003.

[9] L. Yin and G. Cao, DUP: Dynamic-Tree Based Update Propagation in Peer-to-Peer Networks, Proc. Intl Conf. Data Eng. (ICDE), 2005.

[10] R. Cox, A. Muthitacharoen, and R.T. Morris, Serving DNS Using a Peer-to-Peer Lookup Service, Proc. Intl Workshop Peer-to-Peer Systems (IPTPS), 2002.

[11] P. Gummadi, R. Dunn, S. Saroiu, S. Gribble, H. Levy, and J. Zahorjan, Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload, Proc. Symp. Operating Systems Principles (SOSP), 2003.

[12] C. Plaxton, R. Rajaraman, and A. Richa, Accessing Nearby Copies of Replicated Objects in a Distributed Environment, Proc. ACM Symp. Parallel Algorithms and Architectures (SPAA), 1997.

[13] P. Godfrey and I. Stoica, Heterogeneity and Load Balance in Distributed Hash Tables, Proc. IEEE INFOCOM, 2005.

[14] H. Shen and C. Xu, Elastic Routing Table with Provable Performance for Congestion Control in DHT Networks, Proc. Intl Conf. Distributed Computing Systems (ICDCS), 2006.

[15] Q. Lv, S. Ratnasamy, and S. Shenker, Can Heterogeneity Make Gnutella Scalable? Proc. Intl Workshop Peer-to-Peer Systems (IPTPS), 2002.

SAMPLE SCREENSServer

SAMPLE CODINGS

File1.cs

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Linq;

using System.Text;

using System.Windows.Forms;

using System.Data.SqlClient;

using System.Configuration;

namespace SourceMain

{

public partial class Form1 : Form

{

string constring = Convert.ToString(ConfigurationSettings.AppSettings["ConnectionString"]);

int totreqcount;

string rproceedstatus = "Proceed", empty = "";

public Form1()

{

InitializeComponent();

}

private void Form1_Load(object sender, EventArgs e)

{

ToolTip toolTip1 = new ToolTip();

toolTip1.AutoPopDelay = 5000;

toolTip1.InitialDelay = 500;

toolTip1.ReshowDelay = 500;

toolTip1.ShowAlways = true;

toolTip1.SetToolTip(this.pictureBox7, "Click To Reload");

toolTip1.SetToolTip(this.pictureBox1, "Click To Admin Login");

toolTip1.SetToolTip(this.label3, "Click To Admin Login");

toolTip1.SetToolTip(this.pictureBox2, "Click To Upload Files");

toolTip1.SetToolTip(this.label4, "Click To Upload Files");

toolTip1.SetToolTip(this.pictureBox8, "Click To Verify Requested Files");

toolTip1.SetToolTip(this.label9, "Click To Verify Requested Files");

toolTip1.SetToolTip(this.pictureBox3, "Click To Start Transaction");

toolTip1.SetToolTip(this.label5, "Click To File Start Transaction");

SqlConnection con = new SqlConnection(constring);

con.Open();

SqlCommand cmd = new SqlCommand("Delete From FileUpload", con);

cmd.ExecuteNonQuery();

con.Close();

groupBox1.Visible = false;

pictureBox2.Enabled = false;

label4.Enabled = false;

//pictureBox3.Enabled = false;

//label5.Enabled = false;





}

private void pictureBox1_Click(object sender, EventArgs e)

{

pictureBox4.Visible = false;


groupBox1.Visible = true;

}

private void label3_Click(object sender, EventArgs e)

{



groupBox1.Visible = true;

}


{

pictureBox1.Enabled = true;

label3.Enabled = true;

pictureBox4.Visible = true;

pictureBox5.Visible = true;








}

private void button2_Click(object sender, EventArgs e)

{

textBox1.Text = "";

textBox2.Text = "";

}


{

string txt1 = textBox1.Text.ToUpper();

string txt2 = textBox2.Text.ToUpper();

if (txt1 == "ADMIN" && txt2 == "ADMIN")

{














}

else

{

}

textBox1.Text = "";

textBox2.Text = "";

}


{

FileUpload fu = new FileUpload();

fu.ShowDialog();

}


{

FileUpload fu = new FileUpload();

fu.ShowDialog();

}


{

Transaction tr = new Transaction();

tr.ShowDialog();

}


{


tr.ShowDialog();

}


{


con.Open();

SqlDataAdapter adp1 = new SqlDataAdapter("Select COUNT(rstatus) as reqstatus from FileUpload where rstatus='" + rproceedstatus + "'", con);

DataSet ds1 = new DataSet();

adp1.Fill(ds1);

totreqcount = Convert.ToInt32(ds1.Tables[0].Rows[0]["reqstatus"].ToString());

if (totreqcount == 3)

{

//START TRANSACTION


tr.ShowDialog();

}

else

{

MessageBox.Show("ERROR - DO NOT PROCEED FILES.", "Message Box", MessageBoxButtons.OK, MessageBoxIcon.Warning);

}

con.Close();

}


{


con.Open();

SqlDataAdapter adp1 = new SqlDataAdapter("Select COUNT(rstatus) as reqstatus from FileUpload where rstatus='" + rproceedstatus + "'", con);

DataSet ds1 = new DataSet();

adp1.Fill(ds1);

totreqcount = Convert.ToInt32(ds1.Tables[0].Rows[0]["reqstatus"].ToString());

if (totreqcount == 3)

{

//START TRANSACTION


tr.ShowDialog();

}

else

{

MessageBox.Show("ERROR - DO NOT PROCEED FILES.", "Message Box", MessageBoxButtons.OK, MessageBoxIcon.Warning);

}

con.Close();

}


{

RequestFiles rf = new RequestFiles();

rf.ShowDialog();

}


{

RequestFiles rf = new RequestFiles();

rf.ShowDialog();

}

//private void button3_Click(object sender, EventArgs e)

//{

// Form2 frm2 = new Form2();

// frm2.ShowDialog();

//}

}

}

Fileupload.cs

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Linq;

using System.Text;

using System.Windows.Forms;

using System.Data.SqlClient;

using System.Configuration;

using System.IO;

namespace SourceMain

{

public partial class FileUpload : Form

{

string constring = Convert.ToString(ConfigurationSettings.AppSettings["ConnectionString"]);

Class1 cs = new Class1();

string fileDes, fileini, empty = "", rstatus = "Start";

int len;

string yes, yes1, yes2;

public FileUpload()

{

InitializeComponent();

}

private void FileUpload_Load(object sender, EventArgs e)

{

//textBox1.Text = Convert.ToString(cs.fileidgeneration());

}

private void btnbrowse_Click(object sender, EventArgs e)

{

textBox2.Text = "";

openFileDialog1.ShowDialog();

fileDes = openFileDialog1.FileName;

if (fileDes == "openFileDialog1")

{

//lblError.Text = "";

//lblError.Text = "Select a File first";

MessageBox.Show("Select any one File.", "Message Box", MessageBoxButtons.OK, MessageBoxIcon.Warning);

textBox2.Text = "";

button1.Enabled = false;

yes = null;

}

else

{

yes = "yes";

textBox2.Text = openFileDialog1.FileName;

len = fileDes.Length;

fileini = fileDes.Substring(fileDes.IndexOf("\\") + 1);

button1.Enabled = true;

FileInfo fi = new FileInfo(openFileDialog1.FileName);

//byte[] data = new byte[fi.Length];

DateTime dt = fi.CreationTime;

FileStream fs = fi.Open(FileMode.Open, FileAccess.Read, FileShare.Read);

//fs.Position = 0;

//fs.Read(data, 0, Convert.ToInt32(fi.Length));

label2.Text = fi.Name;

label3.Text = Convert.ToString(fi.Length) + " bytes";

label5.Text = Path.GetExtension(openFileDialog1.FileName);

}

}


{

textBox3.Text = "";

openFileDialog2.ShowDialog();

fileDes = openFileDialog2.FileName;

if (fileDes == "openFileDialog2")

{

//lblError.Text = "";

//lblError.Text = "Select a File first";

MessageBox.Show("Select any one File.", "Message Box", MessageBoxButtons.OK, MessageBoxIcon.Warning);

textBox3.Text = "";

button1.Enabled = false;

yes1 = null;

}

else

{

yes1 = "yes";

textBox3.Text = openFileDialog2.FileName;

len = fileDes.Length;

fileini = fileDes.Substring(fileDes.IndexOf("\\") + 1);

button1.Enabled = true;

FileInfo fi = new FileInfo(openFileDialog2.FileName);

//byte[] data = new byte[fi.Length];

DateTime dt = fi.CreationTime;

FileStream fs = fi.Open(FileMode.Open, FileAccess.Read, FileShare.Read);

//fs.Position = 0;

//fs.Read(data, 0, Convert.ToInt32(fi.Length));

label16.Text = fi.Name;

label13.Text = Convert.ToString(fi.Length) + " bytes";

label10.Text = Path.GetExtension(openFileDialog2.FileName);

}

}P2P Shaping

Register

Login

Connected system

Work Group

Search

Short lived

Long lived

Performance

Evaluation

Sharing of file

Key word

Keyword Spilt

Bloom Filter

Possible keywords

Best from the list

Duplicate

Identification

Data alive in best Performing system

Remove Duplication

Indexing

Data base

Retrieved

Data

Network

Tree

Trained data

File name

File Size

Content

WorkFlow Recommend System

Framework Remote Server

DataSet

WorkFlow

User Authentication

Resource Request

R

Inter Process Communication

Cluster

Data Broker

Data Set Request

R

Workflow Scheduler

machine learning approach for efficient keyword prediction

Documents

file server

file close

file provider

file owners

file requesters query

file owner client

median file size

number of file queries