web viewone example application of such ... for case histories or crime patterns to co-ordinate ......

Agent-Based Knowledge Discovery:

Survey and Evaluation

A Term Paper for

EE380L Data Mining

WRITTEN BY:

Austin Bingham

Paul Chan

Dung Lam

SUBMITTED TO:

Dr. Joydeep Ghosh

May 2000

1. Introduction

With the prevalence of networking in the past decade, data has not only grown exponentially in size but also become more decentralized and disordered. In addition, databases, knowledge bases, and online repositories of information (such as dictionaries, user survey results, and server logs) around the world can now interact with one another. These intertwining networks of data sources present a challenge for knowledge discovery, as most existing techniques assume a single source of data. To make the problem worse, there is no agreed method for discovering knowledge through distributed information gathering from heterogeneous data sources. Consequently, the rate of knowledge discovery fails to keep up with that of data generation, and the percentage of knowledge relative to the amount of data declines steadily. To remedy the situation, researchers have developed Agent-based Knowledge Discovery (ABKD) as a new paradigm that combines the two fields of Distributed Artificial Intelligence and Machine Learning [Davies and Edwards 1995A]. The purpose of this paper is to examine how we can apply existing agent-based techniques to the knowledge discovery, or data-mining, field. From this evaluation, an ideal agent-based system is proposed along with issues that must be considered.

An agent-based data mining system is a natural choice for mining large sets of inherently distributed data. One example application of such systems is military decision-making [Yang, Honavar, Miller, and Wong 1998]. Every day, commanders and intelligence analysts need to access critical information in a timely fashion. Typical day-to-day operations involve intelligence data gathering and analysis, situation monitoring and assessment, and looking for potentially interesting patterns in data, such as relationship between troop movements and significant political developments in a region. The information can be valuable for decision-makers to take both proactive and reactive measures designed to safeguard a nation’s security concerns. In a crisis, we need to be able to deliver accurate information to the decision-makers at the right time without overwhelming them with large volume of irrelevant data. This involves physically distributed data source include satellite images, intelligence reports, and records of communication with officers at the frontier. In that situation, nobody can afford the delay of sending large volumes of data back for central processing before we can present relevant information to decision-makers.

Financial institutions and law-enforcement agencies share similar information processing needs. In order to predict market fluctuations, brokerage houses need to analyze news and financial transactions from all over the world in real time. The continuous growth in the amount of data to process makes it impossible for centralized analysis to take place. Besides, the prevalence of electronic commerce demands a secured and trusted inter-banking network with high-speed verification and authentication mechanisms. This requires a widely deployed system that detects local fraudulent transaction attempts and propagates the attack information as soon as possible [Chan and Stolfo 1996]. Likewise, law-enforcement agencies need to obtain information from one another for case histories or crime patterns to co-ordinate nationwide or even worldwide efforts to fight crime.

This paper surveys and evaluates ABKD systems. Section 2 introduces the idea of applying agent technology in distributed data mining. It also describes some metrics that are useful for evaluating the existing ABKD architectures in Section 3. Section 4 proposes the desired characteristics of an

ideal ABKD architecture. Section 5 suggests some possible future work in ABKD research, and Section 6 concludes the paper.

2. Agent Technology

2.1 What is an Agent?

Many areas of research employ agent technology, and thus the definition of an agent varies according to the focus of the research. For example, research in multi-agent systems (MAS) commonly characterizes agents as autonomous and able to plan and coordinate within an organization for solving a problem. In ABKD, an agent is a software entity that can 1) interoperate with its data source and/or other agents, 2) receive/gather raw data, 3) process and learn from the data source or from other sources, and 4) coordinate with other agents to produce relevant and useful information. Research in ABKD emphasizes how agents manipulate data and how agents extract information from distributed data sources. Based on this characterization, many aspects of research in ABKD, such as planning, coordination, and communication, overlap with other fields of agent research. This paper, however, limits its description of agent technology to the context of knowledge discovery.

There are two types of agent-based systems: homogeneous systems and heterogeneous systems. Agents in homogeneous systems have the same functionality and capabilities, whereas agents in heterogeneous systems have dissimilar functionalities and capabilities but can still coordinate with one another. In general, heterogeneous systems are useful for processing different kinds of databases using a variety of techniques, but it may be difficult to integrate the resultant heterogeneous information. Agent systems can also be classified by the source of control. In decentralized systems, agents negotiate among themselves to resolve coordination problems. Centralized systems are usually easier to implement but have single points of failure.

In addition, some agent systems allow agents to dynamically change their roles when necessary. Having static agent roles within a system may simplify coordination mechanism but the system will be less robust as a whole. Choosing the right characteristics for an ABKD system involves considering what types of data are being mined and what coordination and integration techniques are preferred.

2.2 How does ABKD work?

ABKD systems fit naturally to domains with distributed resources. There are three general methods for ABKD to learn from distributed data. The first method involves collecting data into a single repository. This method is impractical and does not take advantage of agents and distributed networks.

Sian researched the second method, which involves information exchange among agents during their learning of local data [Sian 1991]. In the ideal case, since agents are working as a single algorithm over all the data sources, few or no revisions or integration is necessary. However, this method restricts the choice of possible algorithms to those specifically designed for distributed learning. Another drawback of this method is its assumption about consistently reliable communication and secure data channel.

In the third method, agents independently process the data and learn locally. After the agents have completed, they share, refine, and integrate their results. The level of independence in local learning is a design decision that factors into the communication capability of the agents. The third method makes better use of agent technology and is more suitable when the system designers are concerned with network instability and security breaches. It also allows the use of conventional algorithms in the local learning stage. However, problems may arise during the integration phase when agents try to merge different types of results from different local-learning algorithms. Davies and Edwards in particular proposed a high-level model of the third method using multiple distributed agents:

One or more agents per network node are responsible for examining and analyzing a local data source. In addition, an agent may query a knowledge source for existing knowledge (such as rules and predicates). The agents communicate with each other during the discovery process. This allows agents to integrate the new knowledge they produce into a globally coherent theory. A user communicates with the agents via a user-interface. In addition, a supervisory agent responsible for coordinating the discovery agents may exits. … The interface allows the user to assign agents to data sources, and to allocate high-level discover goals. It allows the user to critique new knowledge discovered by the agents, and to direct the agents to new discovery goals, including ones that might make use of the new knowledge. [Davies and Edwards 1995B]

ABKD systems use software agents for encapsulating the learning functionality of data-mining techniques, as well as coordinating distributed agents. There is significant interdependence between integration of gathered information and coordination mechanism in an ABKD system: if integration is concurrent with the gathering process, the coordination of the agents is critical for accurate knowledge discovery; if integration occurs after agents independently gather information, less coordination effort is required.

Two common techniques for merging or integrating gathered information are theory revision and knowledge integration. Both of these techniques involve the local learning by agents but differ in the way they discover knowledge. Theory revision adopts incremental learning, with which an agent passes the theory it develops to another agent for further refinement with respects to the latter’s data sources. In the case of simple knowledge integration, theories are tested against all training examples and the best theory with respects to a test set is selected. ABKD systems can also implement variations of these two techniques. For example, agents can send their theory to every agent, which then modifies the theory to their own local data. The final theory is chosen from the resulting theories based on a test set [Davies and Edwards 1995B].

2.3 What do Agents Contribute to Data Mining?

With the availability of a wide spectrum of agent systems, ABKD contributes to data mining in a number of ways. First of all, adopting ABKD provides parallelism, which improves the speed, the efficiency, and the reliability of data mining. The distributed nature of agent systems allows the parallel execution of data-mining process regardless of the number of distant data sources involved. This means that non-parallel data-mining algorithms can still be applied on local data (relative to the agent) because information about other data sources is not necessary for local operations. It is the responsibility of agents to integrate the information from numerous local sources in collaboration with other agents.

Second, agent concepts assist developers in designing distributed data-mining systems. The encapsulation of variables and methods in the object-oriented paradigm leads to the idea of encapsulating data-mining techniques, and thus developers can reuse agent objects that contain existing techniques. After defining the agent objects, the developers can design how the agent objects interact with one another to generate the correct results.

Third, agent concepts provide users of a data-mining system the capability to retrieve the discovered knowledge at different stages of progression. For instance, a user may want to view the information gathered by a particular agent before integration takes place. The sophistication of details retrieved at each stage depends on the implementation of individual agent-based systems.

Another advantage of adopting ABKD is the ability of agents to gather or search for information beyond a single data repository. As an example, we can view the World Wide Web as one large database of web pages with no particular order or organization. An agent can randomly sample from the database (World Wide Web) or it can selectively filter certain items (web pages). The agent can then process the retrieved information or relay the items to other agents for further processing. The rich interactions and coordination among agents distinguish ABKD from conventional techniques.

2.4 What are the limitations of ABKD?

Despite all its contributions, ABKD is not a panacea for problems inherent with a particular data-mining technique, such as noise, missing data, or lack of scalability. More over, ABKD systems in many cases are more difficult to design and implement than conventional data-mining systems. Hence, ABKD systems are better suited for mining enormous amounts of distributed data, which usually requires a complicated conventional data-mining system.

2.5 How to evaluate ABKD?

Several implementations of agent-based knowledge discovery exist (such as SAIRE and JAM) and more are in development (like InfoSleuth and BODHI). Thus, it is important to be able to evaluate and compare various agent architectures and distributed learning techniques. This paper suggests some common metrics for most, if not all, ABKD systems:

1) What type of information or data do agents communicate with one another?

Do they share summarized information or raw data that represents the data source they mine?

2) How often do agents communicate with one another?

Does their communication require high bandwidth?

3) Do agents communicate during or after the learning process?

4) Are both the architecture and the implementation easily scalable?

Are there limitations on the application?

5) Can the system reuse existing machine learning algorithms without extensive modification?

6) What is the integration technique?

Is it efficient, scalable, and practical?

7) What is the coordination technique?

Is it efficient, scalable, and practical?

8) What are the results of experiments, if any?

These metrics provide some clues about the advantages as well as problems involved in ABKD. With these metrics, the next section will evaluate some of the present work in ABKD. Following that, the paper will present the desired characteristics of an ideal ABKD architecture.

3. Existing Agent Architectures for Data Mining

3.1 CAS

The developers of Cooperative Agent Society (CAS) identified the generation of concise, high-quality information in response to a user's needs as the core problem of information gathering (IG). The constant growth of the number of available information sources compounds the problem. The authors presented a sophisticated view of IG as the processes of both acquiring and retrieving information, instead of just information retrieval. They based their arguments on the observation that "no single source of information may contain the complete response to a query and hence may necessitate piecing together mutually related partial responses from disparate and heterogeneous sources.” Hence, they proposed that the paradigm to supporting flexible and reliable IG application is a distributed cooperative task, in which agents act as the intermediaries between a user and the system.

The team suggested an agent-based approach for several reasons. Their main motivation for using intelligent agents was that "the components with which to interact [when gathering information] are not known a priori". Other motivations included the maintenance of data sources by different providers, the difference in creation times of data sources, and the use of different problem-solving paradigms. Because agents can negotiate and cooperate with one another, the team believed that agents are important tools for interacting with heterogeneous data sources.

The team used the Internet as their test source because it provides the kind of environment they were interested in and where the test results were generally applicable. They first examined both non-agent-based and partially-agent-based approaches for IG, so that they could determine how an

agent-based approach should work and what issues it should address. Non-agent-based systems that the authors looked into were mostly navigational systems such as World Wide Web and gopher. The authors concluded the main problem with non-agent based systems was that “although [non-agent-based systems] allow [the] user to search through a large number of information sources, they provide very limited capabilities for locating, combining, and processing information; the user is still responsible for finding the information.”

The authors classified the partially agent-based systems they have examined into two categories. In the first category, the systems use agents to help users in browsing, mainly as tools that interactively advise users on which link to pick. This approach easily falls prey to poor designs of agents that constantly made "annoying suggestions." Systems in the second category use agents to help users in document search on the Internet, with tools like client-based search tools and indexing agents (or search engines). Nevertheless, is the team found it difficult to scale systems in this category as the size of document pool grows, mainly as a result of the stress such systems place on network resource.

Using information from these prototypes, the authors proposed a completely agent-based IG tool called CAS. The main design concepts of CAS are:

1) Search at remote sites with multiple agents - domain-expert agents determine which sites to search and how to optimize the search

2) Cooperation of agents - an agent would consult other agents when facing an unfamiliar situation

3) Abstraction of low level details from users

Based on these concepts, the team developed three types of agents in the CAS system: 1) User Agents, or UA (one per user), 2) Machine Agents, or MA (one per data source), and 3) Managers, or MAN (each uses its domain knowledge to direct search to proper data sources).

The CAS system adopts the following mechanism of agent interaction for IG. Initially, the UA learns about the preferences of its users either directly from its user or through monitoring. The UA also provides an interface for its user to submit queries. With both the query and profile of its user, the UA can then select a proper MAN for answering the request. This selection process requires the UA to have meta-knowledge about each MAN. The UA can also ask other UAs for advice on picking a MAN. After that, the selected MAN may request further domains-specific information from the user via the UA to process the query. Once all the proper information is gathered, the MAN formulates a plan and contacts the corresponding MAs. Similar to the selection of a MAN by the UA, the MAN uses meta-knowledge about each MA together with advice from other MANs to choose the proper MAs for service. Upon receiving their directions from the MAN, the MAs will try to retrieve the appropriate data from the system. Again, MAs may consult other MAs if they do not have enough information about the query. The key to this approach is the cooperation between agents. Each level of the search requires a high degree of interaction among peer agents for advice and direction.

The team suggested that CAS solve many of the problems found in other distributed data-mining systems. While CAS does not require users to know exactly where to find data, CAS guides the user by asking appropriate questions about domain-specific topics. In addition, CAS simplifies the

maintenance of data among sources by placing an MA at each information source. The topology of CAS allows parallel execution and improves the security of the system.

The authors of this paper presented a theoretical example to show how CAS can be used. In this example, a user tries to plan a trip and wants to perform tasks such as booking a flight, renting a car, and finding interesting routes for site-seeing. CAS will handle this request by first obtaining and clarifying the users request through the UA. Next, the UA will dispatch the request to the proper MAN based on meta-knowledge about the MANs domain. This MAN will then dispatch different parts of the request to different MAs best suited for each subtask. After resolving discrepancies between returned values, the MAN will return the final results to the user via the UA.

The implementation of CAS has two phases. The first phase involves the development of cooperating UAs that learn from users, and the design of MANs that plan request fulfillment and develop trust relationships with other agents. The second phase involves the incorporation of more intelligence into the agents so that they can make better plans. As of writing of this paper, the authors are investigating real-time planning and learning algorithms for this purpose.

The team begun their implementation in 1996 using libwww and wrote their code in C. They used Netscape as the user-interface and implemented each agent as a separate process. There was one UA and several MANs per user. The team used standard web search engines like Lycos, Infoseek, and Crawler, for the data sources and de facto MAs. In their prototype, the UA maintains a log of exchanges as well as a trust table for other agents. After each user query, the UA gets feedback from the user on the usefulness of the information and will recalculate its trust of each agent as a result. As the authors adopted a long-ranged approach for implementation, apparently they are still working on CAS. Unfortunately, as all data on CAS from Tohoku University are in Japanese, information regarding the current status of CAS is unavailable for this paper. The information summarized in this section can be found in [Okada and Lee and Shiratori 1996]. Further information on CAS (in Japanese only) is available from [CAS web].

3.2 PADMA

PADMA (Parallel Data Mining Agents) is an agent-based system designed to address issues in the data-mining field like scalability of algorithms, as well as distributed nature of data and computation. The team that developed PADMA suggests that “the very distributed nature of the data storage and computing environments is likely to play an important role in the design of the next generation of data mining systems.”

With this view of the steady growth in research of agent-based information processing architectures and parallel computing, PADMA uses specialized agents for each specific domain, so that PADMA can evolve to be a “flexible system that will exploit data mining agents in parallel, for the particular application at hand.”

PADMA consists of three main components: 1) data-mining agents, 2) a facilitator for coordinating agents, and 3) a user interface. The third component is not of interest to this paper.

Specifically, data-mining agents directly access the data to extract high-level useful information, and thus each agent need to specialize in the particular domain of the data it deals with. Each agent has its own disk subsystem and performs I/O operations on data independent of other agents: this is key

to the parallel execution in PADMA. In this way, agents can employ local I/O optimization techniques to increase their speed and improve the accuracy. After extracting information from the data, agents share their mined information through the facilitator module. Other than coordinating agents, the facilitator presents the mining result to the user interface and routes feedbacks from the user to the agents.

PADMA addresses the scalability issue by reducing the inter-agent and inter-process communication during the mining process. In the initial stage of processing a user request, each agent runs independently and queries the data in its own data set. This independence in the initial phase allows a speedup that is linear with the number of agents involved. Once each agent finishes its local extraction operations, the facilitator merges the information from the agents into a final result.

Similarly, PADMA analyze data in a parallel fashion. The facilitator instructs the data-mining agents to run a clustering algorithm on their respective local data sources. After analyzing its local sets of data, each agent returns a "concept graph" to the facilitator without interacting with other agents. The concept graph is a null object if no data relevant to the user query exists at a particular data source. The facilitator then combines the concept graphs from the agents and returns the clustering result to the user interface. Note that the mechanisms for detecting and hierarchically merging clusters are largely independent of the way PADMA functions. The system administrator thus needs to provide the clustering mechanisms for each domain to which PADMA is applied.

The team tested PADMA for clustering related texts in a corpus. The test involved designing the agents and the facilitator to identify text relationships based on n-grams, so as to alleviate the problems of typographical errors and misspellings in the texts. Their test showed that PADMA could deliver satisfactory clustering results in an acceptable time frame.

The PADMA project is still under active research. The current implementation performs querying and clustering on bodies of texts. The team ran experiments and tests against the TIPSTER text corpus of size 36 MB, and showed PADMA had linear speedups for clustering. However, the current implementation did not have a reasonable speedup in query operations. The team now investigates the bottleneck that prevents this speedup. The next step will be the tests with a larger corpus (100 MB). The team also tries to develop a combination of supervised and unsupervised clustering algorithms that can be used in PADMA. For more information and detail see [Kargupta and Hamzaoglu and Stafford 1999]. Further information can also be found at [Los Alamos National Laboratory web].

3.3 SAIRE

SAIRE (Scalable Agent-based Information Retrieval Engine) is an agent framework for solving the problem of information overload. The authors remarked that because of this problem, the information delivered to users in a data search is “often unorganized and overwhelming.” SAIRE attempts to alleviate the problem with a combination of software agents, concept-based search, and natural language (NL) processing. The system provides facilities for tailoring a search to the specific need of a user. For instance, a user may use a technical word for its very specific meaning instead of

its more common meaning. In this case, SAIRE will make sure the search is based on the meaning desired by that user.

SAIRE emphasizes more on domain-specific queries and user interaction issues than in most distributed knowledge integration or data-mining systems do. Since the team tries to factor users’ search objectives and prior activities into the searching process, SAIRE aims to "[provide] an opportunity for non-science users to answer questions and perform data analysis using quality science data". Meeting this goal involves incorporating vast amounts of domain expertise into the agents that interact with users, as well as the agents that extract information from the data sources.

Users interact with SAIRE through a User Interface Agent (UIA). The UIA accepts user inputs and passes the inputs to the Natural Language Parser Agent (NLP). The NLP extracts important phrases from the user input, interprets the inputs, and then generates a request to the SAIRE Coordinator Agent (SCA).

The NLP consists of four agents: 1) a dynamic dictionary, 2) a grammar-checking module, 3) a pre-processor, and 4) a chart parser. Both the dictionary and the grammar-checking module are specific to the domain in which the NLP is working. In addition, the dictionary is split into a main dictionary with words and semantic meanings pertinent to a domain, and a user dictionary that contains words with ambiguous or special meanings. SAIRE interacts with the user to construct the user dictionary and update it with each clarification of a word’s preferred domain meaning.

Figure 1. The architecture of SAIRE.

The SCA first forwards the request from the NLP to a User Modeling Agent (UMA). The UMA monitors the usage patterns of individual users and user groups so that SAIRE can adapt to the requests of frequent users and user groups. The UMA, together with the Concept Search Agent (CSA), provides user-specific interpretations of the request to the SCA. After that, the SCA attempts to resolve any remaining ambiguities with the UMA and the user-specific dictionary. If ambiguities remain, the UMA requests clarification from the user, and this clarification will update the user dictionary.

Once the SCA fully understands a request, it sends the request to the proper data source managers. When the corresponding data source agents return information, the SCA passes the results to a Results Agent (RA). The RA notifies the UIA of the availability of the results and provides tools for presenting this data in different media and various formats.

Instead of having each agent maintain the local information by direct interaction with other agents, the SCA serves as a centralized coordinator for agents. Since the SCA is aware of the capabilities of every data source agent, it can coordinate the agents in a very sophisticated way. The SCA can also store this information safely in a repository, and possibly enhance the fault tolerance of the system. The SCA keeps track of the locations and skill bases of agent managers (AM) in the system, and provides this information for the use of all data source agents. An agent manager (AM) controls the command-driven, domain-specific data source agents in a particular domain. Furthermore, by monitoring the request history for each agent, the SCA can control the resource usage of agents through migrating agents from node to node or spawning new agents when necessary.

Consequently, SAIRE overloads no single node in the network and uses the most of the available bandwidth efficiently. This multi-agent coordinator architecture of SAIRE is best suited for applications with well-known data sources but no effective means of finding appropriate agents in the agent pool.

The authors evaluated SAIRE with several experiments, and the results were quite promising. In a sample of represented requests, the number of documents retrieved ranged from 8 to 536 per query. The precision, or the percentage of documents retrieved that are relevant to a user query, ranged from 75% to 100%. With these results, the authors claimed that SAIRE has the potential to retrieve only those documents that are relevant to a user’s objectives and interests, and therefore users need not sort through a vast pool of irrelevant documents.

The SAIRE project appears to be suspended in 1997. As of February 1997, SAIRE could understand 11,000 words and 7000 phrases as well as clarify ambiguous words through user-agent dialogue. SAIRE also could take user-context and previous history into account when understanding a query. The last implementation of SAIRE involved 8 agent groups of 16 agents apiece, and each agent could collaborate with others to fulfill user requests. The implementation also provided visual displays of agent activity along with run-time explanations.

This section summarized work from Lockheed Martin Space Mission Systems & Services presented in [Das and Kocur 1997]. Further information on the SAIRE project can also be found at [SAIRE web].

3.4 InfoSleuth

InfoSleuth is an agent-based system for information retrieval. The team at MCC developed the system for the purpose of extracting and integrating semantic information from diverse sources, as well as providing temporal monitoring of the information network and identifying any patterns that may emerge [Unruh, Martin, and Perry 1998]. They finished the InfoSleuth project by June 30, 1997, and the InfoSleuth project is now in phase two, called InfoSleuth II. The work described in this paper has come under the auspices of both projects. However, the second project focuses on studying how to support multimedia information objects, and on promoting widespread deployment of data-mining technology in business organizations.

In order to deal with the active joining and leaving of data sources in the InfoSleuth system yet avoiding the need of central co-ordination, the team developed its own multi-brokering peer-to-peer architecture to co-ordinate agent actions [Nodine, Bohrer, and Ngu 1998]. The brokering system matches specific request for services with the agents that can provide the services. This matching process is based on both the syntactic characteristics of the request, as well as the semantic nature of the requested service.

Each data-mining agent in the InfoSleuth system subscribes to agents called brokers. Each broker in turn advertises the capabilities of the agents that subscribe to it, as well as what kind of broker advertisements it will take. The brokering system then groups brokers that provide similar agent services into a consortium, but there is enough overlapping among different consortia to guarantee interconnectivity among brokers. Brokers belonging to a consortium maintain up-to-date information about other brokers in the consortium as well as general information about the presence of other consortia.

When a broker wants to join the system, it needs to first discover which consortia its services fit within. Then the new broker will advertise its services and openness for advertisements. Only those brokers whose openness include the services will discover the new broker, and they can choose whether to accept the advertisement after assessing the capabilities of the new broker. On the other hand, the new broker can query the brokers it advertises to for a list of brokers, and if it is interested in any brokers in the list, it can add their advertisement to its own list.

As a data source joins the system, each of its data-mining agents subscribes to one or two brokers. After being in the brokering system for a while, each agent can change its preferred brokers. In one way, the agent queries the related consortia for brokers. If there is a match, it then adds the broker to its preferred list. If the agent figures out that one of its preferred brokers always forwarding the service request from/to another broker, it may simply replace that preferred broker with the intermediate broker.

InfoSleuth uses multiple layers of agents for the task of information gathering and analysis. At each of the data sources, a Resource Agent extracts semantic concepts from the source. Upon receiving user requests, a Multi-resource Query Agent determines whether the request involves more than one Resource Agent, and if so, it will integrate the annotated data from multiple sources. At the same time, Data-mining Agents and Sentinel agents perform the tasks of intelligent system monitoring and correlation of high-level patterns emerged from the data sources. Data-mining agents provide event notifications that encode statistical analyses and summaries of the retrieved data. Sentinel agents support the data-activities by organizing inputs to data-mining agents, and monitoring for “higher-level” event patterns based on data-mining agents’ output events. Through all these layers of agents, InfoSleuth supports derived requests such as deviation analysis, filtered deviations, and correlated deviations.

Figure 2. The architecture of InfoSleuth.

Even though the team has evaluated InfoSleuth with several experiments, they did not publish any data regarding the performance.

With the elaborate brokering system, InfoSleuth does not require central coordination for collaborative agent action. Besides, the peer-to-peer feature of the brokering system provides an efficient way for a data-mining agent to locate another agent for use. The brokering system also provides mechanisms with which agents can rate the service provided by brokers and switch brokers accordingly. This allows the system to dynamically adapt itself to both network instability and major categorical shift of user requests. Nonetheless, the information necessary for brokers and agents to adjust their links may propagate very slowly across the network. In that case, InfoSleuth may have sub-optimal performance for prolonged periods of time.

Organizations such as National Institute of Standard and Technology and companies like Texas Instrument and Eastman Chemical Company have adopted InfoSleuth as the infrastructure for their data-mining operations [MCC web A]. In particular, the EDEN (Environmental Data Exchange Network) project recently used InfoSleuth to support integrated access via web browsers to environmental information sources provides by agencies in different countries [MCC web B].

3.5 JAM

JAM (Java Agents for Meta-Learning over Distributed Databases) attempts to provide a scalable solution for learning patterns and generating a descriptive representation from a large amount of data in distributed databases [Stolfo, Prodromidis, Tselepis, Lee, Fan, and Chan 1997]. The authors identified the need for scaling algorithms in data mining. They claimed that even though many well-developed data-mining algorithms exist, most of these algorithms assume that the total set of data can fit into the memory, and this assumption does not hold in many data mining contexts. The team thus developed JAM as an agent-based framework for handling this scaling problem.

Another motivation for their agent-based data-mining framework is to handle inherently distributed data. The authors claimed that data can be inherently distributed because of its storage on physically distributed mobile platforms like ships or cellular phones. Other reasons for the inherently distributed nature of data include, but are not limited to, secure and fault-tolerant distribution of data and services, proprietary issues (different parts of data belonged to different entities), or statutory constraints required by laws.

Figure 3. The architecture of a JAM network with 3 Datasites.

The JAM system is a collection of distributed learning and classification programs linked by a network of Datasites. Each JAM Datasite consists of a local database, one or more base-learning agents, one or more meta-learning agents, a local user configuration file, graphical user interfaces, and animation facilities. A learning agent is a machine-learning program for computing the classifiers at distributed sites. Base-learning agents at each Datasite first compute base classifiers from a collection of independent and inherently distributed databases in a parallel fashion. Meta-learning agents are learning processes that integrate several base classifiers, which may be generated by different Datasites. In addition, JAM has a central and independent module, called Configuration File Manager (CFM), which keeps up-to-date state of the distributed system. The CFM stores a list of participating Datasites and logs events for future reference and evaluation.

At each Datasite, local learning agents operate on the local database to compute the base classifier. Each Datasite may import classifiers from peer Datasites and combine these with its own local classifier using the local meta-learning agent. JAM solves the scaling problem of data mining by computing a meta-classifier that integrates all the base-classifier and meta-classifier modules once they are computed. The system can then use the resultant meta-classifier module to classify other datasets of interest. Through the ensemble work, JAM boosts the overall predictive accuracy.

The CFM assumes a passive role for the configuration maintenance of the system. It maintains a list of active member Datasites for coordination of meta-learning activities. Upon receiving a JOIN request from a new Datasite, the CFM verifies the validity of the request as well as the identity of the site. Similarly, a DEPARTURE request invokes the CFM to verify the request and remove the Datasite from the list of active members. The CFM logs the events between Datasites, stores the links among Datasites, and keeps the status of the system.

JAM implements both the CFM and Datasites as multi-threaded Java programs. Meta-learning agents are implemented as java applets for their need to migrate to other sites.

The team initially designed JAM for the purpose of fraud and intrusion detection in financial information system. They have conducted an experiment using the system for detecting credit card fraud transactions, which involved processing inherently distributed data from various financial institutions. They obtained the best performance by using a Bayesian network as the meta-classifier: JAM was able to classify 80% of the true positive and had a false alarm rate of 13%.

Agents in JAM communicate with one another for the classifiers they have developed. The system does not require the dispersion of data across the different sites throughout the execution. This allows the participants to share only information without violating security or proprietary protection of the data.

Since JAM does not specify implementation, different Datasites may choose different machine-learning algorithm implementations as the learning agents, and some of these algorithms may not scale well for large data sets. Thus, the ability to handle large datasets may vary among Datasites.

Moreover, there can be a limitation on how many Datasites can join a JAM system. Even though JAM has no central coordinator, the CFM constantly monitors the global state of the system and contends for more network bandwidth when more Datasites join the system. The CFM can both be a single point of failure and a bottleneck to reasonable system performance.

The JAM project ended in December 1998. The team posted the evaluation of JAM’s performance in intrusion detection on their website [JAM web A]. For software download and specification, refer to [JAM web B].

3.6 DKN

DKN (Distributed Knowledge Network) is a research project for large-scale automated data extraction and knowledge acquisition and discovery from heterogeneous, distributed data sources [Yang, Honavar, Miller, and Wong 1998]. As part of this project, they implemented a toolkit of machine learning algorithms, called KADLab, which uses customizable agents for document classification and retrieval from distributed data sources.

Instead of building an agent infrastructure like most projects, the DKN team chose to use the commercially available Voyager platform from ObjectSpace. Voyager uses the Java language object model and allows regular message syntax for constructing and deploying remote objects. Through the Object Request Broker, Voyager provides services to remote objects and autonomous agents. Objects and other agents can send messages to a moving agent, and an agent can continue to execute as it moves. The platform also provides service for persistence, group communication, and basic directory services.

The team has experimented with their approach for retrieving paper abstracts and news articles on a point-to-point basis [Yang, Pai, Honavar, Miller 1998]. They first trained the classifiers with user preferences, and then incorporated the classifiers into mobile agents on the Voyager platform. When a user queries a document using their system, a mobile agent (Agent 1) is generated. Agent 1 moves to a remote site to retrieve relevant documents. It sends the documents to the local site and then dies. Next, the user gives feedback as whether the documents are interesting or not. These preferences train the classifiers and generate another agent (Agent 2). Agent 2 moves to the remote site and runs the classifier to retrieve relevant documents. It sends the relevant documents to the

local site and dies. The team claimed that the mobile agents return only a subset of relevant documents, but they did not explain the mechanism through which they incorporate classifiers into agents.

Other than the point-to-point experiment, the team did not publish any experiments regarding the system performance with distributed data sources or under varying network environment. Nor did they publish the data source characteristics in the experiments they have conducted.

An important feature of their work is the use of an off-the-shelf agent platform. By not building their own agent platforms, developers can proceed to programming the agent activities and launch the agent into the network within a relatively short time frame. On the other hand, most commercially available agent platforms are for general agent usage, and thus developers that use such platforms do not enjoy the same leverage as with a platform specifically designed for data mining. In the case of DKN, the team found it difficult to keep track of agents, once the agents are launched into the network. This is because Voyager requires a proprietary Java message format for communications among agents. Therefore, instead of updating agents at remote sites with the new classifier information, their system has to regenerate and dispatch new agents from a central location every time a user provides feedback to the learning process. Their system may not scale well to handle distributed data source due to the considerable overhead in agent generation and garbage collection.

3.7 BODHI

BODHI is an implementation of Kargupta’s Collective Data-Mining (CDM) framework for distributed knowledge discovery using agents. CDM aims at “designing and implementing efficient algorithms that generate models from heterogeneous and distributed data with guaranteed global correctness of the model” [Kargupta web].

An agent in BODHI is an interface between the learning algorithm and the communication module. At each site, there is an agent station module that maintains communication between sites and handles security issues. A facilitator module co-ordinates inter-agent communication and directs data and control flow among distributed sites. Most of the BODHI implementation is in Java for flexibility, but the system can still import learning algorithms implemented in native code at local machines.

BODHI uses several learning algorithms specifically developed for distributed data mining: collective decision rule learning using Fourier analysis [Kargupta, Park,Hershberger, and Johnson 1999], collective hierarchical clustering [Johnson and Kargupta 1999], collective multivariate regression using wavelets [Hershberger and Kargupta 1999], and collective principal component analysis [Rannar, MacGregor, and Wold 1998]. The first algorithm uses Fourier analysis to find the Fourier spectrum of data at each data source, and then sends the local spectrums to a centralized site for merging. BODHI can then transforms the resultant spectrum to a decision tree representation. The collective hierarchical clustering requires the transmission of local dendograms at O(n) communication cost. It then creates a global model using the local models with a O(n2) bound in time and a O(n) bound in space. The collective multivariate regression only requires the aggregation of significant coefficients of the wavelet transformation of local data to a central site. The algorithm can then reconstruct the model by performing regression on the coefficients.

Collective principal component analysis involves the creation of a global covariance matrix from loading matrices and sample score matrices after distributed data analysis.

In general, collective learning algorithms attempt to build the most accurate model with respect to a centralized algorithm while minimizing data communication [Kargupta web].

The use of these algorithms minimizes the amount of communication between the central coordinator and local data sources, but the type of information communicated varies according to the algorithm used. The use of such algorithms lacks the inter-agent communication before model integration, and this may be a drawback for some data-mining applications.

BODHI adopts an agent architecture that provides the necessary infrastructure for the execution and information transmission of collective learning algorithms. The system uses the network bandwidth efficiently. However, if too many distributed agents send their local models concurrently to a central location for merging, the large amount of incoming information may overload the network and thus scalability is an issue. More detail concerning the implementation can be found on Kargupta’s website [Kargupta web].

4. Desired Characteristics of Ideal ABKD Architecture

The survey in the previous section demonstrated the various ways of applying agent technology to distributed data mining. It covered most of the issues that may arise in distributed data mining, ranging from system architecture and network topology to user interaction. While each existing ABKD system addresses certain issues of distributed data mining better than others, one may wonder if it is possible to extract all the good features from existing ABKD systems to specify an ideal ABKD system that is flexible and robust enough to address all of issues in distributed data mining. In particular, this section attempts to provide insights, if not final answers, to the following questions:

1) What are the characteristics of an ideal ABKD architecture?

2) What can a user or a developer expect from an ideal ABKD system?

3) Is it possible to build an ideal ABKD system?

4.1 Environment

By definition, any ABKD systems need to work in a networking environment. In some networks like the Internet where the stability is uncontrollable, an ideal ABKD system needs to allow the dynamic joining and leaving of remote data sources at any given time. Other networks like corporate intranets can be fairly stable for most of the time. In that case, the ideal ABKD system needs to be aware of the stable network and works with the remote data sources through more efficient means.

An ideal ABKD system scales well with the number of data sources and the size of data to be mined at each remote site. Also, the ideal system can handle concurrent queries from a large number of users. Based on the earlier discussion of existing ABKD system, avoiding centralized coordination and system monitoring is the key to resolving these scalability issues.

4.2 Information Integration

In distributed data mining, ABKD systems need to integrate information from different data sources. These data sources may store data in different formats, belong to different application domains, and support the retrieval of data in different kinds of data structures. Yet there are cases all these distributed data sites are homogeneous. An ideal ABKD system needs to adapt to the different possible natures of remote data sources to perform the necessary information integration.

One issue in ABKD research is whether to integrate information during the mining process, or after independent mining at each remote data source. This issue is closely coupled with the mechanisms with which agents are coordinated. In the former choice of integration, central coordination is not necessary and the action of an agent can be influenced by many different entities through extensive communication. In the latter choice, a coordinator at the top of the system hierarchy is necessary to manage agents for proper information integration. An ideal ABKD system should support both types of information integration and adjust itself dynamically according to the network environment and application context.

What is more, an ideal ABKD system should support both any-time algorithms and non-interruptible algorithms. The use of any-time algorithms in a data-mining operation allows the users to interrupt the system at any stage of processing and retrieve an analysis for the results up to that stage. However, any-time analysis may not make sense for certain data-mining operations or techniques.

4.3 Result Processing

Result processing is an aspect of data mining that involves presenting results to users, understanding user queries, and archiving past results efficiently. Even though result processing is usually not a concern in distributed data mining research, agent technology readily addresses issues in result processing.

As it is important for the data-mining system to thoroughly understand user requests, the ideal system should perform user profiling and take into account the history, experience, and profile of the user before processing the requests from a user. Only then the system can verify if it is finding the information interesting to the user.

Interactive query clarification is a powerful agent tool for this purpose. It allows the data mining system to ask the user questions before the user submits a query. As a result, the system can ensure that it understands the correct meaning of a particular word in the user request. Incorporating this feature to the ideal ABKD system will extend its applicability in domain-specific operations without limiting its application in a particular domain.

Moreover, the ideal system should be able to measure and report the relevance of the returned information with respect to the user query. The user can then cross-reference with this relevance rating when making a decision based on the returned information.

An ideal ABKD system should not only support advanced user interaction but also make the best use of available resources to resolving user queries. The ideal system performs data source profiling to better direct queries to the proper data sources. Data source profiling requires that the coordinating entities keep track of the meta-information for each data source. Besides returning information of higher quality, proper data source profiling also promotes efficient use of systems resources, especially network bandwidth.

In addition, the ideal system may cache past data-mining results, so as to reduce the amount of processing for recurrent queries. Since caching only work well with certain domains, the ideal system should provide means for the system administrator to specify whether caching should take place, and if so, at what points in the system and at what level.

4.4 Resource Usage

In many data-mining applications, it is desirable to minimize resource usage of without significant compromises in speed or accuracy. Hence, an ideal ABKD system needs to provide facilities for adjusting the resource usage for different performance requirements. An example is the resource usage for inter-agent communication. An ideal ABKD system should allow agents to communicate with one another to resolve problems. Increasing the communication among agents usually results in more accurate and more meaningful data mining. Yet increase in such communication invariably leads to increase in network traffic, and the ideal system may need to limit the communication among agents to control both bandwidth usage and runtime bounds. Therefore, the ideal system should adjust the level of inter-agent communication dynamically according to the run-time environment.

Since querying data sources is an expensive operation, an ideal system should try to minimize the time overhead in accessing remote data sites. A lot of research has been conducted in this area to find out a way of gathering information with minimal amount of database access.

4.5 How realistic is it?

Researchers may wonder if they can build such an ideal ABKD system. Even though no existing ABKD system possesses all the ideal characteristics, the survey of ABKD systems in the previous section can provide some insight regarding the feasibility of building the ideal system.

Since the ideal ABKD system essentially consists of the good features of all the existing ABKD systems, the survey demonstrates that each constituent part of the ideal system can be implemented. However, the inherent difficulty of integration means that the ideal system is not a simple union of these constituent parts. As research in distributed data mining shows the difficulty in combining approaches in learning information, building an ideal system is a feasible but non-trivial task. It will require an in-depth understanding of the problems in the distributed data mining and a familiarity with all possible solutions, so that alternative approaches can be taken whenever necessary.

Perhaps the most difficult aspect of building an ideal ABKD system architecture is to allow users to use the system for their domain-specific needs. The ideal system needs to support the flexible tradeoffs for many conflicting desired characteristics, in order to be useful for a wide array of application domains. It can be a challenge for developers to incorporate these conflicting characteristics into the system so that the features work properly by themselves and more importantly, work together when necessary.

4.6 Proposed ABKD Architecture

We propose an agent-based data-mining architecture that accommodates most of the ideal characteristics we described in the previous section. The proposed architecture resembles SAIRE but

emphasizes more on data sources and management rather than user interface. SAIRE requires high level of inter-communication among agents, and it does not address the problem of unstable networks. In other words, even though SAIRE does not stop data-sites to join or leave the system, it does not directly address the issues. Mainly as an effort to allow user-friendly search through a vast amount of data, SAIRE ignored many issues that our proposed architecture deals with.

The proposed system contains three main types of agents: the UI (user interface) agent, the Manager agent, and the KB (knowledgebase) agent. The UI agent helps the user get specific information via the Manager agent. The Manager agent, which is an expert in a particular field, interoperates with a number of KB agents or other Manager agents to get the requested information. Each KB agent wraps around each data source and uses some conventional data-mining technique to get information from the data.

The architecture uses a “registration” mechanism that allows an agent only to have access to those agents that have registered with the particular agent. This protocol makes the organization of the agents dynamic, or in other words, the UI and KB agents are mutable. UI agents are created when users enter the system, and the system destroys the corresponding UI agents when users leave. Similarly, as data sources join and leave the system, the KB agents are created and destroyed. In general, a UI agent has access to multiple Manager agents that have registered with the UI agent.

A Manager agent is the least volatile agent since it has no direct link to an external object, such as a user or a database. Manager agents organize themselves into a hierarchy of domain expertise. Each Manager agent has access to a specified number of Manager and/or KB agents that have registered with the Manager agent. Besides, a Manager agent can terminate itself if no KB or Manager agent is available.

We assume that every KB agent understands the meaning of its data with respects to a hierarchy so that it can register itself to the correct branch of the hierarchy. Nevertheless, researchers can implement a simple organization with a two-level hierarchy, and use the prototype to study the possibilities of multi-level hierarchies.

Based on Lam’s research on agent design, we suggest that each agent have six functionalities: sensing, modeling, organizing, planning, acting, and communicating. A KB agent senses (gathers) the data and produces a model or statistic from the data, and then it registers with Manager agents that are interested in its data. When a KB agent gets an information request from a Manager agent, the KB agent queries its data source using the proper query language, and then communicates the results to the Manager agent. The KB agent collaborates with other KB agents when the Manager agent requests high-level information that requires the querying of more than one data source. KB agents also collaborate during modeling.

A Manager agent is responsible for getting access to all data sources pertaining to a particular field, which may be modeled as a set of keywords within the agent. If a KB agent finds no Manager agent to register with, a new Manager agent will be instantiated. A Manager agent models KB agents that have registered with it, as well as other Manager agents it may collaborate with. The UI agent sends a model of user preferences along with the user’s request to the Manager agent to ensure the retrieval of accurate information.

As an example, suppose a user makes three requests: “find the average price of a house sold for less than $150,000 within 5 miles of my current home,” “what area within my state has the highest selling price for a house in the past 10 years,” and “what is the expected selling price for my neighborhood in the next 10 years.” This example assumes that the ABKD system has established the Manager and KB agents for the realty domain. The UI agent first translates the request into terms that the agents understand. Then, it gets the users location from the user model and sends the request, along with the user model, to a Manager agent that handles residential property (as opposed to business property). When the Manager agent receives this request, it locates a Manager agent that exclusively deals with houses (as opposed to apartments) and redirects the request to that agent. Having modeled the data content of each registered KB agent, the house Manager agent selects only those registered KB agents that have data pertaining to the request. This means that the agent may select only those KB agents within the user’s state. A query is formed and sent to each selected KB agent. The KB agent executes the query and returns the results to the Manager agent. Once the Manager agent receives the results from the KB agents, it can send the results directly to the user or do further integration, analysis, or formatting of the results according to the user’s preferences that was sent from the UI agent. Manager agents can produce high-level knowledge using conventional data-mining techniques, such as clustering, decision trees, and statistical methods. This knowledge is then sent to the UI agent, which displays it according to the user’s request.

The implementation of the proposed architecture needs to focus on flexibility, standardization, and platform-independence. Agent modeling and interoperation can be done with KQML (Knowledge Query and Manipulation Language), which “is part of a larger effort, the ARPA Knowledge Sharing Effort, which is aimed at developing techniques and methodologies for building large-scale knowledge bases which are sharable and reusable” [Mayfield web]. Agent communication should follow the FIPA (Foundation for Intelligent Physical Agents) ACL (agent communication language) standard [FIPA web]. The main programming language should be Java, and CORBA can be used to support and interoperate with other languages that are possible used by the KB agents. CORBA is a powerful tool for interfacing agents with legacy systems. Very often an agent queries data sources that are older, entrenched systems, which do not readily interoperate with modern protocols. By defining CORBA interfaces for both agents and data sources, we can simplify the task of making these data sources accessible. Moreover, CORBA simplifies the interfacing with newer systems since CORBA is an established standard. The learning time for adding a data site to the agent network can thus be substantially shortened.

Our proposed ABKD architecture handles distributed, dynamic, heterogeneous data by having mutable KB agents that are independent of other KB agents. Our overall goal is to provide a flexible framework to support the implementation of various techniques. Depending on the domain and the type of request made, either manager agents or UI agents can integrate data. In addition, decentralized coordination allows the system to remain functional despite breakdown in networks or agents. What is more, the administrator can choose whether to have KB agents mined the data beforehand or only when a query is made. A subset of the UI agents can evolve into a system that recommends further topics or items of interest based on the user model.

5. Future Works

Until recently, there has been a lack of standard for inter-agent communication among different agent systems. Thus existing ABKD research did not consider the possibility of working with one another. In addition, present works in ABKD systems either support only collective data-mining algorithms specially designed for distributed execution, or provide a general framework for the reuse of conventional machine-learning algorithms. One potential research is to develop an ABKD system supporting both types of algorithms, so that researchers can evaluate the two types of algorithms on the same basis. Besides, researchers may try to build an ABKD system that works with other systems from different research teams.

Another observation is that most existing ABKD systems perform only one type of data-mining tasks like classification. Building multi-purpose ABKD systems will allow researchers to reuse the expensive agent platform implementation.

Moreover, publications in ABKD research rarely detail the implementation of the underlying agent systems. Since editors in general prefer new design ideas to implementation details, most papers describe only the architecture of the agent systems and how agents interact. It will be nice if researchers can share their experience of building a prototype ABKD system and thus novices in ABKD research need not start from scratch. Better still, researchers may provide technical reports on their web-sites detailing the design alternatives they have considered but decided not to adopt, so that others can benefit from the previous works.

Last but not least, there is no common methodology to benchmark the various ad hoc implementations of agent-based systems. Also, researchers of ABKD systems in most cases did not publish the data on the performance evaluation of their projects. In addition to existing research that establishes metrics for multi-agent systems [Lam and Barber 2000], researchers may develop a generic test bed for ABKD systems that helps to compare the various architectures of ABKD systems. The development of test bed can lead to the identification and recommendation of the suitable architecture for each application domain.

6. Conclusion

This paper introduced the idea of ABKD system, which allows the mining of distributed heterogeneous data at relatively ease. In addition, the use of agent technology enhances parallel execution of data-mining processes. Nevertheless, ABKD is not a panacea for problems inherent with a particular data-mining technique.

Next, the paper presented a set of metrics for evaluating ABKD systems, and then evaluated present work in ABKD research. It examined ABKD systems built for a specific application domain and those systems for general data-mining applications. We also took a look at an architecture that uses existing commercial package instead of building its own agent infrastructure. In general, most of the existing ABKD systems used multiple layers of agents to handle different levels of data mining tasks.

After that, the paper described the desired characteristics of an ideal ABKD architecture in terms of its functionalities, resource requirements, and processing of results. Since some of the desired

characteristics were conflicting, researchers need to exercise certain tradeoffs when trying to incorporate such characteristics into their own systems.

Finally, the paper identified some potential works in ABKD research, such as building ABKD systems that support both dedicated and conventional data-mining algorithms, developing ABKD systems for multiple categories of data-mining tasks, and implementing a test bed for evaluating different kinds of ABKD architectures with respect to different application domains.

References

Balakrishnan, K. and Honavar, V. (1998). Intelligent Diagnosis Systems. Journal of Intelligent Systems. In press.

Chan, P. and Stolfo, S. (1996). Sharing learned models among remote database partitions by local meta-learning. In Proceedings Second International Conference on Knowledge Discovery and Data Mining, 2-7.

CAS. Shiratori Lab: New Projects.

http://www.shiratori.riec.tohoku.ac.jp/index-e.html

Das, B., and Kocur, D. (1997). Experiments in Using Agent-Based Retrieval from Distributed and Heterogeneous Databases. In Knowledge and Data Engineering Exchange Workshop, 27-35.

Davies, W. H. E. and Edwards, P. (1995A). Agent-Based Knowledge Discovery. In Working Notes of the AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environment.

Davies, W. H. E. and Edwards, P. (1995B). Distributed Learning: An Agent-Based Approach to Data-Mining. In Proceedings of ML95 Workshop on Agents that Learn from Other Agents.

Domingos, P. (1997). Knowledge acquisition from examples via multiple models. In International Conference on Systems, Man and Cybernetics.

FIPA. Foundation for Intelligent Physical Agents. http://www.fipa.org.

Hall, Lawrence O., Chawla, Nitesh, and Bowyer, Kevin W. (1998). Combining Decision Trees Learned in Parallel. In Distributed Data Mining Workshop at KDD-98.

Hayes, Caroline C. (1999). Agents in a Nutshell – A Very Brief Introduction. In IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 1 Jan/Feb 1999.

Hershberger, D. and Kargupta, H. (1999). Distributed Multivariate Regression Using Wavelet-based Collective Data Mining. In Special Issue on Parallel and Distributed Data Mining of the Journal of Parallel Distributed Computing. Kumar, V., Ranka, S., and Singh, V. (Ed.) (In press) (also available as Technical Report EECS-99-02).

Honavar, V. (1994). Toward Learning Systems That Use Multiple Strategies and Representations. In Artificial Intelligence and Neural Networks: Steps Toward Principled Integration. pp. 615-644. Honavar, V. and Uhr, L. (Ed.) New York: Academic Press.

Honavar, V. (1998). Inductive Learning: Principles and Applications. In Intelligent Data Analysis in Science. Cartwright, H. (Ed). London: Oxford University Press.

JAM (A). The JAM Project Overview. http://www.cs.columbia.edu/~sal/JAM/PROJECT/recent-results.html.

JAM (B). Software Download for the JAM Project. http://www.cs.columbia.edu/~andreas/JAM_download.html.

Johnson, E. and Kargupta, H. (1999). Collective, Hierarchical Clustering from Distributed, Heterogeneous Data. In Large-Scale Parallel KDD Systems, Lecture Notes in Computer Science, Springer-Verlag. Zaki, M. and Ho, C. (Ed).

Kargupta, H. Distributed Knowledge Discovery from Heterogeneous Sites. http://www.eecs.wsu.edu/~hillol/DKD/ddm_research.html.

Kargupta, H., Hamzaoglu, I. and Stafford, B. (1999). Scalable, Distributed Data Mining Using An Agent Based Architecture. Proceedings of Knowledge Discovery And Data Mining. Heckerman, D., Mannila, H., Pregibon, D., and Uthurusamy, R. AAAI Press. 211-214.

Kargupta, H., Park, B., Hershberger, D. , and Johnson, E. (1999). Collective Data Mining: A New Perspective Toward Distributed Data Mining. Submitted for publication in Advances in Distributed Data Mining. Kargupta, H., and Chan, P (Ed.). AAAI Press.

Los Alamos National Laboratory. Parallel Data Mining Agents. http://www-fp.mcs.anl.gov/ccst/research/reports_pre1998/algorithm_development/padma/kargupta.html.

Lam, D. N. and Barber, K. S. Tracing Dependencies of Strategy Selections in Agent Design. To be published in AAAI-2000 17th National Conference on AI.

Mayfield, James, Labrou, Yannis, and Finin, Tim. Desiderata for Agent Communication Languages. http://www.cs.umbc.edu/kqml/papers/desiderata-acl/root.html. University of Maryland Baltimore County.

MCC (A). Who Will Use InfoSleuth and For What. http://www.mcc.com/projects/infosleuth/introduction/applications.html, last updated February 10, 1998.

MCC (B). Project Documents.

http://www.mcc.com/projects/env/eden/docs/fact.html, last updated October 11, 1999.

Miller, L., Honavar, V. and Barta, T.A. (1997). Warehousing Structured and Unstructured Data for Data Mining. In Proceedings of the American Society for Information Science Annual Meeting (ASIS 97). Washington, D.C.

Nodine, M., Bohrer, W., and Ngu, A. (1998). Semantic brokering over dynamic heterogeneous data sources in InfoSleuth. MCC Technical Report. Submitted to ICDE '99.

Okada, R., Lee, E., and Shiratori, N. (1996). Agent Based Approach for Information Gathering on Highly Distributed and Heterogeneous Environment. In Proc. 1996 International Conference on Parallel and Distributed Systems.

Parekh, R. and Honavar, V. (1998). Constructive Theory Refinement in Knowledge Based Neural Networks. In Proceedings of the International Joint Conference on Neural Networks Anchorage, Alaska.

Parekh, R., Yang, J., and Honavar, V. (1998). Constructive Neural Network Learning Algorithms for Multi-Category Pattern Classification. In IEEE Transactions on Neural Networks.

Rannar S., MacGregor, J.F., Wold., S. (1998). Adaptive Batch Monitoring using Hierarchical PCA. In Chemometrics & Intelligent Laboratory Systems.

SAIRE. SAIRE Homepage.

http://saire.ivv.nasa.gov/.

Sian, S. (1991). Extending Learning to Multiple Agents: Issues and a Model for Multi-Agent Machine Learning (MA-ML). In Proceedings of the European Working Session on Learning. Y. Kodratroff Ed., Springer-Verlag, 458-472.

Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, D., and Chan , P. (1997). Jam: Java agents for meta-learning over distributed databases. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages 74--81, Newport Beach, CA. AAAI Press.

Unruh, A., Martin, G., and Perry, B. (1998). Getting only what you want: Data mining and event detection using InfoSleuth agents. Technical Report MCC-INSL-113-98, MCC InfoSleuth Project.

Williams, G. (1990). Inducing and Combining Multiple Decision Trees. Ph. D. Dissertation, Australian National University, Canberra, Australia.

Yang, J. and Honavar, V. (1998). Feature Subset Selection Using a Genetic Algorithm. In Feature Extraction, Construction, and Subset Selection: A Data Mining Perspective. Motoda, H. and Liu, H. (Ed.) New York: Kluwer. 1998. A shorter version of this paper appears in IEEE Intelligent Systems (Special Issue on Feature Transformation and Subset Selection).

Yang, J. and Honavar, V. (1998). DistAl: An Inter-Pattern Distance Based Constructive Neural Network Learning Algorithm. In Intelligent Data Analysis. In press. A preliminary version of this paper appears in [IJCNN98].

Yang, J., Pai, P., Honavar, V., and Miller, L. (1998). Mobile Intelligent Agents for Document Classification and Retrieval: A Machine Learning Approach. In Proceedings of the European Symposium on Cybernetics and Systems Research. In press.

Yang, J., Honavar, V., Miller, L. and Wong, J. (1998). Intelligent Mobile Agents for Information Retrieval and Knowledge Discovery from Distributed Data and Knowledge Sources. In Proceedings of the IEEE Information Technology Conference. Syracuse, NY.