jam: java agents for meta-learning over distributed databases › papers › workshops › 1997 ›...

JAM: Java Agents for Meta-Learning over Distributed Databases*

Salvatore Stolfo, Andreas L. Prodromidis¢Shelley Tselepis, Wenke Lee, Dave W. Fan

Department of Computer ScienceColumbia UniversityNew York, NY 10027

{sal, andreas, sat, wenke, wfan}@cs.columbia.edu

Philip K. ChanComputer Science

Florida Institute of TechnologyMelbourne, FL 32901

[email protected]

Abstract

In this paper, we describe the JAM system, a dis-tributed, scalable and portable agent-based datamining system that employs a general approachto scaling data mining applications that we callmeta-learning. JAM provides a set of learningprograms, implemented either as JAVA appletsor applications, that compute models over datastored locally at a site. JAM also provides aset of meta-learning agents for combining mul-tiple models that were learned (perhaps) at dif-ferent sites. It employs a special distributionmechanism which allows the migration of the de-rived models or classifier agents to other remotesites. We describe the overall architecture of theJAM system and the specific implementation cur-rently under development at Columbia Univer-sity. One of JAM’s target applications is fraudand intrusion detection in financial informationsystems. A brief description of this learningtask and JAM’s applicability are also described.Interested users may download JAM fromhttp://www.cs.columbia.edu/,-~sal/JAM/PRO JECT.

IntroductionOne means of acquiring new knowledge from databasesis to apply various machine learning algorithms thatcompute descriptive representations of the data as wellas patterns that may be exhibited in the data. Thefield of machine learning has made substantial progressover the years and a number of algorithms have beenpopularized and applied to a host of applications indiverse fields. There are numerous algorithms rang-ing from those based upon stochastic models, to al-gorithms based upon purely symbolic descriptions likerules and decision trees. Thus, we may simply applythe current generation of learning algorithms to verylarge databases and wait for a response! However,the question is how long might we wait? Indeed, do

*This research is supported by the Intrusion DetectionProgram (BAA9603) from DARPA (F30602-96-1-0311),NSF (IRI-96-32225 and CDA-96-25374) and NYSSTF(423115-445).

tSupported in part by IBM.

the current generation of machine learning algorithmsscale from tasks common today that include thou-sands of data items to new learning tasks encompassingas much as two orders of magnitude or more of datathat is physically distributed? Furthermore, many ex-isting learning algorithms require all the data to beresident in main memory, which is clearly untenable inmany realistic databases. In certain cases, data is in-herently distributed and cannot be localized on any onemachine (even by a trusted third party) for a variety practical reasons including physically dispersed mobileplatforms like an armada of ships, security and faulttolerant distribution of data and services, competitive(business) reasons, as well as statutory constraints im-posed by law. In such situations, it may not be possi-ble, nor feasible, to inspect all of the data at one pro-cessing site to compute one primary "global" classifier.We call the problem of learning useful new knowledgefrom large inherently distributed databases the sealingproblem for machine learning. We propose to solve thescaling problem by way of a technique we have cometo call "meta-learning". Meta-learning seeks to com-pute a number of independent classifiers by applyinglearning programs to a collection of independent andinherently distributed databases in parallel. The "baseclassifiers" computed are then integrated by anotherlearning process. Here meta-learning seeks to computea "meta-classifier" that integrates in some principledfashion the separately learned classifiers to boost over-all predictive accuracy.

In the following pages, we present an overview ofmeta-learning and we describe JAM (Java Agents forMeta-Learning), a system that employs meta-learningto address large scale distributed applications. TheJAM architecture reveals a powerful, portable and ex-tensible network agent-based system that computesmeta-classifiers over distributed data. JAM is beingengaged in experiments dealing with real-world learn-ing tasks such as solving key problems in fraud andintrusion detection in financial information systems.

Meta-LearningMeta-learning provides a unifying and scalable solution

91

From: AAAI Technical Report WS-97-07. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.

that improves the efficiency and accuracy of inductivelearning when applied to large amounts of data in widearea computing networks for a range of different appli-cations.

Our approach to improve efficiency is to execute anumber of learning processes (each implemented as distinct serial program) on a number of data subsets(a data reduction technique) in parallel (eg. over network of separate processing sites) and then to com-bine the collective results through meta-learning. Thisapproach has two advantages, first it uses the same se-rial code at multiple sites without the time-consumingprocess of writing parallel programs and second, thelearning process uses small subsets of data that can fitin main memory. The accuracy of the learned conceptsby the separate learning process might be lower thanthat of the serial version applied to the entire dataset since a considerable amount of information maynot be accessible to each of the independent and sep-arate learning processes. On the other hand, combin-ing these higher level concepts via meta-learning, mayachieve accuracy levels, comparable to that reached bythe aforementioned serial version applied to the entiredata set. Furthermore, this approach may use a varietyof different learning algorithms on different computingplatforms. Because of the proliferation of networks ofworkstations and the growing number of new learningalgorithms, our approach does not rely on any specificparallel or distributed architecture, nor on any particu-lar algorithm, and thus distributed meta-learning mayaccommodate new systems and algorithms relativelyeasily. Our meta-learning approach is intended to bescalable as well as portable and extensible.

In prior publications we introduced a number ofmeta-learning techniques including arbitration, com-bining (Chan & Stolfo 1993) and hierarchical tree-structured meta-learning systems. Other publicationshave reported performance results on standard testproblems and data sets with discussions of related tech-niques, Wolpert’s stacking (Wolpert 1992), Breiman’sbagging (Breiman et al. 1984) and Zhang’s combin-ing (Zhang et al. 1989) to name a few. Here we de-scribe the JAM system architecture designed to sup-port these and perhaps other approaches to distributeddata mining.

The JAM architecture

JAM is architected as an agent based system, a dis-tributed computing construct that is designed as anextension of OS environments. It is a distributed meta-learning system that supports the launching of learn-ing and meta-learning agents to distributed databasesites. JAM is implemented as a collection of dis-tributed learning and classification programs linked to-gether through a network of Datasites. Each JAM Dat-asite consists of:

¯ A local database,

¯ One or more learning agents, or in other words ma-chine learning programs that may migrate to othersites as JAVA applets, or be locally stored as nativeapplications callable by JAVA applets,

¯ One or more meta-learning agents,

¯ A local user configuration file,

¯ Graphical User Interface and Animation facilities.

The JAM Datasites have been designed tocollaborate 1 with each other to exchange classifieragents that are computed by the learning agents.

First, local learning agents operate on the localdatabase and compute the Datasite’s local classifiers.Each Datasite may then import (remote) classifiersfrom its peer Datasites and combine these with its ownlocal classifier using the local recta-learning agent. Fi-nally, once the base and meta-classifiers are computed,the JAM system manages the execution of these mod-ules to classify and label datasets of interest. Theseactions may take place at all Datasites simultaneouslyand independently.

The owner of a Datasite administers the local ac-tivities via the local user configuration file. Throughthis file, he/she can specify the required and optionallocal parameters to perform the learning and meta-learning tasks. Such parameters include the namesof the databases to be used, the policy to partitionthese databases into training and testing subsets, thelocal learning agents to be dispatched, etc. Besidesthe static 2 specification of the local parameters, theowner of the Datasite can also employ JAM’s graphi-cal user interface and animation facilities to superviseagent exchanges and administer dynamically the meta-learning process. With this graphical interface, theowner may access more information such as accuracy,trends, statistics and logs and compare and analyzeresults in order to improve performance.

The configuration of the distributed system is main-tained by the Configuration File Manager (CFM), central and independent module responsible for keep-ing the state of the system up-to-date. The CFM is aserver that provides information about the participat-ing Datasites and logs events for future reference andevaluation.

The logical architecture of the JAM meta-learningsystem is presented in Figure 1. In this example, threeJAM Datasites Marmalade, Mango and Strawberry ex-change their base classifiers to share their local view ofthe learning task. The owner of the Datasite controlsthe learning task by setting the parameters of the userconfiguration file, i.e. the algorithms to be used, theimages to be used by the animation facility, the fold-ing parameters, etc. In this example, the CFM runs

aA Datasite may also operate independently withoutany changes.

2Before the beginning of the learning and meta-learningtasks.

92

Figure 1: The architecture of the meta-learning system.

on Cherry and each Datasite ends up with three baseclassifiers (one local plus the two imported classifiers).

We have used JAVA technology to build the infras-tructure of the system, to develop the specific agentoperators that compose and spawn new agents fromexisting classifier agents, to implement the GUI, theanimation facilities and most of the machine learningalgorithms. The platform-independence of JAVA tech-nology makes it easy to port JAM and delegate itsagents to any participating site. The only parts thatwere imported in their native (C++) form and are notyet platform independent were some of the machinelearning programs; and this was done for faster proto-type development and proof of concept.

Configuration File Manager

The CFM assumes a role equivalent to that of a nameserver of a network system. The CFM provides reg-istration services to all Datasites that wish to be-come members and participate in the distributed meta-learning activity. When the CFM receives a JOIN re-quest from a new Datasite, it verifies both the validityof the request and the identity of the Datasite. Uponsuccess, it acknowledges the request and registers theDatasite as active. Similarly, the CFM can receive andverify the DEPARTURE request; it notes the requestorDatasite as inactive and removes it from its list of mem-bers. The CFM, maintains the list of active memberDatasites to establish contact and cooperation betweenpeer Datasites. Apart from that, the CFM keeps in-formation regarding the groups that are formed (whichDatasites collaborate with which Datasites), logs theevents and displays the status of the system. Throughthe CFM, the JAM system administrator may screenthe Datasites that participate.

Datasites

Unlike CFM which provides a passive configurationmaintenance function, the Datasites are the activecomponents of the meta-learning system. They man-age the local databases, obtain remote classifiers, buildthe local base and meta classifiers and interact withthe JAM user. Datasites are implemented as multi-threaded Java programs with a special GUI.

Upon initialization, a Datasite starts up the GUI,through which it can accept input and display statusand results, registers with the CFM, instantiates thelocal learning engine/agent3 and creates a server socketfor listening for connections4 from the peer Datasites.Then, it waits for the next event to occur, either a com-mand issued by the owner via the GUI, or a messagefrom a peer Datasite via the open socket.

In both cases, the Datasite verifies that the input isvalid and can be serviced. Once this is established, theDatasite allocates a separate thread and performs therequired task. This task can be any of JAM’s func-tions: computing a local classifier, starting the meta-learning process, sending local classifiers to peer Data-sites or requesting remote classifiers from them, report-ing the current status, or presenting computed results.

Figure 2 presents a snapshot of the JAM system dur-ing the meta-learning phase. In this example threeDatasites, Marmalade, Strawberry and Mango (see thegroup panel of the figure) collaborate in order to shareand improve their knowledge in diagnosing hypothy-roidism. The snapshot taken is from "Marmalade’spoint of view". Initially, Marmalade consults the Dat-

3The Datasite consults the local Datasite configurationfile (maintained by the owner of the Datasite) to obtaininformation regarding the central CFM and the types ofthe available machine learning agents.

4For each connection, the Datasite spawns a separatethread.

93

asite configuration file where the owner of the Dat-a.site sets the parameters. In this case, the datasetis a medical database with records (Merz & Murphy1996), noted by thyroid in the Data Set panel. Otherparameters include the host of the CFM, the Cross-Validation Fold, the Meta-Learning Fold, the Meta-Learning Level, the names of the local learning agentand the local meta-learning agent, etc. (Refer to (Chan1996) for more information on the meaning and use ofthese parameters.) Notice that Marmalade has estab-lished that Strawberry and Mango are its peer Data-sites, having acquired this information from the CFM.

Then, Marmalade partitions the thyroid database(noted as thyroid.l.bld and thyroid.2.bld in the DataSet panel) for the 2-Cross-Validation Fold and com-putes the local classifier, noted by Marmalade.1 (hereby calling the ID3 (Quinlan 1986) learning agent).Next, Marmalade imports the remote classifiers, notedby Strawberry.1 and Mango.1 and begins the meta-learning process. Marmalade employs this meta-classifier to predict the classes of input data items (inthis case unlabelled medical records). Figure 2 dis-plays a snapshot of the system during the animatedmeta-learning process where JAM’s GUI moves iconswithin the panel displaying the construction of a newmeta-classifier.

Classifier VisualizationJAM provides graph drawing tools to help users un-derstand the learned knowledge (Fayyad, Piatetsky-Shapiro, & Smyth 1996). There are many kinds ofclassifiers, e.g., a decision tree by ID3, that can be rep-resented as graphs. In JAM we have employed majorcomponents of JavaDot (Lee & Barghouti 1997), an ex-tensible visualization system, to display the classifierand allows the user to analyze the graph. Since eachmachine learning algorithm has its own format to rep-resent the learned classifier, JAM uses an algorithm-specific translator to read the classifier and generate aJavaDot graph representation.

Figure 2 shows the JAM classifier visualization panelwith a decision tree, where the leaf nodes representclasses (decisions), the non-leaf nodes represent theattributes under test, and the edges represent the at-tribute values. The user can select the Attributescommand from the Object pull-down menu to see anyadditional information about a node or an edge. Inthe figure, the Attributes window shows the classi-fying information of the highlighted leaf node5. It isdifficult to view clearly a very large graph (that has large number of nodes and edges) due to the limitedwindow size. The classifier visualization panel providescommands for the user to traverse and analyze partsof the graph: the user can select a node and use theTop command from the Graph menu to make the

5Thus visually, we see that for a test data item, if its"p-2" value is 3 and its "p-14" value is 2, then it belongsto class "0" with .889 probability.

subgraph starting from the selected node be the entiregraph in display; use the Parent command to viewthe enclosing graph; and use the Root command tosee the entire original graph.

Some machine learning algorithms generate conciseand very readable textual outputs, e.g., the rule setsfrom Ripper (Cohen 1995). It is thus counter-intuitiveto translate the text to graph form for display pur-poses. In such cases, JAM simply pretty formats thetext output and displays it in the classifier visualiza-tion panel.

AnimationFor demonstration and didactic purposes, the meta-learning component of the JAM graphical user inter-face contains a collection of animation panels whichvisually illustrate the stages of meta-learning in par-allel with execution. When animation is enabled, atransition into a new stage of computation or analy-sis triggers the start of the animation sequence corre-sponding to the underlying activity. The animationloops continuously until the given activity ceases.

The JAM program gives the user the option of man-ually initiating each distinct meta-learning stage (byclicking a Next button), or sending the process intoautomatic execution (by clicking a Continue button).The manual run option provides a temporary programhalt. For "hands free" operation of JAM, the user canstart the program with animation disabled and execu-tion set to automatic transition to the next stage inthe process.

AgentsJAM’s extensible plug-and-play architecture allowssnap-in learning agents. The learning and meta-learning agents are designed as objects. JAM pro-vides the definition of the parent agent class and ev-ery instance agent (i.e. a program that implementsany of your favorite learning algorithms ID3, Ripper,Cart (Breiman et al. 1984), Bayes (Duda ~: Hart1973), Wpebls (Cost & Salzberg 1993), CN2 (Clark& Niblett 1989), etc.) are then defined as a subclass this parent class. Among other definitions which areinherited by all agent subclasses, the parent agent classprovides a very simple and minimal interface that allsubclasses have to comply to. As long as a learningor meta-learning agent conforms to this interface, itcan be introduced and used immediately in the JAMsystem even during execution. To be more specific, aJAM agent needs to have the following methods im-plemented:

1. A constructor method with no arguments. JAM canthen instantiate the agent, provided it knows itsname (which can be supplied by the owner of theDatasite through either the local user configurationfile or the GUI).

2. An initialize() method. In most of the cases, if notall, the agent subclasses inherit this method from

94

Figure 2: Two different snapshots of the JAM system in action. Left: Marmalade is building the meta classifier(meta learning stage). Right: An ID3 tree-structured classifier is being displayed in the Classifier VisualizationPanel.

the parent agent class. Through this method, JAMcan supply the necessary arguments to the agent.Arguments include the names of the training andtest datasets, the name of the dictionary file, andthe filename of the output classifier.

3. A buildClassifier 0 method. JAM calls this methodto trigger the agent to learn (or meta-learn) from thetraining dataset.

4. A getClassifier 0 and getCopyOfClassifier 0 meth-ods. These methods are used by JAM to obtainthe newly built classifiers which are then encapsu-lated and can be "snapped-in" at any participatingDatasite! Hence, remote agent dispatch is easily ac-complished.

The class hierarchy (only methods are shown) forfive different learning agents is presented in Figure 3.ID3, Bayes, Wpebls, CART and Ripper inherit themethods initialize() and getClassifier0 from their par-ent learning agent class. The Meta-Learning, Classifierand Meta-Classifier classes are defined in similar hier-archies.

JAM is designed and implemented independently ofthe machine learning programs of interest. As longas a machine learning program is defined and encap-sulated as an object conforming to the minimal inter-face requirements (most existing algorithms have sim-ilar interfaces already) it can be imported and useddirectly. In the latest version of JAM for example,ID3 and CART are full JAVA agents, whereas Bayes,Wpebls and Ripper are stored locally as native appli-cations. This plug-and-play characteristic makes JAMtruly powerful and extensible data mining facility.

Fraud and Intrusion DetectionA secured and trusted interbanking network for elec-tronic commerce requires high speed verification andauthentication mechanisms that allow legitimate userseasy access to conduct their business, while thwartingfraudulent transaction attempts by others. Fraudu-lent electronic transactions are a significant problem,one that will grow in importance as the number of ac-cess points in the nation’s financial information systemgrows.

Financial institutions today typically develop cus-tom fraud detection systems targeted to their own as-set bases. Recently though, banks have come to real-ize that a unified, global approach is required, involv-ing the periodic sharing with each other of informationabout attacks.

We have proposed another wall to protect the na-tion’s financial systems from threats. This new wall ofprotection consists of pattern-directed inference sys-tems using models of anomalous or errant transactionbehaviors to forewarn of impending threats. This ap-proach requires analysis of large and inherently dis-tributed databases of information about transactionbehaviors to produce models of "probably fraudulent"transactions. We use JAM to compute these models.

The key difficulties in this approach are: financialcompanies don’t share their data for a number of (com-petitive and legal) reasons; the databases that compa-nies maintain on transaction behavior are huge andgrowing rapidly; real-time analysis is highly desirableto update models when new events are detected andeasy distribution of models in a networked environmentis essential to maintain up to date detection capability.

JAM is used to compute local fraud detection agents

95

IID3Learner

|D3Leamer0boolean BulldClamsl fierO

ClaHI fief 8etCopyOfCImul fierO

Decision Tree

tLearner

fl:0~eturn ¢’asslfle~

l i

BayesLearnerBaymaLla~ar0

boolean BuildClassi fllrOClassifier 8eiCopyOfClammlflerO

Probabilistic

WpeblsLearnerWpeblaLeamer0

boolean BuiidClass[ flerOClaniflar gitCopyOfC,aHiflerO

Nearest Neighbor

ICartLearner

CartLeamerOboolean BulldClasslflerO

Classifier getCopyOfClassl fierO

Decision Tree

¯

RipperLearnerRipperLearnerO

boolean BuildClaasiflerOClassifier 8etCopyOfClassl fierO

Rule-Based

Figure 3: The class hierarchy of learning agents.

that learn how to detect fraud and provide intrusiondetection services within a single corporate informationsystem, and an integrated meta-learning system thatcombines the collective knowledge acquired by individ-ual local agents. Once derived local classifier agentsor models are produced at some Datasite(s), two more such agents may be composed into a new classi-fier agent by JAM’s meta-learning agents.

JAM allows financial institutions to share their mod-els of fraudulent transactions by exchanging classifieragents in a secured agent infrastructure. But they willnot need to disclose their proprietary data. In this waytheir competitive and legal restrictions can be met, butthey can still share information. The meta-learned sys-tem can be constructed both globally and locally. Inthe latter guise, each corporate entity benefits fromthe collective knowledge by using its privately avail-able data to locally learn a meta-classifier agent fromthe shared models. The meta-classifiers then act assentries forewarning of possibly fraudulent transactionsand threats by inspecting, classifying and labeling eachincoming transaction.

How Local Detection Components Work

Consider the generic problem of detecting fraudulenttransactions, in which we are not concerned with globalcoordinated attacks. We posit there are two good can-didate approaches.

Approach 1:i) Each bank i, 1 < i <= N, uses some learning

algorithm or other on one or more of their databases,DB~, to produce a classifier fl. In the simplest version,all the fi have the same input feature space. Note thateach fi is just a mapping from the features space oftransaction, x, to a bimodal fraud label.

ii) All the f~ are sent to a central repository, wherethere is a new "global" training database, call it DB*.

iii) DB* is used in conjunction with some learn-ing algorithm in order to learn the mapping from< f](x),f2(x), ...fy(x), x > to a fraud label (or prob-ability of fraud). That mapping, or meta-classifier, is

f*.

iv) f* is sent to all the individual banks to use asa data filter to mark and label incoming transactionswith a fraud label.

Approach 2:i) Same as approach i.ii) Every bank i sends to every other bank its fi. So

at the end of this stage, each bank has a copy of all Nclassifiers, fl, ...fN.

iii) Each bank i had held separate some data, callit ~, from the DBi used to create fi. Each bank thenuses Ti and the set of all the f’s to learn the mappingfrom < fl(x),f2(x), ...fg(x),x to a f raud label (orprobability of fraud, as the case may be). (This exactly as in step (iii) of approach 1, except the dataset used for combining is ~ (a different one for eachbank) rather than DB*.) Each such mapping is Fi.

iv) Each bank uses its Fi as in approach 1.

Credit Card Fraud Transaction Data

Here we provide a general view of the data schema forthe labelled transaction data sets compiled by a bankand used by our system. For purposes of our researchand development activity, several data sets are beingacquired from several banks, each providing .5 millionrecords spanning one year, sampling on average 42,000per month, from Nov. 1995 to Oct. 1996.

The schema of the database was developed over yearsof experience and continuous analysis by bank person-nel to capture important information for fraud detec-tion. The general schema of this data is provided insuch a way that important confidential and proprietaryinformation is not disclosed here. (After all we seek notto teach "wanabe thieves" important lessons on how tohone their skills.) The records have a fixed length of137 bytes each and about 30 numeric attributes includ-ing the binary classification (fraud/legitimate transac-tion). Some of the fields are arithmetic and the restcategorical, i.e. numbers were used to represent a fewdiscrete categories.

96

In this section, we describe the setting of our exper-iments. In particular, we split the original data setprovided by one bank into random partitions and wedistributed them across the different sites of the JAMnetwork. Then we computed the accuracy from eachmodel obtained at each such partition.

To be more specific, we sampled 84,000 records fromthe total of 500,000 records of the data set we usedin our experiments, and kept them for the Validationand Test sets to evaluate the accuracy of the resultantdistributed models. The learning task is to identifypatterns in the 30 attribute fields that can characterizethe fraudulent class label.

Let’s assume, without loss of generality, that we ap-ply the ID3 learning process to two sites of data (saysites 1 and 2), while two instances of Ripper are ap-plied elsewhere (say at sites 3 and 4), all being initiatedas Java agents. The result of these four local compu-tations are four separate classifiers, CID3-10, i ---- 1, 2,and CRipper-jO,j = 3,4 that are each invocable asagents at arbitrary sites of credit card transaction data.

A sample (and sanitized) Ripper rule-based classi-fier learned from the credit card data set is depictedin Figure 4, a relatively small set of rules that is eas-ily communicated among distributed sites as needed?To extract fraud data from a distinct fifth site of data,or any other site, using say, Cn@p~r-30 the code im-plementing this classifier would be transmitted to thefifth site and invoked remotely to extract data. Thiscan be accomplished for example using a query of theform:

Select X.* From Credit-card-dataWhere CR@per_3(X.fraud- label) =

Naturally, the select expression rendered here inSQL in this example can instead be implemented di-rectly as a data filter applied against incoming trans-actions at a server site in a fraud detection system.

The end result of this query is a stream of data ac-cessed from some remote source based entirely uponthe classifications learned at site 3. Notice that re-questing transactions classified as "not fraud" wouldresult in no information being returned at all (ratherthan streaming all data back to the end-user for theirown sifting or filtering operation). Likewise, in a frauddetection system, alarms would be initiated only forthose incoming transactions that have been selectivelylabelled as potentially fraudulent.

Next, a new classifier, say M can be computed bycombining the collective knowledge of the 4 classifiersusing for example the ID3 meta-learning algorithm. Mis trained over meta-level training data, i.e. the classpredictions from four base classifiers, as well as theraw training data that generated those predictions7.

6The specific confidential attribute names are not re-veaJed here.

rThis meta-learning strategy is denoted class-combineras defined in (Chan & Stolfo 1995~; 1995b).

The meta-training data is a small fraction of the to-tal amount of distributed training datas. In order forM to generate its final class predictions, it requiresthe classifications generated by CXD3-1(), CID3-20,CRipper-30 and CRipper-40. The result is a tree struc-tured meta-classifier depicted in Figure 4.

In this figure, the descendant nodes of the decisiontree are indented while the leaves specify the final clas-sifications (fraud label 0 or 1). A (logic-based) equivalent of the first branch at the top of the ID3Decision tree is:

"If (X.Prediction{site- 1} -- 1) and(X.Prediction{site - 2} = 1)

then the transaction is fraudulent i.e.X.Fraud-label = 1."Similarly, with M, we may access credit card fraud

data at any site in the same fashion as CRipp~r-3 usedover site 5.

Our experiments for fraud and intrusion detectionare ongoing. In one series of measurements, we trainedthe base classifiers and meta-classifiers over a sampledata set with 50% fraudulent and 50% non-fraudulenttransactions. Then we tested their accuracy against adifferent and unseen sample set of data with 20%/80%distribution. In summary, Ripper and Cart were thebest base classifiers, and Bayes, the best and most sta-ble meta-classifier. Ripper and Cart were each able tocatch 80% of the fraudulent transactions (True Posi-tive or TP) but also misclassify 16% of the legitimatetransactions (False Positive or FP) while Bayes exhib-ited 80% TP and 13% FP in one setting and 80% TPand 19% FP in another. In the first setting, Bayes com-bined the three base classifiers with the least correlatederror and in the second it combined the four most accu-rate base classifiers. The experiments, settings, ratio-nale and results have been reported in detail in a com-panion paper (Stolfo et al. 1997) also available fromhttp://www.cs.columbia.edu/~sal/JAM/PROJ ECT.

Conclusions

We believe the concepts embodied by the term meta-learning provide an important step in developing sys-tems that learn from massive databases and that scale.A deployed and secured meta-learning system will pro-vide the means of using large numbers of low-cost net-worked computers who collectively learn from massivedatabases useful and new knowledge, that would other-wise be prohibitively expensive to achieve. We believemeta-learning systems deployed as intelligent agentswill be an important contributing technology to deployintrusion detection facilities in global-scale, integratedinformation systems.

8The section detailing the meta-learning strategies in(Chan & Stolfo 1995b) describes the various empiricallydetermined bounds placed on the meta-training data setswhile still producing accurate meta-classifiers.

97

Fraud :- Fraud :-a >= 148, a >= 774.b >= 695.

Fraud :- Fraud :-c <= 12, e = "8",d <= 4814. f = "3".

c <= 200,a>=4.

NonFraud :- true

Figure 4: Left: This sample rule-based model, covers 1365 non-fraudulent and 290 fraudulent credit card trans-actions. Right: A portion of the ID3 decision tree meta-classifier, learned from the predictions of the four baseclassifiers. In this portion, only the classifiers from site-1 and site-2 are displayed

In this paper we described the JAM architecture,a distributed, scalable, extensible and portable agent-based system that supports the launching of learningand meta-learning agents to distributed database sitesand build upon existing agent infrastructure availableover the internet today. JAM can integrate distributedknowledge and boost overall predictive accuracy of anumber of independently learned classifiers throughmeta-learning agents. We have engaged JAM in areal, practical and important problem. In collabora-tion with the FSTC we have populated these databasesites with records of credit card transactions, providedby different banks, in an attempt to detect and pre-vent fraud by combining learned patterns and behav-iors from independent sources.

AcknowledgementsWe wish to thank David Wolpert, formerly of TXNand presently at IBM Almaden, Hank Vacarro of TXN,Shaula Yemini of Smarts, Inc. and Yechiam Yemini formany useful and insightful discussions. We also wishto thank Dan Schutzer of Citicorp, Adam Banckenrothof Chase Bank, Tom French of First Union Bank andJohn Doggett of Bank of Boston, all executive mem-bers of the FSTC, for their support of this work.

ReferencesBreiman, L.; Friedman, J. H.; Olshen, R. A.; andStone, C. J. 1984. Classification and Regression Trees.Belmont, CA: Wadsworth.

Chan, P., and Stolfo, S. 1993. Toward paral-lel and distributed learning by meta-learning. InWorking Notes AAAI Work. Knowledge Discovery inDatabases, 227-240.

Chan, P., and Stolfo, S. 1995a. A comparative evalua-tion of voting and meta-learning on partitioned data.In Proc. Twelfth Intl. Conf. Machine Learning, 90-98.

Chan, P., and Stolfo, S. 1995b. Learning arbiter andcombiner trees from partitioned data for scaling ma-chine learning. In Proc. Intl. Conf. Knowledge Dis-covery and Data Mining, 39-44.

Chan, P. 1996. An Extensible Meta-Learning Ap-proach for Scalable and Accurate Inductive Learn-ing. Ph.D. Dissertation, Department of ComputerScience, Columbia University, New York, NY.

Clark, P., and Niblett, T. 1989. The CN2 inductionalgorithm. Machine Learning 3:261-285.

Cohen, W. W. 1995. Fast effective rule induction.In Proc. Twelfth International Conference. MorganKaufmann.Cost, S., and Salzberg, S. 1993. A weighted near-est neighbor algorithm for learning with symbolic fea-tures. Machine Learning 10:57-78.

Duda, R., and Hart, P. 1973. Pattern classificationand scene analysis. New York, NY: Wiley.

Fayyad, U.; Piatetsky-Shapiro, G.; and Smyth, P.1996. The kdd process for extracting useful knowledgefrom data. Communications of the ACM 39(11):27-34.Lee, W., and Barghouti, N. S. 1997. Javadot: Anextensible visualization environment. Technical Re-port CUCS-02-97, Department of Computer Science,Columbia University, New York, NY.

Merz, C., and Murphy, P.1996. UCI repository of machine learning databases[http://www.ics.uci.edu/-~mlearn/mlrepository.html].Dept. of Info. and Computer Sci., Univ. of California,Irvine, CA.Quinlan, J. 1~. 1986. Induction of decision trees. Ma-chine Learning 1:81-106.

Stolfo, S.; Fan, W.; Lee, W.; Prodromidis, A.; andChan, P. 1997. Credit card fraud detection usingmeta-learning: Issues and initial results. AAAI Work-shop on AI Approaches to Fraud Detection and RiskManagement.Wolpert, D. 1992. Stacked generalization. NeuralNetworks 5:241-259.Zhang, X.; Mckenna, M.; Mesirov, J.; and Waltz, D.1989. An efficient implementation of the backprop-agation algorithm on the connection machine CM-2.Technical Report RL89-1, Thinking Machines Corp.

98

jam: java agents for meta-learning over distributed databases › papers › workshops › 1997 ›...

Documents