data mining for counter-terrorism - semantic scholar · chapter 3 data mining for counter-terrorism...

Chapter 3

Data Mining forCounter-Terrorism

Bhavani ThuraisinghamThe MITRE Corporation

Burlington Road, Bedford, MAOn leave at the National Science Foundation, Arlington, VA

Abstract:Data mining is becoming a useful tool for detecting and preventing terrorism. Thispaper first discusses some technical challenges for data mining as applied for counter-terrorism applications. Next it provides an overview of thevarious types of terroristthreats and describes how data mining techniques could provide solutions to counter-terrorism. Finally some privacy concerns and potential solutions that could detect ter-rorist activities and yet attempt to maintain privacy will be discussed.

Keywords: Counter-terrorism, Data Mining, Privacy

3.1 Introduction

Data mining is the process of posing queries and extracting useful patterns or trendsoften previously unknown from large amounts of data using various techniques suchas those from pattern recognition and machine learning. There have been several de-velopments in data mining and the technology is being used for a wide variety of ap-

191

192 CHAPTER THREE

plications from marketing and finance to medicine and biotechnology to multimediaand entertainment. Recently there has been much interest onexploring the use of datamining for counter-terrorism applications. For example, data mining can be used todetect unusual patterns, terrorist activities and fraudulent behavior. While all of theseapplications of data mining can benefit humans and save lives, there is also a negativeside to this technology, since it could be a threat to the privacy of individuals. This isbecause data mining tools are available on the web or otherwise and even nave userscan apply these tools to extract information from the data stored in various databasesand files and consequently violate the privacy of the individuals. To carry out effectivedata mining and extract useful information for counter-terrorism and national security,we need to gather all kinds of information about individuals. However, this informationcould be a threat to the individuals’ privacy and civil liberties.

In this paper we will provide an overview of applying data mining for counter-terrorism. At the workshop on Next Generation Data Mining (NGDM), a panel wasconducted on Data Mining for Counter-terrorism. The panel raised many interest-ing technical challenges. In section 3.2 of this paper we will discuss some of thesechallenges. To understand how data mining could be applied,we need a good under-standing of what the terrorist threats are. We have grouped the threats into severalcategories and will discuss them in Section 3.3. Applying data mining techniques forcounter-terrorism will be the subject of Section 3.4. Therehave been many discus-sions recently of the privacy violations that could occur asa result of data mining. Insection 3.5 we address privacy as well as discuss data miningsolutions that attemptto detect/prevent terrorism and at the same time maintain some level of privacy. Thepaper is concluded in Section 3.6.

3.2 Research Challenges

The panel on data mining for counter-terrorism at the NGDM workshop discussedseveral technical challenges. We discuss a few of the challenges in this section. Datamining technologies have advanced a great deal. They are nowbeing applied for manyapplications. The main question is, are they ready for detecting and /or preventingterrorist activities? For example, can we completely eliminate false positives and falsenegatives? False positives could be disastrous for variousindividuals. False negativescould increase terrorist activities. The challenge is to find the “needle in the haystack.”We need knowledge directed data mining to eliminate false positives and false negativesas much as possible.

Another challenge is mining data in real-time. We now have tools to detect creditcard violations and calling card violations. These tools function in real-time. Howevercan one build models in real-time? The general view among theresearch communityis that real-time model building is a challenge. Furthermore, for detecting counter-terrorism activities we need good training examples. How can we get such examplesespecially in an unclassified setting?

A third challenge is multimedia data mining. While we now havetools to minestructured and relational databases, mining unstructureddatabases is still a challenge.Do we extract structure from unstructured databases and then mine the structured data

THURAISINGHAM 193

or do we apply mining tools directly on unstructured data? Furthermore, while there isprogress on text mining, we need work on audio and video as well as on image mining.

Other directions include graph and pattern mining. For example, one has to connectall the dots. Essentially one builds a graph structure basedon the information he or shehas. If multiple agencies are working on the problem, then each agency will have itsown graph. The challenge is to be able to make inferences about missing nodes andlinks in the graph. Also the graph could be very large. The question is how can onereduce the graph to a more manageable size?

Finally finding the data to test the ideas is still a major challenge. How can weget unclassified data? Is it possible to scrub and clean the classified data and producereasonable data at the unclassified level? How can we find large data sets consistingof multimedia data types? Is it possible to develop a test-bed where one can apply thevarious data mining tools to determine their efficiency?

Web mining is a challenge for detecting unusual patterns. Ina way web miningencompasses data mining as one has to mine all the data on the web as well as minethe structure and usage patterns. By mining the usage patterns one could get patternssuch as there are an unusual number of visits to a federal web site from Paris around3am in the morning. Data on the web includes structured as well as unstructured data.Therefore the tools developed for data mining apply for web mining also. In addition,we need tools to mine the structure of the web as well as the usage patterns.

Privacy is a major challenge with respect to data mining for counter-terrorism. Thechallenge is to extract useful information from data miningbut at the same time main-tain privacy. Several efforts are under way for privacy preserving data mining. Theidea here is to use various techniques such as randomization, cover stories, as wellas multi-party policy enforcement for privacy preserving data mining. While there issome progress, the effectiveness of these techniques needsto be determined.

The above are some of the challenges for data mining for counter-terrorism dis-cussed at the workshop. That is, while data mining could become a useful tool forcounter-terrorism, there are many challenges that need to be addressed. They includemining multimedia data, graph mining, building models in real-time, knowledge di-rected data mining to eliminate false positives and false negatives, web mining, andprivacy sensitive data mining. Research is progressing in the right direction. However,there is still much to be done (see also [14]).

Now that we have provided an overview of the challenges on data mining forcounter-terrorism, in the next three sections we will provide some more details on thistopic. To understand how data mining may be applied, we need agood understandingof what the threats are. In section 3.3 we will provide an overview of various threatsand protection measures. In section 3.4 we will examine how data mining could pro-vide potential counter-terrorism solutions, especially for the threats discussed in section3.3. Because of the important of privacy and the potential threats to privacy due to datamining, we will discus various privacy issues in Section 3.5.

194 CHAPTER THREE

3.3 Some Information on Terrorism, Security Threats,and Protection Measures

3.3.1 Overview

Now we are ready to embark on a critical application of data mining technologies.This application is counter-terrorism. Counter-terrorism is mainly about developingcounter-measures to threats occurring from terrorist activities. In this section we focuson the various types of threats that could occur. In section 3.4 we will discuss how datamining could help prevent and detect the threats.

Our discussion of counter-terrorism is rather preliminary. We are not claiming to becounter-terrorism experts. The information on terrorist threats we have presented herehas been obtained entirely from unclassified newspaper articles and news reports thathave appeared over the years. Our focus is to illustrate how data mining could help to-wards combating terrorism. We are not saying that data mining solves all the problems.But because of the fact that data mining has the capability toextract patterns and trends,often previously unknown, we should certainly explore the various data and web datamining technologies for counter-terrorism. For us web datamining goes beyond datamining. It not only includes data mining techniques, but also focuses on web trafficand usage mining as well as web structure mining. That is, there are additional chal-lenges for web data mining that are not present for just data mining. Furthermore, webdata mining also includes structured data mining as well as unstructured data mining.Furthermore, we believe that much of the data will eventually be on the web, whetherthey are public networks such as the Internet or private suchcorporate intranets andclassified intranets. Therefore, studying web data mining encompasses studying datamining as well.

Before we embark on a discussion of counter-terrorism, we need to discus thetypes of threats. Note that threats could be malicious threats due to terror attacksor non-malicious threats due to inadvertent errors. While our main focus is on ma-licious attacks, we also cover some of the inadvertent errors, as there may be similarsolutions to combat such problems. The types of terrorist threats we have discussedinclude non-information related terrorism, information related terrorism, bio-terrorismand chemical attacks. By non-information related terrorism we mean people attackingothers with say bombs and guns. For this we need to find out who these people areby analyzing their connections and then develop counter-terrorism solutions. By infor-mation related threats we mean threats due to the existence of computer systems andnetworks. These are unauthorized intrusions and viruses aswell as computer relatedvandalism. Information related terrorism is essentially cyber terrorism. Then there isbio-terrorism, chemical and nuclear attacks. These are terrorist attacks caused by bio-logical substances and chemical/nuclear weapons. It does not mean that these are allthe types of threats that exist. But these are the threats we will be examining. We willdiscus how data mining could perhaps be used to help prevent and detect attack due tosuch threats.

The organization of this section is as follows. Section 3.3.2 discusses threats fromnatural disasters as well as human errors. We then focus on malicious threats in the

THURAISINGHAM 195

next three sections. Non-information related threats would be discussed in Section3.3.3. These include terrorist attacks as well as insider threat analysis, border andtransportation threats. In section 3.3.4 we discuss information related threats. Essen-tially this is about cyber-terrorism. Threats occurring from biological, chemical andnuclear weapons will be disused in Section 3.3.5. Attacks oncritical infrastructureswill be given special consideration in Section 3.3.6. Note that infrastructures may alsobe attacked during information related attacks and non-information related attacks. Wegroup the threats into two categories in Section 3.3.8. Theyare non real-time threatsand real-time threats. We analyze the threats discussed in section 3.3.3 through 3.3.6to see whether they are non real-time threats or real-time threats. Then we focus oncounter-terrorism measures in Section 3.3.8. These include counter-terrorism for non-information related threats; information related threatsas well as bio-terrorism. Wealso briefly examine counter-terrorism measures for non real-time threats as well as forreal-time threats. Note that when we want to carry out data mining to combat terrorism,we need good data. This means that we need data about terrorists as well as terroristactivities. This also means we will have to gather data aboutall kinds of people, eventsand entities. Therefore, there could be a serious threat to privacy. Therefore, we willaddress privacy and civil liberties in Section 3.5.

3.3.2 Natural Disasters and Human Errors

As we have stated in Section 3.3.1, threats could occur due tonatural disasters andhuman errors as well as through malicious attacks. While the solutions to the attacksin the near-term may not be that different in terms of emergency responses, the way tocombat these threats in the longer-term will very likely be quite different.

By natural disasters we mean disasters due to hurricanes, earthquakes, fires, powerfailures and accidents. Some of these disasters may be due tohuman errors such aspressing the wrong button in a process plant causing the plant to explode. Data miningcould help detect some of the natural disasters. That is, by analyzing lot of geologicaldata, a data mining tool may predict that an earthquake is about to occur in whichcase the people in the area could be evacuated beforehand. Similarly by analyzingthe weather data, the tool could predict that hurricanes areabout to occur. Emergencyresponses, whether a building is caught on fire through natural disasters or by terroristattacks, may not be that different. In both cases, there willintense panic, although if thebuilding explodes due to a bomb the panic may be more intense and the collapse maybe more rapid. We need effective emergency response teams tohandle such attacks.Data mining could be used to analyze say previous attacks andtrain various tools andthen be able to give advice how to handle the emergency situation. Here again we needtraining examples some of which may not exist. In this case wemay need to train withhypothetical scenarios and simulated examples.

The long terms measures to be taken for natural disasters maybe quite differentfrom terrorist attacks. It is not every day that we have an earthquake, even in the mostearthquake prone regions. It is not often that we have hurricanes, even in the mosthurricane prone regions. Therefore we have time to plan and react. This does not meanthat a natural disaster is less complex to manage. It could bedevastating and take manyhuman lives. Nevertheless countries usually plan for such disasters mainly through

196 CHAPTER THREE

experiences.Human errors are also a source of major concern. We need to continually train

say the operators and give them advice to be cautious and alert. We need to takeproper actions if humans have been careless. That is, unlessthere is an absolutely goodexcuse, human errors should not be treated lightly. This way, humans will be cautiousand perhaps not make such errors.

Terrorist attacks are quite different. The problem is, one does not know when it willhappen and how it will happen. Many of us could never have imagined that airplaneswould be used as weapons of mass destruction to bring the famous world trade centertowers down. Many of us still may not know what the next attackmay be. Would theybe attacks caused by suicide bombers or would they be attackscaused by chemicalweapons or would they be attacks caused by cyber terrorism. The counter-measures forprevention and detection may be quite intense for terroristattacks. As we have stated,we are not experts on counter-terrorism or have studied the nature of the attacks. Ourgoal is to examine the various data mining techniques to see how they could be appliedto handle the various threats that have been discussed almost daily in the newspapersand on television.

It should however be noted that to develop effective techniques, the data miningspecialists have to work together with counter-terrorism experts. That is, one cannotuse the techniques without a good understanding of what the threats are. Therefore,while the contents of this paper may be used as a reference, I would urge those in-terested in applying data mining techniques to solve real world problems and terroristattacks to work with counter-terrorism specialists. In thenext few sections we willdiscuss various types of terrorism and counter-terrorism measures.

3.3.3 Non-Information Related Terrorism

3.3.3.1 Overview

In this section we will provide an overview of various types of non-information relatedterrorism. Note that by information related terrorism we mean attacks essentially oncomputers and networks. That is, they are threats that damage electronic information.By non-information related terrorism we mean terrorism dueto other means such asterrorist attacks, car bombing, vandalism such as setting fires etc.

The organization of this section is as follows. We discuss terrorist attacks andexternal threats in Section 3.3.3.2. Insider threats are discussed in section 3.3.3.3.Attacks on borders and transportation are discussed in Section 3.3.3.4. Note that borderand transportation attacks may be considered to be part of non information relatedattacks, we have given special consideration as there is so much discussion now relatedto securing the borders and transportation mechanisms.

3.3.3.2 Terrorist Attacks and External Threats

When we hear the word terrorism it is the external threats thatcome to our mind.My earliest recollection of terrorism is “riots” where one ethnic group attacks anotherethnic group by essentially killing, looting, setting firesto houses, and other acts of

THURAISINGHAM 197

terrorism and vandalism. Then later on we heard of airplane hijackings where a groupof terrorists hijack airplanes and then make demands on governments such as releasingpolitical prisoners who could possibly be terrorists. Thenwe heard of suicide bomb-ings where terrorists carry bombs and blow themselves up as well as others nearby.Such attacks usually occur in crowded places. More recentlywe have heard of usingairplanes to blow up buildings.

While the above acts are all terrorist attacks, we hear almostdaily about someoneshooting and killing someone else when neither party belongs to any gangs or terroristgroups. This in a way is terrorism also, but these acts are more difficult to detect andprevent because there are always what are called “crazy people” in our society. Whilethe technologies should detect and prevent such attacks also, what this paper focuses ison how to detect attacks from people belonging to terrorist groups.

All of the threats we have discussed above are sort of external threats. These arethreats occurring from the outside. In general, the terrorists are usually neither friendsnor acquaintances of the victims involved. But there are also other kinds of threats andthey are insider threats. We will discuss them in the next section.

3.3.3.3 Insider Threats

Insider threats are threats from people inside an organization attacking the others aroundthem through perhaps not bombs and airplanes but using othersinister mechanisms.Examples of insider threats include some one from a corporation giving informationto a competitor of proprietary products. Another example isan agent from an intelli-gence agency committing espionage. A third example is a threat coming from one’sown family. For example, betrayal from a spouse who has insider information aboutassets and the betrayer giving the information to a competitor to his or her advantage.That is, insider threats can occur at all levels and all walksof life and could be quitedangerous and sinister because you never know who these terrorists are. They may beyour so-called “best friends” or even your spouse or your siblings.

Note that people from the inside could also use guns to shoot people around them.We often hear about office shootings. But these shootings arenot in general insiderthreats, as they are not happening in sinister ways. That is,these shootings are sortof external threats although they are coming from people within an organization. Wealso hear often about domestic abuse and violence such as husbands shooting wivesor vice versa. These are also external threats although theyare occurring from theinside. Insider threats are threats where others around aretotally unaware until perhapssomething quite dangerous occurs. We have heard that espionage goes on for yearsbefore someone gets caught. While both insider threats and external threats are veryserious and could be devastating, insider threats can be even more dangerous becauseone never knows who these terrorists are.

3.3.3.4 Transportation and Border Security Violations

Let us examine border threats first and then discuss transportation threats. Safeguardingthe borders is critical for the security of a nation. There could be threats at borders fromillegal immigration to gun and drug trafficking as well as human trafficking to terrorists

198 CHAPTER THREE

entering a country. We are not saying that illegal immigrants are dangerous or areterrorists. They may be very decent people. However, they have entered a countrywithout the proper papers and that could be a major issue. Forofficial immigration intosay the USA, one needs to go through interviews at US embassies, go through medicalcheckups and X-rays as well as checks for diseases such as tuberculosis, backgroundchecks and many more things. It does not mean that people who have entered a countrylegally are always innocent. They could be terrorists also.At least there is someassurance that proper procedures have been followed. Illegal immigration can alsocause problem to the economy of a society and violate human rights through cheapillegal labor etc.

As we have stated, drug trafficking has occurred a lot at borders. Drugs are a dan-ger to society. It could cripple a nation, corrupt its children, cause havoc in families,and damage the education system and cause extensive damage.It is therefore criticalthat we protect the borders from drug trafficking as well as other types of traffickingincluding firearms and human slaves. Other threats at borders include prostitution andchild pornography, which are serious threats to decent living. It does not mean that ev-erything is safe inside the country and these problems are only at borders. Neverthelesswe have to protect our borders so that there are no additionalproblems to a nation.

Transportation systems security violations can also causeserious problems. Buses,trains and airplanes are vehicles that can carry tens of hundreds of people at the sametime and any security violation could cause serious damage and even deaths. A bombexploding in an airplane or a train or a bus could be devastating. Transportation systemsare also the means for terrorists to escape once they have committed crimes. Thereforetransportation systems have to be secure. A key aspect of transportation systems secu-rity is port security. These ports are responsible for shipsof the United States Navy.Since these ships are at sea throughout the world, terroristmay have opportunities toattack these ships and the cargo. Therefore, we need security measures to protect theports, cargo, and our military bases. In Section 3.3.7 we will discuss various counter-terrorism measures for the threats we have discussed here. The next three sections willdiscuss additional types of terrorism.

3.3.4 Information Related Terrorism

3.3.4.1 Overview

This section discusses information related terrorism. By information related terrorismwe mean cyber-terrorism as well as security violations through access control and othermeans. Trojan horses as well as viruses are also informationrelated security violations,which we group into information related terrorism activities.

In the next few subsections we discuss various information related terrorist attacks.In section 3.3.4.2 we give an overview of cyber terrorism andthen discuss insiderthreats and external attacks. Malicious intrusions are thesubject of Section 3.3.4.3.Credit card and identity theft are discussed in Section 3.3.4.4. Information security vi-olations such as access control violations are discussed inSection 3.3.4.5. Since web isa major means of information transportation, we give web security threats special con-sideration in Section 3.3.4.6. Note that an excellent book on web security discussing

THURAISINGHAM 199

various threats and solutions is the one by Ghosh [10]. We also discuss some of thecyber threats and countermeasures in [11].

3.3.4.2 Cyber-terrorism, Insider Threats, and External Attacks

Cyber-terrorism is one of the major terrorist threats posedto our nation today. As wehave mentioned earlier, there is now so much of information available electronicallyand on the web. Attack on our computers as well as networks, databases and theInternet could be devastating to businesses. It is estimated that cyber-terrorism couldcause billions of dollars to businesses. For example, consider a banking informationsystem. If terrorists attack such a system and deplete accounts of the funds, then thebank could loose millions and perhaps billions of dollars. By crippling the computersystem millions of hours of productivity could be lost and that equates to money inthe end. Even a simple power outage at work through some accident could causeseveral hours of productively loss and as a result a major financial loss. Therefore it iscritical that our information systems be secure. Next we discuss various types of cyberterrorist attacks. One is spreading viruses and Trojan horses that can wipe away filesand other important documents. Another is intruding the computer networks, which wewill discuss in the next section. Information security violations such as access controlviolations as well as a discussion of various other threats such as sabotage and denialof service will be given later.

Note that threats can occur from outside or form the inside ofan organization. Out-side attacks are attacks on computers from someone outside the organization. We hearof hackers breaking into computer systems and causing havocwithin an organization.There are hackers who start spreading viruses and these viruses cause great damage tothe files in various computer systems. But a more sinister problem is the insider threat.Just like non-information related attacks, there is the insider threat with informationrelated attacks. There are people inside an organization who have studied the businesspractices and develop schemes to cripple the organization’s information assets. Thesepeople could be regular employees or even those working at computer centers. Theproblem is quite serious as some one may be masquerading as someone else and caus-ing all kinds of damage. In the next few sections we will examine how data miningcould detect and perhaps prevent such attacks.

3.3.4.3 Malicious Intrusions

We have discussed some aspects of malicious intrusions. These intrusions could beintruding the networks, the web clients and servers, the databases, operating systems,etc. Many of the cyber terrorism attacks that we have discussed in the previous sectionsare malicious intrusions. We will revisit them in this section.

We hear a lot of network intrusions. What happens here is that intruders try to tapinto the networks and get the information that is being transmitted. These intruders maybe human intruders or Trojan horses set up by humans. Intrusions could also happen onfiles. For example, one can masquerade as some one else and loginto someone else’scomputer system and access the files. Intrusions can also occur on databases. Intruders

200 CHAPTER THREE

posing as legitimate users can pose queries such as SQL queries and access the datathat they are not authorized to know.

Essentially cyber terrorism includes malicious intrusions as well as sabotage throughmalicious intrusions or otherwise. Cyber security consists of security mechanisms thatattempt to provide solutions to cyber attacks or cyber terrorism. When we discuss ma-licious intrusions or cyber attacks, we need to think about the non cyber world, that isnon information related terrorism and then translate thoseattacks to attacks on com-puters and networks. For example, a thief could enter a building through a trap door. Inthe same way, a computer intruder could enter the computer ornetwork through somesort of a trap door that has been intentionally built by a malicious insider and left unat-tended through perhaps careless design. Another example isa thief entering the bankwith a mask and stealing the money. The analogy here is an intruder masquerading assomeone else, legitimately entering the system and taking all the information assets.Money in the real world would translate to information assets in the cyber world. Thatis, there are many parallels between non-information related attacks and informationrelated attacks. We can proceed to develop counter-measures for both types of attacks.These counter-measures are discussed in Section 3.3.8.

3.3.4.4 Credit Card Fraud and Identity Theft

We are hearing a lot these days about credit card fraud and identity theft. In the caseof credit card fraud, others get hold of a person’s credit card and make all kinds ofpurchases, by the time the owner of the card finds out, it may betoo late. The thiefmay have left the country by then. A similar problem occurs with telephone callingcards. In fact this type of attack has happened to me once. Perhaps while I was makingphone calls using my calling card at airports someone must have noticed say the dialtones and used my calling card. This was my company calling card. Fortunately ourtelephone company detected the problem and informed my company. The problem wasdealt with immediately.

A more serious theft is identity theft. Here one assumes the identity of anotherperson say but getting hold of the socials security number and essentially carried outall the transactions under the other person’s name. This could even be selling housesand depositing the income in a fraudulent bank account. By the time, the owner findsout it will be far too late. It is very likely that the owner mayhave lost millions ofdollars due to the identity theft.

We need to explore the use of data mining both for credit card fraud detection aswell as for identity theft. There have been some efforts on detecting credit card fraud(see citeAFCE). We need to start working actively on detecting and preventing identitythefts.

3.3.4.5 Information Security Violations

In this section we provide an overview of the various information security violations.These violations do not necessarily mean that they are occurring through cyber attacksor cyber terrorism. They could occur through bad security design and practices. Nev-ertheless we have included this discussion for completion.

THURAISINGHAM 201

Information security violations typically occur due to access control violations.That is, users are granted access depending on their roles which is called role-basedaccess control) or their clearance level (which is called multilevel access control) oron a need to know basis. Access controls are violated usuallydue to poor design ordesigner errors. For example, suppose John does not have access to salary data. Bysome error this rule may not be enforced and as a result, John gets access to salaryvalues. Access control violations can occur due to malicious attacks also. That is,someone could enter the system by pretending to be the systemadministrator and deletethe access control rule that John does not have access to salaries. Another way is fora Trojan horse to operate on behalf of the malicious users andeach time John makes arequest, the malicious code could ensure that the access control rule is bypassed.

3.3.4.6 Security Problems for the Web

As mentioned in section 3.3.4.1, there are numerous security attacks that can occurdue to the web. We discuss some of the web security threats in this section. As wehave mentioned, in his book Ghosh [10] has provided an excellent introduction to websecurity and various threats. Note that while we have focused on web threats in thissection, the threats discussed are applicable to any information system such as net-works, databases and operating systems. The threats include access control violations,integrity violations, sabotage, fraud, denial of service and infrastructure attacks.

For example, the traditional access control violations could be extended to the web.User may access unauthorized data across the web. Note that with the web there isso much of data all over the place that controlling access to this data will be quite achallenge. Data on the web may be subject to unauthorized modifications. This makesit easier to corrupt the data. Also, data could originate from anywhere and the producersof the data may not be trustworthy. Incorrect data could cause serious damages suchas incorrect bank accounts, which could result in incorrecttransactions. We hear ofhackers breaking into systems and posting inappropriate messages. With so much ofbusiness and commerce being carried out on the web without proper controls, Internetfraud could cause businesses to loose millions of dollars. Intruder could obtain theidentity of legitimate users and through masquerading may empty the bank accounts.We hear about infrastructures being brought down by hackers. Infrastructures could bethe telecommunication system, power system, and the heating system. These systemsare being controlled by computers and often through the Internet. Such attacks wouldcause denials of service.

Other threats include violations to confidentiality, authenticity, and no repudiation.Confidentiality violations enable intruders to listen in onthe message. Authenticationviolations include using passwords without permissions, and non-repudiation viola-tions enable someone from denying that he sent the message. The web threats dis-cussed here occur because of insecure clients, servers and networks. To have completesecurity, one needs end-to-end security; that means secureclients, secure servers, se-cure operating systems, secure databases, secure middleware and secure networks.

202 CHAPTER THREE

3.3.5 Bio-Terrorism, Chemical and Nuclear Attacks

The previous two sections discussed non-information related as well as information re-lated terrorist attacks. Note that by information related attacks we mean cyber attacks.Non-information related attacks mean everything else. However we have separatedbio-terrorism and chemical weapons attacks from non-information related attacks. Wehave also given special consideration for critical infrastructure attacks. That is, thenon-information related attacks are essentially attacks due to bombs, explosions andother similar activities.

While bio-terrorism and chemical/nuclear weapons attacks have been discussed atleast for several decades, it is only after September 11, 2001 that the public is paying alot of attention to these discussions. The anthrax attacks that occurred during the latterpart of 2001 have resulted in increased fear and awareness ofthe potential dangers ofbio-terrorism attacks and chemical/nuclear weapons attacks. Such attacks could killseveral million people within a short space of time. More recently there is increasingawareness of the dangers due to bio-terrorism attacks resulting in the spread of infec-tious diseases such as smallpox, yellow fever, and similar diseases. These diseasesare so infectious that it is critical that their spread is detected as soon as they occur.Preventing such attacks would be the ultimate goal. One option is to carry out massvaccination. But this would mean some health hazards to various groups of people. Ourchallenge is to use technology to prevent and detect such deadly attacks. Technologieswould include sensor technology and data mining and data management technologies.

Attacks using chemical weapons are equally deadly. One could spray poisonousgas and other chemicals into the air, water and food supplies. For example, variousdangerous chemical agents could be sprayed from the air on plants and crops. Theseplants and crops could get into the food supply and kill millions. We have to developtechnologies to detect and prevent such deadly attacks. Another form of deadly attacksis the nuclear attacks. Such attacks could wipe out the entire population in the world.There are various nations developing nuclear weapons when they do not have the au-thorization to develop them. That is, these weapons are being developed illegally. Thisis what makes the world very dangerous. We have to develop technologies to detectand prevent such deadly attacks.

In this section we have only briefly mentioned the various biological, chemical andnuclear attacks. There are some good books that are being written about such terroristactivities (see [4] and [5]). As we have stressed, we are not counter-terrorism experts;nor have we studied the various types of terrorist attacks inany depth. Our informationis obtained from various newspaper articles and documentaries. Our main goal is toexamine various data mining techniques and see how they could be applied to detectand prevent such deadly terrorist attacks. Data mining for counter-terrorism will bediscussed in sections 3.4 and 3.5.

3.3.6 Attacks on Critical Infrastructures

Attacks on critical infrastructures could cripple a nationand its economy. Infrastruc-ture attacks include attacking the telecommunication lines, the electronic, power, gas,reservoirs and water supplies, food supplies and other basic entities that are critical for

THURAISINGHAM 203

the operation of a nation.

Attacks on critical infrastructures could occur during anytype of attacks whetherthey are non-information related, information related or bio-terrorism attacks. For ex-ample, one could attack the software that runs the telecommunications industry andclose down all the telecommunications lines. Similarly software that rues the powerand gas supplied could be attacked. Attacks could also occurthrough bombs and ex-plosives. That is, the telecommunication lines could be attacked through bombs. At-tacking transportation lines such as highways and railway tracks are also attacks oninfrastructures.

As we have mentioned in Section 3.3.2, infrastructures could also be attacked bynatural disaster such as hurricanes and earth quakes. Our main interest here is theattacks on infrastructures through malicious attacks bothinformation related and non-information related. Our goal is to examine data mining and related data managementtechnologies to detect and prevent such infrastructure attacks.

3.3.7 Non Real-time Threats vs. Real-time Threats

The threats that we have discussed so far can be grouped into two categories; non real-time threats or real-time threats. In a way all threats are real-time as we have to actin real-time once the threats have occurred. However, some threats are analyzed overa period of time while some others have to be handled immediately. We discuss thevarious threats here.

Consider for example the biological, chemical and nuclear threats. These threatshave to be handled in real-time. That is, the response to these threats have timing con-strains. If smallpox virus is being spread maliciously, then we have to start vaccinationsimmediately. Similarly if networks say for critical infrastructures are being attacked,the response has to be immediate. Otherwise we could loose millions of lives and/ormillions of dollars.

There are some other threats that do not have to be handled in real-time. For ex-ample consider the behavior of suspicious people such as those belonging to a certainterrorist organization or those enrolling in flight training schools. In a way these peopleare also planning attacks but sometime even they are not surewhen they will attack.Therefore, one has to monitor these people, analyze their behavior and predict theiractions. While there are timing constraints for these threats, the urgency is not as greatas say the spread of the smallpox virus. But one should be vigilant about these nonreal-time threats also.

In general there is no way to say that A is a real-time threat and B is a non real-timethreat. A non real-time threat could turn into a real-time threat. For example, oncethe terrorists had hijacked the airplanes on September 11, 2001, the threat became areal-time threat as action had to be taken within say an hour.

204 CHAPTER THREE

3.3.8 Aspects of Counter-Terrorism

3.3.8.1 Overview

Now that we have provided some discussion on various types ofterrorist attacks includ-ing non-information related terrorism, information related terrorism, bio-terrorism, etc.we will discus what counter-terrorism is all about. Counter-terrorism is a collection oftechniques used to combat, prevent, and detect terrorism. Our goal in this paper is toexamine various data mining techniques to see how we can combat terrorism usingthese techniques. In this section we will briefly discuss what counter-terrorism is allabout for the terrorist attacks discussed in the previous sections.

In Section 3.3.8.2 we discuss protecting from non-information related terrorism. Insection 3.3.8.3 we discuss protecting from information related terrorism. In particular,we discuss various web security measures as well as other aspects such as intrusiondetection and access control, briefly. In section 3.3.8.4 wediscuss protecting frombio-terrorism and chemical attacks and nuclear attacks. Insection 3.3.8.5 we discussprotecting the critical infrastructures. We analyze counter-terrorism measures for nonreal-time threats as well as for real-time threats in Section 3.3.8.6.

3.3.8.2 Protecting from Non-information Related Terrorism

As we have stated, non-information related counter-terrorism includes protecting frombombings, explosions, vandalism and other kinds of terrorist attacks not involved withcomputers. For example, hijacking an airplane and attacking buildings with airplanesis a case of non-information related terrorist activity. The questions are how we doprotect against such terrorist attacks?

First of all we need to gather information about various scenarios and examples.That is, we need to identify all kinds of terrorist acts that have occurred in historystarting from airplane hijacking to bombing of buildings. We also need to gather infor-mation about those under suspicion. All of the data that we have gathered needs to beanalyzed to see if any patterns emerge.

We also need to ensure there are physical safety measures. For example, we needto check the identity at airports or other places. We need to check for identity randomlysay in trains as well as routinely say at checkpoints. We needto check the belongingsof a person either randomly, routinely or if that person arouses suspicions to see if thereare dangerous weapons or chemicals in his/her belongings. We should also use sniff-ing dogs, sensor devices to see if there are potentially hazardous materials. We needsurveillance cameras to see who is entering the building. These cameras should alsocapture perhaps the facial expressions of various people. The data gathered from thecameras should be analyzed further for suspicious behavior. We also need to enforceaccess control measures at military bases and seaports.

In summary, several counter-terrorism measures have to be taken to combat non-information related terrorism. These include informationgathering and analysis, surveil-lance, physical security and various other mechanisms. In the next few sections we willexamine the data mining techniques and see how they can detect and prevent such ter-rorist attacks.

THURAISINGHAM 205

3.3.8.3 Protecting from Information related terrorism

General DiscussionWe will first provide an overview of counter-terrorism with respect to information re-lated terrorism. We will give special consideration for security solution for the weblater on. Essentially protecting from information relatedterrorism is involved with de-tecting and preventing malicious attacks and intrusions. These attacks could be attacksdue to viruses or spoofing or masquerading and stealing say information assets. Theseattacks could also be attacks on databases and malicious corruption of data. That is,terrorist attacks are not necessarily stealing and accessing unauthorized information.They could also include malicious corruption and alteration of the data so that the datawill be of little or no use. Terrorist attacks also include credit card frauds and identitythefts.

Various data mining techniques are being proposed for detecting intrusions as wellas credit card fraud. We will discuss them in later sections.Preventing maliciousattacks is more challenging. We need to design systems in such a way that maliciousattacks and intrusions are prevented. When an intruder attempts to attack the system,the system would figure this out and alert the security officer. There is research beingcarried out on secure systems design so that such intrusionsare prevented. Howeverthere is more focus on detecting such intrusions than prevention.

Enforcing appropriate access control techniques is also a way to enforce security.For example, users may have certificates to access the information they need to carryout the jobs that they are assigned to do. The organization should give the users nomore or no less privileges. There is much research on managing privileges and accessrights to various types of systems.

We have briefly discussed cyber security measures. We will discuss security solu-tions for the web in more detail next. Note that there are alsoadditional problems suchas the inference problem where users pose sets of queries andinfer sensitive informa-tion. This is also an attack. We will visit the inference problem later when we discussprivacy.Security Solutions for the WebWe need end-end-end security and therefore the components include secure clients,secure servers, secure databases, secure operating systems, secure infrastructures, se-cure networks, secure transactions and secure protocols. One needs good encryptionmechanisms to ensue that the sender and receiver communicate securely. Ultimatelywhether it be exchanging messages or carrying out transactions, the communicationbetween sender and receiver or the buyer and the seller has tobe secure. Secure clientsolutions including securing the browser, securing the Java virtual machine, securingJava applets, and incorporating various security featuresinto languages such as Java.Note that Java is not the only component that has to be secure.Microsoft has come upwith a collection of products including ActiveX and these products have to be securealso. Securing the protocols include secure HTTP, the secure socket layer. Securing theweb server means the server has to be installed securely as well as it has to be ensuredthat the server cannot be attacked. Various mechanisms thathave been used to secureoperating systems and databases may be applied here. Notable among them are accesscontrol lists, which specify which users have access to which web pages and data. The

206 CHAPTER THREE

web servers may be connected to databases at the backend and these databases haveto be secure. Finally various encryption algorithms are being implemented for the net-works and groups such as OMG (Object Management Group) are envisaging securityfor middleware such as ORB (Object Request Brokers).

One of the challenges faced by the web mangers is implementing security policies.One may have policies for clients, servers, networks, middleware, and databases. Thequestion is how do you integrate these policies? That is how do you make these policieswork together? Who is responsible for implementing these policies? Is there a globaladministrator or are there several administrators that have to work together? Securitypolicy integration is an area that is being examined by researchers.

Finally, one of the emerging technologies for ensuring thatan organization’s assetsare protected is firewalls. Various organizations now have web infrastructures for in-ternal and external use. To access the external infrastructure one has to go through thefirewall. These firewalls examine the information that comesinto and out of an orga-nization. This way, the internal assets are protected and inappropriate information maybe prevented from coming into an organization. We can expectsophisticated firewallsto be developed in the future. Other security mechanism includes cryptography.

3.3.8.4 Protecting from Bio-terrorism and Chemical Attacks

We discussed biological, chemical and nuclear threats in Section 3.3.5. In this sec-tion we discuss counter-terrorism measures. First of all unlike say non informationrelated terrorism where bombing and shootings are fairly explicit, bio-terrorism andeven chemical attacks are not immediately obvious. Supposea terrorist spreads thesmallpox virus, it takes time, at least a few days before the symptoms surface and fewmore days before the diagnosis is made. By then it may be too late as millions of peoplemay be infected in trains and planes and large gatherings andmeetings. The challengehere is to prevent as well as detect such attacks as soon as possible.

Preventing such attacks could mean developing special sensors to sense that thereare certain viruses in the air. The sensors may also have to detect what these viruses are.A cold virus may not be as harmful as a smallpox virus. If the disease has spread thensome quick actions have to be taken as to who and how many to vaccinate. Chemicalweapons may also be treated similarly. One needs sensors to detect as to who has theseweapons. Once the dangerous chemicals are spilt, we need to determine what otheragents do we spray to limit the damage caused by the chemicals. For example whenone spills acidic material, then one counters it by washing with soap-based materials.

In the case of nuclear attacks, we need to determine what nuclear weapons havebeen used and then decide what actions to take. How do we evacuate the variousgroups of people in an organized fashion? What medications dowe give them? Theseare very difficult challenges. Research activities are proceeding, but it will take a verylong time to find viable solutions.

3.3.8.5 Critical Infrastructure Protection

Next we discuss critical infrastructure protection. Our critical infrastructures are telecom-munication lines, networks, water, food, gas electric lines, etc. Attacking the critical

THURAISINGHAM 207

infrastructure could cripple businesses and the country. We need to determine the mea-sures to be taken when the infrastructures are attacked.

Essentially the counter-measures include those developedfor non information-based terrorism as well as for information-based terrorism. For example one couldbomb the telecommunication lines or create viruses that would affect the telecom-munications software. This means that communication through telephones as wellas computer communications that occurs through phone linescould be crippled. Thecounter-measures developed for non information related terrorisms well for informa-tion related terrorism could be applied here. We need to gather information about theterrorist groups and extract patterns. We also need to detect any unauthorized intru-sions. Our ultimate goal is to prevent such disastrous acts.

Even biological, chemical and nuclear weapons could attackthe infrastructure ofthe nation. For example our food supplies, water supplies and hospitals could be dam-aged by biological warfare. Here again we need to examine thecounter-terrorism mea-sures for biological, chemical and nuclear attacks and apply them here.

3.3.8.6 Protecting from Non Real-time and Real-time Threats

In section 3.3.7 we discussed both non real-time and real-time threats. As we havementioned, it is difficult to state that A is a real-time threat and B is a non real-timethreat. Over time, a non real-time threat could become a real-time threat. Real-timethreats have to be handled in real-time. Example of a real-time threat is detecting andpreventing the spread of the smallpox virus.

When it comes to counter-measures for handling these threats, one needs to developtechniques that meet timing constraints to handle real-time threats. For example, if datamining is to be used to detect and prevent the malicious intrusions into say corporatenetworks, then these data mining techniques have to give results in real-time. In thecase of non real-time threats, the data mining techniques could analyze the data andmake predictions that certain threats could occur say in July 2003.

In the next section we will revisit non real-time threats andreal-time threats froma data mining perspective. While real-time threats need immediate response, both nonreal-time threats as well as real-time threats could be deadly and have to be takenseriously.

3.4 Data Mining Applications in Counter-Terrorism

3.4.1 Overview

In the previous section we discussed various threats and counter-measures. In partic-ular, we discussed non information related attacks such as bombings and explosions;information related attacks such as cyber terrorism; biological, chemical and nuclearattacks such as the spread of smallpox; and critical infrastructure attacks such as at-tacks on power and gas lines. Counter-terrorism measures include ways of protectingfrom non-information related attacks, information related attacks, biological, chemicaland nuclear attacks, as well as critical infrastructure attacks.

208 CHAPTER THREE

In this section we will provide a high level overview of how web data mining as wellas data mining could help toward counter-terrorism. Note that we have used web datamining and data mining sort of interchangeably as our definition of web data mininggoes beyond just mining structured data. We have included mining unstructured data,mining for business intelligence, web usage mining and web structure mining as partof web data mining. That is, in a way web data mining encompasses data mining.

As we have stated data mining could contribute towards counter-terrorism. We arenot saying that data mining will solve all our national security problems. Howeverthe ability to extract hidden patterns and trends from largequantities of data is veryimportant for detecting and preventing terrorist attacks.

The organization of this section is as follows. Section 3.4.2 provides an overviewof web data mining for counter-terrorism. We will analyze the techniques in Section3.4.3. A particular technique, called link analysis, that may be very important forcounter-terrorism applications will be given more consideration in Section 3.4.4. Thesection is summarized in section??.

3.4.2 Data Mining for Handling Threats

3.4.2.1 Overview

In Section 3.3 we grouped threats different ways. One grouping was whether they werebased on information related or non-information related. It was somewhat artificial, aswe need information for all types of threats. However in our terminology, informationrelated threats were threats dealing with computers; some of these threats were real-time threats while some others were non real-time threats. Even here the groupingwas somewhat arbitrary, as a non real-time threat could become a real-time threat. Forexample, one could suspect that a group of terrorists will eventually perform some actof terrorism. However when we set time bounds such as a threatwill likely occur saybefore July 1, 2003, then it becomes a real-time threat and wehave to take actionsimmediately. If the time bounds are tighter such as a threat will occur within two daysthen we cannot afford to make any mistakes in our response.

The purpose of this section is to examine both the non real-time threats and real-time threats and see how data mining in general and web data mining in particular couldhandle such threats. Again we want to stress that web data mining in our terminologyencompasses data mining as it deals with data mining on the web as well as miningstructured and unstructured data. Furthermore, we are assuming that much of the datawill be on the web whether they be public networks such as the Internet or privatenetworks such as corporate intranets. Therefore, we are using the terms data miningand web data mining interchangeably. In section 3.4.2.2 we discuss non real-timethreats and in section 3.4.2.3 we discuss real-time threats. We will refer to the specificexamples that we have mentioned in the previous section in our discussions as needed.Section 3.4.3 will examine the various data mining outcomesand techniques and seehow they can help toward counter-terrorism. Some very good articles on data miningfor counter-terrorism have been presented at the Security Informatics Workshop heldin June 2003 (see [6]).

THURAISINGHAM 209

3.4.2.2 Non Real-time Threats

Non real-time threats are threats that do not have to be handled in real-time. That is,there are no timing constraints for these threats. For example, we may need to collectdata over months, analyze the data and then detect and/or prevent some terrorist attack,which may or may not occur. The question is how does data mining help towards suchthreats and attacks? As we have stressed in [14], we need gooddata to carry out datamining and obtain useful results. We also need to reason withincomplete data. This isthe big challenge, as organizations are often not prepared to share the data. This meansthat the data mining tools have to make assumptions about thedata belonging to otherorganizations. The other alternative is to carry out federated data mining under somefederated administrator. For example, the Homeland security department could serveas the federated administrator and ensure that the various agencies have autonomy butat the same time collaborate when needed.

Next, what data should we collect? We need to start gatheringinformation aboutvarious people. The question is, who? Everyone in the world?This is quite impossible.Nevertheless we need to gather information about as many people as possible; becausesometimes even those who seem most innocent may have ulterior motives. One possi-bility is to group the individuals depending on say where they come from, what theyare doing, who their relatives are etc. Some people may have more suspicious back-grounds than others. If we know that someone has had a criminal record, then we needto be more vigilant about that person.

Again to have complete information about people, we need to gather all kinds ofinformation about them. This information could include information about their behav-ior, where they have lived, their religion and ethnic origin, their relatives and associates,their travel records etc. Yes, gathering such information is a violation to one’s privacyand civil liberties. The question is what alternative do we have? By omitting informa-tion we may not have the complete picture. From a technology point of view, we needcomplete data not only about individuals but also about various events and entities. Forexample, suppose I drive a particular vehicle and information is being gathered aboutme. This will also include information about my vehicle, howlong I have driven, do Ihave other hobbies or interests such as flying airplanes, have I enrolled in flight schoolsand asked the instructor that I would like to learn to fly an airplane, but do not carelearning about take-offs or landings, etc.

Once the data is collected, the data has to be formatted and organized. Essentiallyone may need to build a warehouse to analyze the data. Data maybe structured orunstructured data. Also, there will be some data that is warehoused that may not be ofmuch use. For example, the fact that I like ice cream may not help the analysis a greatdeal. Therefore, we can segment the data in terms of criticaldata and non-critical data.

Once the data is gathered and organized, the next step is to carry out mining. Thequestion is what mining tools to use and what outcomes to find?Do we want to findassociations or clusters? This will determine what our goalis. We may want to findanything that is suspicious. For example, the fact that I want to learn flying withoutcaring about take-off or landing should raise a red flag as in general one would want totake a complete course on flying. In Section 3.4.3 we discuss the various outcomes ofinterest to counter-terrorism activities. Once we determine the outcomes we want, we

210 CHAPTER THREE

determine the mining tools to use and start the mining process.Then comes the very hard part. How do we know that the mining results are use-

ful? There could be false positives and false negatives. Forexample, the tool couldincorrectly produce the result that John is planning to attack the Empire State Buildingon July 1, 2003. Then the law enforcement officials will be after John and the con-sequences could be disastrous. The tool could also incorrectly product the result thatJames is innocent when he is in fact guilty. In this case the law enforcement officialsmay not pay much attention to James. The consequence here could be disastrous also.As we have stated we need intelligent mining tools. At present we need the humanspecialists to work with the mining tools. If the tool statesthat John could be a ter-rorist, the specialist will have to do some more checking before arresting or detainingJohn. On the other hand if the tool states that James is innocent, the specialist shoulddo some more checking in this case also.

Essentially with non real-time threats, we have time to gather data, build say pro-files of terrorists, analyze the data and take actions. Now, anon real-time threat couldbecome a real-time threat. That is, the data mining tool could state that there couldbe some potential terrorist attacks. But after a while, withsome more information, thetool could state that the attacks will occur between September 10, 2001 and September12, 2001. Then it becomes a real-time threat. The challenge will then be to find exactlywhat the attack will be? Will it be an attack on the World TradeCenter or will it be anattack on the Tower of London or will it be an attack on the Eiffel Tower? We need datamining tools that can continue with the reasoning as new information comes in. Thatis, as new information comes in, the warehouse needs to get updated and the miningtools should be dynamic and take the new data and informationinto consideration inthe mining process.

3.4.2.3 Real-time Threats

In the previous section we discussed non real-time threats where we have time to han-dle the threats. In the case of real-time threats there are timing constraints. That is,such threats may occur within a certain time and therefore weneed to respond to it im-mediately. Example of such threats are the spread of smallpox virus, chemical attacks,nuclear attacks, network intrusions, bombing of a buildingbefore 9am in the morning,etc. The question is what type of data mining techniques do weneed for real-timethreats?

By definition, data mining works on data that has been gathered over a period oftime. The goal is to analyze the data and make deductions and predict future trends.Ideally it is used as a decision support tool. However, the real-time situation is entirelydifferent. We need to rethink the way we do data mining so thatthe tools can give outresults in real-time.

For data mining to work effectively, we need many examples and patterns. Weuse known patterns and historical data and then make predictions. Often for real-timedata mining as well as terrorist attacks we have no prior knowledge. For example, theattack on the world trade center came as a surprise to many of us. As ordinary citizens,no way could we have imagined that the buildings would be attacked by air planes.Another good example is the recent sniper attacks in the Washington DC area. Here

THURAISINGHAM 211

again many of us could never have imagined that the sniper would do the shootingsfrom the trunk of a car. So the question is, how do we train the data mining tools suchas neural networks without historical data? Here we need to use hypothetical data aswell as simulated data. We need to work with counter-terrorism specialists and get asmany examples as possible. Once we gather the examples and start training the neuralnetworks and other data mining tools, the question is what sort of models do we build?Often the models for data mining are built before hand. Thesemodels are not dynamic.To handle real-time threats, we need the models to change dynamically. This is a bigchallenge.

Data gathering is also a challenge for real-time data mining. In the case of non real-time data mining, we can collect data, clean data, format thedata, build warehouses andthen carry out mining. All these tasks may not be possible forreal-time data miningas there are time constraints. Therefore, the questions arewhat tasks are critical andwhat tasks are not? Do we have time to analyze the data? Which data do we discard?How do we build profiles of terrorists for real-time data mining? We need real-timedata management capabilities for real-time data mining.

From the pervious discussion it is clear that a lot has to be done before we can ef-fectively carry out real-time data mining. Some have arguedthat there is no such thingas real-time data mining and it will be impossible to build models in real-time. Someothers have argued that without real world examples and historical data we cannot doeffective data mining. These arguments may be true. Howeverour challenge is to thenperhaps redefine data mining and figure out ways to handle real-time threats.

As we have stated, there are several situations that have to be managed in real-time. Examples are the spread of smallpox, network intrusions, and even analyzingdata emanating from sensors. For example, there are surveillance cameras placed invarious places such as shopping centers and in front of embassies and other publicplaces. The data emanating from the sensors have to be analyzed in many cases inreal-time to detect/prevent attacks. For example, by analyzing the data, we may findthat there are some individuals at a mall carrying bombs. Then we have to alert thelaw enforcement officials so that they can take actions. Thisalso raises the questionsof privacy and civil liberties. The questions are what alternatives do we have? Shouldwe sacrifice privacy to protect the lives of millions of people? As stated in [12] weneed technologists, policy makers and lawyers to work together to come up with viablesolutions. We will revisit privacy in section 3.5.

3.4.3 Analyzing the Techniques

In section 3.4.2 we discussed data mining both for non real-time threats as well as real-time threats. As we have mentioned, applying data mining forreal-time threats is amajor challenge. This is because the goal of data mining is toanalyze data and makepredictions and trends. Current tools are not capable of making the predictions andtrends in real-time, although there are some real-time datamining tools emerging andsome of them have been listed in [16]. The challenge is to develop models in real-timeas well as get patterns and trends based on real world examples.

In this section we will examine the various data mining outcomes and discuss howthey could be applied for counter-terrorism. Note that the outcomes include making

212 CHAPTER THREE

associations, link analysis, forming clusters, classification and anomaly detection. Thetechniques that result in these outcomes are techniques based on neural networks, de-cisions trees, market basket analysis techniques, inductive logic programming, roughsets, link analysis based on the graph theory, and nearest neighbor techniques. As wehave stated in [14], the methods used for data mining are top down reasoning wherewe start with a hypothesis and then determine whether the hypothesis is true or bottomup reasoning where we start with examples and then come up with a hypothesis.

Let us start with association techniques. Examples of thesetechniques are marketbasket analysis techniques. The goal is to find which items gotogether. For exam-ple, we may apply a data mining tool to data that has been gathered and find thatJohn comes from Country X and he has associated with James whohas a criminalrecord. The tool also outputs the result that an unusually large percentage of peoplefrom Country X have performed some form of terrorist attacks. Because of the asso-ciations between John and Country X, as well as between John and James, and Jamesand criminal records, one may need to conclude that John has to be under observation.This is an example of an association. Link analysis is closely associated with makingassociations. While association-rule based techniques areessentially intelligent searchtechniques, link analysis uses graph theoretic methods fordetecting patterns. Withgraphs (i.e. node and links), one can follow the chain and findlinks. For example Ais seen with B and B is friends with C and C and D travel a lot together and D has acriminal record. The question is what conclusions can we draw about A? Link analysisis becoming a very important technique for detecting abnormal behavior. Therefore,we will discuss this technique in a little more detail in the next section.

Next let us consider clustering techniques. One could analyze the data and formvarious clusters. For example, people with origins from country X and who belong to acertain religion may be grouped into Cluster I. People with origins from country Y andwho are less than 50 years old may form another Cluster II. These clusters are formedbased on their travel patterns or eating patterns or buying patterns or behavior patterns.While clustering divides the population not based on any pre-specified condition, clas-sification divides the population based on some predefined condition. The condition isfound based on examples. For example, we can form a profile of aterrorist. He couldhave the following characteristics: Male less than 30 yearsof a certain religion andof a certain ethnic origin. This means all males under 30 years belonging to the samereligion and the same ethnic origin will be classified into this group and could possiblybe placed under observation.

Another data mining outcome is anomaly detection. A good example here is learn-ing to fly an airplane without wanting to learn to takeoff or land. The general patternis that people want to get a complete training course in flying. However there are nowsome individuals who want to learn flying but do not care abouttake off or landing.This is an anomaly. Another example is John always goes to thegrocery store onSaturdays. But on Saturday October 26, 2002 he goes to a firearms store and buys arifle. This is an anomaly and may need some further analysis asto why he is goingto a firearms store when he has never done so before. Is it because he is nervous afterhearing about the sniper shootings or is it because he has some ulterior motive? If he isliving say in the Washington DC area, then one could understand why he wants to buya firearm, possibly to protect him. But if he is living in say Socorro, New Mexico, then

THURAISINGHAM 213

his actions may have to be followed up further.As we have stated, all of the discussions on data mining for counter-terrorism have

consequences when it comes to privacy and civil liberties. As we have mentionedrepeatedly, what are our alternatives? How can we carry out data mining and at thesame time preserve privacy? We revisit privacy in section 3.5.

3.4.4 Link Analysis

In this section we discuss a particular data mining technique that is especially usefulfor detecting abnormal patterns. This technique is link analysis. There have been manydiscussions in the literature on link analysis. In fact, oneof the earlier books on datamining by Berry and Linoff [2] discussed link analysis in some detail. As mentionedin the previous section, link analysis uses various graph theoretic techniques. It isessentially about analyzing graphs. Note that link analysis is also used in web datamining, especially for web structure mining. With web structure mining the idea is tomine the links and extract the patterns and structures aboutthe web. Search enginessuch as Google use some form of link analysis for displaying the results of a search.

As mentioned in [2], the challenge in link analysis is to reduce the graphs intomanageable chunks. As in the case of market basket analysis,where one needs to carryout intelligent searching by pruning unwanted results, with link analysis one needs toreduce the graphs so that the analysis is manageable and not combinatorially explosive.Therefore results in graph reduction need to be applied for the graphs that are obtainedby representing the various associations. The challenge here is to find the interestingassociations and then determine how to reduce the graphs. Various graphs theoreticiansare working on graph reduction problems. We need to determine how to apply thetechniques to detect abnormal and suspicious behavior.

Another challenge on using link analysis for counter-terrorism is reasoning withpartial information. For example, agency A may have a partial graph, agency B anotherpartial graph and agency C a third partial graph. The question is how do you find theassociations between the graphs when no agency has the complete picture? One wouldague that we need a data miner that would reason under uncertainty and be able tofigure out the links between the three graphs. This would be the ideal solution andthe research challenge is to develop such a data miner. The other approach is to havean organization above the three agencies that will have access to the three graphs andmake the links. One can think of this organization to be the Homeland security agency.In the next section as well as in some of the ensuing sections we will discuss variousfederated architectures for counter-terrorism.

We need to conduct extensive research on link analysis as well as on other dataand web data mining techniques to determine how they can be applied effectively forcounter-terrorism. For example, by following the various links, one could perhaps tracesay the financing of the terrorist operations to the president of say country X. Anotherchallenge with link analysis as well with other data mining techniques is having gooddata. However for the domain that we are considering much of the data could beclassified. If we are to truly get the benefits of the techniques we need to test with actualdata. But not all of the researchers have the clearances to work on classified data. Thechallenge is to find unclassified data that is a representative sample of the classified

214 CHAPTER THREE

data. It is not straightforward to do this, as one has to make sure that all classifiedinformation, even through implications, is removed. Another alternative is to find asgood data as possible in an unclassified setting for the researchers to work on. However,the researchers have to work not only with counter-terrorism experts but also with datamining specialists who have the clearances to work in classified environments. That is,the research carried out in an unclassified setting has to be transferred to a classifiedsetting later to test the applicability of the data mining algorithms. Only then can weget the true benefits of data mining.

3.5 A Note on Privacy

In section 3.4 we briefly mentioned the challenges to privacydue to data mining. Therehas been much debate recently among the counter-terrorism experts and civil libertiesunions and human rights lawyers about the privacy of individuals. That is, gatheringinformation about people, mining information about people, conduction surveillanceactivities and examining say e-mail messages and phone conversations are all threatsto privacy and civil liberties. However, what are the alternatives if we are to combatterrorism effectively? Today we do not have any effective solutions. Do we wait untilprivacy violations occur and then prosecute or do we wait until national security dis-asters occur and then gather information? What is more important? Protecting nationsfrom terrorist attacks or protecting the privacy of individuals? This is one of the ma-jor challenges faced by technologists, sociologists and lawyers. That is, how can wehave privacy but at the same time ensure the safety of nations? What should we besacrificing and to what extent?

The challenge is to provide solutions to enhance national security but at the sametime ensure privacy. There is now research at various laboratories on privacy en-hanced sometimes called privacy sensitive data mining (e.g., Agrawal at IBM Almaden,Gehrke at Cornell University and Clifton at Purdue University, see for example [1,3,9]).The idea here is to continue with mining but at the same time ensure privacy as muchas possible. For example, Clifton has proposed the use of themultiparty security policyapproach for carrying out privacy sensitive data mining. While there is some progresswe still have a long way to go. Some useful references are provided in [3] (see also [8]).

An approach we are proposing is to process privacy constraints in a database man-agement system. Note that one mines the data and extracts patterns and trends. Theprivacy constraints determine which patterns are private and to what extent. For exam-ple, suppose one could extract the names and healthcare records. If we have a privacyconstraint that states that names and healthcare records are private then this informationis not released to the general public. If the information is semi-private, then it is re-leased to those who have a need to know. Essentially the inference controller approachwe have discussed in [15] is one solution to achieving some level of privacy. It couldbe regarded to be a type of privacy sensitive data mining. In [13] we have proposed anapproach to handle privacy constraints during query, update and database design oper-ations. Also recently IBM Almaden Research Center is developing a similar approachto privacy management. They call their approach hypocritical databases (see [7]).

Note that not all approaches to privacy enhanced data miningare the same. Re-

THURAISINGHAM 215

searchers are taking different approaches to such data mining. Some have argued thatprivacy enhanced data mining may be time consuming and may not be scalable. How-ever we need to investigate this area more before we can come up with viable solutions.

3.6 Summary and Directions

We first provided an overview of some of the challenges for applying data mining forcounter-terrorism. These include eliminating false positives and false negatives, mul-timedia data mining, real-time data mining and privacy. Next we discussed variousthreats. That is, we provided a fairly broad overview of various aspects of threats andcounter-terrorism measures. First we discussed natural disasters and human errors.Then we divided the threats into various groups including non-information related ter-rorism, information related terrorism and biological, chemical, and nuclear threats.We also discussed critical infrastructure threats. Next wediscussed counter-terrorismmeasures for all types of threats. For example, we need to gather information aboutterrorists and terrorist groups, mine the information and extract patterns. In the case ofbio-terrorism, we need to prevent terrorist attacks with say with the use of sensors.

Next we provided a rather broad overview of data mining for counter-terrorism.We have used the terms data mining and web data mining interchangeably. Again wecan expect much of the data to be on the web, whether they be on the Internet or oncorporate intranets, and therefore, mining the data sources and databases on the web todetect and prevent terrorist attacks will become a necessity. These databases could bepublic databases or private databases.

First we discussed data mining for non real-time threats. The idea here is to gatherdata, build profiles or terrorists, learn from examples and then detect as well as preventattacks. The challenge here is to find real world examples as in many cases a particularattack has not happened before. Next we discussed real-timedata mining. Here thechallenge is to build models in real-time. Finally we discussed data mining outcomesand techniques for counter-terrorism as well as focused on link analysis for counter-terrorism.

We are not counter-terrorism experts. Our discussions on counter-terrorism arebased on various newspaper articles and documentaries. Ourgoal is to explore howdata mining can be exploited for counter-terrorism. We wantto raise the awarenessthat data mining could possibly help detect and prevent terrorist attacks. Again thisarea is a new area. Lot of research needs to be done. It should be noted that we alsoneed to make sure that the data mining tools produce accurateand useful results. Forexample, if there are false positives, the effects could be disastrous. That is, we do notwant to investigate someone who is innocent. This will raisemany privacy concerns.We also do not want the data mining tools to give out false negatives. We hope that thispaper will spawn interesting ideas so that researchers and practitioners start or continueto work on data mining and apply the techniques for counter-terrorism.

We also provided an overview of some of the privacy concerns and discussed thedirections in privacy preserving data mining and privacy constraint processing. Thereare many discussions now on privacy preserving approaches as we need to continuewith this research and develop viable solutions that can carry out useful mining and at

216 CHAPTER THREE

the same time ensure privacy.Data mining and web data mining technologies will have a significant impact on

counter-terrorism. As we are seeing, one of the major concerns of our nation today is todetect and prevent terrorist attacks. This is also becomingthe goal of many nations inthe world. We need to examine the various data mining and web mining technologiesand see how they can be adapted for counter-terrorism. We also need to develop specialweb mining techniques for counter-terrorism. As we have stressed in [14], we expectmuch of the data to be on the web. The web could be the Internet or Intranets. Analystswill have to collaborate via the web within an agency or between agencies. Also, thefounding of the Homeland security department perhaps may have an impact on howdata mining will be carried out.

In addition to improving on data mining and web mining techniques and adaptingthem for counter-terrorism, we also need to focus of federated data mining. We canexpect agencies to collaboratively work together. They will have to share the data aswell as mine the data collaboratively. We can expect to see anincreased interest infederated data mining. In this paper we have discussed just the high level ideas. Weneed to explore the details.

Some other areas of interest include multilingual data mining. Terrorism is not con-fined to one country and it has no borders. There is terrorism everywhere and carriedout by people from different countries speaking different languages. We need technolo-gies to understand the various languages as well as mine the text in different languages.We also need translators to translate one language to another before mining. We alsoneed language experts to work with technologists for multilingual data managementand mining. Note that terrorists may come from different countries and speak differentlanguages. We need to understand their language without anyambiguity.

As we have stressed in [14], we cannot forget about privacy. National securitymeasures will mean violating privacy and civil liberties. We cannot abandon our questfor eliminating terrorism. However, we also have to be sensitive to the privacy of in-dividuals. This will be a major challenge. We need to developtechniques for privacysensitive data sharing and data mining.

Disclaimer: The views and conclusions expressed in this paper are thoseof theauthor and do not reflect the policies or procedures of the MITRE Corporation or ofthe National Science Foundation.

Bibliography

[1] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings ofthe ACM SIGMOD Conference, Dallas, TX, May 2000.

[2] M. Berry and G. Linoff.Data Mining Techniques for Marketing, Sales, and Cus-tomer Support. John Wiley, New York, 1997.

[3] C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining privacy for data mining.Technical report, Purdue University, 2002. (see also Next Generation Data Min-ing Workshop, Baltimore, MD, November 2002).

[4] H. Ellison. Handbook of Chemical and Biological Warfare Agents. CRC Press,1999.

[5] F. Bolz et al. The Counterterrorism Handbook: Tactics, Procedures, and Tech-niques. CRC Press, 2001.

[6] H. Chen et al. InProceedings of the 1st Conference on Security Informatics,Tucson, AZ, June 2003.

[7] R. Agrawal et. al. Hypocritical databases. InProceedings of VLDB, 2002.

[8] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving min-ing of association rules. InProceedings of the Eighth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, Edmonton, Al-berta, Canada, July 2002.

[9] J. Gehrke. Research problems in data stream processing and privacy-preservingdata mining. InProceedings of the Next Generation Data Mining Workshop,Baltimore, MD, November 2002.

[10] A. Ghosh. Ecommerce Security, Weak Links and Strong Defenses. John Wiley,New York, 1998.

[11] B. Thuraisingham. Managing threats to web databases and cyber systems: Issues,solutions and challenges. In V. Kumar et al, editor,Cyber Security: Threats andCountermeasures. Kluwer.

[12] B. Thuraisingham. Data mining, national security, privacy and civil liberties.SIGKDD Explorations, January 2003.

217

218 CHAPTER THREE

[13] B. Thuraisingham. Privacy constraint processing in a database management sys-tem.(accepted to be published) Data and Knowledge Engineering Journal, 2003.

[14] B. Thuraisingham. Web Data Mining Technologies and Their Applications inBusiness Intelligence and Counter-terrorism. CRC Press, June 2003.

[15] B. Thuraisingham and W. Ford. Security constraint processing in a multileveldistributed database management system.IEEE Transactions on Knowledge andData Engineering, April 1995.

[16] http://www.kdnuggets.com.

data mining for counter-terrorism - semantic scholar · chapter 3 data mining for counter-terrorism...

Documents