mining criminal networks from unstructured text...

e at SciVerse ScienceDirect

Digital Investigation 8 (2012) 147–160

Contents lists availabl

Digital Investigation

journal homepage: www.elsevier .com/locate/di in

Mining criminal networks from unstructured text documents

Rabeah Al-Zaidy a, Benjamin C.M. Fung a,*, Amr M. Youssef a, Francis Fortin b

aConcordia Institute for Information Systems Engineering, Concordia University, 1455 De Maisonneuve Blvd. West, CIISE (EV7.640), Montreal, Québec,Canada H3G 1M8b Sûreté du Québec, Montreal, Québec, Canada

a r t i c l e i n f o

Article history:Received 14 October 2010Received in revised form 16 October 2011Accepted 26 December 2011

Keywords:Forensic analysisData miningHypothesis generationCriminal networkInformation retrieval

* Corresponding author. Tel.: þ1 514 8482424.E-mail address: [email protected] (B.C.M. F

1742-2876/$ – see front matter ª 2012 Elsevier Ltddoi:10.1016/j.diin.2011.12.001

a b s t r a c t

Digital data collected for forensics analysis often contain valuable information about thesuspects’ social networks. However, most collected records are in the form of unstructuredtextual data, such as e-mails, chat messages, and text documents. An investigator often hasto manually extract the useful information from the text and then enter the importantpieces into a structured database for further investigation by using various criminalnetwork analysis tools. Obviously, this information extraction process is tedious and error-prone. Moreover, the quality of the analysis varies by the experience and expertise of theinvestigator. In this paper, we propose a systematic method to discover criminal networksfrom a collection of text documents obtained from a suspect’s machine, extract usefulinformation for investigation, and then visualize the suspect’s criminal network.Furthermore, we present a hypothesis generation approach to identify potential indirectrelationships among the members in the identified networks. We evaluated the effec-tiveness and performance of the method on a real-life cybercrimine case and some otherdatasets. The proposed method, together with the implemented software tool, hasreceived positive feedback from the digital forensics team of a law enforcement unit inCanada.

ª 2012 Elsevier Ltd. All rights reserved.

1. Introduction

In many criminal cases, computer devices owned by thesuspect, such as desktops, notebooks, and smart phones,are target objects for forensic seizure. These devices maynot only contain important evidences relevant to the caseunder investigation, but they may also have importantinformation about the social networks of the suspect, bywhich other criminals may be identified. In the UnitedStates, the FBI Regional Computer Forensics Laboratory(RCFL) conducted over 6000 examinations on behalf of 689law enforcement agencies across the United States in oneyear (RCFL, 2009). The amount of data they examined in2009 has reached 2334 Tera Bytes (TB), which is a double ofthe size processed in 2007. To accommodate the increasing

ung).

. All rights reserved.

demand, better resources are needed to help investigatorsprocess forensically collected data.

Most collected digital evidence are often in the form oftextual data, such as e-mails, chat logs, blogs, webpages,and text documents. Due to the unstructured nature of suchtextual data, investigators usually employ some off-the-shelf search tools to identify and extract useful informa-tion from the text, and then manually enter the usefulpieces into a well-structured database for further investi-gation. Obviously, this manual process is tedious and error-prone; the completeness of a search and the quality of ananalysis pretty much relies on the experience and expertiseof the investigators. Important information may be missedif a criminal intends to hide it.

In this paper, we propose a data mining method todiscover criminal communities and extract useful infor-mation for investigation from a collection of text docu-ments obtained from a suspect’s machine. The objective is

mailto:[email protected]

www.sciencedirect.com/science/journal/17422876

http://www.elsevier.com/locate/diin

http://dx.doi.org/10.1016/j.diin.2011.12.001

http://dx.doi.org/10.1016/j.diin.2011.12.001

Ben

Text Box

This is the preprint version. See Elsevier for the final official version.

R. Al-Zaidy et al. / Digital Investigation 8 (2012) 147–160148

to help investigators efficiently identify relevant informa-tion from a large volume of unstructured textual data. Themethod is especially useful in the early stage of an inves-tigation when investigators may have little clue to beginwith. The effectiveness of the proposed method, togetherwith the implemented software tool, have received positivefeedback from the digital forensics team of a law enforce-ment unit in Quebec, Canada. Our major contributions canbe summarized as follows.

1. Communities discovery from unstructured textual data.Several social network analysis tools (Getoor and Diehl,2005; Xu and Chen, 2005) are available to assist inves-tigators in the analysis of criminal networks. However,these tools often assume that the input is a structureddatabase. Nonetheless, structured data is often notavailable in real-life investigations. Instead, the availableinput is usually a collection of unstructured textual data.Our first contribution is to provide an end-to-endsolution to automatically discover, analyze, and visu-alize criminal communities from unstructured textualdata.

2. Introduction of the notion of prominent communities. Afterextensive discussions with the digital forensics team ofa Canadian law enforcement unit, we defined thenotions of community and prominent community. In thecontext of this paper, two or more persons forma community if their names appear together in at leastone investigated document. A community is prominent ifits associated names frequently appear together in someminimum number of documents, which is a user-specified threshold. We propose a method to discoverall prominent communities and measure the closenessamong the members in these communities.

3. Generation of indirect relationship hypotheses. Thenotions of prominent community and closeness amongits members capture the direct relationships among thepersons identified in the investigated documents. Ourrecent work (Al-Zaidy et al., 2011) presents a prelimi-nary study on direct relationships. In many cases, indi-rect relationships are also interesting since they mayreveal hidden relationships. For example, person A andperson B are indirectly related if both of them havementioned a meeting at hotel X in their written e-mails,even though they may not have any direct communi-cations. We present a method to generate all indirectrelationship hypotheses with a maximum, user-specified, depth.

4. Scalable computation. The computations of prominentcommunities and closeness from the investigated textdocument set is non-trivial. A naive approach is toenumerate all 2jUj combinations of communities andscan the document set to determine the prominentcommunities and the closeness, where jUj is thenumber of distinct personal names identified in theinput document set. Our proposed method achievesscalable computation by efficiently pruning the non-prominent communities and examining the closenessof the ones that can potentially be prominent. Thescalability of our method is supported by experimentalresults.

The rest of the paper is organized as follows. Section 2reviews the related works. The problems of criminalcommunity discovery and indirect relationship hypothesesgeneration are formally defined in Section 3 and ourproposed method is described in Section 4. Section 5demonstrates the effectiveness of our proposed methodvia a case study on real-life cybercrime investigation.Section 6 shows the performance study of our proposedmethod. Section 7 concludes the paper.

2. Related works

Criminal network analysis has received great attentionfrom researchers. The pioneer work by Chen et al. (2004)demonstrates a successful application of data miningtechniques to extract criminal relations from a largevolume of police department’s incident summaries. Theyuse the co-occurrence frequency to determine the weightof relationships between pairs of criminals. Yang and Ng(2007) present a method to extract criminal networksfrom web sites that provide blogging services by usinga topic-specific exploration mechanism. In their approach,they identify the actors in the network by using webcrawlers that search for blog subscribers who participatedin a discussion related to some criminal topics. After thenetwork is constructed, they use some text classificationtechniques to analyze the content of the documents. Finallythey propose a visualization of the network that allows foreither a concept network view or a social network view.Our work is different from these works in three aspects.First, our study focuses on unstructured textual data ob-tained from a suspect’s hard drive, not from a well-structured police database. Second, our method candiscover prominent communities consisting of any size, i.e.,not limited to pairs of criminals. Third, while most of theprevious works focus on identifying direct relationships,the methods presented in this paper can also identifyindirect relationships.

A criminal network follows a social network paradigm.Thus, the approaches used for social network analysis canbe adopted in the case of criminal networks. Many studieshave introduced various approaches to construct a socialnetwork from text documents. Hope et al. (2006) proposea framework to extract social networks from text documentthat are available on the web. Jin et al. (2009) proposea method to rank companies based on the social networksextracted fromwebpages. These approaches rely mainly onwebmining techniques to search for the actors in the socialnetworks fromweb documents. Another direction of socialnetwork studies targets some specific type of text docu-ments such as e-mails. Zhou et al. (2006) propose a prob-abilistic approach that not only identifies communities inemail messages but also extracts the relationship infor-mation using semantics to label the relationships. However,the method is applicable to only e-mails and the actors inthe network are limited to the authors and recipients of thee-mails.

Researchers in the field of knowledge discovery haveproposed methods to analyze relationships between termsin text documents in a forensic context. Jin et al. (2007)introduce a concept association graph-based approach to

R. Al-Zaidy et al. / Digital Investigation 8 (2012) 147–160 149

search for the best evidence trail across a set of documentsthat connects two given topics. Srinivasan (2004) proposesthe open and closed discovery algorithms to extractevidence paths between two topics that occur in thedocument set but not necessarily in the same document.Skillicorn and Vats (2007) employ the open discoveryapproach to search for keywords provided by the user andreturn documents containing other different but relatedtopics. They further apply clustering techniques to rank theresults and present the user with clusters of new infor-mation that are conceptually related to their initial queryterms. Their open discovery approach searches for novellinks between concepts from the web with the goal ofimproving the results of web queries. In contrast, this paperfocuses on extracting information for investigation fromtext files.

3. Problem description

The problem of criminal networks analysis can bedivided into two problems. The first one is to discover theprominent communities in a document set and extractuseful information from the documents that contribute tothe formation of the prominent communities. The secondone is to generate hypotheses of indirect relationshipsbetween the prominent communities and other peoplenames in the document set. These two problems areformally defined as follows.

3.1. The problem of criminal community discovery

The problem of criminal community discovery is toidentify the hidden communities from a collection of textdocuments obtained from one (or multiple) suspect’s filesystems. In this paper, a text document is generally definedto be a logical unit of textual data, such as an e-mailmessage, a chat session, a webpage, a blog session, anda text file. Let D be a set of input text documents. Let U bethe set of distinct personal names identified in D. Eachdocument doc ˛ D is represented as a set of names suchthat doc4U Let C4U be a set of personal names calleda community. A document doc contains a community C ifC4doc A community having k personal names is a k-community. The support of a community C is the number ofdocuments in D containing C. For example, {Alan, Kim} inTable 1 is a 2-community with support ¼ 1. A community Cis a prominent community in a set of documents D if thesupport of C is greater than or equal to a user-specifiedminimum support threshold. Suppose the threshold is setto 2. Then, {Alan, Kim} is not a prominent community inTable 1, but {Jenny, John,Mike} is a prominent 3-communitywith support ¼ 2.

Table 1Document set (D).

Document Names in doci

doc1 {Alan, John, Kim}doc2 {Jenny, John, Mike}doc3 {Alan, Jenny, John, Mike}doc4 {Jenny, Mike}

Definition 3.1 (Prominent community). Let D be a set oftext documents. Let support(C) be the number of docu-ments in D that contain C, where C4U A community C isa prominent community in D if support(C) �min_sup, wherethe minimum support min_sup is a user-specified positiveinteger threshold.

The identified prominent communities are also calledthe criminal communities in this paper because the docu-ment set is assumed to be obtained from the suspect’s filesystem under investigation. The problem of criminalcommunity discovery is formally defined as follows:

Definition 3.2 (Problem of criminal community discov-ery). Let D be a set of text documents. Let min_sup bea user-specified minimum support threshold. The problemof criminal community discovery is to identify all prominentcommunities from D with respect to min_sup, and toextract useful information from the documents of everyprominent community for crime investigation.

The specific type of information that is useful forinvestigation depends on the specific criminal case in hand.We will elaborate this point in Section 4.

3.2. The problem of indirect relationship hypothesisgeneration

A person is indirectly related to a prominent communityif there exists a sequence of intermediate terms that linksa person and a prominent community through a chain ofdocuments, inwhich the starting document and the endingdocuments contain the prominent community and thepersonal name, respectively. The problem of indirect rela-tionship hypothesis generation is to identify all indirectrelationships. Specifically, an indirect relationship consistsof a sequence of intermediate terms between a prominentcommunity and a personal name identified in the givendocument set. The generated indirect relationships mayreveal some hidden links that the investigator might not beaware of. Yet, they are only hypotheses; the investigator hasto further verify the truthfulness and usefulness of theserelationships.

Definition 3.3 (Indirect relationship). Let D be a set ofdocuments. Let U be a set of distinct names identified in D.Let C4U be a prominent community and p ˛ (U � C) bea person name that is not in C. Let Dð$Þ4D denote the set ofdocuments containing the enclosed argument where theenclosed argument can be a community, a personal name,or a text term. Let D(C) and D(p) be the sets of documents inD that contain C and p, respectively. An indirect relationshipof depth d between C and p is defined by a sequence ofterms [t1,., td] such that

1. D(C) X D(p) ¼2. ðt1˛DðCÞÞ^ðtd˛DðpÞÞ3. ðtr˛Dðtr�1ÞÞ^ðtr˛Dðtrþ1ÞÞ for 1 < r < d4. D(tr � 1) X D(tr þ 1) ¼ for 1 < r < d

Condition (1) requires that a prominent community Cand a personal name p do not co-occur in any document.Condition (2) states that the first term t1 must occur in at


least one document containing C and the last term td mustoccur in at least one document containing p. Condition (3)requires that the intermediate terms tr must co-occur withthe previous term tr � 1 in at least one document, and trmust co-occur with the next term tr þ 1 in at least onedocument. This requirement defines the chain of docu-ments linking C and p. Condition (4) requires that theprevious term tr � 1 and the next term tr þ 1 do not co-occurin any document. The problem of indirect relationshiphypothesis generation is formally defined as follows:

Definition 3.4 (Problem of indirect relationship hypothesisgeneration). Let D be a set of text documents. Let U be theset of distinct personal names identified in D. Let G be theset of prominent communities discovered in D according toDefinition 3.2. The problem of indirect relationship hypoth-esis generation is to identify all indirect relationships ofmaximum depth max_depth between any prominentcommunity C ˛ G and any personal name p ˛ U in D, wheremax_depth is a user-specified positive integer threshold.

4. Our method

Fig. 1 depicts an overview of our proposed CriminalCommunity Mining System (CCMS). The first step is to readthe investigated text documents and extract the personal

Read docum

Tag p

Normalize nam

Extract prom

ncd

Extract contact information from

community’s document set

Construct profiles

Detect names thco

Visualize

HDD

Fig. 1. Criminal commun

names from them. The name extraction task is followed bya normalization process to eliminate duplicate names thatrefer to the same person. The next step is to discover theprominent criminal communities from the extractednames. Then, we extract the profile information that isvaluable to investigators, such as contact information andsummary topics, from the documents that contribute to theprominent communities. Next, we search for indirectrelationships between the criminals across the documentset. Finally, we provide a visual representation of theprominent communities, their related information, and theindirect relationships found in the document set. Below, weelaborate the steps of prominent community discovery,community information extraction, and indirect relation-ship generation.

4.1. Identifying prominent communities

The first step is to identify the personal names from theinput document files. There are many Named EntityRecognition (NER) tools and methods available in themarket to extract personal names. In our system, we adoptthe Stanford Named Entity Tagger (Finkel et al., 2005),which is a promising tool for identifying English names. Foreach document doc ˛ D, we apply the NER tagger to obtaina set of personal names in doc. Variant names of the same

ent files from HDD

erson names

es across documents

inent communities

Extract city ames from ommunity’s ocument set

for prominent communities

at are indirectly linked to mmunities

criminal networks

Perform text summarization on community's document set

ity mining system.


person are merged into one name. For instance, John, J.Smith, and John Smith are transformed into a common formJohn Smith. Our method also allows the user to incorporatehis/her domain knowledge to merge the names. Forinstance, some people use nicknames in a chat log and theirreal name is mentioned in the same session. Other NERtools can be employed if the document files contain non-English names; however, NER is not the focus of this paper.

The next step is to identify the prominent criminalcommunities. When two or more persons interactfrequently or their names appear together frequently, thisindicates a strong direct linkage. Analyzing the strength oflinkages is a key step for effective crime investigation. Thestrength of a linkage can be measured by comparing thefrequency of the interaction between the individuals toa fixed threshold. A linkage is strong if the number ofinteractions passes a given threshold; otherwise, thelinkage is weak or there is no linkage. A community isconsidered to be a prominent community if its support isequal to or greater than a given threshold.

A naive approach to identify all prominent communitiesis to enumerate all possible communities and identify theprominent ones by counting the support of each commu-nity in D. Yet, in case the number of identified personalnames jUj is large, it is infeasible to enumerate all possiblecommunities because there are 2jUj possible combinations.To efficiently extract all prominent communities from theset of identified individuals, we modify the Apriori

algorithm (Agrawal et al., 1993), which is originallydesigned to extract frequent patterns from transactiondata.

Recall that U denotes the universe of all personal namesin D, and each document doc ˛ D is represented as a set ofnames such thatdoc4U. Ourproposedalgorithm,ProminentCommunity Discovery (PCD), is a level-wise iterative searchalgorithm that uses the prominent k-communities toexplore the prominent (kþ1)-communities. The generationof prominent (k þ 1)-communities from prominent k-communities is based on the following PCD property.

Property 4.1 (PCD property). All nonempty subsets ofa prominent community must also be prominent becausesupport(C0) � support(C) if C04C.

By definition, a community C0 is not prominent if sup-port(C0) <min_sup. The above property implies that addinga personal name p to a non-prominent community C0 willnever make it prominent. Thus, if a k-community C0 is notprominent, then there is no need to generate (k þ 1)-community C0W{p} because C0W{p} must not be prominent.The strength of the linkages among the members ina prominent community C is indicated by support(C). Thepresented algorithm can identify all prominent communi-ties by efficiently pruning the communities that cannot beprominent based on the PCD property.

Algorithm 1. Prominent Community Discovery


Algorithm 1 summarizes our Prominent CommunityDiscovery algorithm. The algorithm finds the prominent k-communities from the prominent (k � 1)-communitiesbased on the PCD property. The first step is to find the set ofprominent 1-communities, denoted by G1. This is achievedby scanning the document set once and counting thesupport count for each 1-community Cj. G1 contains allprominent 1-communities Cj with support(Cj) � min_sup.The set of prominent 1-communities is then used to iden-tify the set of candidate 2-communities, denoted byCandidates2. Then the algorithm scans the database once tocount the support of each candidate Cj in Candidates2. Allcandidates Cj that satisfy support(Cj) � min_sup are prom-inent 2-communities, denoted by G2. The algorithm repeatsthe process of generating Gk from Gk � 1 and stops if Can-didatek is empty.

Personal names in a community are sorted by lexi-cographical order. Two prominent (k � 1)-communitiescan be joined together to form a candidate k-communityonly if their first (k � 2) personal names are identical andtheir last (k � 1) personal names are different. Thisoperation is based on the PCD property: A communitycannot be prominent if any of its subsets is not prom-inent. Thus, the only potential prominent communities ofsize k are those that are formulated by joining prominent(k � 1)-communities. Lines 4–8 describe the procedure ofremoving candidates that contain at least one non-prominent (k � 1)-community.

Lines 9–17 describe the procedure of scanning thedatabase and obtaining the support count of eachcommunity Cj in Candidatesk. Each candidate community Cjis looked up in each document doci in the document set.Once a match is found, the value of support(Cj) is incre-mented by 1 and the document doci is added to the setD(Cj). If support(Cj) is greater than or equal to the user-specified minimum threshold min_sup, then Cj is added toGk, the set of prominent k-communities with k members.The algorithm terminates when no more candidates can begenerated or when none of the candidate communitiespass the min_sup threshold. The algorithm returns allprominent communities G ¼ {G1, ., Gk} with supportcounts and sets of associated documents for each prom-inent community.

Example 4.1 (Prominent communities discov-ery). Consider Table 1 withmin_sup¼ 2. First, we scan thetable to find G1¼ {Alan, Jenny, John,Mike}. Next, we performG1 ) G1 to generate Candidates2 ¼ {{Alan, Jenny}, {Alan,John}, {Alan, Mike}, {Jenny, John}, {Jenny, Mike}, {John,Mike}}. Thenwe scan the table once to obtain the support ofevery community in Candidates2 with support � 2, andidentify G2 ¼ {{Alan, John}, {Jenny, John}, {Jenny, Mike},{John, Mike}}. Similarly, we perform G2 ) G2 to generateCandidates3 ¼ {{Jenny, John, Mike}} and scan the table onceto identify the prominent 3-community G3 ¼ {{Jenny, John,Mike}}.

4.2. Extracting information of prominent communities

The next phase is to retrieve useful information forcrime investigation, such as contact information, from the

discovered prominent communities. In the context of thispaper, a group of people are considered to be in the sameprominent community if their names appear togetherfrequently in a minimum number of text documents.Thus, the topics of the set of documents containing theirnames are the “reasons” bringing them together. Byanalyzing the content of the text documents containingthe names of the community members, a crime investi-gator may obtain valuable clues that are useful for furtherinvestigation, especially during the early stages of theinvestigation. For instance, if a set of community membernames are all contained in the same chat sessions, thensummarizing the topics of the discussion can help theinvestigator infer the type of relationship the communitymembers share. To facilitate the crime investigationprocess, we extract the following types of informationfrom the set of documents D(Cj) for each prominentcommunity Cj:

1. Key topics2. Names of other people who are not in Cj3. Locations and addresses4. Phone numbers5. E-mail addresses6. Website URLs.

In some real-life cyber criminal cases, there could bethousands of identified individuals and hundreds ofprominent communities. Even with a data mining soft-ware, an investigator may still find it difficult to cope withsuch a large volume of information. The summarized keytopics from D(Cj) can provide the investigator with anoverview of each community and the related topics. Theextracted key topics can be a link label when thecommunities are visualized on the screen. Some peoplenames may appear only a few times in D(Cj) but may not befrequent enough to be included as a member in Cj. Iden-tifying these infrequent people names may lead to somenew clues for investigation. Locations, addresses, andcontact information, such as phone numbers and e-mailaddresses, are valuable information for crime investigationbecause they may reveal other potential channels ofcommunications among members of the criminalcommunity.

To extract the key topics, we employ an Open TextSummarizer (OTS) (Rotem, 2003). To extract the citynames, we search the documents for the citiesin the GeoWorldMap database (Geobytes Inc.,2003). To extract other addresses, phone number, ande-mail addresses, we use regular expressions (Friedl,2006).

Other useful information may be extracted to furtherdescribe the relationship between the members of anidentified prominent community, such as the duration ofthe relationship which is a key piece of informationregarding the activity of members of a community. It isespecially useful to provide the investigator with a senseof a time line for the relationships that the communitiesshare. To specify the duration of the relationshipbetween a criminal community identified in a set of textdocuments, we make use of the metadata of the docu-ments. The metadata of a file is the data linked to this


file by the hosting system upon creation of the docu-ment. We can define the duration of a relationship as allor some of the values of: (1) the starting date of therelationship, (2) the ending date, (3) and the amount oftime the relationship lasted. We can identify the startingdate of the relationship between members of a prom-inent community Cj, by the oldest of the dates attachedto the documents in D(Cj). The end date of the rela-tionship is the most recent of the dates associated withthe documents. The duration of the relationship is thencalculated as the difference between the start and enddates.

4.3. Discovering indirect relationships

In this section, we present a method to discover theevidential trails between a prominent community iden-tified in a dataset and other people in the document setwho are not in the community. An evidential trailrepresents a relationship between the prominentcommunity and other people through a common topicrather than co-occurrence. This trail is extracted asa chain of intermediate terms that link a community toa person. Thus, for a given prominent community Cj anda personal name p, the indirect relationship discoverymethod identifies a chain of intermediate terms t fromthe dataset that links Cj with p. The length of the chain islimited by a user-specified threshold, denoted bymax_depth.

4.3.1. ProfilesAny term t in a document set D can be profiled by

extracting interesting information about it from thetextual content of documents in D. For example, if thedocument set is obtained from newswire documents, theprofile of a topic such as Microsoft can be: Corporation,Windows, Bill Gates, and Office. In the same sense, theprofile for a prominent community C existing in a harddrive can be city, phone, and email. This information canbe retrieved from the documents in which the prom-inent community occurs. However, this informationshould not be chosen randomly because of the impor-tance of the profile information in the hypothesisdiscovery process. If the profile information is toogeneral, the discovered relationship is unlikely to besignificant and the investigator may be overwhelmed bya large number of false hypotheses. Thus, data must onlybe added to the profile if it satisfies some pre-specifiedconstraints or conditions that are set to ensure theusefulness of this data.

The structure of a profile is based on semantic types.This structure ensures that only a specific type ofinformation is added to the profile. In the criminalnetwork analysis context, we select semantic types thatare significant to investigations. In particular, thefollowing semantic types are selected: (1) summarytopics of the documents representing the prominentcommunity’s interactions, (2) other names of peoplementioned in the documents with the prominentcommunities, (3) cities and locations, (4) emailaddresses, (5) phone numbers, and (6) website URLs.

These semantic types are also used to identify therelationship between the members of prominentcommunities. For the profiles of the prominentcommunities, we use the same information that isretrieved for the prominent communities, as explainedearlier, for several reasons. First, it is less costly in theinformation extraction process. Second, these semantictypes are extracted as forensically valuable informationabout the set of related individuals and consequentlyany other relationships that are found using this infor-mation are likely to be valuable as well. Within eachsemantic type in the profile of a prominent community,each term has a weight associated with it. In order tominimize the computations, we define the profiles forthe prominent communities of maximal size, andcombine all the profiles of the sub-communities.

Definition 4.1 (Profile). A profile for a prominentcommunity C, denoted by P(C), is defined by a collection ofvectors Vx1 ;Vx2 ;.;Vxn , where n denotes the number ofsemantic types considered. Each vector Vxi , of length lxi ,where lxi is the number of terms of semantic type xi, is givenby:

Vxi ¼

2664

t1; fxiðt1Þt2; fxiðt2Þ

«

tlxi ; fxi�tlxi

�

3775

where fxi ðtjÞ is the weight of term tj and is given by

fxi�tj� ¼ f 0xi

�tj�

maxjf 0xi�tj�

and

f 0xi�tj� ¼ nxi ;tj � log

�jDj=ntj

�;

where jDj is the total number of documents, ntj is thefrequency of occurrence for term tj in the document setD, and nxi ;tj is the frequency of term tj of semantic type xiin D(C).

4.3.2. Indirect relationship generation algorithmGiven a prominent community C with profile P(C),

we propose the indirect relationships generation algo-rithm to extract indirect relationships between theprominent community C and other individuals in thedocument set. This algorithm is a hybrid version of boththe open and closed discovery algorithms described inSrinivasan (2004). The closed discovery requires twoterms, A and C, and generates hypothetical relationshipsbetween A and C through intermediate terms B. On theother hand, open discovery requires the entry of onlyone initial term; the B and C are provided by the algo-rithm. We propose a model that is open in the sensethat it requires only one initial term to start thediscovery process. However, the other end of the rela-tionship must be the name of an individual from thedocument set.


Algorithm 2. Indirect Relationship Generation Algorithm

Fig. 2. Example of a profile.

Algorithm 2 shows the steps of the indirect relationshipgeneration algorithm. The method is applied for eachprominent community Cj ˛ G. The algorithm requires theprofile of the prominent community P(C) as input, where atleast one of its term vectors Vx is of the semantic typepersons. Both the number of intermediate terms N and thedepth of the indirect relationship max_depth are set by theuser. If the depth is 1, for example, then the indirect rela-tionship between community a and person c is through oneconnecting term, e.g., a/ b/ c. However, if the depth is 2,then the relationship is of the form a / b / e / c.

The algorithm proceeds with the truncation of the termvectors, Vx, comprising the profile P(C), by selecting the topN ranking terms in each Vx for all values of x ˛ X. The newtruncated vectors are called Bx for each semantic type xaccordingly. Next, for each xi ˛ X, we search the documentset for each term Bx[y], y ¼ 1,., N in order to build itsprofile. The constructed profiles are P(Bx[1]), P(Bx[2]), .,P(Bx[N]). Now, a combined profile is computed where thecombined weight of a term is the sum of its weights in eachof P(Bx[1]), P(Bx[2]), ., P(Bx[N]) in which it occurs. Thiscombined profile is called P(O) and is comprised of vectorsVx for each semantic type x ˛ X. For each term in tj in theprofile P(O), if a search for (C AND tj) returns a nonemptyset, the term tj is removed from P(O). If the depth is set toa value greater than 1, the algorithm iterates again usingthe value of the profile P(O) produced from the previousiteration as the input profile for the next iteration. Finally,the method returns the terms in P(O) for the semantic typepersons ranked by combined weight and terminates.

Example 4.2 (Indirect relationship hypothesis gen-eration). To illustrate the steps of the algorithm, considerthe community C with profile P(C) in Fig. 2 with depth ¼ 1.First, start with the profile P(C) for the community

C ¼ {John, Jenny, Kim} and construct profiles for each termin P(C). The profiles for the terms auction and seattle aredenoted by P(auction) and P(seattle), respectively, withvalues as shown in Fig. 3. Next, the profiles are combinedinto one profile P(O) as shown in Fig. 3. For each name tj inthe persons vector, search for documents containing both tjand C. In this example, the first lookup searches for docu-ments containing all four names: John, Jenny, Kim, and Sam.If no document contains all of them, then it implies thatSam has no co-occurrence relationship with the prominentcommunity. Thus, Sam is indirectly linked to C through theterm auction and Bob is linked to C through the term seattle.Fig. 3 shows the final results of the discovery method whenapplied to this example.

5. Case study on real-life cybercrime

The objective of this case study is to demonstrate theeffectiveness of our proposed notions and methods in

Fig. 3. Indirect relationship hypothesis generation.


a real-life cybercrime investigation. The dataset wasprovided by a Canadian law enforcement unit. In particular,we performed experiments on anMSN chat log from a harddisk that was confiscated from a suspect in a computerhacking case. The chat log, with a size of approximately500 MB, contains the chat messages and file attachmentsfrom 220 distinct chat accounts. The case had already beensolved by the investigator of the law enforcement unit. Tojudge the effectiveness of our proposed method, wecompared the investigation result using our implementedprototype with the result manually obtained by the inves-tigator. We were informed that the nature of the crime wasrelated to computer hacking. No other information wasprovided to us regarding the chat log to be analyzed. This

Fig. 4. Prominent communi

scenario is similar to the early stage of an investigationwhen an investigator has limited prior knowledge aboutthe suspect(s). Due to confidentiality and privacy concerns,some of the information had to be masked and all theidentities, e-mail accounts, and server names had to bereplaced by pseudonyms in the following discussion.

Fig. 4 depicts the prominent communities identifiedfrom the MSN chat log with min_sup ¼ 5. Each node in thefigure represents a distinct chat account. The distancesamong the nodes in a community C represent the closenessof its members, which are computed from the inverse ofsupport(C). Specifically, there are 24 distinct chat accounts,forming 32 prominent 2-communities, 4 prominent 3-communities, and 1 prominent 4-community. The

ties in the case study.

Fig.

5.Prom

inen

tco

mmun

itiesin

EnronS

mall.


Fig. 6. Indirect relationships in EnronSmall.


interactions of the other 196 accounts are not frequentenough to form a prominent community. The two centralnodes, denoted by S1 and S2, represent two chat logaccounts owned by the same suspect, the owner of theconfiscated computer; therefore, the other 22 userscommunicate with at least one of S1 and S2.

Fig. 7. Number of prominent com

Recall from Section 4.2 that our proposed frameworkextracts some information, namely key topics, personnames, locations, addresses, e-mails, and URLs, from thedocuments (the chat messages) of each prominentcommunity. By performing a simple search on the keytopics of each community, we identified a suspicious user,

munities vs. k in EnronFull.

Table 2Description of Filesystem (40 GB).

File type Number of files Size in MB Percentage

All 43562 40000 100html 14045 420 1.05pdf 326 200 0.5txt 434 1.3 0.00325xml 995 60 0.15audio 215 1058 2.645video 105 840 2.1MS office 253 66 0.165other 27022 37354.7 93.38


denoted by C, who discussed about “botnet” with S2. Thisled us to further look into the details of the extractedcommunity information of this particular prominent 2-community. We found that S2 and C had exchangedseveral e-mail addresses in a form similar to this anony-mous form “[email protected]”. The sub-domain and domain of the e-mail address indicate that it isa temporary IP address assigned by a DHCP server. Yet, it isunusual to have an e-mail server running on a dynamicallychanging IP address. Therefore, we alleged that the serverswere not real e-mail servers, but some bots controlled bythe suspect. Furthermore, A and C exchanged severalsuspicious URLs, in a form similar to “http:// <user>. <freehosting company>.com/save.exe”, which point to somebinary executable files. We concluded that these execut-ables were probably used for spreading the malware tovictims. The indirect relationships discovery process alsoillustrates that C is indirectly related to the prominent 4-community {S1, S2, A, B} through the shared URLs.

Among the 37 prominent communities, we identified 6prominent 2-communities and 1 prominent 3-communitythat share suspicious information similar to the afore-mentioned e-mail addresses and URLs. These suspiciouscommunications are indicated by the dark lines in Fig. 4.We confirmed the correctness of these identified criminalcommunities and activities with two cybercrime investi-gators in the law enforcement unit who solved the real caseby manually reading all the chat messages. Thus, theprecision of our method in this analysis is 100%. Yet, ourproposedmethodmissed 1 suspicious community that wasidentified by the investigators. As a result, the recall of ourmethod in this analysis is 7/8 z 88%. Our method failed toidentify such community due to its infrequent communi-cation. There are two ways to further improve the recall.The first obvious solution is to lower themin_sup thresholdat the expense of larger number of prominent communi-ties. The second solution is to identify the suspiciousinformation, e.g., e-mail addresses and URLs, from theprominent communities, and then search for such infor-mation in the rest of the infrequent communities.

Fig. 8. Number of prominent communities vs. minimum support threshold.

6. Performance analysis

The objective of this section is to study the performanceof the prominent community discovery algorithm and theindirect relationship generation algorithm discussed inSection 4. The performance analysis is performed on tworeal-life datasets. The first dataset is the Enron e-mailcorpus (Mark and Perrault, 2004). We used Enron toanalyze the effectiveness of the prominent communitiesdiscovery algorithm. Although Enron is a de facto bench-mark dataset used in the field of e-mail forensics, the ex-pected input of our proposed method should be a largecollection of files obtained from a file system, not only e-mails. Therefore, to further evaluate the performance,especially the scalability, of our proposed method, we usedthe hard drive of the first author’s personal computer as thesecond dataset. Throughout the rest of the section, we referto this dataset as Filesystem. We present the analysis resultsof Enron and Filesystem in Sections 6.1 and 6.2, respectively.

6.1. The Enron dataset

The Enron dataset contains the e-mails of 158employees in Enron Corporation, which was an Americanenergy, commodities, and services company before itsbankruptcy. In this experiment, we created two versions ofdata from the Enron dataset, namely EnronSmall andEnronFull.

EnronSmall contains the e-mails from 30 randomlyselected employees, resulting in 48,618 e-mails and 5481distinct person names. Our proposed method required16 min to complete the entire process, in which 12 min arespent on extracting all prominent communities formin_sup ¼ 8, 4 min are spent on identifying all indirectrelationships, and 3 s are spent on displaying the result.Fig. 5 depicts only a subset of prominent communitiesidentified in EnronSmall. Among the 14 prominentcommunities in the figure, 5 of them are prominent 2-communities and the remaining 9 are prominent 3-communities. When a user clicks on a community, therelevant information, namely other person names, cities, e-mails, phones numbers, and discussed topics, of thecommunity is shown at the bottom of the screen. Weinspected the e-mails manually and compared the resultingcontact information with the actual content of themessages. The system correctly identified all e-mails andphone numbers without false positives in this case. Fig. 6depicts a subset of indirect relationships identified inEnronSmall. In this particular example, when the mousewas hovered on the link between john and {bush, gray,

Fig. 9. Scalability: Runtime vs. data size.


davis, kevin scott}, the email [email protected] poppedup, indicating that john was related to the prominentcommunity {bush, gray, davis, kevin scott} through theemail [email protected].

EnronFull contains the e-mails from all 158 employeeswith 2.53 GB of 515,767 e-mails and 108,835 distinctperson names. Fig. 7 depicts the number of prominentcommunities with respect to the number of members (k) ina community formin_sup ¼ 250. The number of prominentcommunities peaks at k ¼ 7. As k increases, a communitybecomes more difficult to satisfy the minimum supportthreshold; therefore, the number of prominent communi-ties drops. Our method required 193 min to extract allprominent communities and around 3 s to display theresults.

6.2. The Filesystem dataset

The dataset Filesystem contains 40 GB of files obtainedfrom the first author’s personal computer. Table 2 describesthe number of files, size of files, and percentage by filetypes. Fig. 8 shows the number of prominent communitiesfor 4 � min_sup � 12. As the minimum support thresholdincreases, the number of prominent communities quicklydecreases because the number of documents containing allmembers in a community decreases very quickly.

Next, we evaluate the scalability of our proposedmethods by measuring its runtime. The evaluation is con-ducted on a PC with Intel 3 GHz Core2 Duo with 3 GB ofRAM. Fig. 9 shows the runtime of our proposed methodswith respect to the size of the document set which variesfrom 10 GB to 40 GB with min_sup ¼ 8. The program takes

1430 s to complete the entire process for 40 GB of data,excluding the time spent on reading the document filesfrom the hard drive. As shown in the figure, the total run-time is dominated by prominent community discoveryprocedure. The runtime of the indirect relationship gener-ation and visualization procedures is negligible withrespect to the total runtime.

7. Conclusion

We have proposed an approach to discover and analyzecriminal networks in a collection of investigated textdocuments. Previous studies on criminal network analysismainly focus on analyzing links between criminals instructured police data. As a result of extensive discussionswith a digital forensics team of a law enforcement unit inCanada, we have introduced the notion of prominentcriminal communities and an efficient data mining methodto bridge the gap of extracting criminal networks infor-mation and unstructured textual data. Furthermore, ourproposed methods can discover both direct and indirectrelationships among the members in a criminal commu-nity. The developed software tool has been evaluated by anexperienced crime investigator in a Canaidan forensicsteam and has received positive feedback.

Acknowledgments

The authors would like to thank the anonymousreviewers for their constructive comments that greatlyhelped improve this paper. The research is supported inpart by research grants from Le Fonds québécois de la




recherche sur la nature et les technologies (FQRNT) newresearchers start-up program, Concordia ENCS seed fund-ing program, and the National Cyber-Forensics and TrainingAlliance Canada (NCFTA Canada).

References

Agrawal R, Imieli�nski T, Swami A. Mining association rules between setsof items in large databases. ACM SIGMOD Record 1993;22(2):207–16.

Al-Zaidy R, Fung BCM, Youssef AM. Towards discovering criminalcommunities from textual data. In: Proc. of the 26th ACM SIGAPPsymposium on applied computing (SAC); 2011. TaiChung, Taiwan.

Chen H, Chung W, Xu JJ, Wang G, Qin Y, Chau M. Crime data mining:a general framework and some examples. Computer 2004;37(4):50–6.

Finkel JR, Grenager T, Manning C. Incorporating non-local informationinto information extraction systems by gibbs sampling. In: Proc. ofthe 43rd annual meeting on association for computational linguistics(ACL); 2005. p. 363–70.

Friedl JEF. Mastering regular expressions. 3rd ed. O’Reilly Media; 2006.Geobytes Inc. Geoworldmap, http://www.geobytes.com/; 2003.Getoor L, Diehl CP. Link mining: a survey. ACM SIGKDD Explorations

Newsletter 2005;7(2):3–12.Hope T, Nishimura T, Takeda H. An integrated method for social network

extraction. In: Proc. Of the 15th international conference on worldwide web (WWW); 2006. p. 845–6.

Jin W, Srihari RK, Ho HH. A text mining model for hypothesis generation.In: Proc. Of the 19th IEEE international conference on tools withartificial intelligence ICTAI; 2007. p. 156–62.

Jin Y, Matsuo Y, Ishizuka M. Ranking companies on the web using socialnetwork mining. In: Ting IH, Wu HJ, editors. Web mining applicationsin e-commerce and e-services. Studies in computational intelligence,vol. 172. Berlin/Heidelberg: Springer; 2009. p. 137–52.

Mark W, Perrault RC. Enron email dataset, http://www.cs.cmu.edu/wenron/; 2004.

RCFL. Regional computer forensic laboratory annual report 2009. Tech-nical Report. Federal Bureau of Investigation, http://www.rcfl.gov/downloads/documents/RCFL_Nat_Annual09.pdf; 2009.

Rotem N. Open text summarizer, http://libots.sourceforge.net/; 2003.Skillicorn DB, Vats N. Novel information discovery for intelligence

and counterterrorism. Decision Support Systems 2007;43(4):1375–82.

Srinivasan P. Text mining: generating hypotheses frommedline. Journal ofthe American Society for Information Science and Technology 2004;55:396–413.

Xu J, Chen H. Criminal network analysis and visualization. Communica-tions of the ACM 2005;48(6):100–7.

Yang CC, Ng TD. Terrorism and crime related weblog social network:link, content analysis and information visualization. In: IEEE inter-national conference on intelligence and security informatics (ISI);2007. p. 55–8.

Zhou D, Manavoglu R, Li J, Giles CL, Zha H. Probabilistic models fordiscovering e-communities. In: Proc. of the 15th internationalconference on world wide web (WWW); 2006. p. 173–82.

http://www.geobytes.com/

http://www.cs.cmu.edu/~enron/

http://www.cs.cmu.edu/~enron/

http://www.rcfl.gov/downloads/documents/RCFL_Nat_Annual09.pdf

http://www.rcfl.gov/downloads/documents/RCFL_Nat_Annual09.pdf

http://libots.sourceforge.net/

mining criminal networks from unstructured text...

Documents