data mining and cyber threat analysis – five trendsaleks/icdm02w/grossman.pdfsummary: cyber threat...

44
Data Mining and Cyber Threat Analysis – Five Trends Robert Grossman University of Illinois at Chicago & Open Data Partners February, 2003

Upload: others

Post on 12-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Data Mining and Cyber Threat Analysis – Five Trends

Robert GrossmanUniversity of Illinois at Chicago

& Open Data PartnersFebruary, 2003

Page 2: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Trend 1. Alert Management Systems

Focus in deployment is shifting from models to alerts, from data mining

systems to alert management systems.

Page 3: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

What is an Alert Management System?

An Alert Management System (AMS) is a real time system which maintains profilesabout individuals, threats, or other entities and in real time processes events and returns alerts about profiles and their risks.Examples: credit card fraud detection,

threat assessment systems, intrusion detection systems, homeland defense, etc.

Page 4: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

What are the Five Critical Functions of an AMS?

1. Scoring – compute risk scores for transactions, profiles, targets, etc.

2. Linking – social network analysis of targets 3. Matching – against watch lists, e.g. OFAC4. Checking – regulations & policies5. Routing – analysts have finite capacity

Page 5: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Alert Data Flow

Level 1

fulltransaction stream –unknown risk

Level 2

potential threats –med.risk

alertinvesti-gation

needinvestigation –highrisk bad

guysautomatic

scores

& rules

scores using history

manual

Page 6: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

1a. What is Scoring?

Mining data in motion–assigning scores to data in real time.

Page 7: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Data Mining/Statistical ModelsSummarization Models

• tree-based methods

• neural nets

• k-nearest neighbors

• clustering

• associations

• contact chaining

• social network analysis

Scores result from applying models to

data.

Predictive Models Network/Graph

Page 8: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Detection RatesChart 2

Cumulative Percent Captured By Score Percentile (1st 20)

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Percent of Selection

Cum

. Per

cent

Cap

ture

d

ModelRandom

Page 9: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

False Positive

False Positive vs. Detection Rate

0102030405060708090

3 10 25 50 75 100 150

Page 10: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Comparing Different Models

NoNoYesScalable to large data

YesNoYesEasy to interpret

Yes – smallNo – large

No – hard to retrain

YesEasy to Maintain

NoYesYesAccuracy

RulesNeural Networks

Trees

Page 11: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

1b. What is Linking?

Mining data at rest –bad guys tend to hang out with other bad guys.

Page 12: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Social Networks

Source:Valdis Krebs

Page 13: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Social Networks

degree, betweeness, & closeness

Page 14: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

1c. What is Matching?

Living with watch lists and other lists of good and bad guys

Page 15: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

MatchingWall Street Journal May 6, 2002.

OFAC Entry: ADEN, Abdirisak, Skaftingebacken 8, Spanga 163 67, Sweden; DOB 01 Jun 68 (individual) [SDGT]

Page 16: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Similarity Search Is an Important Component of a Matching System

AHMED, Ahmed (a.k.a. ALI, Ahmed Mohammed Hamed; a.k.a. ABDUREHMAN, Ahmed Mohammed; a.k.a. ABU FATIMA; a.k.a. ABU ISLAM; a.k.a. ABU KHADIIJAH; a.k.a. AHMED HAMED; a.k.a. Ahmed The Egyptian; a.k.a AL-MASRI, Ahmad; a.k.a. AL-SURIR, Abu Islam; a.k.a. ALI, Ahmed Mohammed; a.k.a. ALI, Hamed; a.k.a. HEMED, Ahmed; a.k.a. SHIEB, Ahmed; a.k.a. SHUAIB), Afghanistan; DOB 1965; POB Egypt; citizen Egypt (individual) [SDGT]

Page 17: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

1d. What is Checking?

How to stop worrying and learn to live with regulations.

Page 18: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

There will be more and more regulations about what data can be

used and how…

Page 19: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

1e. What is Routing?

0.1% of 30,000 transactions/second = 30/second at 10

minutes/investigation vs. 100 analysts and 8 hours per day.

Page 20: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Routing

Routing is about getting the right information to the right person at the right time.

Page 21: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Trend 2. Real Time Data Mining

Exploiting Events and Profiles.

Page 22: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

What is an Event?

emaillogin

message

credit card transactionscan

phone calls cell phone call

An event is real time information about an entity, eg. person, place, event, threat, opportunity, etc.

Page 23: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

What is a Profile?

calls

# trips, #cc

trips

0.26 0.86 0.94 … 0.70

summarized information

profiles

A profile is the summarized data and attributes about an entity.

Page 24: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

What is Event Driven Data Mining?statistical

model

Real Time System, Op. Systems, …

recommendation

Real Time Events

1

5

2

1. Get an event.2. Get the corresp. profile.3. Update the profile.4. Score the profile.5. Compute recommended

action.

Profile Database

4 Profiles

3Alert

System

Page 25: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Trend 3.

In the past, we have built models and scored data at rest. In the future,

more and more data will be streaming at faster and faster rates.

Page 26: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Premises

Some data sets will be accessible via OC 12, GigE, 2.5 GigE, 10 GigE, etc. wide area networks – Photonic networks.There will be data mining services for high performance networks, as well as forcommodity networks.Many applications will trade accuracy for speed in order to keep up with line speedCall these Photonic Data Services (PDS)

Page 27: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

The Data Stack – Replacing Apps over Operating Systems

6. Data Mining Applications

5a. StorageServices

5b. Data Web Services

5b. Data Grid Services

4. Transport – TCP, UDP, Reliable UDP

3. IP

2. Photonic Path Services

1. Physical

Page 28: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

TCP Data Transport Chicago to Amsterdam over 622 Mb/s Link

0

1

2

3

4

5

6

7

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59

4.5 Mb/s

Page 29: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Best Effort Distributed Merge Over PDS - Bandwidth

0

100

200

300

400

500

600

700

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59

400 Mb/s

Page 30: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Best Effort Distributed Merge over PDS - Accuracy

0.904

0.905

0.906

0.907

0.908

0.909

0.91

0.911

0.912

0.913

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59

90%

Page 31: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Trend 4. Data Webs for Data Exploration

We have developed some good data mining algorithms & systems, we

need better algorithms and systems for data exploration.

Page 32: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Paradigm Today

1. Learn about a new data source from a friend or over the web

2. FTP the data, federal express, or courier the data

3. A consultant or contractor spends a 1-3 months and then tells you whether or not to start a project to build a centralized data warehouse

Page 33: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Emerging Paradigm: Data WebsDesigned for quick

overlays by keyDesigned for quick

correlationsIf interesting,

perform traditional exploratory data analysis, statistical modeling, etc.Reduce number of clicks to correlate two data sets

names

phone numbers

reports

Page 34: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Technologies for Global Data

Action

Object

Semantic Web

Grids

Data Grids

Digital Libraries

Persistent Archives

Knowledge Mining

Distributed Data Mining

Data WebsWeb-based databases

Knowledge

Attributes/Columns

Files

View Mine/Discover Compute

Page 35: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Data Grids vs. Data Webs

Data Grids

Data Webs

• GSI

• GridFTP

• Replica Management

• Exploration

• Remote Analysis

• Distributed correlation

• Universal keys

Browsing & CasualExploration

Collaborations

Distributed Computing

Web BasedComputing

Page 36: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Trend 5.

Standards are maturing.

Page 37: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Maturing Standards

View data mining as a layer in an open infrastructure.PMML is a maturing standard for data mining.Database SQL, JDBC, ODBC, …

Metadata

Datadata management

data mining

pred. models

web services

Predictive Model Markup Language (PMML), XML

Applications and Services: WSDL, UDDI, …

data management

data mining

pred. models

web services

Page 38: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Data Mining Group

Products shipping with PMML Version 1.1PMML Working Group Full Members– IBM, Magnify, Microsoft, MineIt, NCR, Oracle,

Salford Systems, SAS, SPSS, xChange, University of Illinois at Chicago (over 20 vendors)

PMML Working Group Supporting Members– Angoss, Insightful, KXEN, Microsoft, SGI …Part of Source Forge

Page 39: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Problems with Current Techniques

Models are deployed in proprietary formatsModels are application dependentModels are system dependentModels are architecture dependantTime required to integrate models with

other applications can be long.

Page 40: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Predictive Model Markup Language (PMML)

Based on XML Benefits of PMML– Open standard for Data Mining & Statistical Models – Not concerned with the process of creating a model– Provides independence from application, platform,

and operating system– Simplifies use of data mining models by other

applications (consumers of data mining models)

Page 41: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

PMML Producers,Consumers, & Data Flow

Data Mining Warehouse

learning sets Data Mining System

miningFields

dataFields

PMML Consumers

Operational Systems

miningFieldsPMML models

Operational Data

alerts

derivedFields

derivedFields

PMML Producers

Page 42: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Closely Related Standards

OMGCWMDM

Object modelfor representingdata mining metadata:models, model results(UML/DTD/XML)

DMGPMML

Representation of data mining models for inter-vendor exchange(DTD/XML)

SQL/MMPt. 6 DM

SQL objects for defining,creating, and applying data mining models, andobtaining their results(SQL)

JSR-073JDMAPI

Java API for defining,creating, and applyingdata mining models, andobtaining their results(Java)

OLE DBfor DM

SQL-like interfacefor data miningoperations (OLE DB/SQL)

Page 43: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

Summary: Cyber Threat Analysis1. Deployment is more about alert management than

which algorithm.2. Events and Profiles enable event driven

applications.3. There is a fundamental need to design algorithms

for high bandwidth data streams, at 1 Gb/s and higher.

4. The best way to improve a model is to join new from a new source. Data web and data exploration systems are designed to make this easier.

5. Standards for data mining are maturing.

Page 44: Data Mining and Cyber Threat Analysis – Five Trendsaleks/icdm02w/grossman.pdfSummary: Cyber Threat Analysis 1. Deployment is more about alert management than which algorithm. 2

For More InformationRobert Grossman

grossman at uic.edu or rlg at opendata.bizwww.lac.uic.edu, www.opendata.biz, www.rgrossman.com,

Standardswww.dmg.org (PMML, DWTP, etc.)

Data Webswww.dataspaceweb.net or info at ac.uic.edu

TestbedTerra Wide Data Mining Testbed (TWDM) Terabyte Challenge Testbedwww.ncdm.uic.edu