data mining applications - sede'07 - invited talk

55
7/10/07 - SEDE'07 1 7/10/07 - SEDE'07 DATA MINING DATA MINING APPLICATIONS APPLICATIONS Margaret H. Dunham Margaret H. Dunham Southern Methodist University Southern Methodist University Dallas, Texas 75275 Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Foundation under Grant No. 9820841 Some slides used by permission from Some slides used by permission from Dr Eamonn Keogh; Dr Eamonn Keogh; University of California Riverside; [email protected]

Upload: tommy96

Post on 17-Jan-2015

1.150 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 17/10/07 - SEDE'07

DATA MINING DATA MINING APPLICATIONSAPPLICATIONS

Margaret H. DunhamMargaret H. DunhamSouthern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

[email protected]

This material is based in part upon work supported by the National Science Foundation under Grant No. This material is based in part upon work supported by the National Science Foundation under Grant No. 98208419820841

Some slides used by permission from Some slides used by permission from Dr Eamonn Keogh; Dr Eamonn Keogh; University of California Riverside;[email protected]

Page 2: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 27/10/07 - SEDE'07

The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole

Page 3: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 37/10/07 - SEDE'07

OBJECTIVE

Explore some of the applications of data mining techniques.

Page 4: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 47/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

Page 5: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 57/10/07 - SEDE'07

Data Mining Overview

Finding hidden information in a database Fit data to a model

You must know what you are looking for You must know how to look for you

Page 6: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 67/10/07 - SEDE'07

“If it looks like a duck,

walks like a duck, and

quacks like a duck, then

it’s a duck.”

Description Behavior AssociationsClassification Clustering Link Analysis (Profiling) (Similarity)

“If it looks like a terrorist,

walks like a terrorist, and

quacks like a terrorist, then

it’s a terrorist.”

Page 7: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 77/10/07 - SEDE'07

Classification Applications

Teachers classify students’ grades as A, B, C, D, or F.

Letter Recognition andwriting Recognition Phishing: http://computerworld.com/action/article.do?

command=viewArticleBasic&taxonomyName=cybercrime_hacking&articleId=9002996&taxonomyId=82

Pluto: http://www.npr.org/templates/story/story.php?storyId=5705254

Page 8: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 87/10/07 - SEDE'07

Grasshoppers

Katydids

Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.

(c) Eamonn Keogh, [email protected]

Classification Example

user
Page 9: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 97/10/07 - SEDE'07

An

tenn

a L

engt

hA

nte

nna

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Grasshoppers Katydids

Abdomen LengthAbdomen Length

(c) Eamonn Keogh, [email protected]

Page 10: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 107/10/07 - SEDE'07

Clustering Applications

Targeted Marketing Determining Gene Functionality Identifying Species

Clustering vs. Classification No prior knowledge Number of clusters Meaning of clusters

Unsupervised learning

Page 11: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 117/10/07 - SEDE'07http://149.170.199.144/multivar/ca.htm

Page 12: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 127/10/07 - SEDE'07

What is SimilarityWhat is Similarity??

(c) Eamonn Keogh, [email protected]

Page 13: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 137/10/07 - SEDE'07

Association Rules Applications

People who buy diapers also buy beer If gene A is highly expressed in this disease then gene B is

also expressed Relationships between people www.amazon.com Book Stores Department Stores Advertising Product Placement

Page 14: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 147/10/07 - SEDE'07

Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.

DILBERT reprinted by permission of United Feature Syndicate, Inc.

Page 15: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 157/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications

Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

Page 16: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 167/10/07 - SEDE'07

Page 17: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 177/10/07 - SEDE'07

Fraud Detection

Identify fraudulent behavior Used Extensively in financial, law enforcement, health

care, etc. sectors http://www.aaai.org/AITopics/html/fraud.html SPSS:

http://www.spss.com/predictiveclaims/fraud_detection.htm Neural Technologies:

http://www.neuralt.com/fraud_management.html

Page 18: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 187/10/07 - SEDE'07

Law Enforcement

Identify suspect behavior and relationships I2 Inc.

Investigative analytic/visualization software http://www.i2inc.com

Social Network Analysis – Analyze patterns of relationships

Relationships: personal, religious, operational, etc.

Page 19: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 197/10/07 - SEDE'07

Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.

Page 20: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 207/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities

Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

Page 21: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 217/10/07 - SEDE'07

How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm

Page 22: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 227/10/07 - SEDE'07

Facial Recognition

Based upon features in face Convert face to a feature vector Less invasive than other biometric techniques http://www.face-rec.org http://computer.howstuffworks.com/facial-

recognition.htm SIMS:

http://www.casinoincidentreporting.com/Products.aspx

Page 23: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 237/10/07 - SEDE'07(c) Eamonn Keogh, [email protected]

Page 24: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 247/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition

Cheating & Plagiarism Bioinformatics

Conclusions

Page 25: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 257/10/07 - SEDE'07

Cheating on Multiple Choice Tests

Similarity between tests based on number of common wrong answers.

(George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp909-923.)

The number of common correct answers is often ignored. H-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the

Classroom – Part II, and update,” Journal of Chemical Education, vol 73, no 4, pp 349-351):

H-H = (Number of exact answers in common)(Number of different answers)

Page 26: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 267/10/07 - SEDE'07

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 27: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 277/10/07 - SEDE'07

No/Little Cheating

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 28: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 287/10/07 - SEDE'07

Rampant Cheating

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 29: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 297/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition

Cheating & Plagiarism Bioinformatics

Conclusions

Page 30: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 307/10/07 - SEDE'07

DNA

Basic building blocks of organisms

Located in nucleus of cells Composed of 4

nucleotides Two strands bound

together

http://www.visionlearning.com/library/module_viewer.php?mid=63

Page 31: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 317/10/07 - SEDE'07

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Central Dogma: DNA -> RNA -> Protein

www.bioalgorithms.info; chapter 6; Gene Prediction

Page 32: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 327/10/07 - SEDE'07

miRNA

Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely

appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without impacting mRNA

levels (animal cells) Functions

Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV …

Page 33: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 337/10/07 - SEDE'07

Questions

If each cell in an organism contains the same DNA –

How does each cell behave differently? Why do cells behave differently during

childhood/? What causes some cells to act differently –

such as during disease? DNA contains many genes, but only a few are

being transcribed – why? One answer - miRNA

Page 34: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 347/10/07 - SEDE'07http://www.time.com/time/magazine/article/0,9171,1541283,00.html

Page 35: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 357/10/07 - SEDE'07

Human Genome

Scientists originally thought there would be about 100,000 genes

Appear to be about 20,000 WHY?

Almost identical to that of Chimps. What makes the difference?

Visualization from UCRdnaQT.mov

Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

Page 36: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 367/10/07 - SEDE'07

RNAi – Nobel Prize in Medicine 2006

Double stranded RNA

Short Interfering RNA (~20-25 nt)

RNA-Induced Silencing Complex

Binds to mRNA

Cuts RNA

siRNA may be artificially added to cell!

Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3

Page 37: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 377/10/07 - SEDE'07

Computer Science & Bioinformatics

Algorithms Data Structures Improving efficiency Data Mining Biologists don’t usually understand or even

appreciate what Computer Science can do Issues:

Scalability Fuzzy

We will look at: Microarray Clustering TCGR

Page 38: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 387/10/07 - SEDE'07

Affymetrix GeneChip® Array

http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx

Page 39: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 397/10/07 - SEDE'07

Microarray Data Analysis

Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions

Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene?

Clustering Hierarchical K-means

Page 40: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 407/10/07 - SEDE'07

Microarray Data - Clustering

"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"

Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816,

January 20, 2004

Page 41: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 417/10/07 - SEDE'07

miRNA Research Issues

Predict / Find miRNA in genomic sequence Predict miRNA targets Identify miRNA functions

Page 42: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 427/10/07 - SEDE'07

Temporal CGR (TCGR) 2D Array

Each Row represents counts for a particular window in sequence• First row – first window• Last row – last window • We start successive windows at the next character location

Each Column represents the counts for the associated pattern in that window

• Initially we have assumed order of patterns is alphabetic Size of TCGR depends on sequence length and subpattern

length

Page 43: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 437/10/07 - SEDE'07

TCGR Example (cont’d)

TCGRs for Sub-patterns of length 1, 2, and 3

Page 44: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 447/10/07 - SEDE'07

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus Musculus

Homo Sapiens

C Elegans

ACG CGC GCG UCG

Page 45: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 457/10/07 - SEDE'07

POSITIVE

NEGATIVE

TCGRs for Xue Training Data

C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

Page 46: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 467/10/07 - SEDE'07

POSITIVE

NEGATIVE

TCGRs for Xue Test Data

Page 47: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 477/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

Page 48: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 487/10/07 - SEDE'07

Conclusions

Not magic Doesn’t work for all applications Stock Market Prediction Issues

Privacy Data

Here are some infamous examples of failed data mining applications

Page 49: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 497/10/07 - SEDE'07

Page 50: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 507/10/07 - SEDE'07

Dallas Morning News

October 7, 2005

Page 51: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 517/10/07 - SEDE'07

http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236

Page 52: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 527/10/07 - SEDE'07

BIG BROTHER ? Total Information Awareness

http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office

Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050

511_8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html http://www.thedenverchannel.com/news/9559707/detail.html

CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS

Page 53: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 537/10/07 - SEDE'07

Page 54: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 547/10/07 - SEDE'07

Page 55: Data Mining Applications - SEDE'07 - Invited Talk

7/10/07 - SEDE'07 557/10/07 - SEDE'07