adventures in data mining

47
2/25/09 - GCSU ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; [email protected] 1

Upload: makani

Post on 22-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

ADVENTURES IN DATA MINING. Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ADVENTURES IN DATA MINING

12/25/09 - GCSU

ADVENTURES IN DATA MINING

Margaret H. DunhamSouthern Methodist University

Dallas, Texas [email protected]

This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841

Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;[email protected]

Page 2: ADVENTURES IN DATA MINING

22/25/09 - GCSU

The 2000 ozone hole over the antarctic seen by EPTOMS

http://jwocky.gsfc.nasa.gov/multi/multi.html#hole

Page 3: ADVENTURES IN DATA MINING

32/25/09 - GCSU

Data Mining Outline

Introduction Techniques

Classification Clustering Association Rules

Examples

Explore some interesting data mining applications

Page 4: ADVENTURES IN DATA MINING

42/25/09 - GCSU

Introduction

Data is growing at a phenomenal rate Users expect more sophisticated information How?

UNCOVER HIDDEN INFORMATIONDATA MINING

Page 5: ADVENTURES IN DATA MINING

52/25/09 - GCSU

But it isn’t Magic

You must know what you are looking for You must know how to look for you

Suppose you knew that a specific cave had gold:• What would you look for?• How would you look for it?• Might need an expert miner

Page 6: ADVENTURES IN DATA MINING

62/25/09 - GCSU

“If it looks like a duck, walks like a duck, and quacks like a duck, then

it’s a duck.”

Description Behavior AssociationsClassification Clustering Link Analysis (Profiling) (Similarity)

“If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then

it’s a terrorist.”

Page 7: ADVENTURES IN DATA MINING

2/25/09 - GCSU 7

Page 8: ADVENTURES IN DATA MINING

8

CLASSIFICATION

Assign data into predefined groups or classes.

2/25/09 - GCSU

Page 9: ADVENTURES IN DATA MINING

9

Classification Ex: Grading

2/25/09 - GCSU

>=90<90x

>=80<80

x

>=70<70

x

F

B

A

>=60<50

x C

D

Page 10: ADVENTURES IN DATA MINING

102/25/09 - GCSU

Grasshoppers

Katydids

Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.

(c) Eamonn Keogh, [email protected]

user
Page 11: ADVENTURES IN DATA MINING

112/25/09 - GCSU

Insect ID Abdomen Length

Antennae Length

Insect Class

1 2.7 5.5 Grasshopper

2 8.0 9.1 Katydid

3 0.9 4.7 Grasshopper

4 1.1 3.1 Grasshopper

5 5.4 8.5 Katydid

6 2.9 1.9 Grasshopper

7 6.1 6.6 Katydid

8 0.5 1.0 Grasshopper

9 8.3 6.6 Katydid

10 8.1 4.7 Katydid

11 5.1 7.0 ???????

The classification problem can now be expressed as:

• Given a training database predict the class label of a previously unseen instance

previously unseen instance = (c) Eamonn Keogh, [email protected]

Page 12: ADVENTURES IN DATA MINING

122/25/09 - GCSU

Ant

enna

Len

gth

10

1 2 3 4 5 6 7 8 9 10

123456789

Grasshoppers Katydids

Abdomen Length

(c) Eamonn Keogh, [email protected]

Page 13: ADVENTURES IN DATA MINING

132/25/09 - GCSU

Facial Recognition

(c) Eamonn Keogh, [email protected]

Page 14: ADVENTURES IN DATA MINING

142/25/09 - GCSU

Handwriting Recognition

George Washington Manuscript

0 50 100 150 200 250 300 350 400 4500

0.5

1

(c) Eamonn Keogh, [email protected]

Page 15: ADVENTURES IN DATA MINING

15

Rare Event Detection

2/25/09 - GCSU

Page 16: ADVENTURES IN DATA MINING

2/25/09 - GCSU 16

Page 17: ADVENTURES IN DATA MINING

2/25/09 - GCSU 17

Dallas Morning News

October 7, 2005

Page 18: ADVENTURES IN DATA MINING

18

CLUSTERING

Partition data into previously undefined groups.

2/25/09 - GCSU

Page 19: ADVENTURES IN DATA MINING

192/25/09 - GCSUhttp://149.170.199.144/multivar/ca.htm

Page 20: ADVENTURES IN DATA MINING

202/25/09 - GCSU

What is Similarity?

(c) Eamonn Keogh, [email protected]

Page 21: ADVENTURES IN DATA MINING

21

Two Types of Clustering

2/25/09 - GCSU

Hierarchical Partitional

(c) Eamonn Keogh, [email protected]

Page 22: ADVENTURES IN DATA MINING

22

Hierarchical Clustering ExampleIris Data Set

2/25/09 - GCSU

Setosa

Versicolor

Virginica• The data originally appeared in Fisher, R. A. (1936). "The Use of

Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.

• Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .

Page 23: ADVENTURES IN DATA MINING

23

ASSOCIATION RULES/ LINK ANALYSIS

Find relationships between data

2/25/09 - GCSU

Page 24: ADVENTURES IN DATA MINING

24

ASSOCIATION RULES EXAMPLES

People who buy diapers also buy beer If gene A is highly expressed in this disease then

gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product Placement http://

www.amazon.com/Data-Mining-Introductory-Advanced-Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485&sr=1-1

2/25/09 - GCSU

Page 25: ADVENTURES IN DATA MINING

252/25/09 - GCSU

Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.

DILBERT reprinted by permission of United Feature Syndicate, Inc.

Page 26: ADVENTURES IN DATA MINING

26

Data Mining Outline

Introduction TechniquesExamples

Vision Mining Law Enforcement (Cheating,

Plagiarism, Fraud, Criminal Behavior,…)

Bioinformatics2/25/09 - GCSU

Page 27: ADVENTURES IN DATA MINING

27

Vision Mining

License Plate Recognition Red Light Cameras Toll Booths http://www.licenseplaterecognition.com/

Computer Vision http://www.eecs.berkeley.edu/Research

/Projects/CS/vision/shape/vid/

2/25/09 - GCSU

Page 29: ADVENTURES IN DATA MINING

2/25/09 - GCSU 29

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 30: ADVENTURES IN DATA MINING

No/Little Cheating

2/25/09 - GCSU 30

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 31: ADVENTURES IN DATA MINING

Rampant Cheating

2/25/09 - GCSU 31

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 32: ADVENTURES IN DATA MINING

322/25/09 - GCSU

Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.

Page 33: ADVENTURES IN DATA MINING

http://www.time.com/time/magazine/article/0,9171,1541283,00.html

2/25/09 - GCSU 33

Page 34: ADVENTURES IN DATA MINING

DNA

Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together

2/25/09 - GCSU 34

http://www.visionlearning.com/library/module_viewer.php?mid=63

Page 35: ADVENTURES IN DATA MINING

Central Dogma: DNA -> RNA -> Protein

2/25/09 - GCSU 35

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

Amino Acid

CCUGAGCCAACUAUUGAUGAA

www.bioalgorithms.info; chapter 6; Gene Prediction

Page 36: ADVENTURES IN DATA MINING

Human Genome

Scientists originally thought there would be about 100,000 genes

Appear to be about 20,000 WHY?

Almost identical to that of Chimps. What makes the difference?

Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

2/25/09 - GCSU 36

Page 37: ADVENTURES IN DATA MINING

RNAi – Nobel Prize in Medicine 2006

2/25/09 - GCSU 37

Double stranded RNA

Short Interfering RNA (~20-25 nt)

RNA-Induced Silencing Complex

Binds to mRNA

Cuts RNA

siRNA may be artificially added to cell!

Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3

Page 38: ADVENTURES IN DATA MINING

miRNA

Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely

appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without

impacting mRNA levels (animal cells) Functions

Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV …

2/25/09 - GCSU 38

Page 39: ADVENTURES IN DATA MINING

39

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus Musculus

Homo Sapiens

C Elegans

ACG CGC GCG UCG2/25/09 - GCSU

Page 40: ADVENTURES IN DATA MINING

TCGRs for Xue Training Data

40

POSITIVE

NEGATIVE

C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

2/25/09 - GCSU

Page 41: ADVENTURES IN DATA MINING

2/25/09 - GCSU 41

Affymetrix GeneChip® Array

http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx

Page 42: ADVENTURES IN DATA MINING

Microarray Data Analysis

Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions

Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene?

Clustering Hierarchical K-means

2/25/09 - GCSU 42

Page 43: ADVENTURES IN DATA MINING

Microarray Data - Clustering

2/25/09 - GCSU 43

"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004

Page 44: ADVENTURES IN DATA MINING

BIG BROTHER ? Total Information Awareness

http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office

Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050511_

8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html

CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS

2/25/09 - GCSU 44

Page 45: ADVENTURES IN DATA MINING

452/25/09 - GCSU

http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236

Page 46: ADVENTURES IN DATA MINING

462/25/09 - GCSU

Page 47: ADVENTURES IN DATA MINING

472/25/09 - GCSU