adventures in data mining

Post on 22-Feb-2016

53 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

ADVENTURES IN DATA MINING. Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 - PowerPoint PPT Presentation

TRANSCRIPT

12/25/09 - GCSU

ADVENTURES IN DATA MINING

Margaret H. DunhamSouthern Methodist University

Dallas, Texas 75275mhd@engr.smu.edu

This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841

Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;eamonn@cs.ucr.edu

22/25/09 - GCSU

The 2000 ozone hole over the antarctic seen by EPTOMS

http://jwocky.gsfc.nasa.gov/multi/multi.html#hole

32/25/09 - GCSU

Data Mining Outline

Introduction Techniques

Classification Clustering Association Rules

Examples

Explore some interesting data mining applications

42/25/09 - GCSU

Introduction

Data is growing at a phenomenal rate Users expect more sophisticated information How?

UNCOVER HIDDEN INFORMATIONDATA MINING

52/25/09 - GCSU

But it isn’t Magic

You must know what you are looking for You must know how to look for you

Suppose you knew that a specific cave had gold:• What would you look for?• How would you look for it?• Might need an expert miner

62/25/09 - GCSU

“If it looks like a duck, walks like a duck, and quacks like a duck, then

it’s a duck.”

Description Behavior AssociationsClassification Clustering Link Analysis (Profiling) (Similarity)

“If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then

it’s a terrorist.”

2/25/09 - GCSU 7

8

CLASSIFICATION

Assign data into predefined groups or classes.

2/25/09 - GCSU

9

Classification Ex: Grading

2/25/09 - GCSU

>=90<90x

>=80<80

x

>=70<70

x

F

B

A

>=60<50

x C

D

102/25/09 - GCSU

Grasshoppers

Katydids

Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.

(c) Eamonn Keogh, eamonn@cs.ucr.edu

user

112/25/09 - GCSU

Insect ID Abdomen Length

Antennae Length

Insect Class

1 2.7 5.5 Grasshopper

2 8.0 9.1 Katydid

3 0.9 4.7 Grasshopper

4 1.1 3.1 Grasshopper

5 5.4 8.5 Katydid

6 2.9 1.9 Grasshopper

7 6.1 6.6 Katydid

8 0.5 1.0 Grasshopper

9 8.3 6.6 Katydid

10 8.1 4.7 Katydid

11 5.1 7.0 ???????

The classification problem can now be expressed as:

• Given a training database predict the class label of a previously unseen instance

previously unseen instance = (c) Eamonn Keogh, eamonn@cs.ucr.edu

122/25/09 - GCSU

Ant

enna

Len

gth

10

1 2 3 4 5 6 7 8 9 10

123456789

Grasshoppers Katydids

Abdomen Length

(c) Eamonn Keogh, eamonn@cs.ucr.edu

132/25/09 - GCSU

Facial Recognition

(c) Eamonn Keogh, eamonn@cs.ucr.edu

142/25/09 - GCSU

Handwriting Recognition

George Washington Manuscript

0 50 100 150 200 250 300 350 400 4500

0.5

1

(c) Eamonn Keogh, eamonn@cs.ucr.edu

15

Rare Event Detection

2/25/09 - GCSU

2/25/09 - GCSU 16

2/25/09 - GCSU 17

Dallas Morning News

October 7, 2005

18

CLUSTERING

Partition data into previously undefined groups.

2/25/09 - GCSU

192/25/09 - GCSUhttp://149.170.199.144/multivar/ca.htm

202/25/09 - GCSU

What is Similarity?

(c) Eamonn Keogh, eamonn@cs.ucr.edu

21

Two Types of Clustering

2/25/09 - GCSU

Hierarchical Partitional

(c) Eamonn Keogh, eamonn@cs.ucr.edu

22

Hierarchical Clustering ExampleIris Data Set

2/25/09 - GCSU

Setosa

Versicolor

Virginica• The data originally appeared in Fisher, R. A. (1936). "The Use of

Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.

• Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .

23

ASSOCIATION RULES/ LINK ANALYSIS

Find relationships between data

2/25/09 - GCSU

24

ASSOCIATION RULES EXAMPLES

People who buy diapers also buy beer If gene A is highly expressed in this disease then

gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product Placement http://

www.amazon.com/Data-Mining-Introductory-Advanced-Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485&sr=1-1

2/25/09 - GCSU

252/25/09 - GCSU

Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.

DILBERT reprinted by permission of United Feature Syndicate, Inc.

26

Data Mining Outline

Introduction TechniquesExamples

Vision Mining Law Enforcement (Cheating,

Plagiarism, Fraud, Criminal Behavior,…)

Bioinformatics2/25/09 - GCSU

27

Vision Mining

License Plate Recognition Red Light Cameras Toll Booths http://www.licenseplaterecognition.com/

Computer Vision http://www.eecs.berkeley.edu/Research

/Projects/CS/vision/shape/vid/

2/25/09 - GCSU

2/25/09 - GCSU 29

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

No/Little Cheating

2/25/09 - GCSU 30

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Rampant Cheating

2/25/09 - GCSU 31

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

322/25/09 - GCSU

Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.

http://www.time.com/time/magazine/article/0,9171,1541283,00.html

2/25/09 - GCSU 33

DNA

Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together

2/25/09 - GCSU 34

http://www.visionlearning.com/library/module_viewer.php?mid=63

Central Dogma: DNA -> RNA -> Protein

2/25/09 - GCSU 35

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

Amino Acid

CCUGAGCCAACUAUUGAUGAA

www.bioalgorithms.info; chapter 6; Gene Prediction

Human Genome

Scientists originally thought there would be about 100,000 genes

Appear to be about 20,000 WHY?

Almost identical to that of Chimps. What makes the difference?

Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

2/25/09 - GCSU 36

RNAi – Nobel Prize in Medicine 2006

2/25/09 - GCSU 37

Double stranded RNA

Short Interfering RNA (~20-25 nt)

RNA-Induced Silencing Complex

Binds to mRNA

Cuts RNA

siRNA may be artificially added to cell!

Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3

miRNA

Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely

appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without

impacting mRNA levels (animal cells) Functions

Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV …

2/25/09 - GCSU 38

39

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus Musculus

Homo Sapiens

C Elegans

ACG CGC GCG UCG2/25/09 - GCSU

TCGRs for Xue Training Data

40

POSITIVE

NEGATIVE

C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

2/25/09 - GCSU

2/25/09 - GCSU 41

Affymetrix GeneChip® Array

http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx

Microarray Data Analysis

Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions

Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene?

Clustering Hierarchical K-means

2/25/09 - GCSU 42

Microarray Data - Clustering

2/25/09 - GCSU 43

"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004

BIG BROTHER ? Total Information Awareness

http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office

Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050511_

8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html

CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS

2/25/09 - GCSU 44

452/25/09 - GCSU

http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236

462/25/09 - GCSU

472/25/09 - GCSU

top related