adventures in data mining
DESCRIPTION
ADVENTURES IN DATA MINING. Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 - PowerPoint PPT PresentationTRANSCRIPT
12/25/09 - GCSU
ADVENTURES IN DATA MINING
Margaret H. DunhamSouthern Methodist University
Dallas, Texas [email protected]
This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841
Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;[email protected]
22/25/09 - GCSU
The 2000 ozone hole over the antarctic seen by EPTOMS
http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
32/25/09 - GCSU
Data Mining Outline
Introduction Techniques
Classification Clustering Association Rules
Examples
Explore some interesting data mining applications
42/25/09 - GCSU
Introduction
Data is growing at a phenomenal rate Users expect more sophisticated information How?
UNCOVER HIDDEN INFORMATIONDATA MINING
52/25/09 - GCSU
But it isn’t Magic
You must know what you are looking for You must know how to look for you
Suppose you knew that a specific cave had gold:• What would you look for?• How would you look for it?• Might need an expert miner
62/25/09 - GCSU
“If it looks like a duck, walks like a duck, and quacks like a duck, then
it’s a duck.”
Description Behavior AssociationsClassification Clustering Link Analysis (Profiling) (Similarity)
“If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then
it’s a terrorist.”
2/25/09 - GCSU 7
8
CLASSIFICATION
Assign data into predefined groups or classes.
2/25/09 - GCSU
9
Classification Ex: Grading
2/25/09 - GCSU
>=90<90x
>=80<80
x
>=70<70
x
F
B
A
>=60<50
x C
D
102/25/09 - GCSU
Grasshoppers
Katydids
Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.
(c) Eamonn Keogh, [email protected]
112/25/09 - GCSU
Insect ID Abdomen Length
Antennae Length
Insect Class
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydid
11 5.1 7.0 ???????
The classification problem can now be expressed as:
• Given a training database predict the class label of a previously unseen instance
previously unseen instance = (c) Eamonn Keogh, [email protected]
122/25/09 - GCSU
Ant
enna
Len
gth
10
1 2 3 4 5 6 7 8 9 10
123456789
Grasshoppers Katydids
Abdomen Length
(c) Eamonn Keogh, [email protected]
142/25/09 - GCSU
Handwriting Recognition
George Washington Manuscript
0 50 100 150 200 250 300 350 400 4500
0.5
1
(c) Eamonn Keogh, [email protected]
15
Rare Event Detection
2/25/09 - GCSU
2/25/09 - GCSU 16
2/25/09 - GCSU 17
Dallas Morning News
October 7, 2005
18
CLUSTERING
Partition data into previously undefined groups.
2/25/09 - GCSU
192/25/09 - GCSUhttp://149.170.199.144/multivar/ca.htm
21
Two Types of Clustering
2/25/09 - GCSU
Hierarchical Partitional
(c) Eamonn Keogh, [email protected]
22
Hierarchical Clustering ExampleIris Data Set
2/25/09 - GCSU
Setosa
Versicolor
Virginica• The data originally appeared in Fisher, R. A. (1936). "The Use of
Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.
• Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .
23
ASSOCIATION RULES/ LINK ANALYSIS
Find relationships between data
2/25/09 - GCSU
24
ASSOCIATION RULES EXAMPLES
People who buy diapers also buy beer If gene A is highly expressed in this disease then
gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product Placement http://
www.amazon.com/Data-Mining-Introductory-Advanced-Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485&sr=1-1
2/25/09 - GCSU
252/25/09 - GCSU
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
26
Data Mining Outline
Introduction TechniquesExamples
Vision Mining Law Enforcement (Cheating,
Plagiarism, Fraud, Criminal Behavior,…)
Bioinformatics2/25/09 - GCSU
27
Vision Mining
License Plate Recognition Red Light Cameras Toll Booths http://www.licenseplaterecognition.com/
Computer Vision http://www.eecs.berkeley.edu/Research
/Projects/CS/vision/shape/vid/
2/25/09 - GCSU
2/25/09 - GCSU 28
How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm
2/25/09 - GCSU 29
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
No/Little Cheating
2/25/09 - GCSU 30
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
Rampant Cheating
2/25/09 - GCSU 31
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
322/25/09 - GCSU
Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.
http://www.time.com/time/magazine/article/0,9171,1541283,00.html
2/25/09 - GCSU 33
DNA
Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together
2/25/09 - GCSU 34
http://www.visionlearning.com/library/module_viewer.php?mid=63
Central Dogma: DNA -> RNA -> Protein
2/25/09 - GCSU 35
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
Amino Acid
CCUGAGCCAACUAUUGAUGAA
www.bioalgorithms.info; chapter 6; Gene Prediction
Human Genome
Scientists originally thought there would be about 100,000 genes
Appear to be about 20,000 WHY?
Almost identical to that of Chimps. What makes the difference?
Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)
2/25/09 - GCSU 36
RNAi – Nobel Prize in Medicine 2006
2/25/09 - GCSU 37
Double stranded RNA
Short Interfering RNA (~20-25 nt)
RNA-Induced Silencing Complex
Binds to mRNA
Cuts RNA
siRNA may be artificially added to cell!
Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3
miRNA
Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely
appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without
impacting mRNA levels (animal cells) Functions
Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV …
2/25/09 - GCSU 38
39
TCGR – Mature miRNA(Window=5; Pattern=3)
All Mature
Mus Musculus
Homo Sapiens
C Elegans
ACG CGC GCG UCG2/25/09 - GCSU
TCGRs for Xue Training Data
40
POSITIVE
NEGATIVE
C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
2/25/09 - GCSU
2/25/09 - GCSU 41
Affymetrix GeneChip® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
Microarray Data Analysis
Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions
Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene?
Clustering Hierarchical K-means
2/25/09 - GCSU 42
Microarray Data - Clustering
2/25/09 - GCSU 43
"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004
BIG BROTHER ? Total Information Awareness
http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office
Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050511_
8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html
CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS
2/25/09 - GCSU 44
452/25/09 - GCSU
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
462/25/09 - GCSU
472/25/09 - GCSU