1 peter fox data science – itec/csci/erth-4750/6750 week 9, october 21, 2014 intro to data mining...

77
1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Upload: melanie-cox

Post on 19-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

1

Peter Fox

Data Science – ITEC/CSCI/ERTH-4750/6750

Week 9, October 21, 2014

Intro to Data Mining for Data Science

Page 2: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

2

Page 3: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Contents• Data Mining what it is, is not, types

• Distributed applications – modern data mining

• Science example

• A specific toolkit and two examples– Classifier

– Image analysis – clouds

• Week 9 reading3

Page 4: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Types of data

4

Page 5: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Data Mining – What it is• Extracting knowledge from large amounts of data• Motivation

– Our ability to collect data has expanded rapidly– It is impossible to analyze all of the data manually– Data contains valuable information that can aid in decision making

• Uses techniques from:– Pattern Recognition– Machine Learning– Statistics– High Performance Database Systems – OLAP

• Plus techniques unique to data mining (Association rules)• Data mining methods must be efficient and scalable

Page 6: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Data Mining – What it isn’t• Small Scale

– Data mining methods are designed for large data sets– Scale is one of the characteristics that distinguishes data mining

applications from traditional machine learning applications

• Foolproof– Data mining techniques will discover patterns in any data– The patterns discovered may be meaningless– It is up to the user to determine how to interpret the results – “Make it foolproof and they’ll just invent a better fool”

• Magic– Data mining techniques cannot generate information that is not

present in the data– They can only find the patterns that are already there

Page 7: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Data Mining – Types of Mining

• Classification (Supervised Learning)– Classifiers are created using labeled training samples– Training samples created by ground truth / experts– Classifier later used to classify unknown samples

• Clustering (Unsupervised Learning)– Grouping objects into classes so that similar objects are in the

same class and dissimilar objects are in different classes– Discover overall distribution patterns and relationships between

attributes• Association Rule Mining

– Initially developed for market basket analysis– Goal is to discover relationships between attributes– Uses include decision support, classification and clustering

• Other Types of Mining– Outlier Analysis– Concept / Class Description– Time Series Analysis

Page 8: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Data Mining in the ‘new’ Distributed Data/Services Paradigm

Page 9: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Science Motivation• Study the impact of natural iron fertilization process

(such as a dust storm) on plankton growth and subsequent dimethyl sulfide (DMS) production– Plankton plays an important role in the carbon cycle– Plankton growth is strongly influenced by nutrient

availability (Fe/Ph)– Dust deposition is important source of Fe over ocean– Satellite data is an effective tool for monitoring the effects

of dust fertilization

Page 10: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Hypotheses• In remote ocean locations there is a positive

correlation between the area averaged atmospheric aerosol loading and oceanic chlorophyll concentration

• There is a time lag between oceanic dust deposition and the photosynthetic activity

Page 11: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Primary source of ocean nutrients

WIND BLOWNDU

ST

SAHARA

SEDIMENTS FROM RIVER

OCEAN UPWELLIN

G

Page 12: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

SAHARA

DUST

SST

CLOUDS

NUTRIENTS

CHLOROPHYLL

Factors modulating dust-ocean photosynthetic effect

Page 13: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Objectives

• Use satellite data to determine, if atmospheric dust loading and phytoplankton photosynthetic activity are correlated.

• Determine physical processes responsible for observed relationship

Page 14: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Data and Method• Data sets obtained from two instruments:

SeaWiFS and MODIS during 2000 – 2006 are employed

• MODIS derived AOT (Aerosol Optical Thickness)

– SeaWIFS - Sea-Viewing Wide Field-of-View Sensor

– MODIS – Moderate resolution Imaging Spectrometer

– AOT – Aerosol Optical Thickness

Page 15: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

The areas of study

1

5

6

8

43

2

7

1-Tropical North Atlantic Ocean 2-West coast of Central Africa 3-Patagonia 4-South Atlantic Ocean 5-South Coast of Australia 6-Middle East 7- Coast of China 8-Arctic Ocean

*Figure: annual SeaWiFS chlorophyll image for 2001

Page 16: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Tropical North Atlantic Ocean dust from Sahara Desert

-0.68497

-0.1587

4

-0.856

11

-0.446

7

-0.75102

-0.6644

8

-0.72603

-0.17504 -0.0902 -0.328 -0.4595 -0.14019 -0.7253 -0.1095

Ch

loro

ph

yll

AOT

Page 17: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Arabian Sea Dust from Middle East

0.59895 0.66618 0.37991 0.45171 0.52250 0.36517 0.5618

0.76650

0.69797

0.75071

0.4412

0.8495

0.708625

0.65211

Ch

loro

ph

yll

AOT

Page 18: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Summary …• Dust impacts oceans photosynthetic activity,

positive correlations in some areas NEGATIVE correlation in other areas, especially in the Saharan basin

• Hypothesis for explaining observations of negative correlation: In areas that are not nutrient limited, dust reduces photosynthetic activity

• But also need to consider the effect of clouds, ocean currents. Also need to isolate the effects of dust. MODIS AOT product includes contribution from dust, DMS, biomass burning etc.

Page 19: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Data Mining – Types of Mining

• Classification (Supervised Learning)– Classifiers are created using labeled training samples– Training samples created by ground truth / experts– Classifier later used to classify unknown samples

• Clustering (Unsupervised Learning)– Grouping objects into classes so that similar objects are in the

same class and dissimilar objects are in different classes– Discover overall distribution patterns and relationships between

attributes• Association Rule Mining

– Initially developed for market basket analysis– Goal is to discover relationships between attributes– Uses include decision support, classification and clustering

• Other Types of Mining– Outlier Analysis– Concept / Class Description– Time Series Analysis

Page 20: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Models/ types• Trade-off between Accuracy and

Understandability

• Models range from “easy to understand” to incomprehensible– Decision trees– Rule induction– Regression models– Neural Networks

20

Harder

Page 21: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Qualitative and Quantitative• Qualitative

– Provide insight into the data you are working with• If city = New York and 30 < age < 35 …• Important age demographic was previously 20 to 25• Change print campaign from Village Voice to New

Yorker

– Requires interaction capabilities and good visualization

• Quantitative• Automated process• Score new gene chip datasets with error model every

night at midnight• Bottom-line orientation

21

Page 22: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Management• Creation of logical collections

• Physical data handling

• Interoperability support

• Security support

• Data ownership

• Metadata collection, management and access.

• Persistence

• Knowledge and information discovery

• Data dissemination and publication 22

Page 23: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Provenance*• Origin or source from which something

comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility

Page 24: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

ADaM – System Overview• Developed by the Information Technology and Systems Center

at the University of Alabama in Huntsville• Consists of over 75 interoperable mining and image

processing components• Each component is provided with a C++ application

programming interface (API), an executable in support of scripting tools (e.g. Perl, Python, Tcl, Shell)

• ADaM components are lightweight and autonomous, and have been used successfully in a grid environment

• ADaM has several translation components that provide data level interoperability with other mining systems (such as WEKA and Orange), and point tools (such as libSVM and svmLight)

• Future versions will include Python wrappers and possible web service interfaces

Page 25: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

ADaM 4.0 Components

Page 26: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

ADaM Classification - Process

• Identify potential features which may characterize the phenomenon of interest

• Generate a set of training instances where each instance consists of a set of feature values and the corresponding class label

• Describe the instances using ARFF file format• Preprocess the data as necessary (normalize, sample etc.)• Split the data into training / test set(s) as appropriate• Train the classifier using the training set• Evaluate classifier performance using test set

• K-Fold cross validation, leave one out or other more sophisticated methods may also be used for evaluating classifier performance

Page 27: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

ADaM Classification - Example

• Starting with an ARFF file, the ADaM system will be used to create a Naïve Bayes classifier and evaluate it

• The source data will be an ARFF version of the Wisconsin breast cancer data from the University of California Irvine (UCI) Machine Learning Database:

http://www.ics.uci.edu/~mlearn/MLRepository.html

• The Naïve Bayes classifier will be trained to distinguish malignant vs. benign tumors based on nine characteristics

Page 28: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Naïve Bayes Classification

• Classification problem with m classes C1, C2, … Cm

• Given an unknown sample X, the goal is to choose a class that is most likely based on statistics from training data

P(Ci | X) can be computed using Bayes’ Theorem:

[1] Equations from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.

Page 29: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Naïve Bayes Classification

• P(X) is constant for all classes, so finding the most likely class amounts to maximizing P(X | Ci) P(Ci)

• P(Ci ) is the prior probability of class i. If the probabilities are not known, equal probabilities can be assumed.

• Assuming attributes are conditionally independent:

P(xk | Ci) is the probability density function for attribute k

[1] Equation from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.

Page 30: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Naïve Bayes Classification

P(xk | Ci) is estimated from the training samples Categorical Attributes (non-numeric attributes)

– Estimate P(xk | Ci) as percentage of samples of class i with value xk

– Training involves counting percentage of occurrence of each possible value for each class

Numeric attributes– Also use statistics of the sample data to estimate P(xk | Ci)

– Actual form of density function is generally not known, so Gaussian density is often assumed

– Training involves computation of mean and variance for each attribute for each class

Page 31: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Naïve Bayes Classification

Gaussian distribution for numeric attributes:

– Where is the mean of attribute k observed in samples of class Ci

– And is the standard deviation of attribute k observed in samples of class Ci

[1] Equation from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.

Page 32: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Sample Data Set – ARFF Format

Page 33: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Data management

• Metadata?• Data?• File naming?• Documentation?

Page 34: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Splitting the Samples• ADaM has utilities for splitting data sets into disjoint

groups for training and testing classifiers• The simplest is ITSC_Sample, which splits the source

data set into two disjoint subsets

Page 35: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Splitting the Samples• For this demo, we will split the breast cancer data set into two

groups, one with 2/3 of the patterns and another with 1/3 of the patterns:

ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66

• The –i argument specifies the input file name• The –o and –t arguments specify the names of the two output

files (-o = output one, -t = output two)• The –p argument specifies the portion of data that goes into

output one (trn.arff), the remainder goes to output two (tst.arff)• The –c argument tells the sample program which attribute is the

class attribute

Page 36: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Provenance?• For this demo, we will split the breast cancer data set into two

groups, one with 2/3 of the patterns and another with 1/3 of the patterns:

ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66

• What needs to be recorded and why?• What about intermediate files and why?• How are they logically organized?

ITSC_Sample -c class -i bcw.arff -o train_p.66.arff –t bcw_test.33.arff –p 0.66

Page 37: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Training the Classifier• ADaM has several different types of classifiers• Each classifier has a training method and an application

method• ADaM’s Naïve Bayes classifier has the following syntax:

Page 38: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Training the Classifier• For this demo, we will train a Naïve Bayes classifier:

ITSC_NaiveBayesTrain -c class -i trn.arff –b bayes.txt

• The –i argument specifies the input file name• The –c argument specifies the name of the class attribute• The –b argument specifies the name of the classifier file:

Page 39: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Applying the Classifier• Once trained, the Naïve Bayes classifier can be used to

classify unknown instances• The syntax for ADaM’s Naïve Bayes classifier is as follows:

Page 40: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Applying the Classifier• For this demo, the classifier is run as follows: ITSC_NaiveBayesApply -c class -i tst.arff –b bayes.txt -o

res_tst.arff• The –i argument specifies the input file name• The –c argument specifies the name of the class attribute• The –b argument specifies the name of the classifier file• The –o argument specifies the name of the result file:

Page 41: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Evaluating Classifier Performance

• By applying the classifier to a test set where the correct class is known in advance, it is possible to compare the expected output to the actual output.

• The ITSC_Accuracy utility performs this function:

Page 42: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Confusion matrix

• Gives a guide to accuracy but samples (i.e. bias) are important to take into account

Classified\ Actual 0 1

0 TRUE POSITIVES

FALSE POSITIVES

1 FALSE NEGATIVES

TRUE NEGATIVES

Page 43: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Evaluating Classifier Performance

• For this demo, ITSC_Accuracy is run as follows: ITSC_Accuracy -c class -t res_tst.arff –v tst.arff –o acc_tst.txt

Page 44: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Python Script for Classification

Page 45: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

How would you modify this?

Page 46: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

What is the provenance?

Page 47: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

ADaM Image Classification• Classification of image data is a bit more involved, as there is

an additional set of steps that must be performed to extract useful features from the images before classification can be performed.

• In addition, it is also useful to transform the data back into image format for visualization purposes.

• As an example problem, we will consider detection of cumulus cloud fields in GOES satellite images– GOES satellites produce a 5 channel image every 15 minutes– The classifier must label each pixel as either belonging to a

cumulus cloud field or not based on the GOES data– Algorithms based on spectral properties often miss cumulus

clouds because of the low resolution of the IR channels and the small size of clouds

– Texture features computed from the GOES visible image provide a means to detect cumulus cloud fields.

Page 48: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

GOES Images - Preprocessing

• Segmentation is based only on the high resolution (1km) visible channel.

• In order to remove the effects of the light reflected from the Earth’s surface, a visible reference background image is constructed for each time of the day.

• The reference image is subtracted from the visible image before it is segmented.

• GOES image patches containing cumulus cloud regions, other cloud regions, and background were selected

• Independent experts labeled each pixel of the selected image patches as cumulus cloud or not

• The expert labels were combined to form a single “truth” image for each of the original image patches. In cases where the experts disagreed, the truth image was given a “don’t know” value

Page 49: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

GOES Images - ExampleGOES Visible Image Expert Labels

Page 50: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Image Quantization• Some texture features perform better when the image is

quantized to some small number of levels before the features are computed.

• ITSC_RelLevel performs local image quantization

Page 51: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Image Quantization• For this demo, we will reduce the number of levels from 256

to just three using local image statistics:

ITSC_RelLevel –d -s 30 –i src.bin –o q4.bin –k

• The –i argument specifies the input file name• The –o argument specifies the output file name • The –d argument tells the program to use standard deviation

to set the cutoffs instead of a fixed value• The –k option tells the program to keep values in the range

0, 1, 2 rather than normalizing to 0..1.• The –s argument indicates the size of the local area used to

compute statistics

Page 52: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Computing Texture Features

• ADaM is currently able to compute five different types of texture features: gray level cooccurrence, gray level run length, association rules, Gabor filters, and MRF models

• The syntax for gray level run length computation is:

Page 53: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Computing Texture Features

• For this demo, we will compute gray level run length features using a tile size of 25:

ITSC_Glrl –i q4.bin –o glrl.arff –l 3 –B –t 25

• The –i argument specifies the input file name• The –o argument specifies the output file name • The –l argument tells the program the number of levels in

the input image• The –B option tells the program to write a binary version of

the ARFF file (default is ASCII)• The –t argument indicates the size of the tiles used to

compute the gray level run length features

Page 54: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Provenance alert!• For this demo, we will compute gray level run length

features using a tile size of 25:

ITSC_Glrl –i q4.bin –o glrl.arff –l 3 –B –t 25

• What needs to be documented here and why?

Page 55: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Converting the Label Images

• Since the labels are in the form of images, it is necessary to convert them to vector form

• ITSC_CvtImageToArff will do this:

Page 56: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Converting ??????• Since the labels are in the form of images, it is necessary to convert them to vector

form• Consequences?• Do you save them?• Discussion?

Page 57: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Converting the Label Images

• The labels can be converted to vector form using:

ITSC_CvtImageToArff –i lbl.bin –o lbl.arff -B

• The –i argument specifies the input file name• The –o argument specifies the output file name • The –B argument tells the program to write the output file in

binary form (default is ASCII)

Page 58: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Labeling the Patterns• Once the labels are in vector form, they can be

appended to the patterns produced by ITSC_Glrl• ITSC_LabelPatterns will do this:

Page 59: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Labeling the Patterns• The labels are assigned to patterns as follows:

ITSC_LabelPatterns –i glrl.arff –c class –l lbl.bin –L lbl.arff –o all.arff –B

• The –i argument specifies the input file name (patterns)• The –o argument specifies the output file name The –c

argument• The –c argument specifies the name of the class attribute in

the pattern set• The –l argument specifies the name of the label attribute in

the label set• The –L argument specifies the name of the input label file• The –B argument tells the program to write the output file in

binary form (default is ASCII)

Page 60: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Eliminating “Don’t Know” Patterns

• Some of the original pixels were classified differently by different experts and marked as “don’t know”

• The corresponding patterns can be removed from the training set using ITSC_Subset:

Page 61: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Eliminating “Don’t Know” Patterns

• ITSC_Subset is used to remove patterns with unclear class assignment. The subset is generated based on the value of the class attribute:

ITSC_Subset –i all.arff –o subset.arff –a class –r 0 1 -B

• The –i argument specifies the input file name• The –o argument specifies the output file name • The –a argument tells which attribute to test• The –r argument tells the legal range of the attribute• The –B argument tells the program to write the output file in

binary form (default is ASCII)

Page 62: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Selecting Random Samples• Random samples are selected from the original training data

using the same ITSC_Sample program shown in the previous demo

• The program is used in a slightly different way:

ITSC_Sample –i subset.arff –c class –o s1.arff –n 2000

• The –i argument specifies the input file name• The –o argument specifies the output file name • The –c argument specifies the name of the class attribute• The –n option tells the program to select an equal number of

random samples (in this case 2000) from each class.

Page 63: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Python Script for Sample Creation

Page 64: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

What modifications here??

Page 65: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Merging Samples / Multiple Images

• The procedure up to this point has created a random subset of points from a particular image. Subsets from multiple images can be combined using ITSC_MergePatterns:

Page 66: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Merging Samples / Multiple Images

• Multiple pattern sets are merged using the following command:

ITSC_MergePatterns –c class –o merged.arff –i s1.arff s2.arff

• The –i argument specifies the input file names• The –o argument specifies the output file name • The –c argument specifies the name of the class attribute

Page 67: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Python Script for Training

Page 68: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Results of Classifier Evaluation

• The results of running this procedure using five sample images of size 500x500 is as follows:

Page 69: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Applying the Classifier to Images

• Once the classifier is trained, it can be applied to segment images. One further program is required on the end to convert the classified patterns back into an image:

Page 70: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Python Function for Segmentation

Page 71: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Sample Image Results

Expert Labels Segmentation Result

Page 72: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Remarks • The procedure illustrated here is one specific example of

ADaM’s capabilities• There are many other classifiers, texture features and other

tools that could be used for this problem• Since all of the algorithms of a particular type work in more or

less the same way, the same general procedure could be used with other tools

• DOWNLOAD the ADaM Toolkit

– http://datamining.itsc.uah.edu/adam/

Page 73: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Numpy, scipy• http://scikit-learn.org/stable/

• http://orange.biolab.si

73

Page 74: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

R• http://www.rdatamining.com

74

Page 75: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Management

• What did you learn?• Provenance elements?• How to deal with both?

Page 76: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

What’s next

76

Page 77: 1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science

Project Teams (final)1: Taoran L., Sumithra, Matt K., Rohan, Sicong

2: Anthony, Paul Z., Sarah, Trilok, Charles, Ashwin

3: Michael Shih, Eric D., Apoorva M., Aritra C., Geoffrey W.

4: Jeff D., Niharika, Anand S., Rahul K., Brenda

5: Daniel S., Renaldo S., Pooja, Ahmed, Apurva S.

6: Ranjani, Mithun K .N., Chenxi P., Eric H., Jiaju

7: Aayush, J. Dean McD., Luying W., Nachiket B., Alexa

8: Anshul K, Nitish, Bo Y., Uzma M., Daniel B-C.,

9: Ashley V., Kevin L., Guomin S., Chetan B., Sameer S.

10: Huey M., Kathleen T., Saurabh S., Michael P., Xueyang, Antwane

Anyone missing?77