final project- mining mushroom world

25
Final Project- Mining Mushroom World

Upload: wade-ramirez

Post on 01-Jan-2016

38 views

Category:

Documents


5 download

DESCRIPTION

Final Project- Mining Mushroom World. Agenda. Motivation and Background Determine the Data Set (2) 10 DM Methodology steps (19) Conclusion. Motivation and Background. To distinguish between edible mushrooms and poisonous ones by how they look - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Final Project- Mining Mushroom World

Final Project-Mining Mushroom World

Page 2: Final Project- Mining Mushroom World

Agenda

• Motivation and Background

• Determine the Data Set (2)

• 10 DM Methodology steps (19)

• Conclusion

Page 3: Final Project- Mining Mushroom World

Motivation and Background

• To distinguish between edible mushrooms and poisonous ones by how they look

• To know whether we can eat the mushroom, to survive in the wild

• To survive outside the computer world

Page 4: Final Project- Mining Mushroom World

Determine the Data Set (1/2)

Source of data : UCI Machine Learning Repository

Mushrooms Database• From Audobon Society Field Guide • Documentation: complete, but missing statistical

information • Described in terms of physical characteristics • Classification: poisonous or edible • All attributes are nominal-valued

*Large database: 8124 instances (2480 missing values for attribute #12)

Page 5: Final Project- Mining Mushroom World

Determine the Data Set (2/2)

1. Past Usage• Schlimmer,J.S. (1987). Concept Acquisition Thro

ugh Representational Adjustment (Technical Report 87-19).

• Iba,W., Wogulis,J., & Langley,P. (1988). ICML, 73-79

2. No other mushrooms data

Page 6: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 1. Translate the Business Problem

into a Data Mining Problema. Data Mining Goal: separate edible mushroom

s from poisonous ones

b. How will the Results be Used- increase the survival rate

c. How will the Results be Delivered- Decision Tree, Naïve Bayes, Ripper, NeuralNet

Page 7: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 2. Select Appropriate Dataa. Data Source

– The Audubon Society Field guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf

– Jeff Schlimmer donated these data on April 27th, 1987

b. Volumes of Data- Total 8124 instances

- 4208(51.8%) edible; 3916(48.2%) poisonous

- 2480(30.5%) missing in attribute “stalk-root”

Page 8: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 2. Select Appropriate Data

c. How Many Variables- 22 attributes- cap-shape, cap-color, odor, population, habitat and so

on……

d. How Much History is Required- no seasonality

*As long as we can eat them when we see them

Page 9: Final Project- Mining Mushroom World

10 DM Methodology stepsStep 3. Get to Know the Dataa. Examine Distributions: Use “Weka” to visuali

ze all the 22 attributes with histograms

b. Class: edible=e, poisonous=p

Page 10: Final Project- Mining Mushroom World

Step 3. Get to Know the Data

a. Examine Distributions: there are 2 types of historgrams

b. First- all kinds of values appear

c. (Attribute 21) population: abundant=a, clustered=c,

numerous=n, scattered=s, several=v, solitary=y

Page 11: Final Project- Mining Mushroom World

Step 3. Get to Know the Data

1. Examine Distributions: there are 2 types of historgrams– Second- only some kinds of value appear– (Attribute 7) gill-spacing: close=c, crowded=w, dist

ant=d

Page 12: Final Project- Mining Mushroom World

Step 3. Get to Know the Data

1. Examine Distributions: there are exceptions– Exception 1- missing values in the attribute– (Attribute 11) stalk-root: bulbous=b, club=c, cup=u,

equal=e, rhizomorphs=z, rooted=r, missing=?

2480 of this attribute have missing values (Total 8124)

Page 13: Final Project- Mining Mushroom World

Step 3. Get to Know the Data

1. Examine Distributions: there are exceptions– Exception 2- undistinguishable attribute

– (Attribute 16) veil-type: partial=p, universal=u

Page 14: Final Project- Mining Mushroom World

Step3. Get to Know the Data

2. Compare Values with Descriptions– no unexpected values except for missing values

Page 15: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 4. Create a Model Set – Creating a Balanced Sample- 75%(6093) as

training data, 25%(2031) as test data– Rapid Miner’s “cross-validation” function: k-1 as

training, 1 as test

Page 16: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 5. Fix Problems with the Data– Dealing with Missing Values- the attribute “stalk-

root” has 2480 missing values

– replace all missing values with the average of “stalk-root” value

– We replaced ‘?’ with the average value ‘b’

Page 17: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 6. Transform Data to Bring Information

to the Surface

– all nominal attribute, no numerical analysis in this step

Page 18: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 7. Build Model1. Decision Tree

Performance– Accuracy: 99.11%– Lift: 189.81%

True p True e Class precision

Pred. p 961 0 100%

Pred. e 18 1052 98.32%

Class recall 98.16% 100.00%True p True e Class precision

Pred. p 961 0 100%

Pred. e 18 1052 98.32%

Class recall 98.16% 100.00%

Page 19: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 7. Build Model2. Naïve Bayes

Performance– Accuracy: 95.77%

– Lift: 179.79%

True p True e Class precision

Pred. p 902 9 99.01%

Pred. e 77 1043 93.12%

Class recall 92.13% 99.14%

True p True e Class precision

Pred. p 902 9 99.01%

Pred. e 77 1043 93.12%

Class recall 92.13% 99.14%

Page 20: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 7. Build Model3. Ripper

Performance– Accuracy: 100%

– Lift: 193.06%

True p True e Class precision

Pred. p 979 0 100.00%

Pred. e 0 1052 100.00%

Class recall 100.00% 100.00%

True p True e Class precision

Pred. p 979 0 100.00%

Pred. e 0 1052 100.00%

Class recall 100.00% 100.00%

Page 21: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 7. Build Model4. NeuralNet

Performance– Accuracy: 91.04%– Lift: 179.35%

True p True e Class precision

Pred. p 907 110 89.18%

Pred. e 72 942 92.90%

Class recall 92.65% 89.54%

True p True e Class precision

Pred. p 907 110 89.18%

Pred. e 72 942 92.90%

Class recall 92.65% 89.54%

Page 22: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 8. Assess Models– Accuracy: Ripper and Decision Tree have b

etter performancesAccuracy

99.1195.77

100

91.04

85

90

95

100

105

DecisionTree

NaïveBayes

Ripper Neural Net

Accuracy

Page 23: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 8. Assess Models– Lift (to compare the performances of different classific

ation models): Ripper and Decision Tree have higher lifts

Lift

189.81

179.79

193.06

179.35

170

175

180

185

190

195

1 2 3 4

Lift

Page 24: Final Project- Mining Mushroom World

10 DM Methodology steps

Step 9. Deploy Models– We haven’t go out and find real mushrooms

Step 10. Assess ResultsConclusion and questions– Maybe ripper and decision tree are better models

for nominal data

– How Rapid Miner separates training data from test data

Page 25: Final Project- Mining Mushroom World