database management systems: data mining

13
1 Jerry Post Copyright © 2003 Database Management Database Management Systems: Systems: Data Mining Data Mining Attribute Evaluation

Upload: yadid

Post on 01-Feb-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Database Management Systems: Data Mining. Attribute Evaluation. Multiple Regression. Y = b 0 + b 1 X 1 + b 2 X 2 + … + b k X k. Regression estimates the b coefficients. If a b value is zero, the corresponding X attribute does not influence the Y variable. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Database Management Systems: Data Mining

1

Jerry PostCopyright © 2003

Database Management Database Management Systems:Systems:Data MiningData Mining

Attribute Evaluation

Page 2: Database Management Systems: Data Mining

2

DDAATTAA MMiinniinngg

Multiple Regression

Y = b0 + b1X1 + b2X2 + … + bkXk

Regression estimates the b coefficients.

If a b value is zero, the corresponding X attribute does not influence the Y variable.

The b value coefficient also indicates the strength of the relationship: dY/dXi = bi. A one unit increase in Xi results in a bi change in Y.

Page 3: Database Management Systems: Data Mining

3

DDAATTAA MMiinniinngg

Regression Example: RTQuery: Sales by Year by City Population:

SELECT Format([orderdate],"yyyy") AS SaleYear, City.Population1990, Sum(Bicycle.SalePrice) AS SumOfSalePrice

FROM City RIGHT JOIN (Customer INNER JOIN Bicycle ON Customer.CustomerID = Bicycle.CustomerID) ON City.CityID = Customer.CityID

GROUP BY Format([orderdate],"yyyy"), City.Population1990

HAVING (((City.Population1990)>0));

Paste data into Exel.Tools/Data Analysis/Regression

Page 4: Database Management Systems: Data Mining

4

DDAATTAA MMiinniinngg

Regression Results

75% variation explained

Less than 0.05, so significantly different from zero

Regression StatisticsMultiple R 0.8647R Square 0.7476Adjusted R Square 0.7476Standard Error 7464.1009Observations 12081

ANOVAdf SS MS F

Regression 2 1.9936E+12 9.96799E+11 17891.74218Residual 12078 6.72899E+11 55712802.45Total 12080 2.6665E+12

Coefficients Standard Error t Stat P-valueIntercept -708867.855 46760.007 -15.160 0.000SaleYear 355.889 23.384 15.219 0.000Population1990 0.033 0.000 188.872 0.000

Each year, sales increase $356

For 1000 people, sales increase $33

Page 5: Database Management Systems: Data Mining

5

DDAATTAA MMiinniinngg

Information Gain: Partitioning

)(log2 ii ppI

In 1948, Shannon defined information (I) as:

-pi log2(pi)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.01

0.09

0.17

0.25

0.33

0.41

0.49

0.57

0.65

0.73

0.81

0.89

0.97

If pi is zero or one, there is no information—since you always know what will happen.

Page 6: Database Management Systems: Data Mining

6

DDAATTAA MMiinniinngg

Information Example

),...,(...

)( 11

1mjj

v

i

mjj ssIs

ssAE

Types of shoppers (m=2): status is high roller or tourist

S is a set of data (rows)

The dataset contains attributes (A), such as: Income, Age_range, Region, and Gender.

Each attribute has many (v) possible values. For example, Income categories are: low, medium, high, and wealthy.

The subset Sij contains the rows of customers in category i who possess attribute level j. The count of the number of rows is sij.

The entropy of attribute A defined from this partitioning is

The information gain from the partitioning is

Find the attribute with the highest gain.

)(),...,,()( 21 AEsssIAGain m

Page 7: Database Management Systems: Data Mining

7

DDAATTAA MMiinniinngg

Data for Information ExampleClass:1 High rollerGender Income Age_range Region CountM High Middle Northeast 12M Wealthy Old West 8M Medium Young West 21F High Middle South 32M Low Young Northeast 17M High Old Midwest 14

104Class:2 TouristGender Income Age_range Region CountM Low Young West 25F Low Young West 10M Medium Middle Midwest 32M High Young Northeast 5F Medium Young West 8M Low Old Northeast 27

107

9999.0211

107log

211

107

211

104log

211

104),( 2221 ssI

s1=104s2=107s=211

Expected information for income categories:Value High roller Tourist sum I(s1j,s2j) WeightedLow 17 62 79 0.2262 0.0847Medium 21 40 61 0.2796 0.0808High 58 5 63 0.1204 0.0359Wealthy 8 0 8 0.0000 0.0000

104 107 211 0.2015

E(income)=0.2015Gain(income) = 0.9999-0.2015

= 0.7984

=79/211*I(…)

Page 8: Database Management Systems: Data Mining

8

DDAATTAA MMiinniinngg

Results for Information

Attribute Gain

Income 0.7984

Gender 0.7048

Age_range 0.7025

Region 0.7549

All values are relatively high, so all attributes are important.

Page 9: Database Management Systems: Data Mining

9

DDAATTAA MMiinniinngg

Dimensionality

Notice the issue of dimensionality in the example.We had to setup groups within the attributes. If there are too many groupings/values:

The system will take a long time to run.Many subgroups will have no observations.

How do you establish the groupings/values?Natural hierarchies (e.g., dates)Cluster analysisPrior knowledgeLevel of detail required for analysis

Page 10: Database Management Systems: Data Mining

10

DDAATTAA MMiinniinngg

Non-Linear Estimation

Regression: Polynomial: Y = b0 + b1X + b2X2 + b3X3 + b4X4…+ u Exponential: Y = b0Xb1eu ln(Y) = ln(b0) + b1 ln(X) + u Log-Linear: ln(Y) = b0 + b1 ln(X) + u Other: log log and more

Other Methods: Neural networks Search

Page 11: Database Management Systems: Data Mining

11

DDAATTAA MMiinniinngg

Example: PolyAnalyst: Find Law for MPGmpg = (2.59183e+009 *power*age+176465 *power*age*weight+2.41554e+009 *power*age*age-3.54349e+009 *power+7.27281e+007 *age*weight-2.55635e+010)/(power*age*weight+52028.3 *power*age*age*weight)

Best exact rule found: mpg = (4.71047e+008 *power*age*weight-38783.5 *power*age*weight*weight+2.5987e+009 *power*age*age*weight-7.65205e+009 *power*weight+1.5658e+008 *age*weight*weight+1.15859e+011 *power*power-3.0532e+013 *age*age)/(power*age*weight*weight+52028.3 *power*age*age*weight*weight)

Page 12: Database Management Systems: Data Mining

12

DDAATTAA MMiinniinngg

MPG Versus Weight

Page 13: Database Management Systems: Data Mining

13

DDAATTAA MMiinniinngg

Problems with Non-Linear Models They can be harder to estimate. They are substantially more difficult to optimize. They are often unstable—particularly at the ends.

Y

-60000

-40000

-20000

0

20000

40000

60000

80000

100000

120000

140000

-25

-21

-17

-13 -9 -5 -1 3 7 11 15 19 23

Y = 15000 – 850 X – 435 X2 + 2 X3 + X4

Note: (x + 7)(x – 5)(x + 20)(x – 20)