database management systems: data mining

1

Jerry PostCopyright © 2003

Database Management Database Management Systems:Systems:Data MiningData Mining

Attribute Evaluation

2

DDAATTAA MMiinniinngg

Multiple Regression

Y = b0 + b1X1 + b2X2 + … + bkXk

Regression estimates the b coefficients.

If a b value is zero, the corresponding X attribute does not influence the Y variable.

The b value coefficient also indicates the strength of the relationship: dY/dXi = bi. A one unit increase in Xi results in a bi change in Y.

3


Regression Example: RTQuery: Sales by Year by City Population:

SELECT Format([orderdate],"yyyy") AS SaleYear, City.Population1990, Sum(Bicycle.SalePrice) AS SumOfSalePrice

FROM City RIGHT JOIN (Customer INNER JOIN Bicycle ON Customer.CustomerID = Bicycle.CustomerID) ON City.CityID = Customer.CityID

GROUP BY Format([orderdate],"yyyy"), City.Population1990

HAVING (((City.Population1990)>0));

Paste data into Exel.Tools/Data Analysis/Regression

4


Regression Results

75% variation explained

Less than 0.05, so significantly different from zero

Regression StatisticsMultiple R 0.8647R Square 0.7476Adjusted R Square 0.7476Standard Error 7464.1009Observations 12081

ANOVAdf SS MS F

Regression 2 1.9936E+12 9.96799E+11 17891.74218Residual 12078 6.72899E+11 55712802.45Total 12080 2.6665E+12

Coefficients Standard Error t Stat P-valueIntercept -708867.855 46760.007 -15.160 0.000SaleYear 355.889 23.384 15.219 0.000Population1990 0.033 0.000 188.872 0.000

Each year, sales increase $356

For 1000 people, sales increase $33

5


Information Gain: Partitioning

)(log2 ii ppI

In 1948, Shannon defined information (I) as:

-pi log2(pi)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.01

0.09

0.17

0.25

0.33

0.41

0.49

0.57

0.65

0.73

0.81

0.89

0.97

If pi is zero or one, there is no information—since you always know what will happen.

6


Information Example

),...,(...

)( 11

1mjj

v

i

mjj ssIs

ssAE

Types of shoppers (m=2): status is high roller or tourist

S is a set of data (rows)

The dataset contains attributes (A), such as: Income, Age_range, Region, and Gender.

Each attribute has many (v) possible values. For example, Income categories are: low, medium, high, and wealthy.

The subset Sij contains the rows of customers in category i who possess attribute level j. The count of the number of rows is sij.

The entropy of attribute A defined from this partitioning is

The information gain from the partitioning is

Find the attribute with the highest gain.

)(),...,,()( 21 AEsssIAGain m

7


Data for Information ExampleClass:1 High rollerGender Income Age_range Region CountM High Middle Northeast 12M Wealthy Old West 8M Medium Young West 21F High Middle South 32M Low Young Northeast 17M High Old Midwest 14

104Class:2 TouristGender Income Age_range Region CountM Low Young West 25F Low Young West 10M Medium Middle Midwest 32M High Young Northeast 5F Medium Young West 8M Low Old Northeast 27

107

9999.0211

107log

211

107

211

104log

211

104),( 2221 ssI

s1=104s2=107s=211

Expected information for income categories:Value High roller Tourist sum I(s1j,s2j) WeightedLow 17 62 79 0.2262 0.0847Medium 21 40 61 0.2796 0.0808High 58 5 63 0.1204 0.0359Wealthy 8 0 8 0.0000 0.0000

104 107 211 0.2015

E(income)=0.2015Gain(income) = 0.9999-0.2015

= 0.7984

=79/211*I(…)

8


Results for Information

Attribute Gain

Income 0.7984

Gender 0.7048

Age_range 0.7025

Region 0.7549

All values are relatively high, so all attributes are important.

9


Dimensionality

Notice the issue of dimensionality in the example.We had to setup groups within the attributes. If there are too many groupings/values:

The system will take a long time to run.Many subgroups will have no observations.

How do you establish the groupings/values?Natural hierarchies (e.g., dates)Cluster analysisPrior knowledgeLevel of detail required for analysis

10


Non-Linear Estimation

Regression: Polynomial: Y = b0 + b1X + b2X2 + b3X3 + b4X4…+ u Exponential: Y = b0Xb1eu ln(Y) = ln(b0) + b1 ln(X) + u Log-Linear: ln(Y) = b0 + b1 ln(X) + u Other: log log and more

Other Methods: Neural networks Search

11


Example: PolyAnalyst: Find Law for MPGmpg = (2.59183e+009 *power*age+176465 *power*age*weight+2.41554e+009 *power*age*age-3.54349e+009 *power+7.27281e+007 *age*weight-2.55635e+010)/(power*age*weight+52028.3 *power*age*age*weight)

Best exact rule found: mpg = (4.71047e+008 *power*age*weight-38783.5 *power*age*weight*weight+2.5987e+009 *power*age*age*weight-7.65205e+009 *power*weight+1.5658e+008 *age*weight*weight+1.15859e+011 *power*power-3.0532e+013 *age*age)/(power*age*weight*weight+52028.3 *power*age*age*weight*weight)

12


MPG Versus Weight

13


Problems with Non-Linear Models They can be harder to estimate. They are substantially more difficult to optimize. They are often unstable—particularly at the ends.

Y

-60000

-40000

-20000

0

20000

40000

60000

80000

100000

120000

140000

-25

-21

-17

-13 -9 -5 -1 3 7 11 15 19 23

Y = 15000 – 850 X – 435 X2 + 2 X3 + X4

Note: (x + 7)(x – 5)(x + 20)(x – 20)

database management systems: data mining

Documents