data mining - stanford...

11
Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text- book, slides of Susan Holmes Statistics 202: Data Mining Week 1 c Jonathan Taylor Based in part on slides from textbook, slides of Susan Holmes October 7, 2011 1/1 Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text- book, slides of Susan Holmes Part I Introduction 2/1 Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text- book, slides of Susan Holmes Data Mining What is data mining? Non-trivial extraction of implicit, previously unknown and potentially useful information from data Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. A key feature of data mining is that the data sets are larger than those encountered in “classical” statistics. So large that it must be (semi-)automated. 3/1 Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text- book, slides of Susan Holmes Data Mining Who uses data mining? Industry: 1 Netflix 2 Amazon 3 Google (i.e. google trends) Science: 1 Genomics 2 Climate Science 3 Astrophysics 4 Neuroimaging 4/1

Upload: others

Post on 02-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Statistics 202: Data MiningWeek 1

c©Jonathan TaylorBased in part on slides from textbook, slides of Susan Holmes

October 7, 2011

1 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Part I

Introduction

2 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

What is data mining?

Non-trivial extraction of implicit, previously unknown andpotentially useful information from data

Data mining involves the use of sophisticated data analysistools to discover previously unknown, valid patterns andrelationships in large data sets.

A key feature of data mining is that the data sets arelarger than those encountered in “classical” statistics. Solarge that it must be (semi-)automated.

3 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Who uses data mining?

Industry:1 Netflix2 Amazon3 Google (i.e. google trends)

Science:1 Genomics2 Climate Science3 Astrophysics4 Neuroimaging

4 / 1

Page 2: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Netflix

5 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Amazon

See larger image

Share your own customer images

Publisher: learn how customers can search inside thisbook.

+

Hello, Jonathan Taylor. We have recommendations for you. (Not Jonathan?) FREE Two-Day Shipping: See details

Jonathan's Amazon.com | Today's Deals | Gifts & Wish Lists | Gift Cards Your Digital Items | Your Account | Help

Search Books

Books AdvancedSearch

BrowseSubjects

NewReleases

BestSellers

The New YorkTimes® Bestsellers

Libros enespañol

BargainBooks Textbooks

Introduction to Data Mining [Hardcover]Pang-Ning Tan (Author), Michael Steinbach (Author), VipinKumar (Author)

(18 customer reviews) | (3)

List Price: $120.00

Price: $94.50 & this item ships for FREE withSuper Saver Shipping. Details

You Save: $25.50 (21%)

In Stock.Ships from and sold by Amazon.com. Gift-wrap available.

Want it delivered Tuesday, September 27? Order it in thenext 20 hours and 22 minutes, and choose One-Day Shipping atcheckout. Details

32 new from $94.50 20 used from $55.00

FREE Two-Day Shipping for Students. Learn more

Frequently Bought TogetherCustomers buy this book with Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The MorganKaufmann Series in Data Management Systems) by Ian H. Witten Paperback $39.50

Price For Both: $134.00

Show availability and shipping details

Shop All Departments Cart Wish List

Yes, I want FREE Two-DayShipping with Amazon Prime

Quantity: 1

or

Sign in to turn on 1-Click ordering.

More Buying Choices

52 used & new from $55.00

Have one to sell? or

Get a $62.20 Amazon Gift Card

Share

Tell the Publisher!I'd like to read this book on Kindle

Don't have a Kindle? Get your Kindlehere, or download a FREE KindleReading App.

Formats AmazonPrice

Newfrom

Usedfrom

Hardcover $94.50 $94.50 $55.00

Paperback -- -- $84.93

Sell Back Your Copy for $62.20Whether you buy it used on Amazon for $55.00 or somewhere else, you can sell it backthrough our Book Trade-In Program at the current price of $62.20 through December 20,2011. Restrictions Apply

Customers Who Bought This Item Also Bought Page 1 of 11

Data Mining: PracticalMachine Learning Toolsan... by Ian H. Witten

(13)

$39.50

The Elements ofStatistical Learning:Data Minin... by TrevorHastie

(45)

$61.32

Programming CollectiveIntelligence: BuildingSma... by Toby Segaran

(69)

$26.39

Data Mining: Conceptsand Techniques, ThirdEdition... by Jiawei Han

(4)

$60.12

Amazon.com: Introduction to Data Mining (9780321321367): ... http://www.amazon.com/Introduction-Data-Mining-Pang-Ning...

1 of 7 9/25/11 8:07 PM

6 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Google Trends

[email protected] | Sign out

andrew luck Search Trends Tip: Use commas to compare multiple search terms.

Searches Websites All regions All years

- Scale is based on the average worldwide traffic of andrew luck in all years. Learn more- An improvement to our geographical assignment was applied retroactively from 1/1/2011. Learn more

andrew luck 1.00

Rank by andrew luck

Interception caps tough day for Stanford's Andrew LuckSan Jose Mercury News - Nov 22 2009

Andrew Luck outplays Jake Locker as No. 13 Stanford dominates Washington 41-0Los Angeles Times - Oct 31 2010

Cam Newton wins Heisman Trophy over Andrew Luck, LaMichael James, Kellen Moore in New YorkNew York Daily News - Dec 12 2010

Andrew Luck leads Stanford past Va Tech 40-12Fox News - Jan 4 2011

Andrew Luck, No. 7 Stanford roll past San Jose State 57-3 in season openerWashington Post - Sep 4 2011

Andrew Luck throws for 325 yards as Stanford rolls ArizonaESPN - Sep 18 2011

More news results »

Regions

1. United States

2. Canada

3. Australia

4. United Kingdom

Cities

1. Stanford, CA, USA

2. Charlotte, NC, USA

3. San Francisco, CA, USA

4. Houston, TX, USA

5. San Jose, CA, USA

6. Herndon, VA, USA

7. Austin, TX, USA

8. Raleigh, NC, USA

9. Pleasanton, CA, USA

10. Seattle, WA, USA

Languages

1. English

2. Spanish

Export this page as a CSV file

Google Trends provides insights into broad search patterns. Please keep in mind that several approximations are used when computing these results.

©2008 Google - Discuss - Terms of Use - Privacy Policy - Help

Google Trends: andrew luck http://www.google.com/trends?q=andrew+luck&ctab=0&geo=a...

1 of 1 9/25/11 8:09 PM

7 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Genomics

8 / 1

Page 3: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Neuroimaging

9 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Climate science

10 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Some things that are not data mining

Looking up a record in a database by identifier such as lastname . (No pattern is revealed by this lookup . . . )

Searching for “Amazon” on google. (Google has donesome data mining, but you have not . . . )

Testing a two-sample hypothesis in a clinical trial. (Dataset is often not large and unstructured.)

11 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Some things that are more like data mining

Noting that some last names occur in certain geographicalareas.

Taking all query results from google on Amazon anddiscovering that there are at least two groups: “Amazonriver” and “Amazon.com”

When doing multiple tests across many different genes,identifying very strongly significant genes . . .

12 / 1

Page 4: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Prediction / Supervised Problems

In such problems there is an outcome or label we want topredict based on many features.

Classification

Regression

Outlier detection

13 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Descriptive / Unsupervised Problems

In such problems, we are seeking to discover hidden “structure”in the data, without an outcome or label.

Clustering

Dimension Reduction

Association Rules

Semisupervised problems

A mix of labelled and unlabelled data is used.

14 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Part II

Examples

15 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

A canonical example

Voting records house of representatives

On http://clerk.house.gov, one can view all roll call votesfor many years.

We might hypothesize that there is some structure tothese votes: the votes should cluster roughly by party.

How can we find this information out fromhttp://clerk.house.gov?

16 / 1

Page 5: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Scraping the data

Any practical problem like this requires taking the data offthe web.

This can be a large part of the time spent in such a datamining task.

17 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Example: 2010, Vote # 134

18 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Scraping the data

There were 664 roll call votes in the House ofRepresentatives during 2010.

At any given time, there can be up to 435 members in theHouse of Representatives.

In an idealized setting, this means there exists a matrix

XXX 435×664

of votes.

What are the entries of the matrix? They could be strings:“Aye”/“Nay”, or numbers (e.g. “Aye”→ 1, “Nay”→ −1,“Not Voting”→ 0)

19 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

In practice

Not all members voted on all bills. Some members resign(e.g. Anthony Weiner, etc.)

In the data set I scraped, 427 members had a recordedvote for all bills.

20 / 1

Page 6: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Forming a “model”

Our hypothesis is that for each row XXX i of XXX we might beable to decide the party label of the i-th representative.

This rule must be expressed as a function f (XXX i ).

What rules to use?

A function (or rule, if you prefer) that is constant forDemocrats and Republicans cannot be useful.

An example of such a bad rule is the number of recordedvotes for each person in the dataset – this is just 664.

21 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Forming rules

Given we have our data matrix XXX , we can form many rulesby taking linear combinations across the {−1, 0, 1} values.

Each β ∈ R664, determines a linear rule

fβ(xxx) = xxxTβ

But, we already decided that constant rules areuninteresting. So, we want rules that vary a lot across thedata.

22 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Principal Component Analysis

Consider the following abstract problem

maximizeβ“Variability”(fβ)

For our linear rules, if you scale β, the variability scales.So we should constrain ‖β‖ = 1.

With this constraint, and “Variability” being samplevariance we can actually solve this problem.

The rule with the maximum sample variance is the leadingeigenvector of

XXXTHXXX

where H is the mean-removing matrix (let’s not worryabout exactly what it is at the moment).

23 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

PCA on a sample of 20 votes

24 / 1

Page 7: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

PCA on a sample of 20 votes, labelled

25 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

PCA on a sample of 100 votes, labelled

26 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

PCA on another sample of 20 votes, labelled

27 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

PCA on another sample of 20 votes, labelled

28 / 1

Page 8: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Screeplot

29 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Predicting party from a rule

Now, we have this new rule fβ̂1where β̂1 was the most

“variable” among the linear rules.

It seems that if we just guess party based on the sign ofthis rule, we will do pretty well . . .

30 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Classification

This is an example of a (linear) classifier, which is a rule cthat takes a set of votes xxx and assigns a label R,D.

We can express this thresholding rule for a given β interms of

signR(β) = sign(meani∈sample(R)fβ(XiXiXi )

)

The rule is now

cβ(xxx) =

{R sign(fβ(xxx)) = signR(β)

D otherwise.

31 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Classification accuracy

Given a label yyy we can ask how accurate the rule is:

L(xxx ,yyy , c) =

{1 c(xxx) 6= yyy

0 otherwise.

Not surprisingly, when we apply this rule to our sample of427 members, we do pretty well.

That is, if YYY is the vector of {D,R} labels, the followingquantity is pretty small

427∑

i=1

L(XXX i ,YYY i , cβ̂1).

32 / 1

Page 9: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Validation

However, we used the information on all 427 to find β̂1.Could this be cheating?

We can validate this procedure of feature extractionfollowed by classification by finding β̂1 based on only, say,400 members.

Having found β̂1 on the data from these 400 members wecan then predict the party of the remaining 27 members.

See http://stats202.stanford.edu/voting.html.

33 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Voting records

Summary

Preprocessing Scraping the data off the web. This can be timeconsuming!

Feature extraction We derived interesting linear rules bylooking at votes, XXX but not labels YYY . This is aunsupervised task.

Classification Having found interesting linear rules, we usedthem to predict the labels YYY . This is supervisedtask.

Validation In order to validate the Feature Extraction /Classification steps together, we extractedfeatures based on a random subset of 400members and predicted the labels of theremaining 27. This is possible in supervisedproblems, generally not in unsupervised problems.

34 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Part III

Types of data

35 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data

What is data?

Data comes in many forms.

The simplest one to deal with is a “flat file”, or datamatrix / data frame:

XXX n×p

Usually, n is the number of “cases”, p is the number of“features”.

Example: if we record height, weight, GPA on 400Stanford undergraduates, we would represent this asXXX 400×3.

36 / 1

Page 10: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data

Continuous variables

Our previous example had each feature being numeric.Not all data is numeric.

If we add major to our data set, then we have acategorical or discrete variable.

Many categorical variables are unordered but some areordered.

For example, if we followed our 400 undergrads 10 yearsout into careers, we might make annual income brackets:

(50K , 100K ], (100K , 150K ], (150K , 200K ], etc.

37 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Example from textbook

38 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data

Types of data / attributes

Nominal Examples from text: ID numbers, eye color, zipcodes

Ordinal Examples from text: rankings (e.g., taste ofpotato chips on a scale from 1-10), grades, heightin {tall, medium, short}

Interval Examples from text: calendar dates, temperaturesin Celsius or Fahrenheit.

Ratio Examples from text: temperature in Kelvin,length, time, counts

39 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data

Properties of data / attributes

Distinctness =, 6=Order <,>

Addition +,−Multiplication ∗, /

40 / 1

Page 11: Data Mining - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week1_2x2.pdf · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data

Discrete vs. continuous

Discrete Attribute Has only a finite or countably infiniteset of valuesExamples from text: zip codes, counts, orthe set of words in a collection ofdocuments, binary data.

Continuous Attribute Real numbers for values.Examples from text: temperature, height, orweight.Floating point representation in thecomputer.

41 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data

2010 U.S. Congress example

Let’s look at the variables we used in our previous example.

party: discrete with two values {D,R}vote1, vote2, : discrete with values{Yay,Nay,Aye,Present,Not Voting, No}numeric vote1, numeric vote2, : discrete with values{−1, 0, 1}pca.votes$scores: continuous

This transformation from discrete variables vote1, vote2 tocontinuous variables pca.votes$scores is a recurring one inapplied statistics . . . Let’s take a look at *R*

42 / 1