unsupervised learning

Post on 16-Aug-2015

27 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Unsupervised learningFactor & Cluster Analysis

D3M

Factor & Cluster Analysis

Learning Objectives Unsupervised Learning Methods Principle component, Factor Analysis, & Clustering Objective is Dimension Reduction

Reduce the number of collinear variables (PCA/Factor) Group your rows (e.g. customers, markets, counties): Cluster Analysis

Learning Resources MIT Open Courses Lecture 11 & 14 Data Mining Class at U of Chicago (Lecture notes 7 & 8) Class notes

Basic Idea

Data Exploration

A-theoretical but not mindless

Essentially looking for ‘similarities’o Between variables (columns)

o Principle Component/Factor Analysis

o Between Subjects (rows)o Clustering Algorithm

Examples

Time series of Stock Prices

Items sold in supermarket

Attributes of Fortune 500 companies

Attributes of Brands (Perceptual or Real)

Customer Base of Amazon

Cluster webpages

Biological Attributes of Different Species

Attributes of State/County/Zip Codes

Google searches of keywords

Demographics/Shares of our Brand across stores

6

Example: Marketing Research

• PRIZM (“Potential Ratings Index for Zip Markets”) by Claritas Inc.– “Birds of a feather flock together”– 62 neighborhood (zip code) based groups that are

similar on demographic and behavioral characteristics – Used for store location decisions, direct marketing,

media selection, etc.

• http://www.claritas.com/MyBestSegments/Default.jsp

7

Key Methods

• Two key research tools

Cluster Analysis Tool for actually constructing segments

Factor AnalysisTool for “data reduction”

Difference between cluster and factor analysis

V1 V2 V3 V4 V5 V20…..

Cluster Analysis

(Group Subjects)

Factor Analysis

(Group Variables)

Data

9

Factor Analysis

Difference between cluster and factor analysis

V1 V2 V3 V4 V5 V20…..

Cluster Analysis

(Group Subjects)

Factor Analysis

(Group Variables)

Data

11

Factor Analysis

Factor Analysis can be used for data reduction (i.e., reduce the number of variables needed for analysis).

Factor analysis is able to summarize the information contained in a larger number of variables into a smaller number of ‘factors’ without significant loss of information.

Main use of Factor Analysis

• Harm/care • Authority/respect • Fairness/reciprocity • Ingroup/loyalty• Purity/sanctity

Example: Basis of Moral Foundations

5 Underlying Factors behind these Questions

• Data reduction is important when you need to measure “fuzzy” concepts such as ‘love’, ‘trust’ or ‘satisfaction’

• Ask a series of question that tap into the different components of the concept.

• Too many variables! Factor analysis can help to reduce this dimensionality problem

Factor Analysis

???

?

14

Intuition• Factor analysis assumes that the correlation between a large

number of variables is due to them all being dependent on the same small number of “factors”. Analyze the patterns of correlations to tap into the underlying construct.

• Example: Car ratings

Perception of seats

Perception of noise

Perception of smoothness of ride

Perception of AC-system

(Attributes)

Perception of “quality”

(Factor)

Example: Car Ratings

MKTG450 15

OpenImaginativeInsightful

ConscientiousnessOrganizedThorough

ExtraversionEnergeticAssertive

AgreeablenessSympatheticKindAffectionate

Neuroticism

TenseMoodyAnxious

Psychology: The “Big Five”

Trait Characteristics Example

16

Cluster Analysis

Difference between cluster and factor analysis

V1 V2 V3 V4 V5 V20…..

Cluster Analysis

(Group Subjects)

Factor Analysis

(Group Variables)

Data

18

Cluster Analysis

• Cluster analysis is a technique used to identify groups of ‘similar’ customers in a market (i.e., market segmentation).

Cluster analysis encompasses a number of different algorithms and methods for grouping objects of similar kind into categories.

19

ApplicationExample: Market Segmentation

o Process of dividing a total market into groups of consumers who have similar needs and who respond similarly to marketing mix variables.

?

?

?

20

• General question: how to organize observed data into meaningful structures

• Examples: o In food stores items of similar nature, such as

different types of meat or vegetables are displayed in the same or nearby locations.

o Biologists have to organize the different species of animals-- man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals.

o In medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies.

o In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy.

o Collaborative filtering & Recommendation systems

Example 1Segmenting Stores in Soup Case Study

D3M

Demographics Are Highly Correlated

Cluster Of Variables (Clustofvar Package in R)

Interpret the Factors

These are called factor “loadings”. Measures the correlation between each demographicand the underlying “factor”. Our Job to Interpret and put a label to these.

Information Captured

Factor1 Factor2 Factor3SS loadings 3.143 2.961 1.671Proportion Var 0.314 0.296 0.167Cumulative Var 0.314 0.610 0.777

Using 3 “factors” instead of 10 demographics, we capture approx. 78% of the information in data.

Example 2Segmenting US Counties

D3M

Files UsedUS_Counties.csv, Segment_US_County.R

• Suppose we are analyzing data based on US CountiesDemographic variablesHealth outcomesCrime RatesVoting BehaviorReligion Market Shares of brandsGoogle Searches

Hard to Even See let alone UnderstandBasically Bunch of Variables are Highly Correlated

Cluster Of Variables (Clustofvar Package in R)

top related