data mining algorithms for recommendation systems

Data Mining Algorithms for Recommendation SystemsZhenglu YangUniversity of Tokyo

Sample Applications*

*Sample ApplicationsCorporate Intranets

System InputsInteraction data (users items)Explicit feedback rating, commentsImplicit feedback purchase, browsingUser/Item individual dataUser side:Structural attribute informationPersonal descriptionSocial networkItem side:Structural attribute informationTextual description/content informationTaxonomy of item (category)

*

Interaction between Users and Items*Observed preferences(Purchases, Ratings, page views, bookmarks, etc)

Profiles of Users and Items*User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkTaxonomy of item (category)

All Information about Users and Items*Observed preferences(Purchases, Ratings, page views, bookmarks, etc)User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network


Artificial IntelligenceStatisticsMachinelearningKDDDatabaseNatural LanguageProcessingData mining is a multi-disciplinary fieldKDD and Data Mining*

Recommendation ApproachesCollaborative filteringUsing interaction data (user-item matrix)Process: Identify similar users, extrapolate from their ratingsContent based strategiesUsing profiles of users/items (features)Process: Generate rules/classifiers that are used to classify new itemsHybrid approaches*

A Brief IntroductionCollaborative filteringNearest neighbor basedModel based*

Recommendation ApproachesCollaborative filteringNearest neighbor basedUser basedItem basedModel based*

User-based Collaborative FilteringIdea: People who agreed in the past are likely to agree againTo predict a users opinion for an item, use the opinion of similar usersSimilarity between users is decided by looking at their overlap in opinions for other items

User-based CF (Ratings)*10 9 2 1good bad

Item 1Item 2Item 3Item 4Item 5Item 6User 1817298User 2987?12User 3898931User 4211231User 5312322User 6122111

Similarity between UsersOnly consider items both users have ratedCommon similarity measures: Cosine similarityPearson correlation

*

Item 1Item 2Item 3Item 4Item 5Item 6User 2987?12User 3898931

Recommendation ApproachesCollaborative filteringNearest neighbor basedUser basedItem basedModel basedContent based strategiesHybrid approaches*

Item-based Collaborative FilteringIdea: a user is likely to have the same opinion for similar itemsSimilarity between items is decided by looking at how other users have rated them

*

Example: Item-based CF

Item 1Item 2Item 3Item 4Item 5User 181?27User 222575User 354747User 471738User 517465User 683837

Similarity between ItemsOnly consider users who have rated both itemsCommon similarity measures: Cosine similarityPearson correlation

Item 3Item 4?25774734683

Recommendation ApproachesCollaborative filteringNearest neighbor basedModel basedMatrix factorization (i.e., SVD)Content based strategiesHybrid approaches*

Singular Value Decomposition (SVD)Mathematical method used to apply for many problemsGiven any mxn matrix R, find matrices U,I, and V that R = UIVT U is mxr and orthonormal I is rxr and diagonal V is nxr and orthonormalRemove the smallest values to get Rm,k with k

*Problems with Collaborative FilteringCold Start: There needs to be enough other users already in the system to find a match.Sparsity: If there are many items to be recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items.First Rater: Cannot recommend an item that has not been previously rated.New itemsEsoteric itemsPopularity Bias: Cannot recommend items to someone with unique tastes. Tends to recommend popular items.

Recommendation ApproachesCollaborative filteringContent based strategiesHybrid approaches*

Profiles of Users and Items*User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network


*Advantages of Content-Based ApproachNo need for data on other users.No cold-start or sparsity problems.Able to recommend to users with unique tastes.Able to recommend new and unpopular items No first-rater problem.Can provide explanations of recommended items by listing content-features that caused an item to be recommended.

Recommendation ApproachesCollaborative filteringContent based strategiesAssociation Rule MiningText similarity basedClusteringClassificationHybrid approaches*

Traditional Data Mining TechniquesAssociation Rule MiningSequential Pattern Mining*

Example: Market Basket DataItems frequently purchased together:

Uses:RecommendationPlacement SalesCouponsObjective: increase sales and reduce costsbeerdiaper*

Association Rule MiningGiven a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactionMarket-Basket transactionsExample of Association Rules{Beer} {Diaper}, {Coke} {Eggs},Implication means co-occurrence, not causality!*

TID

Items

1

Beer, Diaper, Milk

2

Coke, Diaper, Eggs

3

Beer, Coke, Diaper, Eggs

4

Coke, Eggs

Some DefinitionsAn itemset is supported by a transaction if it is included in the transaction is supported by transaction 1, and 3, and its support is 2/4=50%.Market-Basket transactions*

TID

Items

1

Beer, Diaper, Milk

2

Coke, Diaper, Eggs

3


4

Coke, Eggs

If the support of an itemset exceeds user specified min_support (threshold), this itemset is called a frequent itemset (pattern).min_support=50% is a frequent itemset is not a frequentitemsetSome DefinitionsMarket-Basket transactions*

TID

Items

1

Beer, Diaper, Milk

2

Coke, Diaper, Eggs

3


4

Coke, Eggs

OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern Mining

*

Apriori AlgorithmProposed by Agrawal et al. [VLDB94]

First algorithm for Association Rule mining

Candidate generation-and-test

Introduced anti-monotone property

*

Apriori AlgorithmMarket-Basket transactions*

TID

Items

1

Beer, Diaper, Milk

2

Coke, Diaper, Eggs

3


4

Coke, Eggs

A Naive Algorithm2313311111.Supmin=22*

TID

Items

1

Beer, Diaper, Milk

2

Coke, Diaper, Eggs

3


4

Coke, Eggs

Apriori Algorithm23133211.Supmin=2Anti-monotone property: If an itemset is not frequent, then any of its superset is not frequent

*

TID

Items

1

Beer, Diaper, Milk

2

Coke, Diaper, Eggs

3


4

Coke, Eggs

1st scanC1L1L2C2C22nd scanC3L33rd scanSupmin = 2Apriori Algorithm*

TidItems10A, C, D20B, C, E30A, B, C, E40B, E

Itemsetsup{A}2{B}3{C}3{D}1{E}3

Itemsetsup{A}2{B}3{C}3{E}3

Itemset{A, B}{A, C}{A, E}{B, C}{B, E}{C, E}

Itemsetsup{A, B}1{A, C}2{A, E}1{B, C}2{B, E}3{C, E}2

Itemsetsup{A, C}2{B, C}2{B, E}3{C, E}2

Itemsetsup{B, C, E}2

Itemset{B, C, E}

Drawbacks of AprioriMultiple scans of transaction databaseMultiple database scans are costlyHuge number of candidatesTo find frequent itemset i1i2i100# of scans: 100# of Candidates: 2100-1 = 1.27*1030 *

OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern Mining

*

FP-GrowthProposed by Han et al. [SIGMOD00]Uses the Apriori pruning principleScan DB only twiceOnce to find frequent 1-itemset (single item pattern)Once to construct FP-tree (prefix tree, Trie), the data structure of FP-growth*

FP-GrowthTIDItems bought 100{f, a, c, d, g, i, m, p}200{a, b, c, f, l, m, o}300 {b, f, h, j, o, w}400 {b, c, k, s, p}500 {a, f, c, e, l, p, m, n}Header Table

Item Support f4c4a3b3m3p3TID (ordered) frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}{}f:1c:1a:1m:1p:1*

FP-GrowthTIDItems bought 100{f, a, c, d, g, i, m, p}200{a, b, c, f, l, m, o}300 {b, f, h, j, o, w}400 {b, c, k, s, p}500 {a, f, c, e, l, p, m, n}Header Table

Item Support f4c4a3b3m3p3TID (ordered) frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1*

FP-GrowthTID (ordered) frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}*

FP-GrowthConditional pattern basesItem cond. pattern base freq. itemsetpfcam:2, cb:1fp, cp, ap, mp, fcp, fap, fmp, cap, cmp, amp, facp, fcmp, famp, fcamp

mfca:2, fcab:1fm, cm, am, fcm, fam, cam, fcambfca:1, f:1, c:1afc:3cf:3

*

OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern MiningGSPSPADE, SPAMPrefixSpan

*

ApplicationsCustomer shopping sequences Note computerMemoryCD-ROMwithin 3 daysQueryRe-queryQ1 P1 Q2 P280%*

If the support of a sequence exceeds user specified min_support, this sequence is called a sequential pattern.min_support=50% is a sequential pattern is not a sequential patternSome Definitions*

IDCustomer Sequence102030


*

GSP (Generalized Sequential Pattern Mining)Proposed by Srikant et al. [EDBT96]Uses the Apriori pruning principleOutline of the methodInitially, every item in DB is a candidate of length-1For each level (i.e., sequences of length-k) doScan database to collect support count for each candidate sequenceGenerate candidate length-(k+1) sequences from length-k frequent sequences using Apriori Repeat until no frequent sequence or no candidate can be found*

Finding Length-1 Sequential PatternsInitial candidates: , , , , , , , Scan database once, count support for candidates*

CandSup35433211

Generating Length-2 Candidates51 length-2CandidatesWithout Apriori property,8*8+8*7/2=92 candidatesApriori prunes 44.57% candidates*

The GSP Mining Processmin_sup =2 *


*

*SPADE AlgorithmProposed by Zaki et al. [MLJ01]

Vertical ID-list data representation

Support counting

Temporal joins

Candidate to be tested

In local candidate lists

SPADE AlgorithmID-List DB83073053010204303309307202308206305209201201308104201010620510320610220310910210410110710PosSIDPosSIDPosSIDPosSIDdcba1030*

IDData Sequence10< b d b c b d a b a d >20< d c a a b c b d a b >30< c a d a d c a d c a >

*Temporal Joins (S-Step)720520810PosSID920910710PosSID+min_support=2Sup{ab}=2Sup{ba}=21020

SIDPos1071092032042093023043073010

SIDPos1011031051082052072010

{a}

{b}

{a, b}

{b, a}


*

SPAM AlgorithmProposed by Ayres et al. [KDD02]

Key idea based on SPADE

Using bitmap data representation

Much faster than SPADE yet space consuming*

SPAM Algorithm

001000100100011010 110 210 310 410 510 610 720 120 220 320 420 530 130 230 330 4SID Pos{a}{b}{c}{d}*

IDData Sequence10< a c ( b c ) d ( a b c) a d >20< b ( c d ) a c ( b d ) >30< d ( b c ) ( a c ) ( c d ) >

0110100010100111

0001001010011001

0010100100010100

SPAM Temporal Joins 0010001001000110{a}{b}Sequence extended step:S-step111111111{a}s{b}&

0000001010010100{a,b}Sup(ab)=200 00 0 0 0min_support=2*

0010100100010100

0010100100010100


*

PrefixSpan (Projection-based Approach)Proposed by Pei et al. [ICDE01]

Based on pattern growth

Prefix-Postfix (Projection) representation

Basic idea: use frequent items to recursively project sequence database into smaller projected databases and grow patterns in each projected database.

*

ExampleSDB-projected database prefix-projected database prefixprefix, prefix -proj. DB

null-proj.DB

nullnullprefix-proj. DB

prefix1-Length frequentl patterns, , , ScanSup(a)=1Sup(b)=2 Sup(c)=3Sup(d)=3Frequent patternsaaab acadmin_sup=2First scan*

SIDSequence102030

Text Similarity based TechniquesVector Space Model (VSM)TF-IDFSemantic resource basedWordnetWikiWeb*

All Information about Users and Items*Observed preferences(Purchases, Ratings, page views, play lists, bookmarks, etc)User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkAssociated relation between items (i.e., co-purchased by the same user)



I like car, movie, music This car is nicely equipped with auto air conditioning

Profile Representation Vector Space Model User ProfileStructured data attributes: book, car, TV Free text I like car, movie, music

* Item ProfileStructured data attributes: name, color, price Free text This car is nicely equipped with auto air conditioning

car

book

TV

bike

User AItem BCosine SimilarityWeighted Cosine SimilarityAcos()Bweight

1001

1001

TF*IDF WeightingTF*IDF weighting

Term frequency tft,d of a term t in a document d i.e., nt,d is how many times t is appears in d

Inverse document frequency idft of a term t i.e., dft how many times t is appears in all documents

where N is the number of all documents

*

Profile Representation Unstructured data e.g., text description or review of the restaurant, or news articles No attribute names with well-defined values Natural language complexity Same word with different meanings Different words with same meaning Need to impose structure on free text before it can be used in recommendation algorithm*



I like automobile, movie, music This car is nicely equipped with auto air conditioning

Knowledge based Similarity Knowledge data WordNet

Intuition: Two words are similar if they are close to each other

Measure approachShortest path based [Rada, SMC89][Wu, ACL94][Leacock98]Content based [Resnik, IJCAI95][Jiang, ROLING97][Lin, ICML98]

*

Knowledge-based word semantic similarity(Leacock & Chodorow, 1998)

(Wu & Palmer, 1994)

(Lesk, 1986)Finds the overlap between the dictionary entries of two words *

Explicit Semantic Similarity (ESA)Proposed by Gabrilovich[IJCAI07]

Map text to concepts (i.e., vector) in Wiki

Calculate ESA score by common vector based measure (i.e., cosine)*

ESA ProcessThis figure is from Gabrilovich IJCAI07.*

ESA ExampleText1: The dog caught the red ball.Text2: A labrador played in the park.

Similarity Score: 14.38%This slide is from Rada Mihalcea.*

Glossary of cue sports termsAmerican Football StrategyBaseballBoston Red SoxT1:2711402487528T2:10817110774

Corpus based similarityCorpus dataWeb (search engine)

Intuition:Two words are similar if they frequently occur in the same pagePMI-IR [Turney, ECML01]

*

PMI-IRPointwise Mutual Information (Church and Hanks89)

PMI-IR (Turney01)

*where N is the number of Web pages



Clustering

ClusteringK-meansHierarchical Clustering

*

K-meansIntroduced by MacQueen, J. B. (1967)

Works when we know k, the number of clusters we want to find

Idea:Randomly pick k points as the centroids of the k clustersLoop: For each point, put the point in the cluster to whose centroid it is closestRecompute the cluster centroidsRepeat loop (until there is no change in clusters between two consecutive iterations.)Iterative improvement of the objective function: Sum of the squared distance from each point to the centroid of its cluster*

K-means Example (K=2)Reassign clustersConverged!*

ClusteringK-meansHierarchical Clustering

*

Hierarchical ClusteringTwo types: Agglomerative (bottom up)Divisive (top down)

Agglomerative: two groups are merged if distance between them is less than a threshold

Divisive: one group is split into two if intergroup distance more than a threshold

Can be expressed by an excellent graphical representation called dendrogram

*

Hierarchical Clusteringdendrogram*

Hierarchical Agglomerative ClusteringPut every point in a cluster by itself. For I=1 to N-1 do{ let C1 and C2 be the most mergeable pair of clusters Create C1,2 as parent of C1 and C2 }

Example: for simplicity, we use 1-dimensional objects.Numerical Objects: 1, 2, 5, 6, 7

Agglomerative clustering: find two closest objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; => {1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}.12567*

Illustrating Classification Task

Learning algorithm

Induction

Deduction

Test Set

Model

Training Set

Classificationk-Nearest Neighbor (kNN)Decision TreeNave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

*

k-Nearest Neighbor Classification (kNN)kNN does not build model from the training data. ApproachTo classify a test instance d, define k-neighborhood P as k nearest neighbors of dCount number n of training instances in P that belong to class cjEstimate Pr(cj|d) as n/k (majority vote)No training is needed. Classification time is linear in training set size for each test case. k is usually chosen empirically via a validation set or cross-validation by trying a range of k values. Distance function is crucial, but depends on applications.

*

Example: k=1 (1NN)CarBookClothes Book* which class?

Example: k=3 (3NN)CarBookClothes Car* which class?

DiscussionAdvantageNonparametric architectureSimplePowerfulRequires no training timeDisadvantageMemory intensiveClassification/estimation is slowSensitive to k*


*

Example of a Decision TreeTraining DataJudge the cheat possibility: Yes/No

Example of a Decision TreeRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80KSplitting AttributesTraining DataModel: Decision TreeJudge the cheat possibility: Yes/No

Another Example of Decision TreecategoricalcategoricalcontinuousclassMarStRefundTaxIncYESNONOYesNoMarried Single, Divorced< 80K> 80KThere could be more than one tree that fits the same data!Judge the cheat possibility: Yes/No

Tid

Refund

Marital

Status

Taxable

Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

10

Decision Tree - ConstructionCreating Decision TreesManual - Based on expert knowledgeAutomated - Based on training data (DM)Two main issues:Issue #1: Which attribute to take for a split?Issue #2: When to stop splitting?

Classificationk-Nearest Neighbor (kNN)Decision TreeCARTC4.5Nave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

*

The CART AlgorithmClassification And Regression TreesDeveloped by Breiman et al. in early 80s.Introduced tree-based modeling into the statistical mainstreamRigorous approach involving cross-validation to select the optimal tree*

Key IdeaRecursive PartitioningTake all of your data.Consider all possible values of all variables.Select the variable/value (X=t1) that produces the greatest separation in the target.(X=t1) is called a split.If X< t1 then send the data to the left; otherwise, send data point to the right.Now repeat same process on these two nodesYou get a treeNote: CART only uses binary splits.*

Key IdeaLet (s |t ) be a measure of the goodness of a candidate split s at node t , where:

Then the optimal split maximizes this (s |t ) measure over all possible splits at node t .*

Key Idea(s |t ) is large when both of its main components are large:and

1. - Maximum value if child nodes are equal size (same support) ): E.g. 0.5*0.5 = 0.25 and 0.9*0.1= 0.09

Maximum value if for each class the child nodes are completely uniform (pure)Theoretical maximum value for Q (s|t) is k, where k is the number of classes for the target variable

*2. Q (s |t )=

CART Example*Training Set of Records for Classifying Credit Risk

Customer

Savings

Assets

Income ($1000s)

Credit Risk

1

Medium

High

75

Good

2

Low

Low

50

Bad

3

High

Medium

25

Bad

4

Medium

Medium

50

Good

5

Low

Medium

100

Good

6

High

High

25

Good

7

Low

Low

25

Bad

8

Medium

Medium

75

Good

10

CART Example Candidate Splits*Candidate Splits for t = Root Node CART is restricted to binary splits

Candidate SplitLeft Child Node, tLRight Child Node, tR123456789Savings = lowSavings = mediumSavings = highAssets = lowAssets = mediumAssets = highIncome $75,000

CART PrimerSplit 1. -> Savings=low (L-true, R-false)Right:1,3,4,6,8Left:2,5,7PR=5/8 = 0.625 PL=3/8=0.375 -> 2*PLPR=15/64=0.46875

P(j=Bad | t)P(Bad | tR)= 1/5 = 0.2 P(Bad | tL)= 2/3 = 0.67P(j=Good | t)P(Good | tR)= 4/5 = 0.8P(Good | tL)= 1/3 = 0.33Q(s|t)= |0.67-0.2|+|0.8-0.33| = 0.934

*

CART Example* For each candidate split, examine the values of the various components of the measure (s|t)

SplitPLPRP(j|tL)P(j|tR)2PLPRQ(s|t)(s |t ) 1

2

3

4

5

6

7

8

90.375

0.375

0.25

0.25

0.5

0.25

0.375

0.625

0.8750.625

0.625

0.75

0.75

0.5

0.75

0.625

0.375

0.125G:0.333B:0.667G:1B:0G:0.5B:0.5G:0B:1G:0.75B:0.25G:1B:0G:0.333B:0.667G:0.4B:0.6G:0.571B:0.429G:0.8B:0.2G:0.4B:0.6G:0.667B:0.333G:0.833B:0.167G:0.5B:0.5G:0.5B:0.5G:0.8B:0.2G:1B:0G:1B:00.46875

0.46875

0.375

0.375

0.5

0.375

0.46875

0.46875

0.218750.934

1.2

0.334

1.667

0.5

1

0.934

1.2

0.8580.4378

0.5625

0.1253

0.6248

0.25

0.375

0.4378

0.5625

0.1877

CART Example* CART decision tree after initial split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk(Records 2,7) Decision Node A(Records 1,3,4,5,6,8)Assets=LowAssets {Medium, High}

CART Example* Values of Components of Measure (s|t) for Each Candidate Split on Decision Node A

SplitPLPRP(j|tL)P(j|tR)2PLPRQ(s|t)(s |t ) 1

2

3

5

6

7

8

9

0.167

0.5

0.333

0.667

0.333

0.333

0.5

0.1670.833

0.5

0.667

0.333

0.667

0.667

0.5

0.833G:1B:0G:1B:0G:0.5B:0.5G:0.75B:0.25G:1B:0G:0.5B:0.5G:0.667B:0.333G:0.8B:0.2G:0.8B:0.2G:0.667B:0.333G:1B:0G:1B:0G:0.75B:0.25G:1B:0G:1B:0G:1B:00.2782

0.5

0.4444

0.4444

0.4444

0.4444

0.5

0.27820.4

0.6666

1

0.5

0.5

1

0.6666

0.40.1112

0.3333

0.4444

0.2222

0.2222

0.4444

0.3333

0.1112

CART Example* CART decision tree after decision node A split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk(Records 2,7) Decision Node A(Records 1,3,4,5,6,8)Assets=LowAssets {Medium, High}

Savings=HighSavings {Low,Medium}

Decision Node B (Records 3,6) Good Risk(Records 1,4,5,8)

CART Example* CART decision tree, fully grown form Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk(Records 2,7) Decision Node A(Records 1,3,4,5,6,8)Assets=LowAssets {Medium, High}

Savings=HighSavings {Low,Medium}

Decision Node B (Records 3,6) Good Risk(Records 1,4,5,8)Assets=HighAssets=Medium

Bad Risk(Records 3) Good Risk(Records 6)

Classificationk-Nearest Neighbor (kNN)Decision TreeCARTC4.5Nave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

*

The C4.5 AlgorithmProposed by Quinlan in 1993An internal node represents a test on an attribute.A branch represents an outcome of the test, e.g., Color=red.A leaf node represents a class label or class label distribution.At each node, one attribute is chosen to split training examples into distinct classes as much as possibleA new case is classified by following a matching path to a leaf node.

*

The C4.5 AlgorithmDifferences between CART and C4.5:Unlike CART, the C4.5 algorithm is not restricted to binary splits.It produces a separate branch for each value of the categorical attribute.C4.5 method for measuring node homogeneity is different from the CART.

*

The C4.5 Algorithm - MeasureWe have a candidate split S, which partitions the training data set T into several subsets, T1, T2, . . . , Tk.C4.5 uses the concept of entropy reduction to select the optimal split.entropy_reduction(S) = H(T)-HS(T), where entropy H(X) is:

Where Pi represents the proportion of records in subset i .The weighted sum of the entropies for the individual subsets T1, T2, . . . , Tk

C4.5 chooses the optimal split - the split with greatest entropy reduction*


*

Bayes RuleRecommender system questionLi is the class for item i (i.e., that the user likes item i)A is the set of features associated with item iEstimate p(Li|A) p(Li |A) = p(A| Li) p(Li) / p(A)We can always restate a conditional probability in terms ofThe reverse condition p(A| Li)Two prior probabilitiesp(Li)p(A)Often the reverse condition is easier to knowWe can count how often a feature appears in items the user likedFrequentist assumption

*

Naive BayesIndependence (Nave Bayes assumption)the features a1, a2, ... , ak are independent

For joint probability

For conditional probability

Bayes' Rule

*

An ExampleCompute all probabilities required for classification*

An ExampleFor C = t, we have

For class C = f, we have

C = t is more probable. t is the final class. *

Nave Bayesian ClassifierAdvantages: Easy to implementVery efficientGood results obtained in many applicationsDisadvantagesAssumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets)

*

ClassificationK-Nearest Neighbor (kNN)Decision TreeNave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

*

References for Machine LearningT. Mitchell, Machine Learning, McGraw Hill, 1997C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Springer, 2001.V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998.Y. Kodratoff, R. S. Michalski, Machine Learning: An Artificial Intelligence Approach, Volume III, Morgan Kaufmann, 1990

*

Recommendation ApproachesCollaborative filteringNearest neighbor basedModel basedContent based strategiesAssociation Rule MiningText similarity basedClusteringClassificationHybrid approaches*

The Netflix PrizeSlides here are from Yehuda Koren.

Netflix Movie rentals by DVD (mail) and online (streaming)100k movies, 10 million customersShips 1.9 million disks to customers each day50 warehouses in the USComplex logistics problemEmployees: 2000But relatively few in engineering/softwareAnd only a few people working on recommender systemsMoving towards online delivery of contentSignificant interaction of customers with Web site*

The $1 Million Question*

Million Dollars Awarded Sept 21st 2009*

Lessons LearnedScale is importante.g., stochastic gradient descent on sparse matricesLatent factor models work well on this problemPreviously had not been explored for recommender systemsUnderstanding your data is important, e.g., time-effectsCombining models works surprisingly wellBut final 10% improvement can probably be achieved by judiciously combining about 10 models rather than 1000sThis is likely what Netflix will do in practice

*

Useful ReferencesY. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 2009Y. Koren, Factor in the neighbors: scalable and accurate collaborative filtering, ACM Transactions on Knowledge Discovery in Data, 2010

*

Thank you!

*

*Each row in the table are the ratings one user on the items*****************nt,d is term count of t in dN is number of documents in the collectiondft is number of documents that contains term t******Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP**Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP***

data mining algorithms for recommendation systems

Documents

itembased cfitem

items userbased cf ratings

users opinion

etcprofiles of users

data mining algorithms

strategieshybrid approaches

etcuser profile

page views