data mining algorithms for recommendation systems

136
Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo

Upload: fionan

Post on 06-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

Data Mining Algorithms for Recommendation Systems. Zhenglu Yang University of Tokyo. Sample Applications. Sample Applications. Corporate Intranets. Sample Applications. System Inputs. Interaction data (users items) Explicit feedback – rating, comments - PowerPoint PPT Presentation

TRANSCRIPT

  • Data Mining Algorithms for Recommendation SystemsZhenglu YangUniversity of Tokyo

  • Sample Applications*

  • Sample Applications*

  • *Sample ApplicationsCorporate Intranets

  • System InputsInteraction data (users items)Explicit feedback rating, commentsImplicit feedback purchase, browsingUser/Item individual dataUser side:Structural attribute informationPersonal descriptionSocial networkItem side:Structural attribute informationTextual description/content informationTaxonomy of item (category)

    *

  • Interaction between Users and Items*Observed preferences(Purchases, Ratings, page views, bookmarks, etc)

  • Profiles of Users and Items*User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

    Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkTaxonomy of item (category)

  • All Information about Users and Items*Observed preferences(Purchases, Ratings, page views, bookmarks, etc)User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

    Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkTaxonomy of item (category)

  • Artificial IntelligenceStatisticsMachinelearningKDDDatabaseNatural LanguageProcessingData mining is a multi-disciplinary fieldKDD and Data Mining*

  • Recommendation ApproachesCollaborative filteringUsing interaction data (user-item matrix)Process: Identify similar users, extrapolate from their ratingsContent based strategiesUsing profiles of users/items (features)Process: Generate rules/classifiers that are used to classify new itemsHybrid approaches*

  • A Brief IntroductionCollaborative filteringNearest neighbor basedModel based*

  • Recommendation ApproachesCollaborative filteringNearest neighbor basedUser basedItem basedModel based*

  • User-based Collaborative FilteringIdea: People who agreed in the past are likely to agree againTo predict a users opinion for an item, use the opinion of similar usersSimilarity between users is decided by looking at their overlap in opinions for other items

  • User-based CF (Ratings)*10 9 2 1good bad

    Item 1Item 2Item 3Item 4Item 5Item 6User 1817298User 2987?12User 3898931User 4211231User 5312322User 6122111

  • Similarity between UsersOnly consider items both users have ratedCommon similarity measures: Cosine similarityPearson correlation

    *

    Item 1Item 2Item 3Item 4Item 5Item 6User 2987?12User 3898931

  • Recommendation ApproachesCollaborative filteringNearest neighbor basedUser basedItem basedModel basedContent based strategiesHybrid approaches*

  • Item-based Collaborative FilteringIdea: a user is likely to have the same opinion for similar itemsSimilarity between items is decided by looking at how other users have rated them

    *

  • Example: Item-based CF

    Item 1Item 2Item 3Item 4Item 5User 181?27User 222575User 354747User 471738User 517465User 683837

  • Similarity between ItemsOnly consider users who have rated both itemsCommon similarity measures: Cosine similarityPearson correlation

    Item 3Item 4?25774734683

  • Recommendation ApproachesCollaborative filteringNearest neighbor basedModel basedMatrix factorization (i.e., SVD)Content based strategiesHybrid approaches*

  • Singular Value Decomposition (SVD)Mathematical method used to apply for many problemsGiven any mxn matrix R, find matrices U,I, and V that R = UIVT U is mxr and orthonormal I is rxr and diagonal V is nxr and orthonormalRemove the smallest values to get Rm,k with k
  • *Problems with Collaborative FilteringCold Start: There needs to be enough other users already in the system to find a match.Sparsity: If there are many items to be recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items.First Rater: Cannot recommend an item that has not been previously rated.New itemsEsoteric itemsPopularity Bias: Cannot recommend items to someone with unique tastes. Tends to recommend popular items.

  • Recommendation ApproachesCollaborative filteringContent based strategiesHybrid approaches*

  • Profiles of Users and Items*User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

    Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkTaxonomy of item (category)

  • *Advantages of Content-Based ApproachNo need for data on other users.No cold-start or sparsity problems.Able to recommend to users with unique tastes.Able to recommend new and unpopular items No first-rater problem.Can provide explanations of recommended items by listing content-features that caused an item to be recommended.

  • Recommendation ApproachesCollaborative filteringContent based strategiesAssociation Rule MiningText similarity basedClusteringClassificationHybrid approaches*

  • Traditional Data Mining TechniquesAssociation Rule MiningSequential Pattern Mining*

  • Example: Market Basket DataItems frequently purchased together:

    Uses:RecommendationPlacement SalesCouponsObjective: increase sales and reduce costsbeerdiaper*

  • Association Rule MiningGiven a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactionMarket-Basket transactionsExample of Association Rules{Beer} {Diaper}, {Coke} {Eggs},Implication means co-occurrence, not causality!*

    TID

    Items

    1

    Beer, Diaper, Milk

    2

    Coke, Diaper, Eggs

    3

    Beer, Coke, Diaper, Eggs

    4

    Coke, Eggs

  • Some DefinitionsAn itemset is supported by a transaction if it is included in the transaction is supported by transaction 1, and 3, and its support is 2/4=50%.Market-Basket transactions*

    TID

    Items

    1

    Beer, Diaper, Milk

    2

    Coke, Diaper, Eggs

    3

    Beer, Coke, Diaper, Eggs

    4

    Coke, Eggs

  • If the support of an itemset exceeds user specified min_support (threshold), this itemset is called a frequent itemset (pattern).min_support=50% is a frequent itemset is not a frequentitemsetSome DefinitionsMarket-Basket transactions*

    TID

    Items

    1

    Beer, Diaper, Milk

    2

    Coke, Diaper, Eggs

    3

    Beer, Coke, Diaper, Eggs

    4

    Coke, Eggs

  • OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern Mining

    *

  • Apriori AlgorithmProposed by Agrawal et al. [VLDB94]

    First algorithm for Association Rule mining

    Candidate generation-and-test

    Introduced anti-monotone property

    *

  • Apriori AlgorithmMarket-Basket transactions*

    TID

    Items

    1

    Beer, Diaper, Milk

    2

    Coke, Diaper, Eggs

    3

    Beer, Coke, Diaper, Eggs

    4

    Coke, Eggs

  • A Naive Algorithm2313311111.Supmin=22*

    TID

    Items

    1

    Beer, Diaper, Milk

    2

    Coke, Diaper, Eggs

    3

    Beer, Coke, Diaper, Eggs

    4

    Coke, Eggs

  • Apriori Algorithm23133211.Supmin=2Anti-monotone property: If an itemset is not frequent, then any of its superset is not frequent

    *

    TID

    Items

    1

    Beer, Diaper, Milk

    2

    Coke, Diaper, Eggs

    3

    Beer, Coke, Diaper, Eggs

    4

    Coke, Eggs

  • 1st scanC1L1L2C2C22nd scanC3L33rd scanSupmin = 2Apriori Algorithm*

    TidItems10A, C, D20B, C, E30A, B, C, E40B, E

    Itemsetsup{A}2{B}3{C}3{D}1{E}3

    Itemsetsup{A}2{B}3{C}3{E}3

    Itemset{A, B}{A, C}{A, E}{B, C}{B, E}{C, E}

    Itemsetsup{A, B}1{A, C}2{A, E}1{B, C}2{B, E}3{C, E}2

    Itemsetsup{A, C}2{B, C}2{B, E}3{C, E}2

    Itemsetsup{B, C, E}2

    Itemset{B, C, E}

  • Drawbacks of AprioriMultiple scans of transaction databaseMultiple database scans are costlyHuge number of candidatesTo find frequent itemset i1i2i100# of scans: 100# of Candidates: 2100-1 = 1.27*1030 *

  • OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern Mining

    *

  • FP-GrowthProposed by Han et al. [SIGMOD00]Uses the Apriori pruning principleScan DB only twiceOnce to find frequent 1-itemset (single item pattern)Once to construct FP-tree (prefix tree, Trie), the data structure of FP-growth*

  • FP-GrowthTIDItems bought 100{f, a, c, d, g, i, m, p}200{a, b, c, f, l, m, o}300 {b, f, h, j, o, w}400 {b, c, k, s, p}500 {a, f, c, e, l, p, m, n}Header Table

    Item Support f4c4a3b3m3p3TID (ordered) frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}{}f:1c:1a:1m:1p:1*

  • FP-GrowthTIDItems bought 100{f, a, c, d, g, i, m, p}200{a, b, c, f, l, m, o}300 {b, f, h, j, o, w}400 {b, c, k, s, p}500 {a, f, c, e, l, p, m, n}Header Table

    Item Support f4c4a3b3m3p3TID (ordered) frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1*

  • FP-GrowthTID (ordered) frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}*

  • FP-GrowthConditional pattern basesItem cond. pattern base freq. itemsetpfcam:2, cb:1fp, cp, ap, mp, fcp, fap, fmp, cap, cmp, amp, facp, fcmp, famp, fcamp

    mfca:2, fcab:1fm, cm, am, fcm, fam, cam, fcambfca:1, f:1, c:1afc:3cf:3

    *

  • OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern MiningGSPSPADE, SPAMPrefixSpan

    *

  • ApplicationsCustomer shopping sequences Note computerMemoryCD-ROMwithin 3 daysQueryRe-queryQ1 P1 Q2 P280%*

  • If the support of a sequence exceeds user specified min_support, this sequence is called a sequential pattern.min_support=50% is a sequential pattern is not a sequential patternSome Definitions*

    IDCustomer Sequence102030

  • OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern MiningGSPSPADE, SPAMPrefixSpan

    *

  • GSP (Generalized Sequential Pattern Mining)Proposed by Srikant et al. [EDBT96]Uses the Apriori pruning principleOutline of the methodInitially, every item in DB is a candidate of length-1For each level (i.e., sequences of length-k) doScan database to collect support count for each candidate sequenceGenerate candidate length-(k+1) sequences from length-k frequent sequences using Apriori Repeat until no frequent sequence or no candidate can be found*

  • Finding Length-1 Sequential PatternsInitial candidates: , , , , , , , Scan database once, count support for candidates*

    CandSup35433211

  • Generating Length-2 Candidates51 length-2CandidatesWithout Apriori property,8*8+8*7/2=92 candidatesApriori prunes 44.57% candidates*

  • The GSP Mining Processmin_sup =2 *

  • OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern MiningGSPSPADE, SPAMPrefixSpan

    *

  • *SPADE AlgorithmProposed by Zaki et al. [MLJ01]

    Vertical ID-list data representation

    Support counting

    Temporal joins

    Candidate to be tested

    In local candidate lists

  • SPADE AlgorithmID-List DB83073053010204303309307202308206305209201201308104201010620510320610220310910210410110710PosSIDPosSIDPosSIDPosSIDdcba1030*

    IDData Sequence10< b d b c b d a b a d >20< d c a a b c b d a b >30< c a d a d c a d c a >

  • *Temporal Joins (S-Step)720520810PosSID920910710PosSID+min_support=2Sup{ab}=2Sup{ba}=21020

    SIDPos1071092032042093023043073010

    SIDPos1011031051082052072010

    {a}

    {b}

    {a, b}

    {b, a}

  • OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern MiningGSPSPADE, SPAMPrefixSpan

    *

  • SPAM AlgorithmProposed by Ayres et al. [KDD02]

    Key idea based on SPADE

    Using bitmap data representation

    Much faster than SPADE yet space consuming*

  • SPAM Algorithm

    001000100100011010 110 210 310 410 510 610 720 120 220 320 420 530 130 230 330 4SID Pos{a}{b}{c}{d}*

    IDData Sequence10< a c ( b c ) d ( a b c) a d >20< b ( c d ) a c ( b d ) >30< d ( b c ) ( a c ) ( c d ) >

    0110100010100111

    0001001010011001

    0010100100010100

  • SPAM Temporal Joins 0010001001000110{a}{b}Sequence extended step:S-step111111111{a}s{b}&

    0000001010010100{a,b}Sup(ab)=200 00 0 0 0min_support=2*

    0010100100010100

    0010100100010100

  • OutlineAssociation Rule MiningAprioriFP-growthSequential Pattern MiningGSPSPADE, SPAMPrefixSpan

    *

  • PrefixSpan (Projection-based Approach)Proposed by Pei et al. [ICDE01]

    Based on pattern growth

    Prefix-Postfix (Projection) representation

    Basic idea: use frequent items to recursively project sequence database into smaller projected databases and grow patterns in each projected database.

    *

  • ExampleSDB-projected database prefix-projected database prefixprefix, prefix -proj. DB

    null-proj.DB

    nullnullprefix-proj. DB

    prefix1-Length frequentl patterns, , , ScanSup(a)=1Sup(b)=2 Sup(c)=3Sup(d)=3Frequent patternsaaab acadmin_sup=2First scan*

    SIDSequence102030

  • Recommendation ApproachesCollaborative filteringContent based strategiesAssociation Rule MiningText similarity basedClusteringClassificationHybrid approaches*

  • Text Similarity based TechniquesVector Space Model (VSM)TF-IDFSemantic resource basedWordnetWikiWeb*

  • All Information about Users and Items*Observed preferences(Purchases, Ratings, page views, play lists, bookmarks, etc)User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

    Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkAssociated relation between items (i.e., co-purchased by the same user)

  • All Information about Users and Items*Observed preferences(Purchases, Ratings, page views, play lists, bookmarks, etc)User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

    Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkAssociated relation between items (i.e., co-purchased by the same user)

    I like car, movie, music This car is nicely equipped with auto air conditioning

  • Profile Representation Vector Space Model User ProfileStructured data attributes: book, car, TV Free text I like car, movie, music

    * Item ProfileStructured data attributes: name, color, price Free text This car is nicely equipped with auto air conditioning

    car

    book

    TV

    bike

    User AItem BCosine SimilarityWeighted Cosine SimilarityAcos()Bweight

    1001

    1001

  • TF*IDF WeightingTF*IDF weighting

    Term frequency tft,d of a term t in a document d i.e., nt,d is how many times t is appears in d

    Inverse document frequency idft of a term t i.e., dft how many times t is appears in all documents

    where N is the number of all documents

    *

  • Profile Representation Unstructured data e.g., text description or review of the restaurant, or news articles No attribute names with well-defined values Natural language complexity Same word with different meanings Different words with same meaning Need to impose structure on free text before it can be used in recommendation algorithm*

  • All Information about Users and Items*Observed preferences(Purchases, Ratings, page views, play lists, bookmarks, etc)User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

    Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkAssociated relation between items (i.e., co-purchased by the same user)

    I like automobile, movie, music This car is nicely equipped with auto air conditioning

  • Text Similarity based TechniquesVector Space Model (VSM)TF-IDFSemantic resource basedWordnetWikiWeb*

  • Knowledge based Similarity Knowledge data WordNet

    Intuition: Two words are similar if they are close to each other

    Measure approachShortest path based [Rada, SMC89][Wu, ACL94][Leacock98]Content based [Resnik, IJCAI95][Jiang, ROLING97][Lin, ICML98]

    *

  • Knowledge-based word semantic similarity(Leacock & Chodorow, 1998)

    (Wu & Palmer, 1994)

    (Lesk, 1986)Finds the overlap between the dictionary entries of two words *

  • Text Similarity based TechniquesVector Space Model (VSM)TF-IDFSemantic resource basedWordnetWikiWeb*

  • Explicit Semantic Similarity (ESA)Proposed by Gabrilovich[IJCAI07]

    Map text to concepts (i.e., vector) in Wiki

    Calculate ESA score by common vector based measure (i.e., cosine)*

  • ESA ProcessThis figure is from Gabrilovich IJCAI07.*

  • ESA ExampleText1: The dog caught the red ball.Text2: A labrador played in the park.

    Similarity Score: 14.38%This slide is from Rada Mihalcea.*

    Glossary of cue sports termsAmerican Football StrategyBaseballBoston Red SoxT1:2711402487528T2:10817110774

  • Text Similarity based TechniquesVector Space Model (VSM)TF-IDFSemantic resource basedWordnetWikiWeb*

  • Corpus based similarityCorpus dataWeb (search engine)

    Intuition:Two words are similar if they frequently occur in the same pagePMI-IR [Turney, ECML01]

    *

  • PMI-IRPointwise Mutual Information (Church and Hanks89)

    PMI-IR (Turney01)

    *where N is the number of Web pages

  • Recommendation ApproachesCollaborative filteringContent based strategiesAssociation Rule MiningText similarity basedClusteringClassificationHybrid approaches*

  • All Information about Users and Items*Observed preferences(Purchases, Ratings, page views, play lists, bookmarks, etc)User Profile: (1) AttributeNationality,Sex,Age,Hobby,etc(2) TextPersonal description(3) LinkSocial network

    Item Profile: (1) AttributePrice,Weight,Color,Brand,etc(2) TextProduct description(3) linkAssociated relation between items (i.e., co-purchased by the same user)

    Clustering

  • ClusteringK-meansHierarchical Clustering

    *

  • K-meansIntroduced by MacQueen, J. B. (1967)

    Works when we know k, the number of clusters we want to find

    Idea:Randomly pick k points as the centroids of the k clustersLoop: For each point, put the point in the cluster to whose centroid it is closestRecompute the cluster centroidsRepeat loop (until there is no change in clusters between two consecutive iterations.)Iterative improvement of the objective function: Sum of the squared distance from each point to the centroid of its cluster*

  • K-means Example (K=2)Reassign clustersConverged!*

  • ClusteringK-meansHierarchical Clustering

    *

  • Hierarchical ClusteringTwo types: Agglomerative (bottom up)Divisive (top down)

    Agglomerative: two groups are merged if distance between them is less than a threshold

    Divisive: one group is split into two if intergroup distance more than a threshold

    Can be expressed by an excellent graphical representation called dendrogram

    *

  • Hierarchical Clusteringdendrogram*

  • Hierarchical Agglomerative ClusteringPut every point in a cluster by itself. For I=1 to N-1 do{ let C1 and C2 be the most mergeable pair of clusters Create C1,2 as parent of C1 and C2 }

    Example: for simplicity, we use 1-dimensional objects.Numerical Objects: 1, 2, 5, 6, 7

    Agglomerative clustering: find two closest objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; => {1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}.12567*

  • Recommendation ApproachesCollaborative filteringContent based strategiesAssociation Rule MiningText similarity basedClusteringClassificationHybrid approaches*

  • Illustrating Classification Task

    Learning algorithm

    Induction

    Deduction

    Test Set

    Model

    Training Set

  • Classificationk-Nearest Neighbor (kNN)Decision TreeNave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

    *

  • k-Nearest Neighbor Classification (kNN)kNN does not build model from the training data. ApproachTo classify a test instance d, define k-neighborhood P as k nearest neighbors of dCount number n of training instances in P that belong to class cjEstimate Pr(cj|d) as n/k (majority vote)No training is needed. Classification time is linear in training set size for each test case. k is usually chosen empirically via a validation set or cross-validation by trying a range of k values. Distance function is crucial, but depends on applications.

    *

  • Example: k=1 (1NN)CarBookClothes Book* which class?

  • Example: k=3 (3NN)CarBookClothes Car* which class?

  • DiscussionAdvantageNonparametric architectureSimplePowerfulRequires no training timeDisadvantageMemory intensiveClassification/estimation is slowSensitive to k*

  • Classificationk-Nearest Neighbor (kNN)Decision TreeNave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

    *

  • Example of a Decision TreeTraining DataJudge the cheat possibility: Yes/No

  • Example of a Decision TreeRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80KSplitting AttributesTraining DataModel: Decision TreeJudge the cheat possibility: Yes/No

  • Another Example of Decision TreecategoricalcategoricalcontinuousclassMarStRefundTaxIncYESNONOYesNoMarried Single, Divorced< 80K> 80KThere could be more than one tree that fits the same data!Judge the cheat possibility: Yes/No

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    1

    Yes

    Single

    125K

    No

    2

    No

    Married

    100K

    No

    3

    No

    Single

    70K

    No

    4

    Yes

    Married

    120K

    No

    5

    No

    Divorced

    95K

    Yes

    6

    No

    Married

    60K

    No

    7

    Yes

    Divorced

    220K

    No

    8

    No

    Single

    85K

    Yes

    9

    No

    Married

    75K

    No

    10

    No

    Single

    90K

    Yes

    10

  • Decision Tree - ConstructionCreating Decision TreesManual - Based on expert knowledgeAutomated - Based on training data (DM)Two main issues:Issue #1: Which attribute to take for a split?Issue #2: When to stop splitting?

  • Classificationk-Nearest Neighbor (kNN)Decision TreeCARTC4.5Nave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

    *

  • The CART AlgorithmClassification And Regression TreesDeveloped by Breiman et al. in early 80s.Introduced tree-based modeling into the statistical mainstreamRigorous approach involving cross-validation to select the optimal tree*

  • Key IdeaRecursive PartitioningTake all of your data.Consider all possible values of all variables.Select the variable/value (X=t1) that produces the greatest separation in the target.(X=t1) is called a split.If X< t1 then send the data to the left; otherwise, send data point to the right.Now repeat same process on these two nodesYou get a treeNote: CART only uses binary splits.*

  • Key IdeaLet (s |t ) be a measure of the goodness of a candidate split s at node t , where:

    Then the optimal split maximizes this (s |t ) measure over all possible splits at node t .*

  • Key Idea(s |t ) is large when both of its main components are large:and

    1. - Maximum value if child nodes are equal size (same support) ): E.g. 0.5*0.5 = 0.25 and 0.9*0.1= 0.09

    Maximum value if for each class the child nodes are completely uniform (pure)Theoretical maximum value for Q (s|t) is k, where k is the number of classes for the target variable

    *2. Q (s |t )=

  • CART Example*Training Set of Records for Classifying Credit Risk

    Customer

    Savings

    Assets

    Income ($1000s)

    Credit Risk

    1

    Medium

    High

    75

    Good

    2

    Low

    Low

    50

    Bad

    3

    High

    Medium

    25

    Bad

    4

    Medium

    Medium

    50

    Good

    5

    Low

    Medium

    100

    Good

    6

    High

    High

    25

    Good

    7

    Low

    Low

    25

    Bad

    8

    Medium

    Medium

    75

    Good

    10

  • CART Example Candidate Splits*Candidate Splits for t = Root Node CART is restricted to binary splits

    Candidate SplitLeft Child Node, tLRight Child Node, tR123456789Savings = lowSavings = mediumSavings = highAssets = lowAssets = mediumAssets = highIncome $75,000

  • CART PrimerSplit 1. -> Savings=low (L-true, R-false)Right:1,3,4,6,8Left:2,5,7PR=5/8 = 0.625 PL=3/8=0.375 -> 2*PLPR=15/64=0.46875

    P(j=Bad | t)P(Bad | tR)= 1/5 = 0.2 P(Bad | tL)= 2/3 = 0.67P(j=Good | t)P(Good | tR)= 4/5 = 0.8P(Good | tL)= 1/3 = 0.33Q(s|t)= |0.67-0.2|+|0.8-0.33| = 0.934

    *

  • CART Example* For each candidate split, examine the values of the various components of the measure (s|t)

    SplitPLPRP(j|tL)P(j|tR)2PLPRQ(s|t)(s |t ) 1

    2

    3

    4

    5

    6

    7

    8

    90.375

    0.375

    0.25

    0.25

    0.5

    0.25

    0.375

    0.625

    0.8750.625

    0.625

    0.75

    0.75

    0.5

    0.75

    0.625

    0.375

    0.125G:0.333B:0.667G:1B:0G:0.5B:0.5G:0B:1G:0.75B:0.25G:1B:0G:0.333B:0.667G:0.4B:0.6G:0.571B:0.429G:0.8B:0.2G:0.4B:0.6G:0.667B:0.333G:0.833B:0.167G:0.5B:0.5G:0.5B:0.5G:0.8B:0.2G:1B:0G:1B:00.46875

    0.46875

    0.375

    0.375

    0.5

    0.375

    0.46875

    0.46875

    0.218750.934

    1.2

    0.334

    1.667

    0.5

    1

    0.934

    1.2

    0.8580.4378

    0.5625

    0.1253

    0.6248

    0.25

    0.375

    0.4378

    0.5625

    0.1877

  • CART Example* CART decision tree after initial split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk(Records 2,7) Decision Node A(Records 1,3,4,5,6,8)Assets=LowAssets {Medium, High}

  • CART Example* Values of Components of Measure (s|t) for Each Candidate Split on Decision Node A

    SplitPLPRP(j|tL)P(j|tR)2PLPRQ(s|t)(s |t ) 1

    2

    3

    5

    6

    7

    8

    9

    0.167

    0.5

    0.333

    0.667

    0.333

    0.333

    0.5

    0.1670.833

    0.5

    0.667

    0.333

    0.667

    0.667

    0.5

    0.833G:1B:0G:1B:0G:0.5B:0.5G:0.75B:0.25G:1B:0G:0.5B:0.5G:0.667B:0.333G:0.8B:0.2G:0.8B:0.2G:0.667B:0.333G:1B:0G:1B:0G:0.75B:0.25G:1B:0G:1B:0G:1B:00.2782

    0.5

    0.4444

    0.4444

    0.4444

    0.4444

    0.5

    0.27820.4

    0.6666

    1

    0.5

    0.5

    1

    0.6666

    0.40.1112

    0.3333

    0.4444

    0.2222

    0.2222

    0.4444

    0.3333

    0.1112

  • CART Example* CART decision tree after decision node A split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk(Records 2,7) Decision Node A(Records 1,3,4,5,6,8)Assets=LowAssets {Medium, High}

    Savings=HighSavings {Low,Medium}

    Decision Node B (Records 3,6) Good Risk(Records 1,4,5,8)

  • CART Example* CART decision tree, fully grown form Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk(Records 2,7) Decision Node A(Records 1,3,4,5,6,8)Assets=LowAssets {Medium, High}

    Savings=HighSavings {Low,Medium}

    Decision Node B (Records 3,6) Good Risk(Records 1,4,5,8)Assets=HighAssets=Medium

    Bad Risk(Records 3) Good Risk(Records 6)

  • Classificationk-Nearest Neighbor (kNN)Decision TreeCARTC4.5Nave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

    *

  • The C4.5 AlgorithmProposed by Quinlan in 1993An internal node represents a test on an attribute.A branch represents an outcome of the test, e.g., Color=red.A leaf node represents a class label or class label distribution.At each node, one attribute is chosen to split training examples into distinct classes as much as possibleA new case is classified by following a matching path to a leaf node.

    *

  • The C4.5 AlgorithmDifferences between CART and C4.5:Unlike CART, the C4.5 algorithm is not restricted to binary splits.It produces a separate branch for each value of the categorical attribute.C4.5 method for measuring node homogeneity is different from the CART.

    *

  • The C4.5 Algorithm - MeasureWe have a candidate split S, which partitions the training data set T into several subsets, T1, T2, . . . , Tk.C4.5 uses the concept of entropy reduction to select the optimal split.entropy_reduction(S) = H(T)-HS(T), where entropy H(X) is:

    Where Pi represents the proportion of records in subset i .The weighted sum of the entropies for the individual subsets T1, T2, . . . , Tk

    C4.5 chooses the optimal split - the split with greatest entropy reduction*

  • Classificationk-Nearest Neighbor (kNN)Decision TreeNave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

    *

  • Bayes RuleRecommender system questionLi is the class for item i (i.e., that the user likes item i)A is the set of features associated with item iEstimate p(Li|A) p(Li |A) = p(A| Li) p(Li) / p(A)We can always restate a conditional probability in terms ofThe reverse condition p(A| Li)Two prior probabilitiesp(Li)p(A)Often the reverse condition is easier to knowWe can count how often a feature appears in items the user likedFrequentist assumption

    *

  • Naive BayesIndependence (Nave Bayes assumption)the features a1, a2, ... , ak are independent

    For joint probability

    For conditional probability

    Bayes' Rule

    *

  • An ExampleCompute all probabilities required for classification*

  • An ExampleFor C = t, we have

    For class C = f, we have

    C = t is more probable. t is the final class. *

  • Nave Bayesian ClassifierAdvantages: Easy to implementVery efficientGood results obtained in many applicationsDisadvantagesAssumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets)

    *

  • ClassificationK-Nearest Neighbor (kNN)Decision TreeNave BayesianArtificial Neural NetworkSupport Vector MachineEnsemble methods

    *

  • References for Machine LearningT. Mitchell, Machine Learning, McGraw Hill, 1997C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Springer, 2001.V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998.Y. Kodratoff, R. S. Michalski, Machine Learning: An Artificial Intelligence Approach, Volume III, Morgan Kaufmann, 1990

    *

  • Recommendation ApproachesCollaborative filteringNearest neighbor basedModel basedContent based strategiesAssociation Rule MiningText similarity basedClusteringClassificationHybrid approaches*

  • The Netflix PrizeSlides here are from Yehuda Koren.

  • Netflix Movie rentals by DVD (mail) and online (streaming)100k movies, 10 million customersShips 1.9 million disks to customers each day50 warehouses in the USComplex logistics problemEmployees: 2000But relatively few in engineering/softwareAnd only a few people working on recommender systemsMoving towards online delivery of contentSignificant interaction of customers with Web site*

  • The $1 Million Question*

  • Million Dollars Awarded Sept 21st 2009*

  • *

  • Lessons LearnedScale is importante.g., stochastic gradient descent on sparse matricesLatent factor models work well on this problemPreviously had not been explored for recommender systemsUnderstanding your data is important, e.g., time-effectsCombining models works surprisingly wellBut final 10% improvement can probably be achieved by judiciously combining about 10 models rather than 1000sThis is likely what Netflix will do in practice

    *

  • Useful ReferencesY. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 2009Y. Koren, Factor in the neighbors: scalable and accurate collaborative filtering, ACM Transactions on Knowledge Discovery in Data, 2010

    *

  • Thank you!

    *

    *Each row in the table are the ratings one user on the items*****************nt,d is term count of t in dN is number of documents in the collectiondft is number of documents that contains term t******Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP**Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP*Copyright 2003 -- Confidential and Proprietary -- Deloitte & Touche LLP***