a deployed people-to-people recommender system in online ...wobcke/papers/online-dating.pdf ·...

Articles

FALL 2015 5

Recommender systems have become important toolshelping users to deal with information overload andthe abundance of choice. Traditionally these systems

have been used to recommend items to users. The work wedescribe in this article concerns people-to-people recom-mendation in an online dating context. People-to-people rec-ommendation is different from item-to-people recommen-dation: interactions between people are two-way, in that anapproach initiated by one person to another can be accept-ed, rejected or ignored. This has important implications forrecommendation. Most basic, for each recommendation,there are two points in time at which the user must be satis-fied. First, when a recommendation is given, the user mustlike the proposed candidate (as in item-to-people recom-mendation), but second, if the user contacts a presented can-didate, the user should be satisfied by the candidate’s reply(the user prefers a positive reply to a negative reply or noreply at all). Thus a people-to-people recommender system

Copyright © 2015, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602

A Deployed People-to-People Recommender System

in Online Dating

Wayne Wobcke, Alfred Krzywicki, Yang Sok Kim, Xiongcai Cai, Michael Bain, Paul Compton, and Ashesh Mahidadia

■ Online dating is a prime applicationarea for recommender systems, as usersface an abundance of choice, must acton limited information, and are partic-ipating in a competitive matching mar-ket. This article reports on the success-ful deployment of a people-to-peoplerecommender system on a large com-mercial online dating site. The deploy-ment was the result of thorough evalu-ation and an online trial of a number ofmethods, including profile-based, col-laborative filtering and hybrid algo-rithms. Results taken a few monthsafter deployment show that the recom-mender system delivered its projectedbenefits.

Articles

6 AI MAGAZINE

needs to take into account the preferences of pro-posed candidates, who determine whether an inter-action is successful, in addition to those of the user.People-to-people recommenders are thus reciprocalrecommenders (Pizzato et al. 2013). Or, putting thisfrom the point of view of the user, a people-to-peo-ple recommender system must take into accountboth a user’s taste (whom they find attractive) andtheir own attractiveness, so the presented candidateswill find them attractive in return, and give a positivereply (Cai et al. 2010).

Using online dating sites can be difficult for manypeople. First, there is the stress of entering or reenter-ing the dating market. Second, there are a variety ofdating sites with competing or confusing claims as totheir effectiveness in matching people with the per-fect partner — a group of psychologists have recentlydrawn attention to the questionable validity of someof these claims (Finkel et al. 2012). Third, once join-ing a site, users may have to fill in a lengthy ques-tionnaire or state preferences for their ideal partner,which can be daunting because if the user’s objectiveis to find a long-term partner, they surely know theywill have to compromise, but do not know at the out-set what compromises they are likely to make. Fourth,as is well known, dating is a type of market, a match-ing market, where people compete for scarceresources (partners) and are, in turn, a resource soughtby others — see the recent user perspective of a labormarket economist on the dating market that exploresthis idea in depth (Oyer 2014). Finally, users are askedto provide a brief personal summary that must differ-entiate themselves from others on the site and attractinterest. It is difficult to stand out from the crowd:many people’s summaries appear cliché-ridden andgeneric, but according to at least one popularaccount, perhaps the main objective of the summaryis simply is to project an optimistic, carefree outlookthat creates a positive first impression and stimulatesfurther interest (Webb 2013).

Once onsite, users are immediately overwhelmedwith an abundance of choice in potential partners.Moreover, they may also be the subject of intenseinterest from existing users now that their profile ispublic (often a site will heavily promote newly joinedusers). On most sites, users can choose from amongstthousands of others by sifting through galleries ofphotos and profiles. The most common way for usersto find potential contacts is by advanced search(search based on attributes, such as age, location,education, and occupation, and other keywords).Searches typically return many results, and it may bedifficult for users to decide whom to contact. As aresult, some users contact those whom they findmost attractive; however, this is a poor strategy forgenerating success, as those popular users thenbecome flooded with contacts but can reply positive-ly to only a very small fraction of them.

A basic problem for users is that it is hard for them

to estimate their own desirability so as to choose con-tacts who are likely to respond positively, and simi-larly, it is impossible for them to know how muchcompetition they face when contacting another per-son, and hence to gauge their likely chance of suc-cess. In addition, after narrowing down searches byobvious criteria, profile information typically pro-vides limited information to discriminate one userfrom another. This is why a recommender system canprovide much help to users of a dating site.

The organization of this article is as follows. In thenext section, we outline the basic problems of rec-ommendation in online dating and introduce ourkey metrics. Then we present our recommendationmethods, and discuss a live trial conducted with theaim of selecting one method for deployment on site(Selection of Recommender for Deployment). TheRecommender Deployment section contains detailsof the deployment process for the winning algo-rithm. A postdeployment evaluation of the method isthen provided, followed by lessons learned.

Recommender SystemsOnline dating is a prime application area for recom-mender systems, as users face an abundance ofchoice, must act on limited information, and are par-ticipating in a competitive matching market. Inter-actions are two way (a user contacts another, andeither receives a positive or negative reply, or receivesno reply at all), and thus the recommendation of acandidate to a user must take into account the tasteand attractiveness of both user and candidate.

Due to the limited amount of user information ononline dating sites (and, of course, users can misrep-resent themselves to some extent to present them-selves in a more favorable light), our recommendersystem does not aim to provide the perfect match.Rather, we adopt a probabilistic stance and aim topresent users with candidates who they will like andthat will increase their chances of a successful inter-action. Moreover, success is defined narrowly as theuser receiving a positive reply to a contact they initi-ate. The main reason for this restrictive definition isthat, for the dating site we were working with, userscan make initial contact for no charge by sending amessage drawn from a predefined list, and replies(also free and from a predefined list) are predeter-mined (by the dating site company) as either positiveor negative, leaving no room for ambiguity. Thus wehave extremely reliable data for this notion of suc-cess. Users can initiate paid open communication toothers, with or without these initial contacts. How-ever, the dating site company does not have any dataon the success of open communications, or any feed-back from person-to-person meetings. Thus the typi-cal user interaction sequence we focus on is illustrat-ed in figure 1.

A secondary objective of our recommender system

Articles

FALL 2015 7

is to increase overall user engagement with the site.If users become frustrated by being unable to findsuccessful interactions, they may leave the site,resulting in an increased attrition rate of users and areduced candidate pool. By providing opportunitiesfor successful interactions, a recommender systemcan help maintain a large candidate pool for othersto contact. The idea was that this could be achievedby designing the recommender system to promoteusers who would otherwise not receive much con-tact, increasing their engagement with the site.

These considerations led us to define two key met-rics for the evaluation of recommendation methods:success rate improvement and usage of recommen-dations. These metrics can apply to an individual,but more commonly we apply them to groups ofusers. Success rate improvement is a ratio measuringhow much more likely users are to have a successfulinteraction from a recommendation compared totheir overall behavior. That is, success rate improve-ment is a ratio of success rates: the success rate forcontacts initiated by users from recommendationsdivided by the success rate for all contacts of thoseusers. Usage of recommendations measures the pro-portion of user-initiated contacts that come from rec-ommendations. This is simply the number of user-initiated contacts from recommendations divided by

the total number of user-initiated contacts. We alsouse similar metrics for positive contacts only and foropen communications.

It is useful to draw an analogy with informationretrieval, where precision and recall are metrics usedto evaluate document retrieval methods. Success rateimprovement is similar to precision, which measuresthe proportion of documents out of those presentedthat are relevant to the user. Recall measures the pro-portion of the relevant documents that are retrievedby the method. We make two observations. First,recall is not the same as usage of recommendations,since typically the recall metric is applied whenevery document is known as either relevant or notrelevant (so the maximum recall is 100 percent). In arecommendation context, not presenting a candi-date is not the same as failing to recall a relevant doc-ument, since it is unknown whether a candidate notpresented would be of interest to the user. A relatedissue is that only a very small number of candidatesare ever presented to users, not (as in informationretrieval) the set of all documents returned by themethod. Thus the maximum value for usage of rec-ommendations can never be 100 percent, and inreality is much lower. As a consequence of presentingonly very few candidates, the ranking of candidatesis highly important: it is much more useful for a rec-

Figure 1. Typical User Interaction Sequence.

Paid opencommunication

“Thanks! I would like to hear from you.”

Free pre-definedcommunication

“I'd like to get to know you, are you interested?”

Hi there!

Articles

8 AI MAGAZINE

ommendation method to generate good results forthe top N candidates whom users will be shown,rather than being able to produce all candidates ofinterest. Thus in our historical data analysis, weemphasize success rate improvement and usage ofrecommendations for the top N candidates, with Ntypically in a range from 10 to 100.

Our second observation is that, similar to infor-mation retrieval, there is a trade-off between successrate improvement and usage of recommendations.Intuitively, to achieve higher success rate improve-ment, a method should prefer to generate candidateswho say yes more often. However, these candidatesare typically not liked as much as other candidates,giving a lower usage of recommendations. Similarly,to achieve higher usage of recommendations, amethod should prefer to generate the most attractivecandidates, whom users are more certain to like;however, these candidates are more likely to respondnegatively or not at all, giving a lower success rateimprovement. Thus it is crucial for a recommenda-tion method to find a suitable balance between suc-cess rate improvement and usage of recommenda-tions, and much of our research was directed towardsfinding this balance for a variety of methods. A majordifficulty we faced, however, was that it was impossi-ble to determine the right balance from historicaldata alone, which meant that a live trial was neces-sary in order to properly evaluate our methods.

In the context of online dating, there are a numberof other considerations, which while important, areeven more difficult to take into account in the designof a recommender system due to the lack of availabledata for analysis. One is that the cannibalization ofsearch by recommendation may mean that, whileoverall user experience might be improved, revenuemight not increase because users would spend thesame amount of money, but find matches throughrecommendation rather than search. This seemed tobe a general concern with online content deliveryprevalent in the media industry. To the contrary, ourhypothesis was that an improved user experiencewould result in more users spending more time onsite, and lead to increased revenue. Another consid-eration is that promoting the engagement of womenon the site is important, because even if (some)women do not generate revenue directly, they gener-ate revenue indirectly by constituting the candidatepool for the generally more active men to contact.Assessing the likely performance of a recommenderon this criterion was impossible with historical data,and difficult even with data from the trial.

Recommendation MethodsWe developed and evaluated a number of recom-mendation methods that provide ranked recommen-dations to users that help them increase theirchances of success and improve engagement with the

site. Our methods are the result of several years ofresearch on people-to-people recommenders startingin 2009, when we developed a number of profile-based and collaborative filtering (CF) methodsapplied to people-to-people recommendation (Kimet al. 2010, Krzywicki et al. 2010). The most impor-tant requirement of our method is scalability: in atypical scenario, the recommender system needed togenerate around 100 ranked recommendations forhundreds of thousands of active users by processingseveral million contacts, using standard hardware ina time period of around 2 hours. As an assessment ofstandard probabilistic matrix factorization at thetime indicated that this method could not scale todata of this size, and moreover could not handle thedynamic and incremental nature of recommenda-tion generation, we focused on simpler methods thatwould be applicable to problems of this size.

Preliminary evaluation of methods was done onhistorical data provided by the dating site company,however there was no guarantee that this evaluationwould transfer to the setting of a deployed recom-mender. This is because evaluation on historical datais essentially a prediction task, the objective being topredict the successful interactions that users foundby search, that is, in the absence of a recommender.Since a recommender system aims to change userbehavior by showing candidates users could perhapsnot easily find using search, predicting the positivecontacts the user did have is only a partial indicationof the quality of the recommendations.

We conducted a first trial of two methods in 2011over a 9 week period, where recommendations weredelivered by email (Krzywicki et al. 2012). The impor-tant results of this trial were that: (1) the performanceof the methods in the trial setting was consistentwith that on historical data, giving us confidence inour methodology, and (2) both methods were able toprovide recommendations of consistent quality overa period of time. More precisely, what the first trialshowed was that: (1) though success rate improve-ment and usage of recommendations did not haveidentical values to those obtained on historical data,broad trends were stable (if one method performedbetter in the historical setting, it also performed bet-ter in the trial setting), and (2) the values of the met-rics were roughly similar over different intervals oftime within the 9 week trial period.

The CF method used in the first trial did notaddress the cold start problem (recommendation toand of new users), because we wanted to establishwhether a basic form of CF, with similarity basedonly on common positive contacts, would work inthis domain. As the trial results showed that thisform of CF worked very well, we developed numer-ous hybrid CF methods that could recommend toalmost all users while maintaining a high successrate (Kim et al. 2012a). This greatly improved theutility of the methods, since providing good recom-

mendations to a user on the day of joining the sitewas considered an effective means of capturing theirinterest. This article reports on the trial of our bestfour methods on the same online dating site in 2012,where users were able to click on their recommenda-tions in the browser.

One serious problem that the CF method used inthe first trial did address, is the tendency of standardCF to recommend highly popular users. As men-tioned above, many users contact the most attractivepeople, with a low chance of success. More concrete-ly, if we define a popular user as one who has receivedmore than 50 contacts in the previous 28 days, then(1) almost everyone is nonpopular, (2) 38 percent ofcontacts are to popular users, but (3) the success rateof those contacts is only 11 percent, whereas the suc-cess rate of contacts to nonpopular users is 20 per-cent. Thus there is a serious imbalance in communi-cation patterns that CF-style recommenders tend toreinforce. We addressed this problem by developinga sequential recommendation process, standard CFcascaded (Burke 2002) with a decision tree critic,where the candidates produced by CF are reranked bymultiplying their score (an expected success rateimprovement derived from their rating) by a weight-ing less than 1 for those candidates with a strong like-lihood of an unsuccessful interaction with the user.The weightings are derived systematically from deci-sion tree rules computed over a large training set ofcontacts that includes temporal features such as the

activity and popularity of users. An effect of thisreranking is to demote popular candidates, so thatthey are not over-recommended. Multiplying scoresby rule weights is justified by Bayesian reasoning(Krzywicki et al. 2012).

Summary of MethodsAfter the first trial, we refined the profile-basedmatching algorithm, called Rules, and developedthree new CF methods, all able to generate candi-dates for almost all users. The CF methods are ofincreasing complexity, though all designed to meetthe stringent computational requirements imposedby the dating site company, more specifically: (1) befeasible to deploy in commercial settings, (2) be ableto generate recommendations in near real time, (3)be scalable in the number of users and contacts, and(4) work with a highly dynamic user base. Our fourtrialled methods are designed to make various trade-offs between computational complexity, success rateimprovement, usage of recommendations, anddiversity of candidates (the number of distinct can-didates and the distribution of their recommenda-tions). The methods are summarized in table 1 forreference, and these design trade-offs are discussedfurther below. Figure 2 shows the generation of can-didates in all four methods pictorially. The diagramsdo not show the application of the decision tree crit-ic in SIM-CF and RBSR-CF, nor how the candidatesare ranked.

Articles

FALL 2015 9

Figure 2. Recommendation Methods.

Candidate C can be recommended to user U if: (Rules) C satisfies the conclusion Q of the rule P ⇒ Q generated for U; (SIM-CF) C has repliedpositively to a user similar in age and location to U; (RBSR-CF) C has replied positively to a user who initiated a successful interaction witha candidate recommended by Rules to U; (ProCF+) C is a similar candidate to either a Rules recommendation for U or a user who has repliedpositively to U, taking into account successful and unsuccessful interactions with those potential candidates.

Similar Sendersby Age and Location

+

+

+

+

+

-

R

R

C

S

U CU

P QRule

CS

U C

Rules SIM-CF RBSR- ProCF+

CS

U

C

C

Compatible Subgroup Rules (Rules)This recommender works by dynamically construct-ing rules for each user of the form: if u1, …, un (con-dition) then c1, …, cn (conclusion), where the ui areprofile features of the user and ci are correspondingprofile features of the candidate (Kim et al. 2012b). Ifthe user satisfies the condition of such a rule, anycandidate satisfying the conclusion can be recom-mended to the user. Candidates are ranked based onmatch specificity and their positive reply rate.

Each profile feature is an attribute with a specificvalue, for example, age = 30–34, location = Sydney(each attribute with a discrete set of possible values).An initial statistical analysis determines, for each pos-sible attribute a and each of its values v (taken as asender feature), the best matching values for thesame attribute a (taken as receiver features), heretreating male and female sender subgroups separate-ly. For example, for males the best matching valuesfor senders with feature age = 30–34 might be femaleswith age = 25–29.

This method can recommend and provide candi-dates to a wide range of users, as, in contrast to thepure form of CF used in the first trial, users and can-didates are not restricted to have prior interactions.The drawbacks are the high computational cost ofcomputing subgroup rules and the lower success rateimprovement.

Profile-Based User Similarity CF (SIM-CF)In contrast to the CF recommender used in the firsttrial, where similarity of users was defined using theircommon positive contacts (Krzywicki et al. 2010),similarity of users in SIM-CF is defined using a verysimple profile-based measure based only on age dif-ference and location (Kim et al. 2012a).

Definition 1For a given user u, the class of similar users consists ofthose users of the same gender and sexuality who areeither in the same 5-year age band as u or one ageband either side of u, and who have the same locationas u.

Age and location are used as the basis of similaritysince these two attributes are the most commonly

used in searches on the site. The data shows that suc-cessful interactions are far more likely between peo-ple with at most a 10-year age difference thanbetween those with a greater age difference. Similar-ly, location is not arbitrary but designed to captureregions of similar socioeconomic status. Thus there isreason to believe that users similar under this meas-ure will encompass a large number of users with sim-ilar behavior and socioeconomic status. Note that weinvestigated a number of more sophisticated defini-tions of user similarity, but surprisingly, none ofthem gave any improvement in the context of SIM-CF so were not pursued further (however, see the dis-cussion of ProCF+ below, where both successful andunsuccessful interactions are used to define user sim-ilarity).

SIM-CF is standard user-based CF (figure 2) andworks by finding users similar to a given user and rec-ommending the contacts of those users. Candidatesare first rated by the number of their successful inter-actions that are initiated by users similar to the targetuser, then reranked using the decision tree critic,which demotes candidates with a high likelihood ofan unsuccessful interaction. Historical data analysisshowed that this would give a higher success rateimprovement and diversity of candidates (Krzywickiet al. 2012).

Rule-Based Similar Recipients CF (RBSR-CF)RBSR-CF calculates similarity for a user u based on thesuccessful interactions of other users with candidatesgenerated by the Rules recommender for u, thenapplies CF to generate and rank candidates (Kim et al.2012a), as in a content-boosted method (Melville,Mooney, and Nagarajan 2002). In figure 2, Rules rec-ommendations are shown as a double arrow labeledR. As in SIM-CF, candidates are ranked using thenumber of successful interactions with users similarto the target user, and the decision tree rules are usedfor reranking. A strength of this method is that it pro-vides a greater diversity of candidates than SIM-CFwith a similar success rate improvement, but with thedrawback of a higher computational complexity andlower usage of recommendations.

Articles

10 AI MAGAZINE

Rules Profile matching method optimizing user coverage and diversity SIM-CF Profile-based user similarity with CF, cascaded with decision tree rules that conservatively

demote candidates likely to result in unsuccessful interactions RBSR-CF Content-boosted CF method exploiting user-candidate pairs generated by Rules treated as

successful interactions to define user similarity ProCF+ Hybrid method combining Rules and ProCF, using a probabilistic user similarity function

based on successful and unsuccessful interactions to optimize success rate improvement

Table 1. Methods Evaluated in Online Trial.

Probabilistic CF+ (ProCF+)ProCF (Cai et al. 2013) uses a more sophisticatedmodel of user similarity than SIM-CF, derived fromsuccessful and unsuccessful interactions (denotedwith arrows labeled + and – in figure 2). As withRBSR-CF, ProCF+ uses actual interactions augmentedwith user-candidate pairs generated by the Rules rec-ommender treated as successful interactions, apply-ing CF to generate and rank candidates (as with SIM-CF and RBSR-CF but without the decision tree critic).The main advantage of ProCF+ is the higher successrate improvement than SIM-CF and RBSR-CF (due tothe more accurate calculation of user similarity), butthis comes with a higher computational cost andlower usage of recommendations. In addition,ProCF+ generates more user-candidate pairs furtherapart in geographical distance: the data suggests thatthese particular matches are likely to be successful,despite the fact that long-distance matches generallyare unlikely to be successful.

Baseline MethodIn addition to our four methods, a number of profile-based proprietary methods were trialled, built aroundmatching heuristics and individual contact prefer-ences. One method based on profile matching wasagreed as a baseline for comparison with our algo-rithms. But, as recommendations for this methodwere not able to be recorded, comparison of ourmethods to the baseline covers only contacts andopen communications.

Selection of Recommender for Deployment

A live trial of recommenders was conducted as a closecollaboration between researchers and the dating sitecompany, and treated by the company as a commer-cial project with strictly defined and documentedobjectives, requirements, resources, methodology,key performance indicators and metrics all agreed inadvance. The main objective of the trial was to deter-mine if a novel recommender could perform betterthan the baseline method, and if so, to select onesuch recommender for deployment. Aside from anincrease in revenue, the company was aiming toimprove overall user experience on the site, and torespond to competitor site offerings of similar func-tionality.

Considerable time and effort of the company wasdedicated to the proper conduct of the trial, includ-ing project management, special software develop-ment and additional computational resources. Awhole new environment was created including a sep-arate database containing generated recommenda-tions, impressions and clicks for all methods, run-ning alongside the production system so as tominimally impact system performance.

Trial MethodologyEach of the methods described in the Recommenda-tion Methods section received 10 percent of all siteusers, including existing and new users joining inthe period of the trial. To avoid cross-contaminationof user groups, once a user was assigned to a group,they remained in the same group for the duration ofthe trial. Thus the proportion of new users in eachgroup increased over the course of the trial. The rec-ommenders were required to compute recommenda-tions daily, and hence provide recommendations tonew users with very limited training data.

After a brief period of onsite testing and tuning,the trial was conducted over 6 weeks, from May tomid-June 2012. In contrast to the first trial, recom-mendations were allowed to be repeated from day today with the restriction not to generate candidateswith whom the user had had a prior interaction. Ourrecommenders generated candidates on the day theywere delivered, using an offline copy of the databasecreated that morning; thus training data was one dayout of date. In contrast, the baseline method gener-ated and delivered recommendations on the fly. Inconsequence, the baseline method could recom-mend users who had joined the site after our recom-menders had been run and make use of data unavail-able to our recommenders, giving it some advantage.The number of recommendations generated was lim-ited to the top 50 candidates for each user.

Candidates for each user were assigned a score,which, for our recommenders, was the predictedlikelihood of the interaction being successful. Can-didates were displayed on a number of user pages,four at a time, with probability proportional to theirscore, and users could see more candidates by click-ing on an arrow button on the interface.

Trial MetricsA set of primary metrics were agreed between theresearch group and the company before the trial in aseries of meetings. There are two types of primarymetric, corresponding to the results presented intables 2 and 3. The first set of metrics are used forcomparing the user groups allocated the recommen-dation methods to the baseline user group, and focuson usage of recommendations. The second set ofmetrics are used to compare user behavior within thesame group, focusing on success rate improvementover search and usage of recommendations, meas-ured as the proportion of contacts or open commu-nications initiated from recommendations.

A third group of metrics (table 4) are additionalmetrics, determined only after the trial to be of rele-vance in assessing the increase in user engagementwith the site due to the recommenders. Note that itis not that these metrics could not have been definedbefore the trial, rather their usefulness only becameevident after the trial. These metrics cover contactsand open communications to the majority of non-

Articles

FALL 2015 11

highly popular users (influencing overall userengagement), contacts and open communicationsinitiated by women (related to maintaining the poolof women on the site), and age/location differencesbetween users and candidates (since some users react-ed strongly when receiving candidates very differentin age or location from their stated preferences).

Trial Results and Selection of Best MethodData from the trial for final evaluation of the recom-mendation methods was collected two weeks afterthe end of the trial to count responses to messagesinitiated during the trial and to count open commu-nications resulting from recommendations deliveredduring the trial. Note that due to the considerablevariation in outlier behavior (a small number ofhighly active members), the top 200 most activeusers from the whole trial were removed from theanalysis.

Table 2 summarizes the results for comparison ofthe recommender groups with the baseline group.

These metrics are percentage lifts in overall behavior,and reflect how much users act on recommendations.Here SIM-CF produced the best results. The increasein open communications is important, because this isdirectly related to revenue. Even a small increase inthis measure was considered significant from thebusiness perspective.

The second set of metrics for comparing userbehavior within groups (table 3) include success rateimprovement and usage of recommendations, forwhich these measure the proportion of the behaviorof users in the same group produced by recommen-dations. Again SIM-CF performed the best, whileProCF+ showed the best success rate improvement,consistent with historical data analysis. What is mostsurprising is the higher than expected usage of rec-ommendations for Rules and the lower than expect-ed performance of ProCF+ on these metrics (that is,expected from historical data analysis), suggestingthat ProCF+ is overly optimized to success rateimprovement. Also interesting is that, while ProCF+

Articles

12 AI MAGAZINE

Rules SIM-CF RBSR-CF ProCF+Lift in contacts per user 3.3% 10.9% 8.4% –0.2% Lift in positive contacts per user 3.1% 16.2 10.4% 5.6% Lift in open communications per user 4.3% 4.8% 3.7% 0.8%

Table 2. Comparison of Recommender Groups with Baseline Group.

Rules SIM-CF RBSR-CF ProCF+Success rate improvement over search 11.2% 94.6% 93.1% 133.5%Contacts from recommendations 8.1% 11.8% 9.9% 8.2% Positive contacts from recommendations 8.9% 20.7% 17.5% 17.2% Open communications from recommendations 8.1% 18.2% 14.8% 13.4%

Table 3. Comparisons Within Groups.

Rules SIM-CF RBSR-CF ProCF+Contacts with no reply 33.0% 26.1% 27.1% 27.3% Positive contacts to nonpopular users 85.7% 62.0% 64.4% 63.9% Positive contacts by women 33.1% 33.4% 30.0% 27.0% Recommendations with age difference > 10 years 0.1% 3.3% 3.2% 8.3% Average/median distance in km in recommendations 91/20 106/20 384/40 478/50

Table 4. Additional Metrics.

users make heavy use of the recommendations, theiroverall increase in behavior is much less, suggestingsome cannibalization of search behavior by the rec-ommender, whereas in the other groups, the recom-menders result in more additional user behavior. Thepotential for cannibalization of search by the recom-menders was a major concern to the dating site com-pany, because if this happened, overall revenuewould not increase.

The additional metrics (table 4) relate more tolong-term user experience, user satisfaction with therecommendations, and maintaining overall userengagement. It is understood that the metrics do notcapture these qualities directly, and that the inter-pretation of the results is more subjective.

The first metric is the proportion of contacts fromrecommendations without any reply; the next relatesto contacts to nonpopular users. The importance ofthese metrics is that many contacts, typically those topopular users, go without a reply, potentially dis-couraging users. It was felt that even a negative replywould make the user more engaged with the site. Onthese metrics, all of our CF methods perform verywell, as all are designed not to overrecommend pop-ular users (thus avoiding contacts with no response).For SIM-CF and RBSR-CF, this is due to the use of thedecision tree rules that demote popular users in therankings; for ProCF+ due to the use of unsuccessfulinteractions in the calculation of user similarity.Though SIM-CF is best on the proportion of contactswith no reply, the other CF methods are ahead oncontacts to nonpopular users. This may suggest thatwhen popular users are recommended by SIM-CF,they are slightly more likely to generate a reply.

The third metric concerns usage of recommenda-tions by women. Women are often thought of asbeing passive on online dating sites, however this isnot the case. Women are more selective in their con-tacts and are thus typically less active than men. SIM-CF is clearly the best method for encouraging actionsinitiated by women.

The remaining metrics relate to general user per-ception and trust in the reliability of the recom-mender system, as candidates shown that were out-side a user’s stated preferences were sometimesregarded by users as faults in the system (these can-didates were generated because data indicates thatmany of them will be successful). Some simple meas-ures of age and location differences were calculatedfor the whole set of recommendations generated.Rules is the best method on these metrics, while ofthe CF methods, RBSR-CF is superior on age differ-ence and SIM-CF on location difference. ProCF+ hasthe highest proportion of recommendations with anage difference more than 10 years, which, since italso has the highest success rate lift, may suggest thatthese recommendations have a high success rate.

On the basis of this evaluation, SIM-CF was select-ed as the method for deployment. It has the best

score on all primary metrics except success rateimprovement, the smallest proportion of contactswith no reply, and the best proportion of contactsinitiated by women. This method also gives a bal-anced approach to age and location differences dueto how user similarity is calculated. The values for theother metrics were lower than for other methods, butnot deemed to be substantially lower. Also of impor-tance was the fact that SIM-CF was the least complexCF method to implement, with no dependencies onother processes, whereas RBSR-CF and ProCF+ bothdepend on Rules.

Recommender DeploymentThe initial SIM-CF implementation shown in figure 3was used for evaluation on historical data and in thetrial. SIM-CF generates ranked recommendations in atwo stage process. The first stage involves generatingcandidates with a preliminary score using profile-based user similarity; the second stage involves usingdecision tree rules computed from a larger trainingset to weight the scores produced in the first stage(Krzywicki et al. 2012). Note that the decision treerules are the same on each run of the recommender,since retraining the decision tree is done only asneeded. Hence this step of the process is compara-tively simple.

SIM-CF used an Oracle 11g database to store tablesfor the user similarity relation and for recommenda-tions. In the trial context, each table had several tensof millions of rows, well within performance require-ments. The reason for using Oracle was that this is

Articles

FALL 2015 13

Figure 3. SIM-CF Trial Implementation.

GenerateSimilarity

Table

GenerateRecommendation

Table

ApplyDecision Tree

Rules

User-Based CF

Critic

Oracle 11gbatch processrun overnight

Oracle PL/SQLderived from C5.0Decision Tree computed once

CF to provide more candidates, since, if the numberof similar users based on the same location as the userwas insufficient, users further away could be used togenerate candidates. Another change was to augmentsuccessful interactions with open communicationsfor generation of candidates, after some experimentsconfirmed that this would slightly improve theresults.

The whole development and deployment processtook around 3.5 months and was done using incre-mental releases and testing. The actual design anddevelopment, including decisions about changes,was done by the dating site company. The role of theresearch group was to advise on the changes and pro-vide detailed information about the SIM-CF method.The following describes the deployment timeline:

Mid-June 2012: Trial ended and analysis and selectionof recommender commenced.

August 2012: SIM-CF was selected and was switched todeliver recommendations to all users from the offlinedatabase, except for the first day of a user’s registrationwhen recommendations were supplied by the baselinemethod. At the same time, design and development ofthe in-memory version of SIM-CF started.

September 2012: The first version of in-memory SIM-CF started to deliver recommendations to 60 percentof users, working in parallel with the original trial ver-sion for the next couple of weeks and still using theoffline database.

October 2012: The in-memory version was switched to100 percent of users, still running from the offlinedatabase.

the database system used by the dating site company.This implementation also enabled us to experimentextensively with variations of the different methods.Our implementation was efficient and robust enoughto be used in the trial and in the initial deploymentenvironment. The decision tree was constructedusing C5.0 from RuleQuest Research, with rulesderived from the decision tree converted into OraclePL/SQL.

Deployment ProcessThe initial implementation of SIM-CF was high-qual-ity, robust software, suitable for research and devel-opment and a rigorous trial for 10 percent of theusers, but it was not suited to the production envi-ronment due to its heavy reliance on the database.The company decided that the method could beimplemented in Java with the use of in-memorytables to provide near-real-time recommendations ona per user basis as needed. Here near real time meansthat recommendations are computed and deliveredto users after they take an action to move to a newpage that contains recommendations, with no appre-ciable delay in page loading time. This required areimplementation of SIM-CF and integration withthe production system (figure 4). This had to be donewith minimal cost and impact on the existing back-end system, which served millions of online cus-tomers.

Some changes were made to simplify the SIM-CFuser similarity function based on age bands and loca-tion by calculating the exact age difference and esti-mating the distance between users. This allowed SIM-

Articles

14 AI MAGAZINE

Figure 4. SIM-CF Deployment Implementation.

GenerateSimilarity

Table

GenerateRecommendation

Table

ApplyDecision Tree

Rules

User-Based CF

Critic

Java in-memory tablesReal-time candidategeneration

Rules implementedonline in Java

November 2012: The production version of SIM-CFstarted to run from the live database. As the recom-mender started to use the online database, recom-mendations could be generated more often, coveringnewly joined users and eliminating the need for rec-ommendations for new users generated by the base-line method. An initial concern was that memoryusage might be too high, however a careful designensured that the memory requirement was within rea-sonable limits.

The dating site company indicated that the rec-ommender system is stable and does not require anyimmediate additional maintenance related to themethod itself. The decision tree rules have been test-ed against several data sets from different periods andgiven consistent results. Therefore there is currentlyno provision to determine when the decision treerules need to be updated. If such a need occurs in thefuture, it would not be difficult to update the rules.

Postdeployment EvaluationIn this section, we compare the results from the trialand posttrial deployment to show how the benefitsof the SIM-CF recommender established during thetrial are maintained in the deployed setting. Ourcomparison covers the key metrics discussed in TrialResults and Selection of Best Method section and isbased on data from three months collected betweenNovember 2012 (after the production version startedusing the live database) and February 2013. As in thetrial analysis, we allowed 2 extra weeks for collectingresponses to messages and open communications tocandidates recommended during the three months.

Table 5 compares posttrial deployment metrics to

Articles

FALL 2015 15

those shown previously for the trial (repeated in thefirst column), except there is one important differ-ence. In the trial setting, the first group of primarymetrics compared the recommender group to thebaseline group. Now since all users have SIM-CFthere is no baseline group.

Therefore we calculate the lift in various measuresfor the deployment setting with respect to data fromNovember 2011 to February 2012, when the baselinerecommender was in use. The reason for using thisperiod of time is that there is no need to adjust forseasonal effects (typically the values of such metricsvary throughout the year). Though inexact, this givesus reasonably high confidence that the group metricresults are maintained after deployment.

The next set of primary metrics concerning usageof recommendations shows a drop in success rateimprovement in the posttrial deployment setting butan increase in usage of recommendations. The exactreasons for these changes are unknown, but could bedue to the modifications to the original SIM-CFmethod (see the Recommender Deployment sec-tion), which were made with a view to increasingcontacts at the cost of slightly lowering success rateimprovement.

The final section of table 5 compares the trial andposttrial deployment values for the additional met-rics. The values of all metrics improved since the tri-al. One reason for this could be that recommenda-tions are generated from an online database (asopposed to the offline database used during the trial),thus covering new users soon after they join the site.Providing recommendations to new users at this timeis very important as they are less experienced in theirown searches and eager to make contacts.

Trial DeploymentComparison with Baseline Group Lift in contacts per user 10.9% 10.3% Lift in positive contacts per user 16.2% 12.4% Lift in open communications per user 4.8% 7.3% Comparisons Within Groups Success rate improvement over search 94.6% 88.8% Contacts from recommendations 11.8% 18.6% Positive contacts from recommendations 20.7% 30.2% Open communications from recommendations 18.2% 28.3% Additional Metrics Contacts with no reply 26.1% 24.9% Positive contacts to nonpopular users 62.0% 63.5% Positive contacts by women 33.4% 39.3%

Table 5. Comparison of Trial and Deployment Metrics.

Lessons LearnedLooking over the whole period of this project frominception to deployment, we identify several majorlessons learned during the process of the develop-ment and deployment of an AI application in a com-mercial environment that we believe to be generalbut also more relevant to the field of recommendersystems. These lessons can be summarized as: (1) theresults of evaluation on historical data do not neces-sarily translate directly to the setting of the deployedsystem, since deployment of the system changes userbehavior, (2) commercial considerations go farbeyond simple metrics used in the research literature,such as precision, recall, mean absolute error or rootmean squared error, (3) computational requirementsin the deployment environment, especially scalabili-ty and runtime performance, determine what meth-ods are feasible for research (in our case, collabora-tive filtering methods that are popular withresearchers, such as types of matrix factorization,were infeasible for the scale of our problem in thedeployed setting).

Historical Data Analysis InsufficientThe fundamental problem with evaluation of a rec-ommendation method using historical data is thatwhat is being measured is the ability of the methodto predict user behavior without the benefit of therecommender (in our case, behavior based onsearch). There is no a priori guarantee that suchresults translate to the setting of deployment, wherethe objective is to change user behavior using the rec-ommender. Critical was the first trial (Krzywicki et al.2012) where we learned that, though the values ofour metrics from the trial were not the same as thoseon historical data, overall trends were consistent,meaning that evaluation on historical data was a reli-able indicator of future recommender performance.

After we had developed our best methods, the tri-al reported in this article was essential for selectingthe method for deployment, due to the impossibilityof choosing between the methods using historicaldata analysis alone.

Another facet of the problem is that typically eval-uations on historical data consider only static datasets. The highly dynamic nature of the deployed sys-tem is ignored, in particular the high degree ofchange in the user pool as users join or leave the site,and the requirement for the recommender to gener-ate candidates over a period of time as users changetheir overall interaction with the system. Both trialsshowed that our methods were capable of consistentperformance over a extended period of time with ahighly dynamic user base.

Overly Simplistic MetricsConcerning metrics, our basic observation is that theresearch literature over-emphasizes simple metricsthat fail to capture the complexity of the deployment

environment. Simple metrics are usually statisticalmeasures that aggregate over a whole user base, so donot adequately account for the considerable variationbetween individual users. In our case, some measurescan be dominated by a minority of highly activeusers. However a deployed system has to work wellfor all users, including inactive users for whom thereis little training data. Moreover, often these metricsare used to aggregate over all potential recommenda-tions, however what matters are only the recommen-dations the user is ever likely to see (the top N candi-dates), not how well the method predicts the score oflower ranked candidates.

We found particularly useful a prototype that wedeveloped to enable us to visually inspect the recom-mendations that would be given to an individualuser, to see if those recommendations might beacceptable. In this way, we identified very early theproblem of relying only on the simple metric of suc-cess rate improvement, which tended to result in rec-ommendations that were all very similar and whichmay not have been of interest to the user. Thus evenconsidering simple metrics, what was needed was away of taking into account several metrics simulta-neously, involving design trade-offs in the recom-mendation methods.

Further, the company deploying the recommenderwas of course interested in short-term revenue, butalso in improving the overall user experience which,it was understood, would lead to increased engage-ment with the site and potentially more revenue inthe long term. However, the simple metrics used inthe literature can be considered only proxies indi-rectly related even to short-term revenue, so muchinterpretation and discussion was needed to under-stand the impact of the recommenders on user expe-rience (which motivated the additional metricsdescribed above). The company chose the method fordeployment by considering a range of metrics cover-ing both short-term revenue and user experience.

Commercial FeasibilityOur final point is that there is often a large gapbetween typical research methodology and commer-cial requirements. Our project was successful becausewe took seriously the requirements of the deploy-ment environment and focused research on thosemethods that would be feasible to trial and deploy.The alternative approach of developing a researchprototype (without considering feasibility), thentreating the transfer of that prototype to an industri-al context as merely a matter of implementation,would not have worked.

Even so, the research environment has differentrequirements from the deployment environment,which means that some reimplementation of theresearch system is almost inevitable for deployment.The research system is focused on experimentationand requires simple, flexible, and easily modifiable

Articles

16 AI MAGAZINE

software, whereas the emphasis in deployment is onresource constraints and online efficiency. Thoughour implementation worked in the trial and in adeployed setting where recommendations were up toone day out of date, our implementation would notwork in the production environment, and moreover,we could not have built a system in the research lab-oratory that would work in production since thisrequired integration with the commercial systems.

ConclusionWe have presented the results of a successful deploy-ment of our people-to-people recommender systemon a large commercial online dating site with nearlyhalf a million active users sending more than 70,000messages a day. The recommender had been in usefor about 7 months (from August 2012 to March2013) before these results were obtained. In the peri-od from November 2012 to March 2013, 61 percentof active users clicked recommendations and 33 per-cent of them communicated with recommendedcandidates.

The recommender system is a hybrid system com-bining several AI techniques that all contributed toits overall success. First, collaborative filtering allowsrecommendations to be based on user behaviorrather than profile and expressed preferences. Sec-ond, decision tree rules were crucial in addressing thecommon problem with collaborative filtering inover-recommending popular items, which is particu-larly acute for people-to-people recommendation. Nosingle AI method, whether decision tree learning,profile-based matching, or collaborative filtering,could alone produce satisfactory results. More gener-ally, we think there is much scope for further researchinto hybrid recommender systems that combinemultiple sources of information when generatingand ranking recommendations.

Methods developed for this recommender systemcan be used, apart from in online dating, in othersocial network contexts and in other reciprocal rec-ommendation settings where there are two-wayinteractions between entities (people or organiza-tions) with their own preferences. Typical such prob-lems include intern placement and job recommen-dation. Moreover, our method of using decision treerules as a critic to reduce the recommendation fre-quency of popular users can also be applied to itemrecommendation.

AcknowledgementsThis work was funded by Smart Services CooperativeResearch Centre. We would also like to thank ourindustry partner for supporting this research consis-tently throughout the whole process, and specifi-cally for conducting the trial and giving us details ofthe method reimplementation and the deploymentprocess.

References

Burke, R. 2002. Hybrid Recommender Systems: Survey andExperiments. User Modeling and User-Adapted Interaction12(4): 331–370.

Cai, X.; Bain, M.; Krzywicki, A.; Wobcke, W.; Kim, Y. S.;Compton, P.; and Mahidadia, A. 2010. Collaborative Filter-ing for People to People Recommendation in Social Net-

Articles

FALL 2015 17

The Twenty-Eighth Annual Conference on

Innovative Applications of Artificial Intelligence

(IAAI-16)

February 12–17, 2016 Phoenix, Arizona USA

Please Join Us!www.aaai.org/iaai16

http://www.aimagazine-digital.org/aimagazine/fall_2015/TrackLink.action?pageName=17&exitLink=http%3A%2F%2Fwww.aaai.org%2Fiaai16

works. In AI 2010: Advances in Artificial Intelligence, ed. J. Li,476–485. Berlin: Springer-Verlag.Cai, X.; Bain, M.; Krzywicki, A.; Wobcke, W.; Kim, Y. S.;Compton, P.; and Mahidadia, A. 2013. ProCF: GeneralisingProbabilistic Collaborative Filtering for Reciprocal Recom-mendation. In Advances in Knowledge Discovery and DataMining, ed. J. Pei, V. Tseng, L. Cao, H. Motoda, and G. Xu.Berlin: Springer-Verlag.Finkel, E. J.; Eastwick, P. W.; Karney, B. R.; Reis, H. T.; andSprecher, S. 2012. Online Dating: A Critical Analysis fromthe Perspective of Psychological Science. Psychological Sci-ence in the Public Interest 13(1): 3–66.Kim, Y. S.; Mahidadia, A.; Compton, P.; Cai, X.; Bain, M.;Krzywicki, A.; and Wobcke, W. 2010. People Recommenda-tion Based on Aggregated Bidirectional Intentions in SocialNetwork Site. In Knowledge Management and Acquisition forSmart Systems and Services, ed. B.-H. Kang and D. Richards,247–260. Berlin: Springer-Verlag.Kim, Y. S.; Krzywicki, A.; Wobcke, W.; Mahidadia, A.; Comp-ton, P.; Cai, X.; and Bain, M. 2012a. Hybrid Techniques toAddress Cold Start Problems for People to People Recom-mendation in Social Networks. In PRICAI 2012: Trends inArtificial Intelligence, ed. P. Anthony, M. Ishizuka, and D.Lukose, 206–217. Berlin: Springer-Verlag.Kim, Y. S.; Mahidadia, A.; Compton, P.; Krzywicki, A.;Wobcke, W.; Cai, X.; and Bain, M. 2012b. People-To-PeopleRecommendation Using Multiple Compatible Subgroups.In AI 2012: Advances in Artificial Intelligence, ed. M. Thielsch-er and D. Zhang, 61–72. Berlin: Springer-Verlag.Krzywicki, A.; Wobcke, W.; Cai, X.; Bain, M.; Mahidadia, A.;Compton, P.; and Kim, Y. S. 2012. Using a Critic to PromoteLess Popular Candidates in a People-to-People Recom-mender System. In Proceedings of the Twenty-Fourth AnnualConference on Innovative Applications of Artificial Intelligence,2305–2310. Palo Alto, CA: AAAI Press.Krzywicki, A.; Wobcke, W.; Cai, X.; Mahidadia, A.; Bain, M.;Compton, P.; and Kim, Y. S. 2010. Interaction-Based Col-laborative Filtering Methods for Recommendation inOnline Dating. In Web Information Systems Engineering, ed. L.Chen, P. Triantafillou, and T. Suel, 342–356. Berlin:Springer-Verlag.Melville, P.; Mooney, R. J.; and Nagarajan, R. 2002. Content-Boosted Collaborative Filtering for Improved Recommen-dations. In Proceedings of the Eighteenth National Conferenceon Artificial Intelligence (AAAI-02), 187–192. Menlo Park, CA:AAAI Press.Oyer, P. 2014. Everything I Ever Needed to Know About Eco-nomics I Learned from Online Dating. Boston: Harvard Busi-ness Review Press.Pizzato, L.; Rej, T.; Akehurst, J.; Koprinska, I.; Yacef, K.; andKay, J. 2013. Recommending People to People: the Nature ofReciprocal Recommenders with a Case Study in Online Dat-ing. User Modeling and User-Adapted Interaction 23(5): 447–488.Webb, A. 2013. Data, A Love Story. New York: Plume, Pen-guin Group.

Wayne Wobcke is an associate professor in the School ofComputer Science and Engineering at the University of NewSouth Wales, Australia. His research interests span intelli-gent agent theory, dialogue management, personal assis-tants, and recommender systems. From 2008 to 2014, he led

the Personalisation program within Smart Services Cooper-ative Research Centre, which provided the environment forthis collaboration. He has a Ph.D. in computer science fromthe University of Essex.

Alfred Krzywicki is a research fellow in the School of Com-puter Science and Engineering at the University of NewSouth Wales, Australia, with a background in both softwareengineering and AI. After receiving his Ph.D. from the Uni-versity of New South Wales in 2012, he worked on socialmedia analysis and contributed to the implementation andcommercial deployment of a recommender system. His cur-rent research interests include recommender systems, text,and stream mining.

Yang Sok Kim is an assistant professor at the School ofManagement Information Systems at Keimyung University,Korea. His research interests include knowledge-based sys-tems and machine learning and their hybrid approaches.Kim received his Ph.D. in computing in 2009 from theSchool of Computing and Information Systems at the Uni-versity of Tasmania. He worked as a postdoctoral researcherfor the Personalisation program within Smart ServicesCooperative Research Centre.

Xiongcai Cai is a scientist with a general background in AIwith expertise in the fields of machine learning, data min-ing, and social media analysis. He received a Ph.D. in com-puter science and engineering from the University of NewSouth Wales, Australia, in 2008. He has spent the pastdecade researching and developing models to better under-stand behaviors and patterns that expand the scope ofhuman possibilities.

Michael Bain is a senior lecturer in the School of Comput-er Science and Engineering at the University of New SouthWales, Australia. His research focuses on machine learning,particularly the logical foundations of induction, and he hasexperience in a wide variety of application areas includingcomputer chess, bioinformatics, and recommender systems.He has a Ph.D. in statistics and modeling science from theUniversity of Strathclyde.

Paul Compton is an emeritus professor in the School ofComputer Science and Engineering at the University of NewSouth Wales, Australia. Most of his research has beenaround the idea of building systems incrementally using alearning or knowledge acquisition strategy known as Rip-ple-Down Rules.

Ashesh Mahidadia is the chief executive officer of the com-pany smartAcademic, which empowers businesses to addintelligence to the way they operate. He has a Ph.D. in arti-ficial intelligence from the University of New South Wales,Australia. His research interests include knowledge-inten-sive machine learning, recommender systems, intelligentassistants, and intelligent education systems. He was a sen-ior researcher in the Personalisation program within SmartServices Cooperative Research Centre.

Articles

18 AI MAGAZINE

a deployed people-to-people recommender system in online ...wobcke/papers/online-dating.pdf ·...

Documents