value-aware recommendation based on reinforced profit ... · value-aware recommendation based on...

8
Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei * Alibaba Group [email protected] Xinru Yang * Carnegie Mellon University [email protected] Qing Cui Alibaba Group [email protected] Xiao Lin Alibaba Group [email protected] Fei Sun Alibaba Group [email protected] Peng Jiang Alibaba Group [email protected] Wenwu Ou Alibaba Group [email protected] Yongfeng Zhang Rutgers University [email protected] ABSTRACT Existing recommendation algorithms mostly focus on optimizing traditional recommendation measures, such as the accuracy of rating prediction in terms of RMSE or the quality of top- k recom- mendation lists in terms of precision, recall, MAP, etc. However, an important expectation for commercial recommendation systems is to improve the nal revenue/prot of the system. Traditional recommendation targets such as rating prediction and top- k rec- ommendation are not directly related to this goal. In this work, we blend the fundamental concepts in online ad- vertising and micro-economics into personalized recommendation for prot maximization. Specically, we propose value-aware rec- ommendation based on reinforcement learning, which directly op- timizes the economic value of candidate items to generate the rec- ommendation list. In particular, we generalize the basic concept of click conversion rate (CVR) in computational advertising into the conversation rate of an arbitrary user action (XVR) in E-commerce, where the user actions can be clicking, adding to cart, adding to wishlist, etc. In this way, each type of user action is mapped to its monetized economic value. Economic values of dierent user actions are further integrated as the reward of a ranking list, and reinforcement learning is used to optimize the recommendation list for the maximum total value. Experimental results in both oine benchmarks and online commercial systems veried the improved performance of our framework, in terms of both traditional top- k ranking tasks and the economic prots of the system. CCS CONCEPTS Information systems Recommender systems; KEYWORDS Recommender Systems; Reinforcement Learning * Equal contribution. Listing order is random. is work was done when Xinru Yang was an intern at Alibaba. Corresponding author. 1 INTRODUCTION Recommender system has become a fundamental service in many online applications. Years of research have witnessed the great advancement on the development of recommendation systems, and various dierent techniques have been proposed to support beer performance in practice, including but not limited to content-based methods [22], user/item-based collaborative ltering methods [8], matrix factorization techniques [16], and the more recent deep learning-based approaches [32]. Most of the existing methods focus on two of the most fundamental tasks, namely, rating prediction and top- k recommendation. ey usually aim at the optimization of rating or ranking-related measures, such as root mean square error (RMSE), mean average precision (MAP), and many others. However, an important and sometimes the ultimate goal of com- mercial recommendation system is to gain revenue and prots for the platform, but traditional recommendation research are not di- rectly related to this goal. ey mostly focus on whether or not the algorithm can predict accurate ratings (rating prediction), or if the system can rank the clicked items correctly (top- k recom- mendation). Nevertheless, researchers have shown that improved performance on rating prediction does not necessarily improve the top- k recommendation performance in terms of user purchase [5]. And even beer performance on top- k recommendation does not necessarily improve the prot of the system [33], because the pur- chased recommendations may not convert into the best prot due to their dierence in price. For example, by recommending daily necessities, the system has a beer chance of geing the user to pur- chase the recommendation, but the expected prot achieved from the purchase may be smaller than if a luxury good was purchased, though the laer has a smaller chance of being purchased. In this work, we propose value-aware recommendation for prot maximization, which directly optimizes the expected prot of a recommendation list. e basic idea is to balance the unit prot of a recommendation and the chance that the recommendation is purchased, so that we can nd the recommendation list of the maximum expected prot. To achieve this goal, we generalize the key concept of click conversion rates (CVR) [2, 18] in online advertising to the conversion rate of arbitrary user actions (XVR), arXiv:1902.00851v1 [cs.IR] 3 Feb 2019

Upload: others

Post on 11-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

Value-aware Recommendation based on Reinforced ProfitMaximization in E-commerce Systems

Changhua Pei∗Alibaba Group

[email protected]

Xinru Yang∗†Carnegie Mellon [email protected]

Qing CuiAlibaba Group

[email protected]

Xiao LinAlibaba Group

[email protected]

Fei SunAlibaba Group

[email protected]

Peng JiangAlibaba Group

[email protected]

Wenwu OuAlibaba Group

[email protected]

Yongfeng Zhang‡Rutgers University

[email protected]

ABSTRACTExisting recommendation algorithms mostly focus on optimizingtraditional recommendation measures, such as the accuracy ofrating prediction in terms of RMSE or the quality of top-k recom-mendation lists in terms of precision, recall, MAP, etc. However,an important expectation for commercial recommendation systemsis to improve the �nal revenue/pro�t of the system. Traditionalrecommendation targets such as rating prediction and top-k rec-ommendation are not directly related to this goal.

In this work, we blend the fundamental concepts in online ad-vertising and micro-economics into personalized recommendationfor pro�t maximization. Speci�cally, we propose value-aware rec-ommendation based on reinforcement learning, which directly op-timizes the economic value of candidate items to generate the rec-ommendation list. In particular, we generalize the basic concept ofclick conversion rate (CVR) in computational advertising into theconversation rate of an arbitrary user action (XVR) in E-commerce,where the user actions can be clicking, adding to cart, adding towishlist, etc. In this way, each type of user action is mapped toits monetized economic value. Economic values of di�erent useractions are further integrated as the reward of a ranking list, andreinforcement learning is used to optimize the recommendation listfor the maximum total value. Experimental results in both o�inebenchmarks and online commercial systems veri�ed the improvedperformance of our framework, in terms of both traditional top-kranking tasks and the economic pro�ts of the system.

CCS CONCEPTS•Information systems→ Recommender systems;

KEYWORDSRecommender Systems; Reinforcement Learning

∗Equal contribution. Listing order is random.†�is work was done when Xinru Yang was an intern at Alibaba.‡Corresponding author.

1 INTRODUCTIONRecommender system has become a fundamental service in manyonline applications. Years of research have witnessed the greatadvancement on the development of recommendation systems, andvarious di�erent techniques have been proposed to support be�erperformance in practice, including but not limited to content-basedmethods [22], user/item-based collaborative �ltering methods [8],matrix factorization techniques [16], and the more recent deeplearning-based approaches [32]. Most of the existing methods focuson two of the most fundamental tasks, namely, rating predictionand top-k recommendation. �ey usually aim at the optimizationof rating or ranking-related measures, such as root mean squareerror (RMSE), mean average precision (MAP), and many others.

However, an important and sometimes the ultimate goal of com-mercial recommendation system is to gain revenue and pro�ts forthe platform, but traditional recommendation research are not di-rectly related to this goal. �ey mostly focus on whether or notthe algorithm can predict accurate ratings (rating prediction), orif the system can rank the clicked items correctly (top-k recom-mendation). Nevertheless, researchers have shown that improvedperformance on rating prediction does not necessarily improve thetop-k recommendation performance in terms of user purchase [5].And even be�er performance on top-k recommendation does notnecessarily improve the pro�t of the system [33], because the pur-chased recommendations may not convert into the best pro�t dueto their di�erence in price. For example, by recommending dailynecessities, the system has a be�er chance of ge�ing the user to pur-chase the recommendation, but the expected pro�t achieved fromthe purchase may be smaller than if a luxury good was purchased,though the la�er has a smaller chance of being purchased.

In this work, we propose value-aware recommendation for pro�tmaximization, which directly optimizes the expected pro�t of arecommendation list. �e basic idea is to balance the unit pro�tof a recommendation and the chance that the recommendationis purchased, so that we can �nd the recommendation list of themaximum expected pro�t. To achieve this goal, we generalizethe key concept of click conversion rates (CVR) [2, 18] in onlineadvertising to the conversion rate of arbitrary user actions (XVR),

arX

iv:1

902.

0085

1v1

[cs

.IR

] 3

Feb

201

9

Page 2: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

which scales each user action into themonetized pro�t of the systembased on large-scale user behavior logs.

Once the user actions are converted into expected pro�ts, we fur-ther design an aggregated reward function based on the monetizeduser actions, and then adopt reinforcement learning to optimize therecommendation list towards the maximized expected pro�t for thesystem. For training, we developed a benchmark dataset by collect-ing the user behaviors in a real-world E-commerce system. Further-more, we conducted online experiments on this E-commerce systemto validate the online performance of value-aware recommendationin terms of both ranking performance and system pro�ts.

�e key contributions of the paper can be summarized as follows:• We propose value-aware recommendation to maximize the pro�t

of a recommendation system directly, which to the best of ourknowledge, is the �rst time to explicitly optimize the pro�t of thepersonalized recommendation list in large-scale online system.

• We propose to generalize the basic concept of click conversationrate into the conversion rate of arbitrary user actions, so thateach user action can be monetized into the pro�t of the system.

• By aggregating the monetized user actions into a uni�ed pro�treward, we develop a reinforcement learning-based algorithmfor value-aware recommendation.

• We conduct both o�ine experiments with benchmark datasetsand online experiments with real-world commercial users tovalidate the performance of value-aware recommendation onranking and pro�t maximization.

2 RELATEDWORKIn this section, we introduce the key related work in terms of threeperspectives: economic recommendation, computational advertis-ing, and reinforcement learning.

2.1 Economic RecommendationFor a long time, recommendation system research has been work-ing on rating- or ranking-related tasks such as rating prediction,top-k recommendation, sequential recommendation, etc., and a lotof successful models have been proposed, which greatly advancedthe performance of recommendation systems. To name a few, thisincludes content-based methods [22], user/item-based collaborative�ltering [8], shallow latent factor models such as matrix factoriza-tion [16, 17, 21, 23], and more recent deep learning based methods[13, 30, 32]. However, the related methods seldom consider theeconomic value/pro�t that a recommendation list brings about tothe system, although this is one of the most important goals for real-world commercial recommender systems. Some recent research oneconomic recommendation has begun to take care of the economicvalue of recommendation[1, 7, 38],. For example, [33] proposed tomaximize social surplus for recommendation, and [34] proposedto learn the substitutional and complementary relations betweenitems for utility maximization. However, none of the above meth-ods explicitly maximizes the expected pro�ts of a recommendationlist to generate recommendations.

2.2 Computational AdvertisingCurrently, computational advertising [6] has been playing a crucialrole in online shopping platforms, rendering it essential to apply

an intelligent advertising strategy that can maximize the pro�tsby ranking numerous advertisements in the best order [3, 10, 11].In this scenario, the goal of advertisers is to maximize clicks andconversions such as item purchase. �erefore, cost-per-click (CPC)and cost-per-action (CPA) models are o�en adopted to optimizethe click-through-rate (CTR) and the conversion-rate (CVR), re-spectively. Although technically there are a lot of analogy betweenadvertising and recommendation, there is not much work tryingto bridge these two principles. �is work is one of our �rst at-tempts to integrate the economic consideration in adversting intorecommender systems for value-aware recommendation.

2.3 Reinforcement LearningReinforcement Learning (RL) aims at automatically learning an op-timal strategy from the sequential interactions between the agentand the environment by trial-and-error [28]. Some existed workshave made e�orts on exploiting reinforcement learning for recom-mendation systems. For instance, conventional RL methods such asPOMDP (Partially Observable Markov Decision Processes) [26] andQ-learning [29] estimate the transition probability and store theQ-value table in modeling. However, they soon become ine�cientwith the sharp increase in the number of items for recommenda-tion. Meanwhile, other approaches have been proven useful withan e�ective utilization of deep reinforcement learning. For instance,[37] employs Actor-Critic framework to learn the optimal strategyby a online simulator. Furthermore, [36] considers both positiveand negative feedback from users recent behaviors to help �ndoptimal strategy. [9] uses the multi-agent reinforcement learningto optimize the multi-scenario ranking. Meanwhile, [35] adopts RLto recommend items on a 2-D page instead of showing one singleitem each time.

To mitigate the performance degradation due to high-varianceand biased estimation of the reward, [4] provides a strati�ed ran-dom sampling and an approximate regre�ed reward to enhancethe robustness of the model. Similarly, [15] introduces DPG-FBEalgorithm to maintain an approximate model of the environmentto perform reliable updates of value functions.

3 GENERALIZED CONVERSION RATEIn order to maximize the pro�t of recommender system, we needto measure the the value of di�erent user actions on the platform.In this section, we introduce how to generalize the basic conceptof click conversion rate to rescale each user action into monetizedpro�t. Here we use Gross Merchandise Volume (GMV) to measurethe pro�t of the system. GMV is a measure used in online retailing,which indicates the total sales dollar value of the merchandise soldthrough a particular marketplace over a certain time frame1. Weuse GMV and pro�t interchangeably a�ermentioned.

3.1 Conversion Rate of ClicksClick conversion rate (CVR) is widely used in online advertising toestimate the potential pro�t of an advertising item to the advertisingplatform [20, 39]. It is usually de�ned as the result of dividingthe number of conversions by the number of total clicks that can

1h�ps://en.wikipedia.org/wiki/Gross merchandise volume2

Page 3: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

Impression Click Add to cart

Favorite

Purchase

Figure 1: A �ow of the user behavior series

be tracked to a conversion during the same period of time. In E-commerce system, it is di�cult to learn an optimal recommendationstrategy by simply calculating the pro�ts of transactions becausepurchasing an item is a relatively sparse action of users. Instead,users tend to click an item much more frequently. CVR providesa way to map the dense action of clicking into a scaled range ofpro�ts. For E-commerce system, CVR is the �rst step towards value-aware recommendation[27, 31]. Eq, 1 shows the total estimatedGMV brought by click, which shows that CVR plays a crucial rolein pro�t maximization,

E[GMV ] =∑iI (click, i) · CVR(i) · price(i) (1)

where price(i) is the unit price of item i , I (click, i) is an indicator ofthe occurrence of click on item i . If item i is clicked, I (click, i) = 1,otherwise, 0.

3.2 Generalization to other ActionsIn E-commerce platform, a series of behaviors of a user is o�enformed step by step with transforming probabilities. In Figure 1,the general shopping process is simpli�ed into four steps. First,the user sees something of interests and gets an impression on theitem, then the user will click to see more details and to eventuallyadd it to the cart. �e user can also add it to the wishlist so hecan add it to the cart later. Eventually, some items in the cart willbe purchased together and contribute to �nal pro�ts. Adding towishlist and adding to cart reveal a strong desire for the item evenif no transaction is made eventually. �erefore, these actions arealso considered to contribute to the value.

Here we generalize the concept of CVR to XVR to map each typeof user action x to its monetized economic value. XVR is de�nedas the possibility of transition from certain action x to purchase.Using XVR, we can estimate the total pro�ts of di�erent actions onplatform. By generalizing Eq. 1, we get Eq. 2,

E[GMV ] =∑x,i

V (x , i) =∑x,i

I (x , i) · XVR(i) · price(i) (2)

where E[GMV] denotes the expected GMV of the platform, x rep-resents certain action including click, add to cart, add to wishlist,and purchase. Here V (x , i)=I (x , i) · XVR(i) · price(i) is the expectedpro�t given user has an action x .

4 REINFORCED PROFIT OPTIMIZATION�e main goal of this work is to maximize the total pro�t of recom-mender system in an E-commerce scenario. Here the pro�t indicatea total sales dollar value for merchandise sold online in a certaintime frame. Using the generalized conversation rate, we can transferall user actions to monetized pro�t and maximize it. Reinforcementlearning (RL) aims to learn a policy that maximizes the accumulatedreward. �e pro�t can naturally be used as the reward with which

we can �nd be�er actions (ranking policy). Besides, the view ofinteract between the environment (users) and RL agent �ts ourrecommendation scenario well. �erefore, in this paper, we choosereinforcement learning, evolution strategy more speci�cally[24],as our algorithm for its simplicity and e�ectiveness.

4.1 Reinforcement Learning ModelingFigure 2 illustrates how our value-based RL agent interacts with theuser to maximize the pro�t. First, the user comes to the system atsome time and makes an initial request, the system receives someinformation about the user as well as the candidate items as a stateand responds to the request, which virtually means taking an actionby generating an order to rank items. Next, the user shows a certainbehaviors regarding the items such as click, add to wishlist, addto cart or purchase, which is used to calculate the reward to thesystem for what kind of action it takes.

In our recommendation platform, items are shown in cascadeon a mobile App one by one. Each time the user initiates a request,50 items are recommended to him/her. As user scrolls down thelist and have seen all 50 items, a new request is triggered. �isprocess is repeated until the user leaves the App or return the topof the cascade, labelled as ”exit” in Figure 2. We use a metric called”pageid” to distinguish di�erent requests in this interaction, similarto the concept of ”page” to a search engine. As the user and thesystem interact with each other, the system learns how to respondto the state to obtain an optimized accumulative reward.

States: States are used to describe an user’s request, includinguser-side, item-side, and contextual features. �e user-side featuresinclude age, gender, purchase power. �e item-side features are ctr,cvr, price. �e contextual features include pageid, request time. Intotal, the states in our model can be formulated as ¡age, gender,purchase power, ctr0, …, ctr49, cvr0, …, cvr49, price0, …, price49,pageid, request time¿, the subscript i denotes the correspondingfeatures of ith item in the corresponding page.

All these features work together to in�uence the �nal GMV.For example, purchase power is calculated by how many and howexpensive the bought items are. �e higher the level of purchasepower is, the more money is spent on the platform. Fig. 3a, 3b, and3c show average GMV per user, number of users and summed GMVunder di�erent purchasing power and degree of activity. Degree ofactivity indicates the time a user spends on the platform. Fig. 3ashows users with higher purchase power have higher average GMV.�us the model tends to recommend items with higher price to userwho has high purchase power. However, for summed GMV, userswith middle level of purchase power contribute most because thelarge number of users, as shown in Fig. 3b.

Request time (labelled as hour in Fig. 4) and pageid are alsofeatures used in our model. Fig. 4 shows that the the �rst pagewhose pageid equal to 0 contributes most to the summed GMV,as user scrolls downwards, pages with larger pageid have smallerpossibilities to have impression on users. �us the model shouldrecommend items with higher conversion rate in the �rst page.Besides, model will learn di�erent weights at di�erent hour asFig. 4 show the summed GMV varies with request time.

Actions: Action is the coe�cient vector in the ranking formularwhich decides the order of candidate items. We design a ranking

3

Page 4: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

purchaseadd to cartfavorite

Figure 2: Structure of the model. When user initiates a request, the system will generate a recommendation list, and the userwill interact with the list with certain actions. �ese Actions will be used to calculate the reward to update the RL Agent.

Purchasing Power

1 2 3 4 5 6 7 Degre

e o

f act

ivit

y

0246810

Avera

ge G

MV

5

10

15

20

(a) Average GMV.

Purchasing Power

12

34

56

7 Degree of activity

02

46

810

#U

sers

0.2

0.4

0.6

0.8

1.0

1.2

(b) Number of Users.

Purchasing Power

12

34

56

7 Degree of activity

02

46

810

Sum

med G

MV

0.2

0.4

0.6

0.8

1.0

1.2

1.4

(c) Summed GMV.

Figure 3: Average/Summed GMV and the number of users under di�erent level of purchasing power and degree of activity.

formula shown in Eq. 3 to order the items,

rankscore(i)=∑x ∈A

P(x , i)αx ·XVR(i)βx ·price(i)γ (3)

whereA is the user action set which contains click, add to cart, addto wishlist. P(x , i) is the possibility for a given user has action x onitem i . XVR(i) is the generalized conversion rate of item i . In total,the action can be formulated as<αclick,αcar t ,αfav . , βclick, βcart , βfav . ,γ>.

If we only consider click in the action setA, rankscore can be sim-pli�ed to Eq. 4. �e corresponding action becomes <αclick, βclick,γ>.

rankscore(i)=P(click, i)αcl ick ·XVRβcl ick ·price(i)γ

=ctr (i)αcl ick ·cvr (i)βcl ick ·price(i)γ(4)

Changing an action means to change the coe�cient vector for theranking formula.

Reward: In our value-based method, we use the expected pro�tas reward, which contains the monetized pro�t converted from allkinds of user actions. Based the de�nition of the expected GMV inEq. 2, for a given item i , the reward Ri can be de�ned as:

Ri = V (click, i) +V (f av ., i) +V (cart , i) +V (pay, i) (5)

In this paper, unless speci�cally mentioned, we mainly use clickand pay in reward, i.e., Ri=V (click, i)+V (pay, i). In the last section,we evaluate the performance when taking adding to cart and addingto wishlist into reward. �e total reward of a recommendation listin a page that contains T items can be de�ned as Rpaдe =

∑Ti=0 Ri .

4.2 Model Training4.2.1 O�line Reward. In the above section, the reward of the

model Rpaдe is directly calculated with user’s feedback online. �isrequires that the model is deployed online directly and learns thepolicy from the real-time data streaming. However, the onlinetra�c is expensive. It is risky to let RL model learns directly onlinebefore we validate the e�ectiveness of the value-based method.To overcome this problem, we propose a simulated environmentwhich leverage the historical data to approximate the feedbackof the users. �e simulated environment adopts a simple idea ofNDCG (normalized discounted cumulative gain) that good policyshould rank the item user actually clicked and paid in the front.In this way, an o�ine reward R′paдe is used to evaluate actions as

4

Page 5: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

Pageid

02

46

810

12

Request Time (hour)

05

1015

20

Sum

med G

MV

5

10

15

20

25

30

35

Figure 4: GMV across di�erent pages and time.

following:

R′paдe =

T∑i=0

Ri ∗Wπ (i) (6)

For a given page with T items 1, · · · ,T , the ranking policy cangenerate an order list, where π (i) = k means item i is ranked atposition k . We assign a discounted weight for each position k andhigher position has greater weight, i.e.Wπ (i). �en we representthe reward as the weighted sum of item reward. In this paper,Wπ (i)is represented by exponential function which isWπ (i) = exp(π (i)).

4.2.2 Learning Algorithm. As the main goal of this paper isto evaluate the e�ectiveness of the value-aware recommendation,we choose the simplest model in reinforcement learning modelfamily to exclude the in�uence of complex models. Finally weadopt evolution strategy algorithm (abbreviated ES algorithm). Wetrain themodel o�ine and deploy it online to performA/B tests. �efollowing introduces how the ES algorithm learns in the trainingprocess.

Our task is to learn a policy which maps state to action thatmaximizes the expected reward. To keep it simple, we use a linearmodel as the policy and θ is the parameters to be optimized. �eES algorithm starts from initial parameter θ0 and then updates θtin the loop where t=1, 2, . . . . At each iteration loop t , the user’srequest for a new page and pass the state vector st to the model. �ealgorithm generates n baby-agent and each baby-agent j samples aset of increments ϵ jt from a normal distribution. Each model variantwith parameter θt + ϵ jt maps the state to an action. Each actionorders the list in a di�erent way and the user reacts to di�erentranking polices by clicking or purchasing certain items in the list.We calculate the reward of each policy j using Eq. 6 and labeledas R′paдe (j) = F (θt + ϵ jt ). We collect all the rewards together andupdate θt eventually:

θt+1 ←− θt + αlr1nσ

n∑j=1

R′paдe (j) (7)

where σ and αlr are hyper parameters, σ is the noise standarddeviation and αlr is the learning rate.

Table 1: Overview of the benchmark Dataset

clause size(x106)#distinct users 49#distinct items 200#requests 500#clicks 670#adding to carts 60#adding to wishlists 30#purchases 3

0 50 100 150 200

Epoch

0.0

0.2

0.4

0.6

0.8

1.0

1.2

rew

ard

<0.2,0.001,200>

<0.5,0.001,200>

<0.5,0.0005,200>

<0.5,0.0005,500>

Figure 5: �e training curve under di�erent parameters:<noise standard deviation, learning rate, batch size>

5 EXPERIMENTSIn this section, we �rst introduce the se�ings of our benchmarkdataset used for training or o�ine evaluation. �en we evaluateour method both online and o�ine together with baselines. 2

5.1 Benchmark DatasetOur benchmark dataset is collected from real-world E-commerceplatform and can simulate the interactive environment and users’behaviors. Each piece of data is comprised of two parts: the statesand users’ feedback: click, add to cart, add to wishlist, purchase.Table 1 lists the details of the dataset.

5.2 Experimental Setup5.2.1 Baselines: We use the following three methods as base-

lines to compare with our value-based RL method: Item-based Col-laborative Filtering, LR-based learning to rank(LTR), DNN-basedlearning to rank.• Item-basedCF:�e classic Item-based CF [25] is widely used for

recommendation. compared to user-based CF [14], Item-basedCF ismore scalable and accurate in E-commerce recommendationbecause the relationships between items are more stable. Inthis paper, we use a �ne-tuned item-based collaborative �lteringmethod as one of our baselines. Other CF methods such as matrixfactorization can also be used as baselines. However, as they are

2�e code and dataset of the paper are released at h�ps://github.com/rec-agent/rec-rl.5

Page 6: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

Table 2: O�line Evaluation Results (p-value ¡ 0.005). Numbers in the brackets are improvements compared to Item-based CFand LR-based LTR.

E[GMV] Average R′paдe Precision@20(%) Recall@20(%) MAP@20(%)

Item-based CF 0.40 19.73 2.96 27.93 8.69LR-based LTR 0.49 (22.5%/-) 25.87 (31.1%/-) 3.45 (16.6%/-) 32.42 (16.1%/-) 11.04 (27.0%/-)DNN-based LTR 0.50 (25.0%/2.0%) 25.88 (31.2%/0.04%) 3.65 (23.3%/5.8%) 34.01 (21.8%/4.9%) 12.27 (41.2%/11.1%)Value-based RL 0.53 (32.5%/8.2%) 27.78 (40.8%/7.4%) 3.74 (26.4%/8.4%) 34.82 (24.7%/7.4%) 12.36 (42.2%/12.0%)

Table 3: Online Evaluation Results (p-value ¡ 0.005).

GMV CTR(%) IPVItem-based CF 7.57 3.04 2.48LR-based LTR 8.95 (18.3%/-) 3.26 (7.4%/-) 2.67 (7.5%/-)DNN-based LTR 9.06 (19.7%/1.2%) 3.28 (8.1%/0.6%) 2.69 (8.3%/0.7%)Value-based RL 9.68 (27.9%/8.2%) 3.29 (8.2%/0.9%) 2.70 (8.8%/1.1%)

all non value-aware methods, we omit them in this paper forsimplicity.

• LR-based LTR: �e Point-wise Learning-To-Rank paradigm[19] is adopted to learn the rankings of recommended items withLogistic Regression(LR) as the ranking model. We use clickedand purchased items as positive samples to learn a binary clas-si�cation model, and the price is used as the weight of loss. Toreduce the variance of price and make the optimization stable,we actually use log(price) as the weight. For fair comparison,here we employ the same feature set as our proposed value-basedRL. �ere are three hyper-parameters involved in the trainingof LR-based LTR, i.e., the `2 regularization weight, the learningrate and the batch size. �ese parameters are set to 0, 0.5 and 128a�er tuning. We also test a repression model for predicting thepro�t. However, it cannot beat classi�cation models like LR. �isis consistent with previous work [12], regression model may notbe suitable for ranking problems in recommender system.

• DNN-based LTR: �is method is the same as LR-based LTR,except that the ranking model is a Deep Neural Network (DNN)instead. �e neural network involves two hidden layers with 32and 16 neurons respectively. �e `2 regularization weight, thelearning rate and the batch size are set to 0, 0.05 and 1024. We alsotested network structures including RNN, a�ention networks,and Wide&Deep network. Optimizer, activation functions, nor-malization, and other hyper parameters are tuned by greedysearch. However, the improvement of these networks is notsigni�cant on very large dataset.

5.2.2 Value-based RL:. Di�erent hyper parameters are triedby a greedy search to �nd the be�er training performance. In ESalgorithm, under the same training set, the larger reward a modelcan achieve, the be�er the model is. We name some importantparameters in Fig. 5 to illustrate this. Having a larger deviation(from 0.2 to 0.5) can greatly improve themaximum reward themodelcan achieve. However, the model is not stable so that the rewardwill drop a�er some iterations. We adjust the learning rate in Adamoptimizer from 0.001 to 0.0005 and the training curve become stable.

We also try larger batch(from 200 to 500), the improvement is notthat much. Finally, <0.5,0.0005,200> are used as noise standarddeviation, learning rate and batch size in our RL model respectively.

5.3 Key ResultsWe evaluate our method with both o�ine and online metrics, seeTable 2 and Table 33.

O�line Evaluation: In o�ine evaluation, we train the modelon the data of one week and evaluate it on the data of the followingweek. Precision, recall and MAP within the top 20 items is used tomeasure the performance. Besides the traditional metrics, averageR′paдe (Eq. 6) and E[GMV] (Eq. 2) are used as value-based metrics.Table 2 shows that the Learning-To-Rank (LTR) baseline out-

performs item-based CF in all measures which indicates the LTRmethods can predict the pro�t more e�ectively. We leverage pricein two ways. On one hand, LTR methods use price as one of thefeatures. On the other hand, LTR methods use price as the weightfor loss in the learning process. Our value-based approach has 6.0%and 7.3% improvements in E[GMV] and R′paдe respectively whencompared to DNN-based LTR. Besides, Value-based RL achieves bet-ter performance in precision(2.5%), recall(2.4%) and MAP(0.7%) thanDNN-based LTR with same features. Our approach can preciselymonetize an arbitrary user action into the pro�t by generalizingthe basic concept of click conversion rate (CVR) to CXR.

Online Evaluation: An online A/B test is used to evaluatedi�erent methods. Metrics we use in A/B test are ctr, IPV and GMV.IPV is the absolute number of clicked items on the platform, whichacts a supplement metric to ctr. To make the online evaluationcomparable, each bucket in A/B test has the same number of usersand contains over millions of users.

Table 3 shows the average results for one week. Online evalua-tion shows results are consistent with the o�ine evaluation. �eDNN-based LTR improves both GMV, ctr and IPV by 19.7%, 8.1%and 8.3% respectively compared to Item-based CF. Even thoughthe DNN-based LTR does not explicitly optimizing the GMV, ithas 19.7% improvements in GMV compared to item-based CF. Byexplicitly mapping the value of di�erent actions into the GMV, ourvalue-based RL method can bring another 6.8% improvement onGMV compared to DNN-based LTR. �e improvements on ctr andIPV is 0.3% and 0.4% respectively. Online test is done in real-worldsystem (involving 2 hundred million clicks per day, 7 days). Basedon signi�cance test (p¡0.005), the improvement are all signi�cant3Normalized related value instead of the original absolute value of the online measuresare shown for business reason. �e relative improvement is consistent with realexperimental results.

6

Page 7: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

Table 4: O�line performance when considering adding tocart and adding to wishlist (p-value ¡ 0.005).

E[GMV] Precision@20(%)

Recall@20(%)

MAP(%)

click 42.2 3.67 34.5 12.1

click,cart,fav 43.5(3.1%)

3.88(5.7%)

35.8(3.8%)

12.9(6.6%)

and have business values for E-commercial platform, because everythousandth of improvement on ctr means hundreds of millions ofextra clicks in real-world systems.

5.4 Click vs. Add to Cart and Add to WishlistIn this part, we take adding to cart and adding to wishlist intoconsideration to evaluate the performance of our value-basedmodel.Results in Table 4 show o�ine performance when considering thevalue of adding to cart and adding to wishlist actions into the model.�e improvement on expected GMV is 3.1%. �e improvements onprecision, recall and MAP are 5.7%, 3.8% and 6.6% respectively. Itmeans adding to cart and adding to wishlist actions are useful forthe value-based pro�t maximization.

6 CONCLUSIONS AND FUTUREWORKGreat advances have been achieved by existing research to im-prove the accuracy of rating prediction and the quality of top-krecommendation lists. However, the economic value of recommen-dation is rarely studied. For large-scale commercial recommenda-tion system, state-of-the-art algorithms seldom maximize the �nalrevenue/pro�t of the system. To eliminate this gap, we proposevalue-aware recommendation to maximize the pro�t of commercialrecommendation system directly. Speci�cally, we generalize the ba-sic concept of click conversion rate (CVR) to the conversation rateof an arbitrary user action on the platform (XVR), where di�erentactions of users(click, add to cart, add to wishlist, purchase)can bemonetized into the pro�t of the system. �en we use reinforcementlearning (RL) to maximize the pro�t whose reward is the aggregatedmonetized user actions. Both o�ine and online experiments showthat our value-based RL model not only performs be�er on tradi-tional metrics such as precision, recall and MAP, but also greatlyimproves the �nal pro�t than existing methods. �is paper actsas the �rst step towards value-aware recommendation and furtherimprovement can be achieved by designing more powerful featuresand RL models in the future.

REFERENCES[1] 2017. VAMS2017. (2017). h�ps://vams2017.wordpress.com/[2] Ashish Agarwal, Kartik Hosanagar, and Michael D Smith. 2011. Location, lo-

cation, location: An analysis of pro�tability of position in online advertisingmarkets. Journal of marketing research 48, 6 (2011), 1057–1073.

[3] Andrei Broder, Marcus Fontoura, Vanja Josifovski, and Lance Riedel. 2007. Asemantic approach to contextual advertising. In Proceedings of the 30th annualinternational ACM SIGIR conference on Research and development in informationretrieval. ACM, 559–566.

[4] Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-HongTang. 2018. Stabilizing reinforcement learning in dynamic environment with

application to online recommendation. In Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining. ACM, 1187–1196.

[5] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance ofrecommender algorithms on top-n recommendation tasks. In Proceedings of thefourth ACM conference on Recommender systems. ACM, 39–46.

[6] Kushal Dave, Vasudeva Varma, et al. 2014. Computational advertising: Tech-niques for targeting relevant ads. Foundations and Trends® in Information Re-trieval 8, 4–5 (2014), 263–418.

[7] Gediminas Adomavicius Dietmar Jannach. 2017. Price and Pro�t Awareness inRecommender Systems. arXiv preprint arXiv:1801.00209 (2017).

[8] Michael D Ekstrand, John T Riedl, Joseph A Konstan, et al. 2011. Collaborative�ltering recommender systems. Foundations and Trends® in Human–ComputerInteraction 4, 2 (2011), 81–173.

[9] Jun Feng, Heng Li, Minlie Huang, Shichen Liu, Wenwu Ou, Zhirong Wang,and Xiaoyan Zhu. 2018. Learning to Collaborate: Multi-Scenario Ranking viaMulti-Agent Reinforcement Learning. In Proceedings of the 2018 World Wide WebConference on World Wide Web. International World Wide Web ConferencesSteering Commi�ee, 1939–1948.

[10] Anindya Ghose and Sha Yang. 2009. An empirical analysis of search engineadvertising: Sponsored search in electronic markets. Management science 55, 10(2009), 1605–1622.

[11] Avi Goldfarb and Catherine Tucker. 2011. Online display advertising: Targetingand obtrusiveness. Marketing Science 30, 3 (2011), 389–404.

[12] Asela Gunawardana and Guy Shani. 2009. A survey of accuracy evaluationmetrics of recommendation tasks. Journal of Machine Learning Research 10, Dec(2009), 2935–2962.

[13] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative �ltering. In Proceedings of the 26th InternationalConference on World Wide Web. International World Wide Web ConferencesSteering Commi�ee, 173–182.

[14] Jonathan L Herlocker, Joseph A Konstan, Al Borchers, and John Riedl. 2017. Analgorithmic framework for performing collaborative �ltering. In ACM SIGIRForum, Vol. 51. ACM, 227–234.

[15] Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforce-ment Learning to Rank in E-Commerce Search Engine: Formalization, Analysis,and Application. In Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining. ACM.

[16] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems. Computer 8 (2009), 30–37.

[17] Daniel D Lee and H Sebastian Seung. 2001. Algorithms for non-negative matrixfactorization. In Advances in neural information processing systems. 556–562.

[18] Kuang-chih Lee, Burkay Orten, Ali Dasdan, and Wentong Li. 2012. Estimatingconversion rate in display advertising from past erformance data. In Proceedingsof the 18th ACM SIGKDD international conference on Knowledge discovery anddata mining. ACM, 768–776.

[19] Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf.Retr. 3, 3 (March 2009), 225–331.

[20] Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and KunGai. 2018. Entire Space Multi-Task Model: An E�ective Approach for EstimatingPost-Click Conversion Rate. In Proceedings of the 41st International ACM SIGIRConference on Research & Development in Information Retrieval. ACM, 1137–1140.

[21] AndriyMnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix factorization.In Advances in neural information processing systems.

[22] Michael J Pazzani and Daniel Billsus. 2007. Content-based recommendationsystems. In�e adaptive web. Springer, 325–341.

[23] Ste�en Rendle and Christoph Freudenthaler. 2014. Improving pairwise learningfor item recommendation from implicit feedback. In Proceedings of the 7th ACMinternational conference on Web search and data mining. ACM, 273–282.

[24] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. 2017.Evolution strategies as a scalable alternative to reinforcement learning. arXivpreprint arXiv:1703.03864 (2017).

[25] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-basedcollaborative �ltering recommendation algorithms. In Proceedings of the 10thinternational conference on World Wide Web. ACM, 285–295.

[26] Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-basedrecommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.

[27] SIGKDD. 2015. Life-stage Prediction for Product Recommendation in E-commerce. ACM, 1879–1888.

[28] Richard S Su�on, Andrew G Barto, et al. 1998. Reinforcement learning: Anintroduction. MIT press.

[29] Nima Taghipour and Ahmad Kardan. 2008. A hybrid web recommender systembased on q-learning. In Proceedings of the 2008 ACM symposium on Appliedcomputing. ACM, 1164–1168.

[30] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learningfor recommender systems. In Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining. ACM, 1235–1244.

7

Page 8: Value-aware Recommendation based on Reinforced Profit ... · Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems Changhua Pei∗ Alibaba Group

[31] Qiaolin Xia, Peng Jiang, Fei Sun, Yi Zhang, Xiaobo Wang, and Zhifang Sui. 2018.Modeling Consumer Buying Decision for Recommendation Based on Multi-TaskDeep Learning. In CIKM. ACM, 1703–1706.

[32] Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep learning based recommendersystem: A survey and new perspectives. arXiv preprint arXiv:1707.07435 (2017).

[33] Yongfeng Zhang, Qi Zhao, Yi Zhang, Daniel Friedman, Min Zhang, Yiqun Liu,and Shaoping Ma. 2016. Economic recommendation with surplus maximization.InWWW.

[34] Qi Zhao, Yongfeng Zhang, Yi Zhang, and Daniel Friedman. 2017. Multi-productutility maximization for economic recommendation. In Proceedings of the TenthACM International Conference on Web Search and Data Mining. ACM, 435–443.

[35] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and JiliangTang. 2018. Deep Reinforcement Learning for Page-wise Recommendations. InProceedings of the 12th ACM Conference on Recommender Systems. ACM.

[36] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and DaweiYin. 2018. Recommendation with Negative Feedback via Pairwise Deep Reinforce-ment Learning. Proceedings of the 24th ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining (2018).

[37] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and JiliangTang. 2017. Deep Reinforcement Learning for List-wise Recommendations. arXivpreprint arXiv:1801.00209 (2017).

[38] Yong Zheng. 2017. Multi-Stakeholder Recommendation: Applications and Chal-lenges. arXiv preprint arXiv:1707.08913 (2017).

[39] Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai.Optimized cost per click in taobao display advertising.

8