an e-commerce recommender system based on content-based filtering

WUJNS Wuhan University Journal of Natural Sciences

Vol. 11 No. 5 2006 1091-1096

Article ID:1007-1202(2006)05 1091 06

An E.Commerce Recommender System Based on Content.Based Filtering

[ ] HE Weihong I ,CAO Yi 1'2 1. Department of Computer Science and Technology,

Hunan Institute of Technology, Hengyang 421101, Hunan, China;

2. School of Business, Central South University, Changsha 410083, Hunan, China

Abstract. Content-based filtering E-commerce recommen der system was discussed fully in this paper. Users ' unique features can be explored by means of vector space model first-

ly. Then based on the qualitative value of products informa tion, the recommender lists were obtained. Since the system can adapt to the users ~ feedback automatically, its perform

ance were enhanced comprehensively. Finally the evaluation of the system and the experimental results were presented.

Key words: E-commerce; recommender system; personal ized recommendation; content-based filtering; Vector Spatial Model ( VSM )

CLC number= TP 311

Received date: 2006 02-10 Foundation item: Supported by the Hunan Teaching Reform and Re search Project of Colleges and Universities (2003-B72), the Hunan Board of Review on Philosophic and Social Scientific PayLoff Project (0406035), and the Hunan Soft Science Research Projeet(04ZH6005) Biography= HE Weihong (1970) , male, Associate professor, research direction: network database and data mining. E mail: hwh@ hnpu. edu. cn

0 Introduction

W ith the popularization of the Internet and the develop-

ment of E-commerce, the E-Commerce system's struc-

ture becomes more complicated when it provides more and more choices for users. So users usually get lost in the vast space of commodity information and can not find the goods they really want. Under the increasingly intense competitive circumstance

the E-commerce recommender system can ef{ectivdy reserve us-

ers, keep them from losing and increase the cross selling ability.

According to the research, with the personalized recommender

system used in E-commerce marketing industries, the sales im-

proved by 2%-8% c17 , especially in those industries such as

books, movies, CD audio-video products, and articles of daily

use, which are cheap and various in kinds, and great in extent ot using personalized recommender system. The recommender system can greatly boost sales.

Presently, there is a relatively large gap between the recommender function of E-commerce in China and that in other countries. And our theoretical research in personalized and automatic recommendation is almost a blank. If we search for the articles containing "recommender system" in CNKI(China Net Knowledge Intellegment), little can we find. It shows that our research on recommendation had fallen far behind others.

This paper talks about an E-commerce recommender system based on content-based filtering. It basically includes ob- taining the feature of users ~ interest, its modeling and calcula-

tion of similarity. The system was divided into 2 phases, data

processing and recommender processing. Recommender model

and initial threshold are to be generated on the first phase and

on the second phase the system can automatically adjust the

1091

recommender model and threshold so as to get the best

recommender performance.

1 Architecture and Workflow

1.1 Architecture The architecture for the E-commerce recommender

system based on content-based filtering is shown as Fig. 1.

The function of the whole recommender system falls

1.2 Workflow

Fig. 1

I -1 Current access list ]-- -I l

0 torca

I wou,o

, [ Recommenaed,ist ]--

into two parts , data processing and self-adapting which

are invisible to the users. The former mainly and proces-

ses the features of users' interet, extracts their interests

by vector space model and forms models and set initial

threshold. The latter mainly is used to generate recom-

mender access list which are to be re commended to the

users through Web server and the feed back information

on recommended content from users will be obtained

timely so as to self adapting the recommender model.

J, Data processing

l Self-adaptation

Non-access list -~ a wareh o

] Recommender processing ]- I-

Architecture for the recommender system based on content-based filtering

The work flow of this recommender system goes like

this. First, preprocess the current access list, historical

trade data and Web log and so on, extract the topic vec-

tor and feature vector of users' interests, and form initial

recommender model b y d a t a processing, set initial

threshold. Then, calculate the similarity between the ini-

tial recommender model and the introduction of the prod-

ucts in product information base, i.e. recommender pro-

cessing. If the degree of similarity is bigger than or equal

the initial threshold, that means the product match with the interest of users, the information of this product will act as the recommender access list provided for the users.

At last, according to the feedback information of recommender access list from the users, the system automatically adapt the recommender model and threshold so as to obtain the best recommender quality.

2 Description of the Feature of Users' Interest

2.1 Users' Interest In order to fulfill the personalized recommender

service, we need first collect the individual information of users and form the models on description of the feature of

users' interest. At present, there are two methods com- monly used in collecting users' information, dominate

feature description and recessive feature description. For

the former, the system always asked every fetcher to fill

in the table of information or questionnaire which include

gender, age, educational background, interest etc. so as

to obtain the information of users' interests directly. But

as far as users are concerned, it is a little troublesome.

And the degree of accuracy depends on the design of the

table of information or questionnaire and the corporation of users E22. For the latter, the system track the fetchers'

behavior(which is visible to fetchers), memorize their IP

address, inquiry time, and inquiry content, and analyze their interests through Web mining E3~. It has no need of

users' active participation and will not interfere their

work. So it is a comparable convenient method.

The ratio between the time users spent on browsing a certain Web page and the number of characters on that page can effectively reveal users' interests E4~. These in-

terests are related with the category of the information

and the categories can be assured and are relatively stead-

y. All the information related to users' browsing Web

page including clicks of each Web page, residence time,

access order can be found in proxy server's Log, and all

the Web pages browsed by users can be found in server's

Cache. By Web mining like this we can obtain users' in-

terests.

2.2 Vector Space Model(VSM) In this system textual messages corresponding with

1092

the content of Web page can be obtained by taking off

structures unrelated to content of Web page. VSM is

adopted in the system to process these textual messages.

The basic thought is like this: suppose word' probability

of occurrence and its location and the content of the docu-

ment are mutual independent. And the textual structure

and occurrence order of words can be left out while con-

firming the kind of the textual content. VSM ES? is a

wide-used textual computational model in taxonomy sys-

tems based on statistics. It can convert a given document

into a vector with multidimension. Its prominent feature

lies in the convenient calculation of the similarity between

two vectors, i.e. the similarity between the vector and

the corresponding document.

In VSM, D stands for document, and generally re-

fers to any readable record by computer. T stands for

term, usually in form of word or phrase, refers to the

fundamental language unit representing the content of the

document. Document can express as the aggregate of

terms:D(7"1, Te, "", T . ) , of which, Te is the term,

l~<k~<n. For example, a document contains 4 terms: a,

b,c,d, and it can express as D(a,b,c,d). And the value

of terms to the content of document are different, for

their location and occurrence frequency are not the same.

Therefore, as to the document containing n terms, cer-

tain weighting should be added to each terms indicating

its degree of importance, i.e. D = D(TI ,W1; Te ,We;

�9 '" ; Tn,W,, ), abbreviated as D-- D (Wl, W e , ' " , W,, ) to

represent the vector of document D. We is the weighting

of Tk, l<~k<~n. In VSM, the degree of correlation be-

tween two documents D1 and De, Sim(D~ ,/)2 ), expres-

ses with cosine value of vectors E6'r? , the formula is: 7t

Wlk • W2k Sim(D1 ,D2) = cos0 k=l

( W~k ) ( W~k ) k = l k : l

Here, Wlk and W2k represent the weighting of the K th

term in D1 and D2 respectively, l~le~n. 2.3 Extraction of Terms

Because the vocabulary in documents is large, the

dimension of vectors representing the document is large

too, almost up to tens of thousands dimension. So the

dimension packing is necessary. Extraction of terms is to

obtain the document term vector, i.e. extract the best

and rather potent in expressing document content term

subset from all possible vocabulary. That aims at two as-

pects. First, to improve the programming efficiency, re-

duce the operation and increase the speed of operation;

Second, all the tens and thousands vocabulary mean dif-

ferent to the document. Some common words mean little

to the document. In order to improve the accuracy of rec-

ommender system, we should remove those lack of ex-

pressive force and filter the best interest terms aggregate.

The best terms refer to those carry the largest infor-

mation content related to the relevant text set (rel(Q)).

The computation expression of logarithm of information

content between the vocabulary and relevant text set is Es~ :

logI(wi ,rel(Q)) = log[P(wi [w~ ~ rel(Q))/P(w~)] Here, wi refers to the ith word, P(w~ [wi ~ rel(Q)) re-

fers to the proportion of word w~ in relevant test set rel

(Q), P(w, ) refers to the proportion of the word w~ in

data processing text.

A certain number of words extracted from ordering

the foregoing information content are terms.

3 Algorithm Recommender System Based on Content Filtering

3.1 Algorithm of Data Processing 3.1.1 Data flow

First, convert the current access list to interest topic

vector, weight and operate feature vectors extracted from

users' historical trade data and Web log, get initial rec-

ommender model; Then calculate the similarity between

the initial vector and current access list; Last, set the

best initials similarity threshold for each interest topic.

The Data flow diagram shows in Fig. 2.

U~aS, historic-'~a 1 I Web LOg I Current access list I tabase S ' "

"[Feature processing Topic processing

I Feature vectors ~ - - ~ Interest topic vectors ~

[ Initial recommender modell-,~- ~ Initial thresho,d ]

Fig. 2 Data flow diagram

3.1.2 Forming of initial recommender model

Initial recommender model vector is made from in-

terest topic vector, feature vector extracted from users'

historical trade data, and feature vector extracted from

Web Log. Suppose the weighting are respectively a, b

and c, then:

1093

Pro (Q) = a X Po (Q) + b X P1 (Q) q- c x P2 (Q) Here, Q refers to the interest topic, Pro refers to the

initial recommender model vector of Q, Po ,P1 and P2 refer to its 3 sub vectors.

As to topic vector Po (Q), Po (Q) = (pol, Po2 , " ' , pow),W refers to the number of words, Po refers to the weighting of wi. According to ltc formula in Smart sys-

tem by Buckley Eg? ,

(log(N/df(wi) ) ,if wi ff Q Poi = ~[ 0 , otherwise

Here N refers to the total of documents and dr (wi) refers to the number of wi in the document. If there is no w~ in Q, the weighting is 0.

Similarly, as to the feature vector from users' his-

torical trade data P1 (Q),P1 (Q) =(P11 ,P12 ,"" ,Plw),Pli refers to the weighting of wi :

= ~logI(wi ,rel(Q)),if log/( w~ ,rel(Q)) >~ 3 Pli [0, otherwise

As to the feature vector P2 (Q) from Web log,we

have P2 (Q) = (P21 ,P22 , ' " ,P2w) ,P2~ refers to the weighting of wi:

[logI(wi, pseudo-rel(Q) ),

P2~ = ~if log/( wl, pseudo-rel(Q)) >~ 3

t.0, otherwise 3.1.3 Setting of initial threshold

It is very difficult to confirm the similarity thresh-

old. Here in this system predetermine initial value is adopted. After recommending according to the testing

data, the initial value was adjusted according to the degree of accuracy of the recommending. Once the threshold was set, those similarity data of model vector bigger than or equal to threshold were taken as relevant interest topic data and should be recommender processed.

This recommender system follows the T9P El~ evalu- ating indicator of information filtration, by calculating the similarity between the model vector and preprocessing data, calculates the T9P value of any threshold, and find the threshold can make the best performance as the initial threshold. And the similarity between the model vector and preprocessing data can express as:

~-] dk X f kP Sim(d,pr) = k=l

( t ip) k = l k = l

Here, d refers to the preprocessing document, Pr refers to the model vector, rn refers to the dimension of feature vector, dk refers to the weighting of the kth word

in document. 3.2 Algorithm of Recommender Processing 3.2.1 Data flow diagram of recommender processing

After forming the initial recommender model and setting the threshold, we can calculate the similarity between any data on product introduction in product information base and the model vector of certain users' interest topic. If the similarity is bigger than or equal the threshold, it will be regarded as relevant to user's interest and generate the recommender list which are to be

recommended to the users via Web server. According to Users' judgment about its efficiency, the system will automatically adjust the model or threshold so that the recommender performance will be continuously improved to meet the needs of users.

The data flow diagram of self-adjusting recommender processing shows in Fig. 3.

IFea urovectorsl [Interes'topicvecto,sl- ~176 ICurreo, accos,,,st

[ Whresho" I + l Threshold adaptation User

Model m o d i f i c a t i o ~ / . ~ , , . . ~ Yes

.e omm n. ..st

Fig. 3 Data flow diagram of recommender processing

3.2.2 Adjustment of the threshold

The selecting of threshold is the key to the perform-

ance of recommender system model. The threshold can

be adjusted according to the feedback information from

users on the recommended access list. If it was set rather

low, the information bigger than or equal to the thresh-

old will increase, and the recall rate of the system will in-

creased greatly too. On the other hand, if the threshold

1094

was set rather high, the information meeting the recom-

mender condition will decrease and it accuracy rate in- creased. So there are following principles on adjusting the threshold:

@ If the recommended information is beyond the need, increase the threshold, increase the precision rate, and decrease the recall rate;

@ If the recommended information is within the need, decrease the threshold and precision rate, increase the recall rate. 3.2.3 Modification of model

If users find the recommended access list meet to his

interest, he will browse the relevant information. The

recommended access list became current access list. In order to adjust the model vector, we can extract interest topic vector from this list, extract the feature vector from users' historical data of trade and Web log (Here the Web log changed correspondingly). By weighting and calculating topic vector and feature vector we get the new model vector. Suppose the weighting is a',b' and c', we have.

P'r(Q) = a' X P3(Q) -~b' X P~ (Q) +c ' X P4 (Q) Here, Ps (Q) refers to the interest topic vector ex-

tracted from current access list, P:~ (Q) = (P:u, P32 , ' " ,

P3w) ,P4 (Q) refers to the users' feature vector extracted

from the Web log, P~ (Q)= (P41 ,P42 ,"" ,P4~ ).

4 Experimental Results and Its Analysis

In order to test this system, we choose 3 200 articles from all the articles published on a computer magazine in 2000-2003 (800 per year) as experimental data. In the experiment, we take the abstract of these articles as the introduction of products; these articles as products, and users' download record as historical users' trade data, which can be found in Web, log on the server. Owing to the clear and definite content of these articles, the result is clear enough.

We can set initial recommender model and threshold according to the current users' access list; historical download record and Web log in the experiment. Then recommend based on each abstract and provide recommend lists in time order for users to confirm. The system modified the model according to the feed back information and recommender process the articles of next year according to the modified model. The experimental result shows in Table 1.

Table 1 Check list of experimental result and actual value

Year Actual value Recommended result

Right Wrong

2000 13 8 7

2001 15 7 6

2002 16 7 4

2O03 14 6 2

Usually the evaluation standard in information retrieval field is adopted to judge the recommending quality of system E~ , i.e. precision and recall rate:

number of right recommended items P-- number of all recommended items

number of right recommended items r =

number of all recommended items Precision and recall are contradictory index to a cer-

tain degree. High precision means low recall. To balance the two, overall evaluation index F-measure is adopted.

F-measure= 2pr p+r

0.8

0.6

,,~ 0.4

0.2 [ -4~- P r r /;'-measure

0 2000 2()01 20;2 2()03 Year

Fig. 4 Graph of system performance

Figure 4 is the graph of system performance. From it we know, with the passing of time, the precision of the recommender system rose to an extent and the recall dropped. But the F-measure of the system was on the rise. It indicates that the total performance of the recommender system improved with the passing of time.

5 Conclusion

The content-based filtering E-commerce recommender system advanced in this paper explored users' unique interest needs with vector space model, recommended the products according to the qualitative value of products information, and automatically adapted to users' feedback information. In this way its comprehensive performance were enhanced.

1095

And the space model is used to judge the significance of terms according to its location and occurrence frequen-

cy in the document. It is essentially a kind of low-leveled documental statistical method. The content-based filtering recommendation can construct a reliable classifier on- ly when it accumulates adequate evaluation. But it is troublesome to deal with new users and new products E12]. When the feature of users ~ interest changed, the recommender system cannot update easily. It must

ultimately optimize through practical application.

References

Eli Lawrence R D, Almasi G S, Kotl Yar V, et al. Personali- zation of Supermarket Product Recommendations [R]. New York: IBM Research Report, 2000.

E2] Li Peng, Wang Dongsheng, Chen Kang. A Personalization Information Recommendation System Based on VSM [J] . Computer Engineering and Design, 2003,24 (10) : 19-22 (Ch).

[3] Cooley R, Mobasher B, Srivastava J. Grouping Web Page References into Transitions for Mining World Wide Web Browsing Patterns E R]. Minneapolis: Technical Report of Computer Science, t997.

E4] Pretschner A. Ontology Based Personalized Search ED]. Lawrence, KS: University of Kansas, 1999.

[5] Buckley C, Salton G, Allan J, et al. Automatic Query Ex-

pansion Using SMARTEC]//Overview o f the Third Text Retrieval Conference (TREC- 3), Donna Haman, 1994.

[6] Cao Yi,He Weihong. An Information Security Filtering Sys- tem Based on Vector Space Model [J]. Computer Engineer- ing and Design, 2006,27(2) : 224-227(Ch).

E7] Malone T W, Grant K R, Turbak F A, etal. Intelligent In- formation Sharing Systems E J] . Communications o f the ACM, 1987,30(5) : 390-402.

[-8] Huang X J, Xia Y J, Wu L D. A Text Filtering System Based on Vector Space Model EJ] �9 Journal o f Software, 2003,14(3): 435 442(Ch).

E9] Buckley C, Salton G, Allan J. Automatic Retrieval with Lo- cality Information Using SMART [ C] //Proceedings of the 1st Text Retrieval Conference ( TREC-1 ). Gaithersburg: NIST Special Publication, 1992 : 59-72.

ElO] Robertson S, Hull D. The TREC-9 Filtering Track Final ReportEC]//Proceedings o f the 9th Text Retrieval Confer- ence (TREC-9). Gaithersburg: NIST Special Publication, 2001:25-40.

[111 Yu Li, Liu Lu. Research on Personalized Recommendations in E-business I-J]. Computer Integrated Manufacturing Systems, 2004,10(10) : 1306-1313(Ch).

E12] Pazzani M J. A Framework for Collaborative, Content Based and Demographic Filtering EJ]. Artificial Intelligence Re- view, 1999,13(5/6): 393-408.

[]

1096

an e-commerce recommender system based on content-based filtering

Documents