interest-based user grouping model for collaborative filtering in digital libraries 7 th icadl 2004...

Interest-based User Grouping Model Interest-based User Grouping Model for Collaborative Filtering for Collaborative Filtering

in Digital Librariesin Digital Libraries

7th ICADL 2004

Shanghai, P.R.China

Dec. 15, 2004

Edward A. Fox, Seonho Kim

Digital Library Research Laboratory (DLRL)

Virginia Tech, Blacksburg, VA 26061 USA

ICADL 20042

OverviewOverview

• Introduction – Previous work

– Recommender System for Digital Libraries .

• Proposed User Grouping Model– Collecting User Interests

– System Diagram

– System Analysis by 5S Model

– User Model

– Recommendation Algorithms

– Hypotheses

• Experiment

• Experiment Results– Collected Data

– Hypothesis Test

– User Grouping

• Future Work

• Conclusions

ICADL 20043

Previous WorkPrevious Work

• Collaborative filtering– GroupLens [2] for Usenet news

• Recommender system– Ungar [1] and lots of works for online shopping malls

– PASS [3], Renda [4], DEBORA [5] for Digital Library (DL)

• Standard log for DL– Gonçalves et al. [6] XML log standard for DLs

• Rating data– Explicit rating data – Entered by user during registration or by

answering questionnaires

– Implicit rating data, Nichols [7] – Obtained by analyzing log data

ICADL 20044

Recommender System for Dynamic Recommender System for Dynamic & Complex Information Systems& Complex Information Systems

• Different features of DL– Difficulties in collecting explicit user rating data

– Difficulties in analyzing user log data

– Difficulties in item classification

– Sparseness in rating data

– Dynamic, vast and complex items

– Narrow user interests

• Important aspects of recommender system for DLs – scalability, accommodating new data, comprehensibility [9]

ICADL 20045

Conventional Recommender Conventional Recommender System for Shopping MallSystem for Shopping Mall

UsersUser Classes ItemsItem Classes

recommend

ICADL 20046

Recommender System for DLRecommender System for DL

Users User Classes ItemsItem Classes

?

?

?

?

?

??

dynamic itemshard to classify

ICADL 20047

Proposed User Grouping ModelProposed User Grouping Model

• User grouping is the most critical procedure for a recommender system.

• Suitable for dynamic and complex information systems like DLs

• Overcomes data sparseness

• Uses implicit rating data rather than explicit rating data

• User oriented recommender algorithm

• User interest-based community finding

• User modeling– User model (UM) contains complete statistics for recommender

system.

– Enhanced interoperability

ICADL 20048

Collecting User Interests for User Collecting User Interests for User GroupingGrouping

• Users with similar interests are grouped• Employs a Document Clustering Algorithm,

LINGO [10], to collect document topics• Users’ interests are collected implicitly during

searching and browsing.• A User Model (UM) contains her interests and

document topics.• Interests of a user are subset of document topics

proposed to her by Document Clustering.

ICADL 20049

Interest-based Recommender Interest-based Recommender SystemSystem

ICADL 200410

System Analysis with 5S ModelSystem Analysis with 5S Model

Interest-based Recommender System for DL

Society

Space

Structure

Stream Scenario

User Interface

User Model

PresentationPush service

FilteringRanking

HighlightingPersonalized

pages

Recommendation

Group Selection

Individual Selection

Interest GroupResearcher

Learner

Teacher

Class Group

Probability Space

Vector space

Collaboration space

Community

displays

Text AudioVideo

represented by

UM schema

User description

User interestsDocument topics

User groups

Statistics

participates

generates

refers

composed of

refers

Users

Users

ICADL 200411

User Model (UM)User Model (UM)

User ID

User Description Groups Statistics

Name Document Topic Score

User Interest Score

Group ID Score

E-mail

Address

Publications

User Interests

(implicit data-generated by user interface and recommender)

(implicit data-generated by recommender)

(explicit data-obtained from questionnaire)

ICADL 200412

User Similarity Based on User User Similarity Based on User InterestsInterests

• Derived from Pearson’s correlation coefficient

• Similarity between users ‘a’ and ‘i’

• where j is common interest, va,j is rating score of item j by user a, and

j jijiaja

jijiaja

vvvv

vvvvias

2,

2,

,,

)()(

))((),(

systemthebyproposedtopicsofnumbertotal

selectedusertopicsofnumbertotalva

ICADL 200413

Interest-based Group Interest-based Group RecommendationRecommendation

• , the probability that a user group ‘k’ is affected from a rating which is made by a user ‘a’ to an item ‘j’, can be calculated as follows:

kR

kiCiji

kiCiji

vNT

kv

Nkk PPR :,

:,

)1(1

1

)1(

• where

T : total number of users in the system

Ci : the group that user i belongs to

Vi,j : probability that user i votes for j

N : total number of users in group K

Pk : base rate of group K, observed from DB

ICADL 200414

Interest-based Individual Interest-based Individual RecommendationRecommendation

• Once groups are selected, individual users, who will be affected from the voting, are selected.

• Probability that a user ‘a’ in group ‘k’ likes the item ‘j’ can be obtained by:

))(,(1

,,

n

iijiaja vviasvP

• where

n is the number of users in the selected group k

s(a,i) is the similarity between user ‘a’ and user ‘i’

is average probability of positive voting of user ‘a’

is a normalizing factor

av

ICADL 200415

Interest-based Recommendation Interest-based Recommendation AlgorithmsAlgorithms

• Correct user grouping is critical for correct recommendation.

• User centered algorithm– User model consists of complete statistics for recommender.

– Since no references to statistics of items are needed, is suitable for information systems with dynamic and complex items.

– Enhanced User Model interoperability and reusability

• Group oriented algorithm– Two phase recommendation

• Group selection & Individual selection

– Automatic user’s communities finding

ICADL 200416

HypothesesHypotheses

• Three hypotheses about proper document clustering algorithm behavior– H1 : Any serious user who has his own research interests and

topics, shows consistent output for the document collections referred to by that user.

– H2 : Serious users who share common research interests and topics, show overlapped output for the document collections referred to by them.

– H3 : Serious users who don’t share any research interests and topics, show different output for document collections referred to by them.

ICADL 200417

Experiment - TasksExperiment - Tasks

• Subjects are asked to – answer a questionnaire to collect democratic

information

– list research interests to help us collect explicit rating data which is used for evaluation in the experiment

– search some documents in her research interests and browse the result documents to help us collect implicit rating data

ICADL 200418

Experiment - ParticipantsExperiment - Participants

• 22 Ph.D and MS students majoring in Computer Science

• CITIDEL [8] is used as a DL in “Computing” field• Data from 4 students were excluded as their

research domains are not included in CITIDEL

ICADL 200419

Experiment - InterfacesExperiment - Interfaces

• Specially designed user interfaces are required to capture user’s interactions

• JavaScripts

• Java Application

ICADL 200420

Results - Collected DataResults - Collected Data

• Example<Semi Structured Data<Cross Language Information Retrieval CLIR<Translation Model<Structured English Query<TREC Experiments at Maryland<Structured Document<Evaluation<Attribute Grammars<Learning<Web<Query Processing<Query Optimisers<QA<Disambiguation<Sources<SEQUEL<Fuzzy<Indexing<Inference Problem<Schematically Heterogeneity<Sub Optimization Query Execution Plan<Generation<(Other)(<Cross Language Information Retrieval CLIR)(<Structured English Query)(<TREC Experiments at Maryland)(<Evaluation)(<Query Processing)(<Query Optimisers)(<Disambiguation)

<Cross Language Information Retrieval CLIR<Machine Translation<English Japanese<Based Machine<TREC Experiments at Maryland<Approach to Machine<Natural Language<Future of Machine Translation<Machine Adaptable Dynamic Binary<CLIR Track<Systems<New<Tables Provide<Design<Statistical Machine<Query Translation<Evaluates<Chinese<USA October Proceedings<Interlingual<Technology<Syntax Directed Transduction<Interpretation<Knowledge<Linguistic<Divergences<(Other)(<Cross Language Information Retrieval CLIR)(<Machine Translation)(<English Japanese)(<TREC Experiments at Maryland)(<CLIR Track)(<Query Translation)

• Parenthesized topics mean they are rated positively

ICADL 200421

Results – Hypothesis Test for HResults – Hypothesis Test for H11

• H0 (Null hypothesis of H1) : Mean(μ) of frequency of document topics proposed by Document Clustering Algorithm are NOT consistent (μ0 = 1) for a user, H0 : μ = μ0 vs H1 : μ > μ0

• Conditions : 95% confidence (test size α = 0.05), sample size ‘n’ ≤ 25, standard deviation ‘σ’ unknown, i.i.d. random samples, normal distribution, estimated z-score T-test

• Test statistics : sample mean ‘ỹ’ = 1.1429, sample standard deviation ‘s’ = 0.2277 are observed from the experiment

• Rejection Rule is to reject H0 if the ỹ > μ0+zα/2 σ/√n

• From the experiment, ỹ = 1.1429 > μ0+zα/2 σ/√n = 1.0934

• Therefore decision is to Reject H0 and accept H1

• 95% Confidence Interval for μ is 1.0297 ≤ μ ≤1.2561

• P-value (confidence of H0) = 0.0039

ICADL 200422

Results - UsersResults - Users• All users are assigned a symbol after experiments

according to their explicit data for convenience of analysisUser Symbols User profiles collected from questionnaire

1 dlmember The one who belonged to the Digital Library Research Laboratory

2 softeng The one who has an interest in Software Engineering

3 bio The one who has an interest in Bioinformatics

4 vr_hci The one who has an interest in Virtual Reality and Human Computer Interaction

5 clir_1 The one who has an interest in Cross Language Information Retrieval

6 clir_2 The one who has an interest in Cross Language Information Retrieval

7 nlp_1 The one who has an interest in Natural Language Processing

8 nlp_2 The one who has an interest in Natural Language Processing

9 vr_1 The one who has an interest in Virtual Reality

10 vr_2 The one who has an interest in Virtual Reality

11 EC_agent The one who has an interest in E-Commerce and Agent

12 CybEdu_agt The one who has an interest in Cyber Education and Agent

13 dlandedu_1 The one who has an interest in Digital Library and Education

14 dlandedu_2 The one who has an interest in Digital Library and Education

15 person_1 The one who has an interest in Personalization

16 person_2 The one who has an interest in Personalization

17 se_me The one who has an interest in Software Engineering

18 fuzzy The one who has an interest in Fuzzy Theory

ICADL 200423

Results – User SimilaritiesResults – User Similarities

dlmem

ber

vr_hci

nlp_1

fuzzy

clir_2

person_2 dlmember

dlandedu_1

ec_agent

person_20

0.050.1

0.15

0.2

0.25

0.3

0.35

0.4

dlmembersoftengbiovr_hciclir_1dlandedu_1nlp_1person_1vr_1fuzzyec_agentcybedu_agtclir_2dlandedu_2nlp_2person_2se_mevr_2

ICADL 200424

Results – User Similarity LevelsResults – User Similarity Levels

User ID Level 1 Level 2 Level 3 Level 4

dlmember dlandedu_1, dlandedu_2

softeng se_me person_2

bio

vr_hci vr_2, vr_1 person_1, person_2

clir_1 nlp_1, clir_2 nlp_2

clir_2 clir_1, nlp_1 nlp_2

nlp_1 nlp_2 clir_1, clir_2

nlp_2 nlp_1 clir_1, clir_2

vr_1 vr_2, vr_hci

vr_2 vr_hci vr_1 person_1, person_2

EC_agent CybEdu_agt person_2

CybEdu_agt EC_agent fuzzy

dlandedu_1 dlmember dlandedu_2 vr_hci, CybEdu_agt

dlandedu_2 dlmember dlandedu_1 CybEdu_agt, vr_hci

person_1 person_2

person_2 person_1

se_me softeng

fuzzy CybEdu_agt bio, nlp_1

ICADL 200425

Results - User GroupsResults - User Groups

User Group ID

Member IDs ( assigned after experiment according to their research interests which are answered on the questionnaire)

A dlmember, dlandedu_1, dlandedu_2

B softeng, se_me,

C vr_hci, vr_1, vr_2

D clir_1, nlp_1, clir_2

E nlp_1, nlp_2

F person_1, person_2

G fuzzy, cybedu_agt

H EC_agent, cybedu_agt, fuzzy

I Bio

• User groups are generated by merging a user and other members with the closest similarity level

ICADL 200426

Future WorkFuture Work

• Detailed analyses for accuracy, scalability and efficiency

• Further confirmation of our hypotheses

ICADL 200427

ConclusionsConclusions

• Proposed a user clustering model based on user interests

• Proposed user centered recommendation algorithms which are suitable for DLs

• Proposed a way of collecting and using implicit rating data from DL users

• Proposed a active way of user communities finding

• Verified proposed approaches through designed experiments and hypothesis tests

ICADL 200428

ReferencesReferences

• [1] Lyle H. Ungar and Dean P. Foster: A Formal Statistical Approach to Collaborative Filtering. CONALD ’98, Carnegie Mellon U., Pittsburgh, PA (1998)

• [2] Joseph A. Konstan, Bradley N. Miller, David Maltz, Jonathan L. Herlocker, Lee R. Gordon, and John Riedl: GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, Vol. 40 (1997) 77-87

• [3] Chun Zeng, Xiaohui Zheng, Chunxiao Xing, Lizhu Zhou : Personalized Services for Digital Library. In Proc. 5th Int. Conf. on Asian Digital Libraries, ICADL (2002) 252-253

• [4] M. Elena Renda, Umberto Straccia: A Personalized Collaborative Digital Library Environment. In Proc. 5th Int. Conf. on Asian Digital Libraries, ICADL (2002) 262-274

• [5] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael B. Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In Proceedings of the Fourth European Conference on Research and Advanced Technology for Digital Libraries (2000) 239-248

• [6] Marcos A. Gonçalves, Ming Luo, Rao Shen, Mir Farooq, and Edward A. Fox: An XML Log Standard and Tools for Digital Library Logging Analysis, in Proc. of Sixth European Conference on Research and Advanced Technology for Digital Libraries (2002) 16-18

ICADL 200429

ReferencesReferences

• [7] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael B. Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In Proceedings of the Fourth European Conference on Research and Advanced Technology for Digital Libraries (2000) 239-248

• [8] CITIDEL project, Computing and Information Technology Interactive Digital Educational Library, http://www.citidel.org/ (2004)

• [9] John S. Breese, David Heckerman and Carl Kadie: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (1997) 43-52

• [10] Stanisław Osiński and Dawid Weiss: Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data, Advanced in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’04 Conference, Zakopane, Poland (2004) 369-378

• Acknowledgements to NSF for partial support– DUE-0121679, 0121741, 0333531; IIS-0086227, 0307867, 0325579

interest-based user grouping model for collaborative filtering in digital libraries 7 th icadl 2004...

Documents