interest-based user grouping model for collaborative filtering in digital libraries 7 th icadl 2004...
TRANSCRIPT
Interest-based User Grouping Model Interest-based User Grouping Model for Collaborative Filtering for Collaborative Filtering
in Digital Librariesin Digital Libraries
7th ICADL 2004
Shanghai, P.R.China
Dec. 15, 2004
Edward A. Fox, Seonho Kim
Digital Library Research Laboratory (DLRL)
Virginia Tech, Blacksburg, VA 26061 USA
ICADL 20042
OverviewOverview
• Introduction – Previous work
– Recommender System for Digital Libraries .
• Proposed User Grouping Model– Collecting User Interests
– System Diagram
– System Analysis by 5S Model
– User Model
– Recommendation Algorithms
– Hypotheses
• Experiment
• Experiment Results– Collected Data
– Hypothesis Test
– User Grouping
• Future Work
• Conclusions
ICADL 20043
Previous WorkPrevious Work
• Collaborative filtering– GroupLens [2] for Usenet news
• Recommender system– Ungar [1] and lots of works for online shopping malls
– PASS [3], Renda [4], DEBORA [5] for Digital Library (DL)
• Standard log for DL– Gonçalves et al. [6] XML log standard for DLs
• Rating data– Explicit rating data – Entered by user during registration or by
answering questionnaires
– Implicit rating data, Nichols [7] – Obtained by analyzing log data
ICADL 20044
Recommender System for Dynamic Recommender System for Dynamic & Complex Information Systems& Complex Information Systems
• Different features of DL– Difficulties in collecting explicit user rating data
– Difficulties in analyzing user log data
– Difficulties in item classification
– Sparseness in rating data
– Dynamic, vast and complex items
– Narrow user interests
• Important aspects of recommender system for DLs – scalability, accommodating new data, comprehensibility [9]
ICADL 20045
Conventional Recommender Conventional Recommender System for Shopping MallSystem for Shopping Mall
UsersUser Classes ItemsItem Classes
recommend
ICADL 20046
Recommender System for DLRecommender System for DL
Users User Classes ItemsItem Classes
?
?
?
?
?
??
dynamic itemshard to classify
ICADL 20047
Proposed User Grouping ModelProposed User Grouping Model
• User grouping is the most critical procedure for a recommender system.
• Suitable for dynamic and complex information systems like DLs
• Overcomes data sparseness
• Uses implicit rating data rather than explicit rating data
• User oriented recommender algorithm
• User interest-based community finding
• User modeling– User model (UM) contains complete statistics for recommender
system.
– Enhanced interoperability
ICADL 20048
Collecting User Interests for User Collecting User Interests for User GroupingGrouping
• Users with similar interests are grouped• Employs a Document Clustering Algorithm,
LINGO [10], to collect document topics• Users’ interests are collected implicitly during
searching and browsing.• A User Model (UM) contains her interests and
document topics.• Interests of a user are subset of document topics
proposed to her by Document Clustering.
ICADL 20049
Interest-based Recommender Interest-based Recommender SystemSystem
ICADL 200410
System Analysis with 5S ModelSystem Analysis with 5S Model
Interest-based Recommender System for DL
Society
Space
Structure
Stream Scenario
User Interface
User Model
PresentationPush service
FilteringRanking
HighlightingPersonalized
pages
Recommendation
Group Selection
Individual Selection
Interest GroupResearcher
Learner
Teacher
Class Group
Probability Space
Vector space
Collaboration space
Community
displays
Text AudioVideo
represented by
UM schema
User description
User interestsDocument topics
User groups
Statistics
participates
generates
refers
composed of
refers
Users
Users
ICADL 200411
User Model (UM)User Model (UM)
User ID
User Description Groups Statistics
Name Document Topic Score
User Interest Score
Group ID Score
Address
Publications
User Interests
(implicit data-generated by user interface and recommender)
(implicit data-generated by recommender)
(explicit data-obtained from questionnaire)
ICADL 200412
User Similarity Based on User User Similarity Based on User InterestsInterests
• Derived from Pearson’s correlation coefficient
• Similarity between users ‘a’ and ‘i’
• where j is common interest, va,j is rating score of item j by user a, and
j jijiaja
jijiaja
vvvv
vvvvias
2,
2,
,,
)()(
))((),(
systemthebyproposedtopicsofnumbertotal
selectedusertopicsofnumbertotalva
ICADL 200413
Interest-based Group Interest-based Group RecommendationRecommendation
• , the probability that a user group ‘k’ is affected from a rating which is made by a user ‘a’ to an item ‘j’, can be calculated as follows:
kR
kiCiji
kiCiji
vNT
kv
Nkk PPR :,
:,
)1(1
1
)1(
• where
T : total number of users in the system
Ci : the group that user i belongs to
Vi,j : probability that user i votes for j
N : total number of users in group K
Pk : base rate of group K, observed from DB
ICADL 200414
Interest-based Individual Interest-based Individual RecommendationRecommendation
• Once groups are selected, individual users, who will be affected from the voting, are selected.
• Probability that a user ‘a’ in group ‘k’ likes the item ‘j’ can be obtained by:
))(,(1
,,
n
iijiaja vviasvP
• where
n is the number of users in the selected group k
s(a,i) is the similarity between user ‘a’ and user ‘i’
is average probability of positive voting of user ‘a’
is a normalizing factor
av
ICADL 200415
Interest-based Recommendation Interest-based Recommendation AlgorithmsAlgorithms
• Correct user grouping is critical for correct recommendation.
• User centered algorithm– User model consists of complete statistics for recommender.
– Since no references to statistics of items are needed, is suitable for information systems with dynamic and complex items.
– Enhanced User Model interoperability and reusability
• Group oriented algorithm– Two phase recommendation
• Group selection & Individual selection
– Automatic user’s communities finding
ICADL 200416
HypothesesHypotheses
• Three hypotheses about proper document clustering algorithm behavior– H1 : Any serious user who has his own research interests and
topics, shows consistent output for the document collections referred to by that user.
– H2 : Serious users who share common research interests and topics, show overlapped output for the document collections referred to by them.
– H3 : Serious users who don’t share any research interests and topics, show different output for document collections referred to by them.
ICADL 200417
Experiment - TasksExperiment - Tasks
• Subjects are asked to – answer a questionnaire to collect democratic
information
– list research interests to help us collect explicit rating data which is used for evaluation in the experiment
– search some documents in her research interests and browse the result documents to help us collect implicit rating data
ICADL 200418
Experiment - ParticipantsExperiment - Participants
• 22 Ph.D and MS students majoring in Computer Science
• CITIDEL [8] is used as a DL in “Computing” field• Data from 4 students were excluded as their
research domains are not included in CITIDEL
ICADL 200419
Experiment - InterfacesExperiment - Interfaces
• Specially designed user interfaces are required to capture user’s interactions
• JavaScripts
• Java Application
ICADL 200420
Results - Collected DataResults - Collected Data
• Example<Semi Structured Data<Cross Language Information Retrieval CLIR<Translation Model<Structured English Query<TREC Experiments at Maryland<Structured Document<Evaluation<Attribute Grammars<Learning<Web<Query Processing<Query Optimisers<QA<Disambiguation<Sources<SEQUEL<Fuzzy<Indexing<Inference Problem<Schematically Heterogeneity<Sub Optimization Query Execution Plan<Generation<(Other)(<Cross Language Information Retrieval CLIR)(<Structured English Query)(<TREC Experiments at Maryland)(<Evaluation)(<Query Processing)(<Query Optimisers)(<Disambiguation)
<Cross Language Information Retrieval CLIR<Machine Translation<English Japanese<Based Machine<TREC Experiments at Maryland<Approach to Machine<Natural Language<Future of Machine Translation<Machine Adaptable Dynamic Binary<CLIR Track<Systems<New<Tables Provide<Design<Statistical Machine<Query Translation<Evaluates<Chinese<USA October Proceedings<Interlingual<Technology<Syntax Directed Transduction<Interpretation<Knowledge<Linguistic<Divergences<(Other)(<Cross Language Information Retrieval CLIR)(<Machine Translation)(<English Japanese)(<TREC Experiments at Maryland)(<CLIR Track)(<Query Translation)
• Parenthesized topics mean they are rated positively
ICADL 200421
Results – Hypothesis Test for HResults – Hypothesis Test for H11
• H0 (Null hypothesis of H1) : Mean(μ) of frequency of document topics proposed by Document Clustering Algorithm are NOT consistent (μ0 = 1) for a user, H0 : μ = μ0 vs H1 : μ > μ0
• Conditions : 95% confidence (test size α = 0.05), sample size ‘n’ ≤ 25, standard deviation ‘σ’ unknown, i.i.d. random samples, normal distribution, estimated z-score T-test
• Test statistics : sample mean ‘ỹ’ = 1.1429, sample standard deviation ‘s’ = 0.2277 are observed from the experiment
• Rejection Rule is to reject H0 if the ỹ > μ0+zα/2 σ/√n
• From the experiment, ỹ = 1.1429 > μ0+zα/2 σ/√n = 1.0934
• Therefore decision is to Reject H0 and accept H1
• 95% Confidence Interval for μ is 1.0297 ≤ μ ≤1.2561
• P-value (confidence of H0) = 0.0039
ICADL 200422
Results - UsersResults - Users• All users are assigned a symbol after experiments
according to their explicit data for convenience of analysisUser Symbols User profiles collected from questionnaire
1 dlmember The one who belonged to the Digital Library Research Laboratory
2 softeng The one who has an interest in Software Engineering
3 bio The one who has an interest in Bioinformatics
4 vr_hci The one who has an interest in Virtual Reality and Human Computer Interaction
5 clir_1 The one who has an interest in Cross Language Information Retrieval
6 clir_2 The one who has an interest in Cross Language Information Retrieval
7 nlp_1 The one who has an interest in Natural Language Processing
8 nlp_2 The one who has an interest in Natural Language Processing
9 vr_1 The one who has an interest in Virtual Reality
10 vr_2 The one who has an interest in Virtual Reality
11 EC_agent The one who has an interest in E-Commerce and Agent
12 CybEdu_agt The one who has an interest in Cyber Education and Agent
13 dlandedu_1 The one who has an interest in Digital Library and Education
14 dlandedu_2 The one who has an interest in Digital Library and Education
15 person_1 The one who has an interest in Personalization
16 person_2 The one who has an interest in Personalization
17 se_me The one who has an interest in Software Engineering
18 fuzzy The one who has an interest in Fuzzy Theory
ICADL 200423
Results – User SimilaritiesResults – User Similarities
dlmem
ber
vr_hci
nlp_1
fuzzy
clir_2
person_2 dlmember
dlandedu_1
ec_agent
person_20
0.050.1
0.15
0.2
0.25
0.3
0.35
0.4
dlmembersoftengbiovr_hciclir_1dlandedu_1nlp_1person_1vr_1fuzzyec_agentcybedu_agtclir_2dlandedu_2nlp_2person_2se_mevr_2
ICADL 200424
Results – User Similarity LevelsResults – User Similarity Levels
User ID Level 1 Level 2 Level 3 Level 4
dlmember dlandedu_1, dlandedu_2
softeng se_me person_2
bio
vr_hci vr_2, vr_1 person_1, person_2
clir_1 nlp_1, clir_2 nlp_2
clir_2 clir_1, nlp_1 nlp_2
nlp_1 nlp_2 clir_1, clir_2
nlp_2 nlp_1 clir_1, clir_2
vr_1 vr_2, vr_hci
vr_2 vr_hci vr_1 person_1, person_2
EC_agent CybEdu_agt person_2
CybEdu_agt EC_agent fuzzy
dlandedu_1 dlmember dlandedu_2 vr_hci, CybEdu_agt
dlandedu_2 dlmember dlandedu_1 CybEdu_agt, vr_hci
person_1 person_2
person_2 person_1
se_me softeng
fuzzy CybEdu_agt bio, nlp_1
ICADL 200425
Results - User GroupsResults - User Groups
User Group ID
Member IDs ( assigned after experiment according to their research interests which are answered on the questionnaire)
A dlmember, dlandedu_1, dlandedu_2
B softeng, se_me,
C vr_hci, vr_1, vr_2
D clir_1, nlp_1, clir_2
E nlp_1, nlp_2
F person_1, person_2
G fuzzy, cybedu_agt
H EC_agent, cybedu_agt, fuzzy
I Bio
• User groups are generated by merging a user and other members with the closest similarity level
ICADL 200426
Future WorkFuture Work
• Detailed analyses for accuracy, scalability and efficiency
• Further confirmation of our hypotheses
ICADL 200427
ConclusionsConclusions
• Proposed a user clustering model based on user interests
• Proposed user centered recommendation algorithms which are suitable for DLs
• Proposed a way of collecting and using implicit rating data from DL users
• Proposed a active way of user communities finding
• Verified proposed approaches through designed experiments and hypothesis tests
ICADL 200428
ReferencesReferences
• [1] Lyle H. Ungar and Dean P. Foster: A Formal Statistical Approach to Collaborative Filtering. CONALD ’98, Carnegie Mellon U., Pittsburgh, PA (1998)
• [2] Joseph A. Konstan, Bradley N. Miller, David Maltz, Jonathan L. Herlocker, Lee R. Gordon, and John Riedl: GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, Vol. 40 (1997) 77-87
• [3] Chun Zeng, Xiaohui Zheng, Chunxiao Xing, Lizhu Zhou : Personalized Services for Digital Library. In Proc. 5th Int. Conf. on Asian Digital Libraries, ICADL (2002) 252-253
• [4] M. Elena Renda, Umberto Straccia: A Personalized Collaborative Digital Library Environment. In Proc. 5th Int. Conf. on Asian Digital Libraries, ICADL (2002) 262-274
• [5] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael B. Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In Proceedings of the Fourth European Conference on Research and Advanced Technology for Digital Libraries (2000) 239-248
• [6] Marcos A. Gonçalves, Ming Luo, Rao Shen, Mir Farooq, and Edward A. Fox: An XML Log Standard and Tools for Digital Library Logging Analysis, in Proc. of Sixth European Conference on Research and Advanced Technology for Digital Libraries (2002) 16-18
ICADL 200429
ReferencesReferences
• [7] David M Nichols, Duncan Pemberton, Salah Dalhoumi, Omar Larouk, Clair Belisle, Michael B. Twidale: DEBORA: Developing an Interface to Support Collaboration in a Digital Library. In Proceedings of the Fourth European Conference on Research and Advanced Technology for Digital Libraries (2000) 239-248
• [8] CITIDEL project, Computing and Information Technology Interactive Digital Educational Library, http://www.citidel.org/ (2004)
• [9] John S. Breese, David Heckerman and Carl Kadie: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (1997) 43-52
• [10] Stanisław Osiński and Dawid Weiss: Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data, Advanced in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’04 Conference, Zakopane, Poland (2004) 369-378
• Acknowledgements to NSF for partial support– DUE-0121679, 0121741, 0333531; IIS-0086227, 0307867, 0325579