personalised,(interac&ve(access(to(...
TRANSCRIPT
Chris&n Seifert, University of Passau Hamburg, 2015-‐03-‐26
Personalised, Interac&ve Access to Digital Library Content Lessons Learned in the EEXCESS Project
Mo&va&on
A.-‐L. Barabasi, R. Albert, and H. Jeong. Scale-‐free characteris&cs of random networks: the topology of the world-‐wide web. Physica A: Sta&s&cal Mechanics and its Applica&ons, 281(1–4):69 – 77, 2000.
Why and how to reduce this distance?
Uniqu
e Visitors (%)
0,00%
15,00%
30,00%
45,00%
60,00%
Rank of the Site
0 25 50 75 100
Digital libraries, museums, archives
specialised resources
User-‐content distance
3
4
Long -‐tail resources in context: • Discover new informa&on • Verify facts • Enrich exis&ng informa&on
Why to reduce the user-‐content distance?
Mo&va&on• Content provider strategies:
A. Dedicated portals B. Search engine op&misa&on C. Social Network Marke&ng
User finds content. Limited success.
• User strategies: A. Use major search engines B. Use dedicated portals C. Don’t know of existence of
resources and/or portals
6
Idea
EEXCESS Approach
Reduce user-‐content distance Bring content to users
(in a helpful, polite, non-‐obtrusive manner)
Locate users
Channel Iden&fica&on and
Injec&on
Find out what users need
Context Detec&on and
Personalised Search
Present resources
Interac&ve Visualisa&ons
8
More details, please..
Channel Iden&fica&on and Injec&on
• Frequently used channels – Social media channels – CMS -‐ mul&plier effect – Online Word Processors – No access → Browser technology
• Challenge: – Variety of clients
Lesson 1 [Clients]: Favour clients with mul&plier effect
Locate users
10
D7.1 Test Bed design, deployment plan and mockups
© EEXCESS consortium: all rights reserved page 27
Figure 7: SITOS EasyWiki with EEXCESS recommendations
By clicking on the “use” icon of a recommendation the recommendation snippet is copied to the text.
Figure 8: SITOS EasyWiki with used recommendation
D5.2First Prototype: User Pro�le and Context Detection, Usage Analysis
8 Prototype: Twitter BotTwitter is used as an distribution channel for cultural content. The bot was implemented to enableTwitter users to access the EEXCESS recommendations. Users can query the bot for speci�c contentsand the bot offers resources to random users to broaden its publicity and form a network withinthe Twitter environment. The contents are distributed via status-updates using the twitter account@RecoRobot.
8.1 Guided TourThere are three ways to get involved with the EEXCESS twitter bot:
• Actively question the bot (mention @RecoRobot in your own tweet) to get a one-time answer
• Follow the bot to get continuous recommendations
• The bot can offers recommendations to random users triggered by keywords
In general, the bot can extract information from tweets and query the EEXCESS service for a recom-mendation. If a good recommendation is found, the TwitterBot responds by updating its status updatementioning the user and supplying the recommendation link together with a short description.
Figure 4: Mention the bot for a recommendation.
Basically there are two approaches (push or pull the recommendation), how this content deliveryprocess can be triggered. First, the user can mention the Twitter bot in a tweet and it will try torecommend a suitable resource. Figure 4 shows query and result of a successful recommendation.This abstract process is presented in �gure 5
TwitterBot queried by user:
Crawl @Mentions from Twitter
Get Recommendations Persist Update Status
and mention user
Figure 5: Abstract process: Mention.
c� EEXCESS consortium: all rights reserved 36
Lesson 2 [Architecture]: Modularise; use APIs to separate clients from back-‐end
User Context Detec&on
• Translate user context to informa&on need • Example — browser extension
Find out what users need
1
3
Results for a page
Results for a selec&on
Search Backend
2
Results for a paragraph
User context
Personalised Results
User Profile Mining
Can we predict manual queries from a text selec&on?
[1] hmp://www.britannica.com/EBchecked/topic/219315/French-‐Revolu&on
Find out what users need
“ .. The gathering of troops around Paris and the dismissal of Necker provoked insurrec&on in the capital. On July 14, 1789, the Parisian crowd seized the Bas&lle, a symbol of royal tyranny. Again the king had to yield; visi&ng Paris, he showed his recogni&on of the sovereignty of the people by wearing the tricolour cockade...” [1]
storming Bas&lle 1789
User Profile Mining
Chris&n Seifert, Jörg Schlömerer, Michael Granitzer: “ Towards a Feature-‐Rich Data Set for Personalized Access to Long-‐Tail Content”, Proc. IAR at ACM SAC, to appear
(a) Ratio of selection terms in query (b) Ratio of query terms in selection
Figure 4: Term analysis for queries and selected text
of a text selection enriched with the aforementioned featuresand a label for each term of the selection, which indicates ifthe term is also contained in the corresponding query (andhence considered relevant).
The list of stop words is the one provided by the “tm”package for R7, the POS tags were obtained with NLTK [3]and the CRF models were computed with Mallet [16]. Weevaluated the performance of 29 feature combinations using10-fold cross-validation. In order to evaluate the stabilityacross users and tasks we also performed cross-validation onsplits defined by users (all but one user as training and oneuser for test), and tasks respectively.
Table 4: Accuracies [%] for query prediction from selectedtext. Cross-validated using splits over users, tasks, and 10-fold random.
feature set triviali, c, t i, t c, t rejector acceptor
users mean 76 77 75 51 49SD 15 15 18 35 35
tasks mean 82 83 82 71 29SD 6 6 7 8 8
10-fold mean 89 88 84 71 29SD 1 2 1 2 2
i - the identity of a term, i.e. the term itselfc - whether the term begins with upper- or lowercaset - POS tag
The best performing feature combinations are shown intable 4. As the CRF model assigns a label to each termin the selection (identifying it as relevant or not relevant),accuracy refers to the ratio of correctly labeled terms tothe total number of terms. Incorporating a term itself asa feature (i, c, t & i, t) leads to the best results, but thismay not generalize well due to the limited vocabulary inthe dataset. Nevertheless, feature combinations without thewords provide similar results as well (e.g., the combinationof case-identifier and POS-tag, c, t) and thus are the betteroption.
The standard deviations reveal, that the query behavioris stable over tasks, but not over users. In fact half of the
7http://cran.r-project.org/web/packages/tm/
users incorporated the major part of the selection into theirqueries and the queries of the other half contained only a mi-nority of the selection terms. Thus, prediction performancedrops for the evaluation over users.
7. RELATED WORKProviding long-tail recommendations is a highly challeng-
ing task, first of all because of the data sparsity issue: onlya few or even no ratings are available for items in the long-tail. To overcome this problem, the authors in [18] partitionthe whole item set into head and tail parts and cluster theitems in the tail. In [11] recommendations are obtained bycombining the items in a user’s personal long-tail with users,which have those items in their head portion. While theseapproaches still require the existence of at least a few ratingsin the tail or even the existence of dense data in the head,Stickroth et al. [25] aim to provide high quality recommen-dations in a network with a small amount of users and items(and hence without the presence of a dense head). Thereforethe authors propose a multilevel approach, with a decreas-ing degree of personalization and di↵erent recommendationstrategies at each level. Their dataset encompasses 60 rat-ings on 151 items by 175 users and is not published. Closestto our work, Wang et al. [27] conducted a user study in thecultural heritage domain in which they elicited user modelswith ratings of museum objects of the Rijksmuseum Ams-terdam from 39 participants.Most of the approaches for user data collection for long-
tail domains use server-side data logging. A representativeexample is the smartmuseum approach were user interestsare either manually given or by tagging and rating of re-sources [23]. A game-based approach to server side collec-tion was pursued by Wang et al. [27] who used an interac-tive quiz to collect ratings for museum objects. Goecks andShavlik [6] use client-side data collection in a Web browserfor user interest detection based on the text of the webpage,clicked hyperlinks, scrolling and mouse activity.All of those data sets capture the features we identified as
necessary to collect only partly. To the best of our knowl-edge, there is no publicly available dataset, which accountsfor the specific challenges of long-tail recommendations andcontains the required data.
Lesson 3 [Data]: Collect ground truth data as early as possible
Lesson 4 [Data]: Collect ground truth data as early as possible
Find out what users need
User Profile MiningFind out what users need
Lesson 5: Informa&on need is en&ty-‐based
Visualisa&onsPresent resources
Visualisa*ons allow more engaging access to data and help to deal with the informa&on overload by using power of the human visual system [1]
[1] Ben Shneiderman. 1996. The Eyes Have It: A Task by Data Type Taxonomy for Informa&on Visualiza&ons. In IEEE Visual Languages. College Park, Maryland 20742, U.S.A., 336–343
Visualisa&ons
Lesson 6 [UI]: Use mock-‐ups (fake data)
Present resources
17
Summary
18
EEXCESS framework ‣ Inject content in channels ‣ Detect informa&on need ‣ Visualise results
enabling ‣ Discovery, Verifica&on and Enrichment of Informa&on
Reduce user-‐content distance Bring content to users
19
Ques&ons?
hmp://eexcess.eu
hmps://github.com/EEXCESS/eexcess
chris&n.seifert@uni-‐passau.de