personalised,(interac&ve(access(to(...

Chris&n Seifert, University of Passau Hamburg, 2015-‐03-‐26

Personalised, Interac&ve Access to Digital Library Content Lessons Learned in the EEXCESS Project

Mo&va&on

A.-‐L. Barabasi, R. Albert, and H. Jeong. Scale-‐free characteris&cs of random networks: the topology of the world-‐wide web. Physica A: Sta&s&cal Mechanics and its Applica&ons, 281(1–4):69 – 77, 2000.

Why and how to reduce this distance?

Uniqu

e Visitors (%)

0,00%

15,00%

30,00%

45,00%

60,00%

Rank of the Site

0 25 50 75 100

Digital libraries, museums, archives

specialised resources

User-‐content distance

4

Long -‐tail resources in context: • Discover new informa&on • Verify facts • Enrich exis&ng informa&on

Why to reduce the user-‐content distance?

Mo&va&on• Content provider strategies:

A. Dedicated portals B. Search engine op&misa&on C. Social Network Marke&ng

User finds content. Limited success.

• User strategies: A. Use major search engines B. Use dedicated portals C. Don’t know of existence of

resources and/or portals

6

Idea

EEXCESS Approach

Reduce user-‐content distance Bring content to users

(in a helpful, polite, non-‐obtrusive manner)

Locate users

Channel Iden&fica&on and

Injec&on

Find out what users need

Context Detec&on and

Personalised Search

Present resources

Interac&ve Visualisa&ons

8

More details, please..

Channel Iden&fica&on and Injec&on

• Frequently used channels – Social media channels – CMS -‐ mul&plier effect – Online Word Processors – No access → Browser technology

• Challenge: – Variety of clients

Lesson 1 [Clients]: Favour clients with mul&plier effect

Locate users

10

D7.1 Test Bed design, deployment plan and mockups

© EEXCESS consortium: all rights reserved page 27

Figure 7: SITOS EasyWiki with EEXCESS recommendations

By clicking on the “use” icon of a recommendation the recommendation snippet is copied to the text.

Figure 8: SITOS EasyWiki with used recommendation

D5.2First Prototype: User Pro�le and Context Detection, Usage Analysis

8 Prototype: Twitter BotTwitter is used as an distribution channel for cultural content. The bot was implemented to enableTwitter users to access the EEXCESS recommendations. Users can query the bot for speci�c contentsand the bot offers resources to random users to broaden its publicity and form a network withinthe Twitter environment. The contents are distributed via status-updates using the twitter account@RecoRobot.

8.1 Guided TourThere are three ways to get involved with the EEXCESS twitter bot:

• Actively question the bot (mention @RecoRobot in your own tweet) to get a one-time answer

• Follow the bot to get continuous recommendations

• The bot can offers recommendations to random users triggered by keywords

In general, the bot can extract information from tweets and query the EEXCESS service for a recom-mendation. If a good recommendation is found, the TwitterBot responds by updating its status updatementioning the user and supplying the recommendation link together with a short description.

Figure 4: Mention the bot for a recommendation.

Basically there are two approaches (push or pull the recommendation), how this content deliveryprocess can be triggered. First, the user can mention the Twitter bot in a tweet and it will try torecommend a suitable resource. Figure 4 shows query and result of a successful recommendation.This abstract process is presented in �gure 5

TwitterBot queried by user:

Crawl @Mentions from Twitter

Get Recommendations Persist Update Status

and mention user

Figure 5: Abstract process: Mention.

c� EEXCESS consortium: all rights reserved 36

Lesson 2 [Architecture]: Modularise; use APIs to separate clients from back-‐end

User Context Detec&on

• Translate user context to informa&on need • Example — browser extension


1

3

Results for a page

Results for a selec&on

Search Backend

2

Results for a paragraph

User context

Personalised Results

User Profile Mining

Can we predict manual queries from a text selec&on?

[1] hmp://www.britannica.com/EBchecked/topic/219315/French-‐Revolu&on


“ .. The gathering of troops around Paris and the dismissal of Necker provoked insurrec&on in the capital. On July 14, 1789, the Parisian crowd seized the Bas&lle, a symbol of royal tyranny. Again the king had to yield; visi&ng Paris, he showed his recogni&on of the sovereignty of the people by wearing the tricolour cockade...” [1]

storming Bas&lle 1789

User Profile Mining

Chris&n Seifert, Jörg Schlömerer, Michael Granitzer: “ Towards a Feature-‐Rich Data Set for Personalized Access to Long-‐Tail Content”, Proc. IAR at ACM SAC, to appear

(a) Ratio of selection terms in query (b) Ratio of query terms in selection

Figure 4: Term analysis for queries and selected text

of a text selection enriched with the aforementioned featuresand a label for each term of the selection, which indicates ifthe term is also contained in the corresponding query (andhence considered relevant).

The list of stop words is the one provided by the “tm”package for R7, the POS tags were obtained with NLTK [3]and the CRF models were computed with Mallet [16]. Weevaluated the performance of 29 feature combinations using10-fold cross-validation. In order to evaluate the stabilityacross users and tasks we also performed cross-validation onsplits defined by users (all but one user as training and oneuser for test), and tasks respectively.

Table 4: Accuracies [%] for query prediction from selectedtext. Cross-validated using splits over users, tasks, and 10-fold random.

feature set triviali, c, t i, t c, t rejector acceptor

users mean 76 77 75 51 49SD 15 15 18 35 35

tasks mean 82 83 82 71 29SD 6 6 7 8 8

10-fold mean 89 88 84 71 29SD 1 2 1 2 2

i - the identity of a term, i.e. the term itselfc - whether the term begins with upper- or lowercaset - POS tag

The best performing feature combinations are shown intable 4. As the CRF model assigns a label to each termin the selection (identifying it as relevant or not relevant),accuracy refers to the ratio of correctly labeled terms tothe total number of terms. Incorporating a term itself asa feature (i, c, t & i, t) leads to the best results, but thismay not generalize well due to the limited vocabulary inthe dataset. Nevertheless, feature combinations without thewords provide similar results as well (e.g., the combinationof case-identifier and POS-tag, c, t) and thus are the betteroption.

The standard deviations reveal, that the query behavioris stable over tasks, but not over users. In fact half of the

7http://cran.r-project.org/web/packages/tm/

users incorporated the major part of the selection into theirqueries and the queries of the other half contained only a mi-nority of the selection terms. Thus, prediction performancedrops for the evaluation over users.

7. RELATED WORKProviding long-tail recommendations is a highly challeng-

ing task, first of all because of the data sparsity issue: onlya few or even no ratings are available for items in the long-tail. To overcome this problem, the authors in [18] partitionthe whole item set into head and tail parts and cluster theitems in the tail. In [11] recommendations are obtained bycombining the items in a user’s personal long-tail with users,which have those items in their head portion. While theseapproaches still require the existence of at least a few ratingsin the tail or even the existence of dense data in the head,Stickroth et al. [25] aim to provide high quality recommen-dations in a network with a small amount of users and items(and hence without the presence of a dense head). Thereforethe authors propose a multilevel approach, with a decreas-ing degree of personalization and di↵erent recommendationstrategies at each level. Their dataset encompasses 60 rat-ings on 151 items by 175 users and is not published. Closestto our work, Wang et al. [27] conducted a user study in thecultural heritage domain in which they elicited user modelswith ratings of museum objects of the Rijksmuseum Ams-terdam from 39 participants.Most of the approaches for user data collection for long-

tail domains use server-side data logging. A representativeexample is the smartmuseum approach were user interestsare either manually given or by tagging and rating of re-sources [23]. A game-based approach to server side collec-tion was pursued by Wang et al. [27] who used an interac-tive quiz to collect ratings for museum objects. Goecks andShavlik [6] use client-side data collection in a Web browserfor user interest detection based on the text of the webpage,clicked hyperlinks, scrolling and mouse activity.All of those data sets capture the features we identified as

necessary to collect only partly. To the best of our knowl-edge, there is no publicly available dataset, which accountsfor the specific challenges of long-tail recommendations andcontains the required data.

Lesson 3 [Data]: Collect ground truth data as early as possible

Lesson 4 [Data]: Collect ground truth data as early as possible


User Profile MiningFind out what users need

Lesson 5: Informa&on need is en&ty-‐based

Visualisa&onsPresent resources

Visualisa*ons allow more engaging access to data and help to deal with the informa&on overload by using power of the human visual system [1]

[1] Ben Shneiderman. 1996. The Eyes Have It: A Task by Data Type Taxonomy for Informa&on Visualiza&ons. In IEEE Visual Languages. College Park, Maryland 20742, U.S.A., 336–343

Visualisa&ons

Lesson 6 [UI]: Use mock-‐ups (fake data)

Present resources

17

Summary

18

EEXCESS framework ‣ Inject content in channels ‣ Detect informa&on need ‣ Visualise results

enabling ‣ Discovery, Verifica&on and Enrichment of Informa&on

Reduce user-‐content distance Bring content to users

19

Ques&ons?

hmp://eexcess.eu

hmps://github.com/EEXCESS/eexcess

chris&n.seifert@uni-‐passau.de

http://eexcess.eu

https://github.com/EEXCESS/eexcess

mailto:[email protected]

personalised,(interac&ve(access(to(...

Documents