ferosa - insights

FeRoSA F a c e t e d R e c o m m e n d a t i o n

S y s t e m f o r S c i e n t i f i c

A r t i c l e s

Recommendation Engine

Scientific Articles A C L A n t h o l o g y – A c o l l e c t i o n o f 2 0 , 0 0 0

a r t i c l e s i n c o m p u t a t i o n a l l i n g u i s t i c s

Faceted N o t j u s t r e c o m m e n d a t i o n s , b u t h o w t h e y

a r e r e l a t e d

www.ferosa.org L i v e a n d r u n n i n g

•Edge labelling task

b

d l

A b

A

d

l

• Set of Nodes • Links between similar nodes • Label the edges

• Analogy

• Nudge user – suggest why one should buy the combo offered in Flipkart

• Type of social ties in a friendship network

CHALLENGES

Quality

Accessibility

Ranking

Scalable

Q

R

A

S

• High Specificity & Precision • Outperforms current system for

Scientific Articles retrieval by high margin

• Individual ranking per facet • Most relevant entry comes first • Aggregation of ranklists over Content

and Citation network info

• Categorized into 4 facets • Easy to streamline as per need

and filter results

• Random Walks (with restarts) • Independent of domain

Information Overload Even for Relatively closed community like ACL

IR Tools Rather than text based indexing

Varying intentions Streamlined results based on intention, entries may appear, which otherwise may not appear in flat recommendations

Dataset ACL Anthology Collection

Statistics Full Filtered

Number of papers 21,212 9,843

Average number of references (within ACL only)

5.23 6.21

Number of unique authors 17,551 7,892

Number of unique venues 451 280

• Computational Linguistics

• 1961 – 2013

• text data open to public

Form Citation Network

• Identify Citation Contexts and Section heading - parscit

• Section heading to Facet Mapping

• Refinement of facets from prior works

Number of citation contexts extracted

61,051

Number of BG Edges 23,022

Number of AA Edges

10,797

Number of MD Edges

8,828

Number of CM Edges 18,404

AA – Alternative Approaches

BG – Background

CM – Comparison

MD – Method

Induced Subgraphs

• Query Paper • 2 hop citation in either direction • Highly similar papers based on cosine similarity

Nodes

• Edges belonging to a particular facet • 4 different subgraphs for each query paper Edges

Random Walks

• Random walks with restarts • The walker iteratively moves to its neighbourhood with a probability proportional to the

edge weights. • Restart probability c = 0.4, to return to the starting node i. • Teleportation with probability 0.3

Rank Aggregation

Aggregation of ranked lists based on

Content similarity

RWR Values

R package

Optimization problem

Spearman footrule

EXPERIMENTAL RESULTS

• most cosine similar paper comes in 1 hop or 2 hop itself • less edge density as citation increases (due to single edges or few edges) • MD sub-graphs have nodes with high degree • Average path length increases with citation count • clustering coefficient correlates wit edge density • 1-hop nodes contribute more in this measurement.

EVALUATION

FeRoSA

Google Scholar

Microsoft Academic Search

LDA based system (Liang et.al, 2011)

EVALUATION

EVALUATION

• All systems perform better in >2 hop • cosine similarity - FeRoSA works in all sections, while others works marginally better or equivalent to

ferosa only in high or mid • Pr, - FeRoSA in all 3 buckets, others suffer in low citation buckets

Scalable solution

High specificity

Stratification

Flat recommendation

Multi-hop neighbors

Low citation buckets

THANKS

ferosa - insights

Engineering