ferosa - insights
TRANSCRIPT
FeRoSA F a c e t e d R e c o m m e n d a t i o n
S y s t e m f o r S c i e n t i f i c
A r t i c l e s
Recommendation Engine
Scientific Articles A C L A n t h o l o g y – A c o l l e c t i o n o f 2 0 , 0 0 0
a r t i c l e s i n c o m p u t a t i o n a l l i n g u i s t i c s
Faceted N o t j u s t r e c o m m e n d a t i o n s , b u t h o w t h e y
a r e r e l a t e d
www.ferosa.org L i v e a n d r u n n i n g
•Edge labelling task
b
d l
A b
A
d
l
• Set of Nodes • Links between similar nodes • Label the edges
• Analogy
• Nudge user – suggest why one should buy the combo offered in Flipkart
• Type of social ties in a friendship network
CHALLENGES
Quality
Accessibility
Ranking
Scalable
Q
R
A
S
• High Specificity & Precision • Outperforms current system for
Scientific Articles retrieval by high margin
• Individual ranking per facet • Most relevant entry comes first • Aggregation of ranklists over Content
and Citation network info
• Categorized into 4 facets • Easy to streamline as per need
and filter results
• Random Walks (with restarts) • Independent of domain
Information Overload Even for Relatively closed community like ACL
IR Tools Rather than text based indexing
Varying intentions Streamlined results based on intention, entries may appear, which otherwise may not appear in flat recommendations
Dataset ACL Anthology Collection
Statistics Full Filtered
Number of papers 21,212 9,843
Average number of references (within ACL only)
5.23 6.21
Number of unique authors 17,551 7,892
Number of unique venues 451 280
• Computational Linguistics
• 1961 – 2013
• text data open to public
Form Citation Network
• Identify Citation Contexts and Section heading - parscit
• Section heading to Facet Mapping
• Refinement of facets from prior works
Number of citation contexts extracted
61,051
Number of BG Edges 23,022
Number of AA Edges
10,797
Number of MD Edges
8,828
Number of CM Edges 18,404
AA – Alternative Approaches
BG – Background
CM – Comparison
MD – Method
Induced Subgraphs
• Query Paper • 2 hop citation in either direction • Highly similar papers based on cosine similarity
Nodes
• Edges belonging to a particular facet • 4 different subgraphs for each query paper Edges
Random Walks
• Random walks with restarts • The walker iteratively moves to its neighbourhood with a probability proportional to the
edge weights. • Restart probability c = 0.4, to return to the starting node i. • Teleportation with probability 0.3
Rank Aggregation
Aggregation of ranked lists based on
Content similarity
RWR Values
R package
Optimization problem
Spearman footrule
EXPERIMENTAL RESULTS
• most cosine similar paper comes in 1 hop or 2 hop itself • less edge density as citation increases (due to single edges or few edges) • MD sub-graphs have nodes with high degree • Average path length increases with citation count • clustering coefficient correlates wit edge density • 1-hop nodes contribute more in this measurement.
EVALUATION
FeRoSA
Google Scholar
Microsoft Academic Search
LDA based system (Liang et.al, 2011)
EVALUATION
EVALUATION
• All systems perform better in >2 hop • cosine similarity - FeRoSA works in all sections, while others works marginally better or equivalent to
ferosa only in high or mid • Pr, - FeRoSA in all 3 buckets, others suffer in low citation buckets
Scalable solution
High specificity
Stratification
Flat recommendation
Multi-hop neighbors
Low citation buckets
THANKS