leveraging social network knowledge in data...
TRANSCRIPT
Leveraging Social Network Knowledge in Data Mining
Maurizio Marchese Department of Information Engineering and Computer Science,
University of Trento, Italy
Work presented done in collaboration with Fabio Casati, Alejandro Mussi, Mikalai Krapivin and many more people in the LiquidPub project
(EU FET-Open grant number 213360)
Outline
• Introduction: Social Informatics • Three Applications
– Discovering Communities – Scientific impact evaluation – Navigation in scientific publishing
• Conclusions
Introduction
• Social informatics is the study of ICT tools in cultural, institutional contexts and their applications in social domains (Kling, Rosenbaum, & Sawyer,2005).
• It is naturally a trans-disciplinary field and is part
of a larger body of socio-technical research that examines the ways in which the technological artifact and human social context interact
Social Informatics and Complex Systems
• In this lecture, we will focus on a specific aspect in SI, namely: – set of complex systems approaches – how to use methods and tools from IT to mine
social networks information/knowledge – explicit vs. implicit knowledge
• In particular we will present some current experimentations in two scenario – scientific publishing – scientific impact evaluation
Alejandro Mussi, "Discovering and analyzing scientic communities using conference network", thesis work at Departamento de Electronica e Informatica -Universidad Catolica Nuestra Senora de la Asuncion in collaboration with the University of Trento, Italy
Discovery and Analyzing Scientific Communities
Scientific Community Discovery
• In research, researchers write contributions together, they publish their advances in some event or journal.
• contributions refer other contributions, some contributions are organized in collections, and so on.
• The emerging community structure and rich contextual information could be used (among others) to improve two main aspect in the research scope: search and assessment. – search for a contribution, or group them; people working in similar
content, events that are related to a contribution, – measure the impact on a specific community (normalize the actual
metrics), narrow down the search space into a community structure, and so on.
Community Discovery: Process 2
Conf A Conf B Conf A Conf B
1 (one author in common)
2
1
3
4
5
6
2
1
3
4
5
6
Community A Community B Conference Network
% of Common Authors Overlapping Between Communities
TELETEACHING/HUM_INT (chi,hicss)
AI/DB (icai,aaai)
ROBOTIC/M.MEDIA (icra,icpr)
TELECOM (icc,globecom)
APPLIED COMPUTING/CRYPTO(sac,compsac)
SOFTWARE ENGINEERING(kbse,icse)
DIST. SYSTEM/COMPILER(ipps,iccS)
GENETIC AND EVO ALG(cec,gecco) HUMMAN –COMP INTER(icchp,hci)
Overview of the DBLP Network
12.227 conference!747.752 contributions!533.334 authors !as of March 2010
Community Based Metrics
• Community Impact (Cimp) A community has a scientific impact n if n of their authors have h-
index equal to at least n, and the other authors have at most n h-index each.
• Community Health/Diversity (Cht) The health/diversity of a community is defined as the number of
communities that share authors in common with the specific community.
• Author Membership Degree (Amd) The community membership degree of an author is defined as the
number of scientific publication the author has published in the community divided by the total number of publication of the author
…
Community Engine Tool
Clustering Engine
Network Manager
Network Metadata
Communities Database
Reseval
REST/HTTP/SOAP
CET Services
REST/HTTP/SOAP
UI
Community Network Analysis
Community Engine Tool Architecture External Services
Community Impact
18.8
24.6
38.6
25
14.25 13
25.2 21.2 20.4
h-index
1332.8
2846
6676.6
3587
749.5 579.8
3008
2055.2 1989.4
Citations
Community Based Metric Analysis Communities with high Health/Diversity
Communities with low Health/Diversity
Lesson Learned & Future Work
• We developed a model and a tool that implements community discovery mining from existing social network information (conferences, papers, affiliations..)
• We propose community-based metrics that aim at improving how scientific content and researchers are searched and assessed
• Future Work includes: – Provide different algorithms for discovering communities using different
networks. – The approach is part of a larger research effort aimed at studying how
scientific communities are born, evolve, remain healthy or become unhealthy (e.g., self-referential), and eventually vanish.
M. Krapivin, M. Marchese, F. Casati, "Exploring and Understanding Citation-Based Scientific Metrics”, Advances in Complex Systems, vol. 13, No.1, p1-23, 2010.
Exploring and Understanding Scientific Metrics in Citation Networks
Metrics, Indicators. The context.
Web crawlers • Search “relevant” papers in a specific domain/topic • Navigation cited papers in a specific domain/topic
Scientific domain • Measuring progress of researcher, group, institution • For Hiring, Career promotion • Seeking for conferences, committee members, workshops Chairs,…
Standard Metrics • P-index, number of papers. • CC – index, citation index excluding self citation. • Modification to improve the metric
– CPP – average number of citation per paper. – "Crown" indicator, which is the average number of citations per article normalized by
the CPP, but average in the domain.
Citation Network metrics
• Hypertext-Induced Topic Selection (HITS) (1998) • Page Rank (PR) (1998) • Hilltop algorithm (1999) • Trust Rank (2004)
Describing person but citation based • H-index – Hirsch index (2005) • G-index • M-indexes
PageRank
• PageRank (Brin S., Page L,1998)
• Adaptation to citations
• Focusing (changing the Markov matrix)
(CN, no loops)
(PR)
Mikalai Krapivin and Maurizio Marchese Focused Page Rank in Scientific Papers Ranking ICADL 2008
Potential Weight
PR of the citing papers
Dispersed Weight
outgoing links of citing papers
Related Work
P. Chen, H. Xie, S. Maslov and S. Redner Finding scientific gems with Google’s PageRank algorithm Journal of Informetrics, 2007
Experiments over Physical Review citation network: 353,268 nodes and only internal citations
Average PR versus number of citations
Experiments
• Data set: – ACM portal based: metadata crawled by Citeseer – 260K papers, 240K authors and ca. one million
internal citations – Completeness ?
• internal citations represent between 1/5 to 1/3 of all citations
• “ACM world” vs. Google Scholar ? – Less errors, manually or semi-supervised
processed, trusted origin
Plotting the difference
All 266K of papers are presented in one plot
May be applied to arbitrary quantity of papers (i.e. citation graph nodes)
Gem: plotting incoming citations
• It gets the weight from a very cited paper
• It got the attention of a paper that “will” become important
• Is it good or bad ? It only identifies the relevance of a paper with a different metric
Stone (popular paper): plotting outgoing links
Paper with a significant number of citations
Effect of outgoing links: the more a paper cites, the less weight it brings to each cited papers
PR-Hirsch vs Hirsch
The same band-based plotting to see the difference between H-index and PRH-index
Impact of different metrics
• Pragmatic Approach:
– to understand divergence of the two indexes is how often, on average, the top t results would contain different papers, with significant values for t = 1, 10, 20
– – Divergence: DivM1,M2 (t,n,S)
• t top results (search window) • n subset of documents • S the complete set of documents
Lesson Learned
• PR and CC are quite different metrics for ranking papers. A typical search would return half of the results different.
• There are a significant number of “gems” while there are relatively few “stones”. (To be explored in the future work).
• The main factor contributing to the difference is weight dispersion: gems are caused by high weight concentration, while stones are caused by dispersion.
• The PR based metrics may be applied for authors as well.For authors the difference between PRH and H is again very significant, and index selection is likely to have a strong impact on how people are ranked based on indexes
Mikalai Krapivin, Aliaksandr Autayeu, Maurizio Marchese, Enrico Blanzieri, and Nicola Segata, “Improving Machine Learning Approaches with Natural Language Processing” in International Conference on Asia-Pacific Digital Libraries 2010, ICADL 2010
Keyphrases Extraction from Scientific Documents from Scientific Communites
Keyphrases Extraction: Motivations
• Keyphrase is a phrase that shortly describes the content of a document
• Stakeholders – Librarians
• Content classification / categorization • Adding more meta information/support for facets • Improved navigation • ...
– End users/researchers • Tagged search • State-of-the-art search • Search for collaborators • Search for appropriate venue for publication • ..
Data Mining From Scientific Papers: Challenges and Problems
• Explicit information – title, references, authors (name, mail, …),
venues, keywords, …
• Implicit information – keyphrases, keywords (tags) – concepts
present in the header after the token “Keyphrases/
Ketwords:”
Data Mining: Explicit Keyphrases Extraction
• State-of-the-art techniques – Hidden Markov Model (Seymore et al, 1999) – Conditional Random Field (Peng et al, 2004) – Support Vector Machine (SVN) (Hui et al, 2003)
• Datasets: – a few exists and are publicly available – e.g. Rexa 5000 scholar documents headers
• Results: up to 97% F-measure
Data Mining: Implicit Keyphrases Extraction
• State-of-the-art techniques – KEA (naïve Bayes), Witten et al, 1999 – Support Vector Machine (SVN), Wang et al, 2005 – Decision trees, Tourney et al., 2002 – Genetic algorithms + heuristics, Tourney, 2002
• Datasets: – News, emails, meeting notes, scientific papers,.. – No standard set yet !
• Results – vary from 10% to 30% in F-measure – It is difficult to compare results from different datasets
• Small dimensionality: 10-200 documents, manual quality control , different types of content
Our Approach: methodology
• Build high quality documents’ dataset – Domain specific, i.e. community
• Analyze linguistic characteristics of the dataset • Propose heuristics for potential keyphrases
candidates • Define the feature set • Analyze and compare different machine learning
methods
Linguistic Analysis
40
OpenNLP tools: Part-of-speech POS tags
OpenNLP tools: chunks
POS Tags Meanings
NN, NNP, NNS Nouns
IN Prepositions
JJ Adjectives
VBN, VBP, VBG, VBD
Verbs
Chunks Meanings
B-NP, I-NP Noun phrases
B-PP Prepositional phrases
B-VP, I-VP
verbal phrases
Keyphrases Extraction: Heuristic
• Filter by chunk type, only NP chunk
• Filter by PoS tags, NN, NNP, JJ, NNS, VBG and VBN
Sentence Candidates
Therefore, the seat reservation problem is an on-line problem, and a competitive analysis is appropriate.
seat seat reservation seat reservation problem reservation reservation problem problem on-line problem analysis competitive analysis
Keyphrases Extraction: Feature Set
# Feature # Feature 1 Term Frequency 6-8 i-th token POS tag
2 Inverse Document Frequency 9-11 i-th token head POS tag
3 Position in text 12,15,18 i-th token dependency label
4 Quantity of tokens 13,16,19 distance for i-th incoming arc
5 Part of text 14,17,20 distance for i-th outgoing arc
42
Keyphrases Extraction: Machine Learning Methods
• SVM (FAST SVM library ) – Universal, slow, – not really scalable for large datasets
• FaLKM-SVM – SVM with Local Search – Faster, more scalable
• Random Forest – Fastest – Probabilistic – Based on decision trees
• KEA Naïve Bayes + Heuristics – Fast, 3 features only: not powered by linguistic features 43
44
Precision Recall F-Measure
FaLK-SVM 24.59% 35.88% 29.18%
SVM 22.78% 38.28% 28.64%
Random Forest 26.40% 34.15% 29.78%
KEA 18.61% 26.96% 22.02%
�
F =2 ⋅PRP + R extracted
correct
Number of features Training dataset size
Compared with the actual ACM keyphrases
Current Work
• Include tags navigation in community Digital Libraries or Liquid Journals
• Use extracted key phrases as seeds in clustering algorithms – Seeds Affinity Propagation algorithm (Renchu
et al, 2010)
• Propose automatically tags to users of DL
Conclusions
• Social Network contains important information both explicitly (tags, topics, interactions, etc..) and implicitly (their inner structure)
• This information can be successfully mined with state-of-the-art IT methods and tools and used in various applications: – Improving digital library navigation – Innovative ways to access scientific impact – New ways of disseminating knowledge
(LiquidJournal, Fabio Casati Lecture..) – Recommendations systems – …
Acknowledgements
• Work presented done in collaboration with Fabio Casati, Alejandro Mussi, Mikalai Krapivin and many more people in the LiquidPub group
• Part of the work has been supported by the EU ICT project LiquidPub, under FET-Open grant number 213360.