science 2.0 vu - ktikti.tugraz.at/staff/elex/courses/science20/slides/...ie as machine learning task...
Post on 26-Jun-2020
3 Views
Preview:
TRANSCRIPT
www.tugraz.at n
W I S S E N n T E C H N I K n L E I D E N S C H A F T
u www.tugraz.at
Science 2.0 VU Processing Science 2.0 Data, Content Mining
WS 2015/16
Elisabeth Lex KTI, TU Graz
www.tugraz.at n
Agenda
• Repetition from last time: Open Science • Processing academic resources • Mining in academic resources (content perspective) • Example:
• ContentMine: Extraction of scientific facts
2
www.tugraz.at n
Repetition: Open Science
• Open Science • Ideas, Concepts, Benefits and Pitfalls
• E.g. Enhancing collaboration and community-building, increasing efficiency of research vs no reward system yet
• Open Data • Sharing your data influences how often you get
cited (Piwowar, et al., 2007 and Pinowar, et a., 2013)
• Different models for Open Access • Green vs. Gold vs. Hybrid
3
www.tugraz.at n
Open Science – 5 schools of thought
4
www.tugraz.at n
Example: Open Government Data: Eurostat
5
“I’d like to compare the unemployment rate in Austria with other European ones”
Via Google Public Data Explorer, https://www.google.com/publicdata/directory
www.tugraz.at n
Open Access in Science: Open Access Journals ● Green („self-archiving): author can self-archive at the time of
submission of the publication whether the publication is grey literature (usually internal non-peer-reviewed), a peer-reviewed journal publication, a peer-reviewed conference proceedings paper or a monograph
● Gold („author pays“): the author or author institution can pay a fee to the publisher at publication time, the publisher then makes the publication available 'free' at the point of access .
● further little-used “road” hybrid forms: for example platinum open access (does not charge author fees)...
● Both green and gold are compatible and can co-exist
Source: Jeffery, K. Open Access: An Introduction, 2006. http://www.ercim.eu/publication/Ercim_News/enw64/jeffery.html
www.tugraz.at n
Processing Academic Resources
7
www.tugraz.at n
• Aggregate scientific results • Exploratory search in digital collections • Find experts in domains
• Make science discoverable • Improve access to scientific publications • Extract facts for research • Discover relationships
• Check for errors => improve science
Motivation
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
9
www.tugraz.at n
KDD Process
10
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,....
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
11
www.tugraz.at n
Datasets
• The European Library Open Dataset • Digital collection and 200 mio bibliographic records • http://www.theeuropeanlibrary.org/tel4/access/data/
opendata • Datahub.io
• E.g. DBLP Computer Science Bibliography http://datahub.io/dataset/dblp
• Metadata of over 1.8 mio publications by 1 mio authors
12
www.tugraz.at n
Repositories and Aggregators
• ISI Web of Science • Scopus • Pubmed • The European Library • Library of Congress • ArXiv • Figshare • Data Citation Index • Mendeley • Google Scholar • CiteSeerX • ...
13
www.tugraz.at n
APIs to Repositories ...
• APIs to access scientific publications and research data
• rOpenSci: arXiv, PlosOne, Figshare • Mendeley: Developer API, http://dev.mendeley.com
• Python package: pip install mendeley
14
www.tugraz.at n
Example - rOpenSci
15
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering / Classification • Linking publications
• Make available data and source code J
16
www.tugraz.at n
Information Extraction
• IE Goal: Extract structured information out of unstructured content, e.g.
• Method names, quantities, temporal expressions • Authors from scientific publications • Organizations in acknowledgements section of
papers • References • ...
17
www.tugraz.at n
IE Process
18
http://www.nltk.org/book/ch07.html
Input: raw text of a document Output: list of (entity, relation, entity)
ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity)
Applying word classes to words within a sentence
www.tugraz.at n
IE Standard Approaches (1/2)
• Regular expressions / Rule-based approaches • E.g. dates, email addresses, @user, RT@user
http://localhost:8888/notebooks/twitterprocessing.ipynb
19
www.tugraz.at n
IE as Machine Learning Task
• Supervised: train model with annotated training data, use trained model to classify unknown text
• Choose a class label for a given input • Identify features of language data to classify it • Construct language models out of them • Learn about text/language from these models
• Methods: • Classifiers: Naive Bayes, Maxent Models • Sequence models: Hidden Markov Models, CRFs
20
www.tugraz.at n
Libraries
• NLTK (http://www.nltk.org) • http://localhost:8888/notebooks/science20-ie.ipynb
21
www.tugraz.at n
Mining academic documents
• Extraction of structural elements • Tables, figures,..
• Extraction of facts from structural elements and doc • Named Entity Recognition (e.g. gene names,..) • Relation extraction (e.g. system A impacts system
B) • Mostly: PDF format
• Good for presentation but problems with metadata quality, hard to analyse
• While PDF analysis tools exist, there is still room for improvement!
22
www.tugraz.at n
Approach
• Divide and conquer • Extracting blocks from the PDF based on structure
and layout information • Classify the extracted blocks
• E.g. into title, body, references, abstract,.. • Classify content of extracted blocks
• E.g. tables • Extract relevant info from the content (Named
Entities, nouns, dates, quantities,..)
23
www.tugraz.at n
Approach
• Extracting blocks • Features: layout specific such as position, font, font
size,.. • Apply Machine Learning approches
• Unsupervised (clustering) • Supervised (classification)
24
www.tugraz.at n
Unsupervised Approach
• Clustering: given a set of objects find the groupings of objects so that the similarity within a group is maximized and the similarity between groups is minimized
• Cluster = block • Successive merge and split mechanism
25
www.tugraz.at n
Supervised Approach
• Classification: given a set of labeled examples, create a model and use it to predict the label of unknown examples
• Classify blocks: Maximum Entropy Models • Create training data by labeling blocks, i.e. assigning
blocks to classes • Learn a model based on the training data and apply
it to classify unknown blocks • Features: layout, formatting, word frequencies,..
26
www.tugraz.at n
Fact Extraction from Publications
• Extract entities from within the identified blocks • E.g. author block – divide further to extract all
authors contained in the block • Extract relations between entities
• Open Information Extraction • Learns a models without needing training data • Can extract binary relations from sentences
27
www.tugraz.at n
Example: Measuring quality of Wikipedia
28 Elisabeth Lex, Michael Voelske, Marcelo Errecalde, Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein, and Michael Granitzer. 2012. Measuring the quality of web content using factual information. In Proceedings of WebQuality '12 at WWW‘12
(a) Unbalanced (b) Balanced
Figure 1: Histograms of Wikipedia corpora for unbalanced dataset and balanced dataset.
is the word count of t, and t is a Wikipedia article. Thesame holds for “Factual-density/sentence-count”.
The word count measure outperforms the factual densitymeasure normalized to sentence count as well as the wordcount on the unbalanced corpus. Apparently, word count isa strong feature on the unbalanced corpus.
We then evaluated the factual density measure on the bal-anced corpus where both featured/good and non-featuredarticles are more similar in respect to document length.The results for this experiment are shown in Figure 2(b)as precision-recall curves. On the balanced corpus, factualdensity normalized to sentence count as well as word countperforms much better than on the unbalanced corpus, whileword count, as expected, performs worse. There is not muchdi↵erence between the normalization to word or sentencecount since here, the number of words per document has asmaller influence on the result.
We also analyzed the distributions of featured/good andnon-featured articles if factual density is used as measure,as depicted in Figure 3. We found that the distributionof the featured/good articles is clearly separated from thedistribution of the non-featured articles, with peaks at twodi↵erent factual density values (0.06 and 0.03 respectively).This finding is in contrast to the fact that the distributionsof featured/good articles and non-featured articles have ahigh degree of overlap if word count is used, as shown inFigure 1(b). Consequently, on the balanced corpus, factualdensity clearly outperforms our baseline word count.
In a related experiment, we investigated the relational in-formation contained in the binary relationships ReVerb ex-tracts from sentences. We used the relations, i.e. only thepredicates from the extracted triples as a vocabulary to rep-resent the documents. We then tested the discriminativepower of these features by training a classifier to solve the bi-nary classification problem of distinguishing featured/goodfrom non-featured articles. The results reported in Table 1were obtained using the WEKA6 implementation of a NaiveBayes Classifier in combination with feature selection basedon Information Gain (IG). From 40 000 relations, we selected
6http://www.cs.waikato.ac.nz/~ml/weka/
Figure 3: Distribution of articles by factual density.
the 10% best features in terms of IG. We achieved similarresults for both corpora.
Table 1: Classification results using relational fea-tures on both corpora.
Unbalanced Balanced
Measure Value [%] Value [%]
Accuracy 84.01 87.14F-Measure 84 86.7Precision 84 89.2Recall 84 87.1
Apparently, relational features are more robust when thedocument length varies. However, we need to investigatethis in more detail.
www.tugraz.at n
Extract Topics from Publications
• Topic Models: algorithms that uncover thematic structure in document collections
• Facilitate searching, browsing, summarizing • Latent Dirichlet Allocation (LDA)
• Hierarchical probabilistic model
18/11/15 29
www.tugraz.at n
LDA
• Probabilistic model that helps find latent topics for documents
• Probabilistic model: treat data as observations that stem from a generative proabilistic process which involves hidden variables • Documents: Thematic structure are the hidden
variables • Each topic is described by words in the documents
18/11/15 30
www.tugraz.at n
LDA
• Infer hidden structure using posterior inference
• „What are the topics that describe the documents?“ • Classify unknown data using the topic model
• „How does unknown data fit into estimated topic structure?“
• Nr of topics Z has to be choosen in advance • Defines level of specification of topics
18/11/15 31
Probability of ith word for doc d Probability of ti
within topic zi
Probability of using a word from topic zi in the doc
www.tugraz.at n
Example: Model evolution of topics over time in Science journal
18/11/15 32
• Dataset: pages Science from 1880-2002 from JSTOR archive
https://www.cs.princeton.edu/~blei/topicmodeling.html
www.tugraz.at n
Validation of extracted information
33
• Crowdsourcing as a way to evaluate mining quality • Share the extracted information via e.g. a Web-
based platform • Enable users to give feedback
• Accept, reject, suggest new concepts/facts
www.tugraz.at n
HowTo: Text Mining using rOpenSci
• Library that facilitates text mining on publications • Search for articles • Fetch articles • Get links for full text articles (xml, pdf) • Extract text from articles / convert formats • Collect bits of articles that you actually need • Download supplementary materials from papers
34 https://ropensci.org/tutorials/fulltext_tutorial.html
Chamberlain Scott (2015). fulltext: Full Text of Scholarly Articles Across Many Data Sources. R package version 0.1.0. https://github.com/ropensci/fulltext
www.tugraz.at n
Example: Text Mining using rOpenSci
#include the library!library("fulltext“)! #ft_search() - get metadata on a search query.!> (res1 <- ft_search(query = 'open science', from = 'arxiv'))!> (out <- ft_get(res1))!> res1$arxiv!!# ft_get() - get full or partial text of articles.!> res <- ft_get('cs/9301113v1', from='arxiv')!!#extract the fulltext!> res2 <- ft_extract(res)!> res2$arxiv$data!!#extract interesting parts from the fulltext!> out %>% chunks("doi")!
35
www.tugraz.at n
Example: Text Mining using rOpenSci
• fulltext can extract parts of a paper via chunks(): • “all”, “front”, “body”, “back”, “title”, “doi”,
“categories”, “authors”, “keywords”, “abstract”, “executive_summary”, “refs”, “refs_dois”, “publisher”, “journal_meta”, “article_meta”, “acknowledgments”, “permissions”, “history”!
• Can do PDF extraction • E.g. via GhostScript: (res_gs <- ft_extract(pdf, "gs"))!• ..
36
https://ropensci.org/tutorials/fulltext_tutorial.html
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
37
www.tugraz.at n
Clustering of Academic Resources
• Detect groupings of papers based on content similarity
• E.g. alongside of topics • Transform content (e.g. abstract of a paper) into
machine readable representation • Bag of Words approach: document treated as bag
of words/terms, represented as vector • Document-Term matrix: term frequencies across all
documents
38
www.tugraz.at n
Vector Space Model
• Documents are vectors in Term-Document Space
• Elements of vector are weights wij corresponding to doc i and term j
• Weights: frequencies of terms in docs • TF-IDF
• Proximity of documents (similarity) calculated by cosine of angle between document vectors
39
www.tugraz.at n
Example: Facilitate exploratory search
• By topic of interest (cluster = topic of interest) • Setting: Social bookmarking dataset, URLs
described by tags § Research Questions:
§ What clusters (aka groups of interests) exist? § Are they somehow related? § How do they evolve over time?
www.tugraz.at n
Clustering Algorithms
• KDD lectures! • Here, briefly: K-means algorithm
1. Select k points as initial centroids 2. Repeat
3. Form k clusters by assigning all points to closest centroid
4. Recompute centroid of each cluster 5. Until centroids don‘t change
18/11/15 41
www.tugraz.at n
Example
www.tugraz.at n
www.tugraz.at n
Classification of Scientific Publications
• Categorize into established subject-based taxonomy • E.g. Library of Congress • UNESCO thesaurus • DOAJ subject classification • Library of Congress Subject Headings
44
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
45
www.tugraz.at n
Linking Scientific Publications
• Citations (explicitely defined) • Similarity
• Statistical similarity: cosine • Semantic similarity: more complex, e.g. via topics
• Usage • Argument support • Contradiction • ...
46
www.tugraz.at n
Linking via Citations
47
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Clustering / Classification • Linking publications • Search
• Make available data and source code J
48
www.tugraz.at n
Sharing code
• Github • Bitbucket • iPython Notebooks • ...
49
www.tugraz.at n
Example: ContentMine
50 http://contentmine.org
Idea: • facts cannot
be copyrighted • Billion of facts
in copyright-protected research articles
à Make them publicly accessible!
www.tugraz.at n
Possible questions for ContentMine
• Find references to papers by a given author. This is metadata and therefore factual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.
• Find who sponsors research. Extract acknowledgements and perform Named Entity Recognition to detect companies. Link the companies to the papers where they are listed in the acknowledgement
51
www.tugraz.at n
1. Crawl scientific literature 2. Scrape each scientific article 3. Extract facts 4. Index 5. Republish (WikiData)
Machine Extraction of scientific facts
https://github.com/ContentMine
www.tugraz.at n
Example: retrieve metadata for specific article
18/11/15 53
www.tugraz.at n
• Secondary publishers create walled gardens • E.g. ResearchGate portal
• Publishers’ contracts ban content-mining. • Publishers may cut off universities who mine • Publishers lobby governments to require “licences
for content mining” UK à “the right to read is the right to mine”
Content Mining Problems
http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-our-digital-future-peter-murray-rust-is-the-problem/
www.tugraz.at n
Summary
• Aggregators/repos for scientific publications • Mining content/data in publications
• Information / fact extraction • Topic modeling • Clustering
• E.g. Exploratory analysis of large datasets • Find groups of interest expressed by user generated
tags and their relations
• ContentMine as example
55
www.tugraz.at n
Questions?
See you next week!
56
top related