mashspa
DESCRIPTION
Open Bibliography and standards.TRANSCRIPT
Open Bibliography,And why it shouldn't have to exist.
Ben O'Steen“Mashspa” Mashed Libraries, Bath 29/10/2010
CC-By
Morning,(don't worry, I'll be quick...)
Urgh, “Open” - what does that mean?
Publishing bibliographic information under a permissive license to
encourage indexing, re-use, and re-purposing.
But.... why?
In essence, an open bibliography is all about
Advertising
Bibliographic info allows you to
● Identify and find an item you know you want
Bibliographic info allows you to
● Identify and find an item you know you want,● Discover related items or items you believe you
want
Bibliographic info allows you to
● Identify and find an item you know you want,● Discover related items or items you believe you
want● Serendipitously discover items you would like
without knowing they might exist● And so on.
Bibliographic info allows you to
● Identify and find an item you know you want,● Discover related items or items you believe you
want● Serendipitously discover items you would like
without knowing they might exist● And so on.
RequiresIncreasingInvestment!
Advertising 'proverb'
You never spend money on advertising;
you invest with an expectation of
return on investment
To maximise returns, you maximise the audience.
Should the advertising target 'b2b' or 'consumers'?
One thing I am not saying must be necessary...
But, by not making bibliographic data open, you
limit the audience.
(You also limit the data quality, but more on that later.)
“Can't I just scrap sites and reuse it anyway? It's just facts
after all...”
“Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal
protection of databases”
http://is.gd/gqkqb
Databases have in the past been defended using Copyright laws.
This new law codifies a new protection based on
“sui generis”* rights, rights earned by the “sweat of the brow”
* http://en.wikipedia.org/wiki/Sui_generis
So far, noone seems to have any evidence that this encouraged
database-based economies.
There is evidence that it 'awarded' unending monopolies on existing
datasets.
Due to fluffy wording, it is a timebomb
It is a right, like copyright, that doesn't need to be defended
and can be assumed for almost any aggregation.
When we asked UK PubMedCentral if we could reproduce the bibliographic data they share through
their OAI-PMH service.
They said “Generally, No”*
(*me paraphrasing that they had non-transferable licenses and contracts yada yada. Their 'OA subset' of
1876 journals is available however, mainly BMC.)
From OAI-PMH specification:
* Data Providers administer systems that support the OAI-PMH as a means of exposing metadata; and
* Service Providers use metadata harvested via the OAI-PMH as a basis for building value-added
services.
http://www.openarchives.org/OAI/openarchivesprotocol.html
“… Service Providers use metadata harvested via the OAI-PMH as a basis
for building value-added services.”
And the survey said...
X
Open Bibliographic principles
http://openbiblio.net/2010/10/15/principles-for-open-bibliographic-data/
1 -When publishing data make an explicit and robust license
statement.
2 -Use a recognized waiver or license that is appropriate for
metadata.
3 - If you want your data to be effectively used and added to
by others it should be open … – in particular non-commercial
and other restrictive clauses should not be used.
4 - We strongly recommend explicitly placing bibliographic data in the Public Domain via
PDDL or CC0.
5 – We strongly urge creators of bibliographic metadata
explicitly either dedicate this to the public domain or use an
open licence.
IdentifyTitle, Date, Any identifiers, Publisher, Container (eg Journal), Author names etc
Discover Keywords, Abstract, Author Identifiers, etc
Serendipity Citations, citing text, Usage data, supplemental data, etc.
Bibliographic Sliding Scale
Identify
Discover
Serendipity
IncreasingInvestment
BUT
IncreasedChance of usage
Bibliographic Sliding Scale
“So, we just pick a standard and publish and we'll reap all the
benefits, right?”
Erm, no.
For three main reasons.
#1 “Where there is human input, there is interpretation”
Meanings of words and usage of fields have changed
over time.
#1 (cont.) Interchange standards don't make the
information any more understandable.
Someone interprets them.
#2 Data has been entered and curated without large-
scale sharing as a focus.
Lots of implicit, contextual info left out.
#3 Data quality is typically poor with formally closed
datasets.
For #1 - Collisions caused by interpretation can really only be
solved by sharing data and seeing how bad things are.
Standards and interoperability:
“The first follower transforms a lone nut into a leader” -
Derek Sivers' TED Talk
http://www.ted.com/talks/lang/eng/derek_sivers_how_to_start_a_movement.html
Video:http://www.youtube.com/watch?v=GA8z7f7a2Pk
The man dancing is joined by one or two, but he is still doing his own thing.
Eventually a group decides to join him, and the group grows.
The quality of the dance isn't important, but the community dancing along with it is.
And so it is with standards.
For #2 (implicit info), provenance and the source of data gives us
crucial clues.
Due to #1, I remain unconvinced that this information can ever be
totally machine-readable.
And for #3, misleading or incorrect data...
… um.
No easy answers – we just don't have the info.
The data clean-up process is going to be
probabalistic.
(We cannot be sure – by definition - that we are 'accurate' when we de-duplicate or disambiguate.)
Typical methods then:
Natural Language Processing,
Machine learning techniques
and
String Metrics and old skool record deduplication
I <3 String Metrics and old skool record deduplication
(out of the 3)
http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html
http://is.gd/gqOjQ
Old skool record linkage:
“Felligi-Sunter” - probabilistic record linkage (PRL).
It's not a great model, but it's achievable.
Machine-learning requires a reasonably large golden set.
(http://en.wikipedia.org/wiki/Record_linkage)
PRL is not great in itself, BUT
It does lend itself to Map-Reduce style operations
And
It's a great way to filter down to those records that really do need to be compared by eye.
http://datamining.anu.edu.au/projects/linkage.html
“Record or data linkage techniques are used to link together records which relate to the same entity (e.g.
patient, customer, household) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be linked.”
ANU's Febrl python code
So far, much effort has been directed at the Works;
We need to put much more effort into their
Networks.
Bibliographic directions
Networks?
Networks?
● A cites B
Networks?
● A cites B● Works by a given (identified) Author● Works cited by a given Author● Works citing articles that have since been disproved,
redacted or withdrawn.● Co-authors● And many more connections we've not even
considered yet ('betweenness', 'centrality', etc)
In Summary,
● Accessible Bibliography as Advertising.
● Bibliography authors choose how they wish to invest to gain usage and real impact.
● Closed data has a much slimmer chance of increasing in quality
● Open data makes it easier to find problems and to improve the data
● Benefits will come from developing networks of information
● Don't get hung up on standards! A lone nut with followers doing something copyable is enough!