mashspa

Open Bibliography,And why it shouldn't have to exist.

Ben O'Steen“Mashspa” Mashed Libraries, Bath 29/10/2010

CC-By

Morning,(don't worry, I'll be quick...)

Urgh, “Open” - what does that mean?

Publishing bibliographic information under a permissive license to

encourage indexing, re-use, and re-purposing.

But.... why?

In essence, an open bibliography is all about

Advertising

Bibliographic info allows you to

● Identify and find an item you know you want


● Identify and find an item you know you want,● Discover related items or items you believe you

want



want● Serendipitously discover items you would like

without knowing they might exist● And so on.



want● Serendipitously discover items you would like

without knowing they might exist● And so on.

RequiresIncreasingInvestment!

Advertising 'proverb'

You never spend money on advertising;

you invest with an expectation of

return on investment

To maximise returns, you maximise the audience.

Should the advertising target 'b2b' or 'consumers'?

One thing I am not saying must be necessary...

But, by not making bibliographic data open, you

limit the audience.

(You also limit the data quality, but more on that later.)

“Can't I just scrap sites and reuse it anyway? It's just facts

after all...”

“Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal

protection of databases”

http://is.gd/gqkqb

Databases have in the past been defended using Copyright laws.

This new law codifies a new protection based on

“sui generis”* rights, rights earned by the “sweat of the brow”

* http://en.wikipedia.org/wiki/Sui_generis

So far, noone seems to have any evidence that this encouraged

database-based economies.

There is evidence that it 'awarded' unending monopolies on existing

datasets.

Due to fluffy wording, it is a timebomb

It is a right, like copyright, that doesn't need to be defended

and can be assumed for almost any aggregation.

When we asked UK PubMedCentral if we could reproduce the bibliographic data they share through

their OAI-PMH service.

They said “Generally, No”*

(*me paraphrasing that they had non-transferable licenses and contracts yada yada. Their 'OA subset' of

1876 journals is available however, mainly BMC.)

From OAI-PMH specification:

* Data Providers administer systems that support the OAI-PMH as a means of exposing metadata; and

* Service Providers use metadata harvested via the OAI-PMH as a basis for building value-added

services.

http://www.openarchives.org/OAI/openarchivesprotocol.html

“… Service Providers use metadata harvested via the OAI-PMH as a basis

for building value-added services.”

And the survey said...

Open Bibliographic principles

http://openbiblio.net/2010/10/15/principles-for-open-bibliographic-data/

1 -When publishing data make an explicit and robust license

statement.

2 -Use a recognized waiver or license that is appropriate for

metadata.

3 - If you want your data to be effectively used and added to

by others it should be open … – in particular non-commercial

and other restrictive clauses should not be used.

4 - We strongly recommend explicitly placing bibliographic data in the Public Domain via

PDDL or CC0.

5 – We strongly urge creators of bibliographic metadata

explicitly either dedicate this to the public domain or use an

open licence.

IdentifyTitle, Date, Any identifiers, Publisher, Container (eg Journal), Author names etc

Discover Keywords, Abstract, Author Identifiers, etc

Serendipity Citations, citing text, Usage data, supplemental data, etc.

Bibliographic Sliding Scale

Identify

Discover

Serendipity

IncreasingInvestment

BUT

IncreasedChance of usage

Bibliographic Sliding Scale

“So, we just pick a standard and publish and we'll reap all the

benefits, right?”

Erm, no.

For three main reasons.

#1 “Where there is human input, there is interpretation”

Meanings of words and usage of fields have changed

over time.

#1 (cont.) Interchange standards don't make the

information any more understandable.

Someone interprets them.

#2 Data has been entered and curated without large-

scale sharing as a focus.

Lots of implicit, contextual info left out.

#3 Data quality is typically poor with formally closed

datasets.

For #1 - Collisions caused by interpretation can really only be

solved by sharing data and seeing how bad things are.

Standards and interoperability:

“The first follower transforms a lone nut into a leader” -

Derek Sivers' TED Talk

http://www.ted.com/talks/lang/eng/derek_sivers_how_to_start_a_movement.html

Video:http://www.youtube.com/watch?v=GA8z7f7a2Pk

The man dancing is joined by one or two, but he is still doing his own thing.

Eventually a group decides to join him, and the group grows.

The quality of the dance isn't important, but the community dancing along with it is.

And so it is with standards.

http://www.youtube.com/watch?v=GA8z7f7a2Pk

For #2 (implicit info), provenance and the source of data gives us

crucial clues.

Due to #1, I remain unconvinced that this information can ever be

totally machine-readable.

And for #3, misleading or incorrect data...

… um.

No easy answers – we just don't have the info.

The data clean-up process is going to be

probabalistic.

(We cannot be sure – by definition - that we are 'accurate' when we de-duplicate or disambiguate.)

Typical methods then:

Natural Language Processing,

Machine learning techniques

and

String Metrics and old skool record deduplication

I <3 String Metrics and old skool record deduplication

(out of the 3)

http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html

http://is.gd/gqOjQ

Old skool record linkage:

“Felligi-Sunter” - probabilistic record linkage (PRL).

It's not a great model, but it's achievable.

Machine-learning requires a reasonably large golden set.

(http://en.wikipedia.org/wiki/Record_linkage)

PRL is not great in itself, BUT

It does lend itself to Map-Reduce style operations

And

It's a great way to filter down to those records that really do need to be compared by eye.

http://datamining.anu.edu.au/projects/linkage.html

“Record or data linkage techniques are used to link together records which relate to the same entity (e.g.

patient, customer, household) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be linked.”

ANU's Febrl python code

So far, much effort has been directed at the Works;

We need to put much more effort into their

Networks.

Bibliographic directions

Networks?

Networks?

● A cites B

Networks?

● A cites B● Works by a given (identified) Author● Works cited by a given Author● Works citing articles that have since been disproved,

redacted or withdrawn.● Co-authors● And many more connections we've not even

considered yet ('betweenness', 'centrality', etc)

In Summary,

● Accessible Bibliography as Advertising.

● Bibliography authors choose how they wish to invest to gain usage and real impact.

● Closed data has a much slimmer chance of increasing in quality

● Open data makes it easier to find problems and to improve the data

● Benefits will come from developing networks of information

● Don't get hung up on standards! A lone nut with followers doing something copyable is enough!

mashspa

Technology

bibliographic data open

usage data

data providers

data quality

bibliographic information

supplemental data

related items

open bibliography