mashspa

58
Open Bibliography, And why it shouldn't have to exist. Ben O'Steen “Mashspa” Mashed Libraries, Bath 29/10/2010 CC-By

Upload: benosteen

Post on 06-May-2015

534 views

Category:

Technology


0 download

DESCRIPTION

Open Bibliography and standards.

TRANSCRIPT

Page 1: Mashspa

Open Bibliography,And why it shouldn't have to exist.

Ben O'Steen“Mashspa” Mashed Libraries, Bath 29/10/2010

CC-By

Page 2: Mashspa

Morning,(don't worry, I'll be quick...)

Page 3: Mashspa

Urgh, “Open” - what does that mean?

Page 4: Mashspa

Publishing bibliographic information under a permissive license to

encourage indexing, re-use, and re-purposing.

Page 5: Mashspa

But.... why?

Page 6: Mashspa

In essence, an open bibliography is all about

Advertising

Page 7: Mashspa

Bibliographic info allows you to

● Identify and find an item you know you want

Page 8: Mashspa

Bibliographic info allows you to

● Identify and find an item you know you want,● Discover related items or items you believe you

want

Page 9: Mashspa

Bibliographic info allows you to

● Identify and find an item you know you want,● Discover related items or items you believe you

want● Serendipitously discover items you would like

without knowing they might exist● And so on.

Page 10: Mashspa

Bibliographic info allows you to

● Identify and find an item you know you want,● Discover related items or items you believe you

want● Serendipitously discover items you would like

without knowing they might exist● And so on.

RequiresIncreasingInvestment!

Page 11: Mashspa

Advertising 'proverb'

You never spend money on advertising;

you invest with an expectation of

return on investment

Page 12: Mashspa

To maximise returns, you maximise the audience.

Page 13: Mashspa

Should the advertising target 'b2b' or 'consumers'?

Page 14: Mashspa

One thing I am not saying must be necessary...

Page 15: Mashspa
Page 16: Mashspa

But, by not making bibliographic data open, you

limit the audience.

(You also limit the data quality, but more on that later.)

Page 17: Mashspa

“Can't I just scrap sites and reuse it anyway? It's just facts

after all...”

Page 18: Mashspa

“Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal

protection of databases”

http://is.gd/gqkqb

Page 19: Mashspa

Databases have in the past been defended using Copyright laws.

This new law codifies a new protection based on

“sui generis”* rights, rights earned by the “sweat of the brow”

* http://en.wikipedia.org/wiki/Sui_generis

Page 20: Mashspa

So far, noone seems to have any evidence that this encouraged

database-based economies.

There is evidence that it 'awarded' unending monopolies on existing

datasets.

Page 21: Mashspa

Due to fluffy wording, it is a timebomb

It is a right, like copyright, that doesn't need to be defended

and can be assumed for almost any aggregation.

Page 22: Mashspa

When we asked UK PubMedCentral if we could reproduce the bibliographic data they share through

their OAI-PMH service.

They said “Generally, No”*

(*me paraphrasing that they had non-transferable licenses and contracts yada yada. Their 'OA subset' of

1876 journals is available however, mainly BMC.)

Page 23: Mashspa

From OAI-PMH specification:

* Data Providers administer systems that support the OAI-PMH as a means of exposing metadata; and

* Service Providers use metadata harvested via the OAI-PMH as a basis for building value-added

services.

http://www.openarchives.org/OAI/openarchivesprotocol.html

Page 24: Mashspa

“… Service Providers use metadata harvested via the OAI-PMH as a basis

for building value-added services.”

And the survey said...

Page 25: Mashspa

X

Page 26: Mashspa

Open Bibliographic principles

http://openbiblio.net/2010/10/15/principles-for-open-bibliographic-data/

Page 27: Mashspa

1 -When publishing data make an explicit and robust license

statement.

Page 28: Mashspa

2 -Use a recognized waiver or license that is appropriate for

metadata.

Page 29: Mashspa

3 - If you want your data to be effectively used and added to

by others it should be open … – in particular non-commercial

and other restrictive clauses should not be used.

Page 30: Mashspa

4 - We strongly recommend explicitly placing bibliographic data in the Public Domain via

PDDL or CC0.

Page 31: Mashspa

5 – We strongly urge creators of bibliographic metadata

explicitly either dedicate this to the public domain or use an

open licence.

Page 32: Mashspa

IdentifyTitle, Date, Any identifiers, Publisher, Container (eg Journal), Author names etc

Discover Keywords, Abstract, Author Identifiers, etc

Serendipity Citations, citing text, Usage data, supplemental data, etc.

Bibliographic Sliding Scale

Page 33: Mashspa

Identify

Discover

Serendipity

IncreasingInvestment

BUT

IncreasedChance of usage

Bibliographic Sliding Scale

Page 34: Mashspa

“So, we just pick a standard and publish and we'll reap all the

benefits, right?”

Page 35: Mashspa

Erm, no.

For three main reasons.

Page 36: Mashspa

#1 “Where there is human input, there is interpretation”

Meanings of words and usage of fields have changed

over time.

Page 37: Mashspa

#1 (cont.) Interchange standards don't make the

information any more understandable.

Someone interprets them.

Page 38: Mashspa

#2 Data has been entered and curated without large-

scale sharing as a focus.

Lots of implicit, contextual info left out.

Page 39: Mashspa

#3 Data quality is typically poor with formally closed

datasets.

Page 40: Mashspa
Page 41: Mashspa

For #1 - Collisions caused by interpretation can really only be

solved by sharing data and seeing how bad things are.

Page 42: Mashspa

Standards and interoperability:

“The first follower transforms a lone nut into a leader” -

Derek Sivers' TED Talk

http://www.ted.com/talks/lang/eng/derek_sivers_how_to_start_a_movement.html

Page 43: Mashspa

Video:http://www.youtube.com/watch?v=GA8z7f7a2Pk

The man dancing is joined by one or two, but he is still doing his own thing.

Eventually a group decides to join him, and the group grows.

The quality of the dance isn't important, but the community dancing along with it is.

And so it is with standards.

Page 44: Mashspa

For #2 (implicit info), provenance and the source of data gives us

crucial clues.

Due to #1, I remain unconvinced that this information can ever be

totally machine-readable.

Page 45: Mashspa

And for #3, misleading or incorrect data...

… um.

No easy answers – we just don't have the info.

Page 46: Mashspa

The data clean-up process is going to be

probabalistic.

(We cannot be sure – by definition - that we are 'accurate' when we de-duplicate or disambiguate.)

Page 47: Mashspa

Typical methods then:

Natural Language Processing,

Machine learning techniques

and

String Metrics and old skool record deduplication

Page 48: Mashspa

I <3 String Metrics and old skool record deduplication

(out of the 3)

Page 49: Mashspa
Page 50: Mashspa

http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html

http://is.gd/gqOjQ

Page 51: Mashspa

Old skool record linkage:

“Felligi-Sunter” - probabilistic record linkage (PRL).

It's not a great model, but it's achievable.

Machine-learning requires a reasonably large golden set.

(http://en.wikipedia.org/wiki/Record_linkage)

Page 52: Mashspa

PRL is not great in itself, BUT

It does lend itself to Map-Reduce style operations

And

It's a great way to filter down to those records that really do need to be compared by eye.

Page 53: Mashspa

http://datamining.anu.edu.au/projects/linkage.html

“Record or data linkage techniques are used to link together records which relate to the same entity (e.g.

patient, customer, household) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be linked.”

ANU's Febrl python code

Page 54: Mashspa

So far, much effort has been directed at the Works;

We need to put much more effort into their

Networks.

Bibliographic directions

Page 55: Mashspa

Networks?

Page 56: Mashspa

Networks?

● A cites B

Page 57: Mashspa

Networks?

● A cites B● Works by a given (identified) Author● Works cited by a given Author● Works citing articles that have since been disproved,

redacted or withdrawn.● Co-authors● And many more connections we've not even

considered yet ('betweenness', 'centrality', etc)

Page 58: Mashspa

In Summary,

● Accessible Bibliography as Advertising.

● Bibliography authors choose how they wish to invest to gain usage and real impact.

● Closed data has a much slimmer chance of increasing in quality

● Open data makes it easier to find problems and to improve the data

● Benefits will come from developing networks of information

● Don't get hung up on standards! A lone nut with followers doing something copyable is enough!