lodlam presentation v1.0 final al20151104

Discoverability and the Web

Getting PROV ready for the semantic web

What do I hope you'll take away from this presentation?

•The web is moving from a web of documents to a web of data

•Making web content machine readable is important for discoverability

•We can also use APIs, web mark-up, and LOD to make our web resources more discoverable and reusable

•This isn't a fad or a fantasy, it's happening all over the globe right now and we can be a part of it if we want to at low cost in a timely fashion

Before we get into the heavy stuff

Some Big Bang Theory c/o Googlehttps://www.youtube.com/watch?v=mmQl6VGvX-c

Who needs all this data and who works with it?•Researchers who are after as fine grain data as possible on a given topic i.e. basically anyone who isn’t satisfied with just a web page of interpretation (document) about something https://en.wikipedia.org/wiki/Vida_Goldstein but would rather supplement this with the granular details about that thing or person and browse to related data across the web http://dbpedia.org/page/Vida_Goldstein

•Any organisation who wants to make its web resources as discoverable and usable as possible e.g. the BBC, the Smithsonian, the Getty, UK National Archives, Digital NZ, National Archives of Korea, SLNSW, TROVE, SRNSW, Auckland Museum or just check out http://bit.ly/1OGYZYJ

•Anyone who wants to help annotate content on the web for the social good. Think of TROVE or our own WIKI in which 92 tags have been used 457,750 times across more than 50,000 pages! Here’s just one example http://wiki.prov.vic.gov.au/index.php/Property:Has_keywords

•Software developers who want to build new applications out of this data to make it more accessible and engaging. (We’ll look at some real life examples in just a moment).

•Anyone who wants to ask or allow to be asked sophisticated questions like "Show me all 20th Century painters who were born near Timaru“, "Who were Colin McCahon's contemporaries and let me see a chronology of their major paintings.“, “Show me all the Works in Harvard Library by Swedish Nobel Prize winners.”, “How many people died from tuberculosis in Victoria from 1840-1940?”, “List all Parish Plans showing allotments purchased by person X from 1900-1915 for up to 300 pounds only”.

Imagine...

Imagine... a researcher in 10 years time who wants to use research data about the Eureka Stockade from the Life Sciences, the Humanities and the decorative arts to examine the consequences of the event for Victoria’s economy, environment and art trends from 1854 to 1870. Imagine they have access to a range of documents but also statistical and other data from a range of institutions that allows them to carry this out.

Is this just a fantasy driven by a select few for a niche audience?

In a word...NO. Next slide please...

2014 Linked Open Data Cloud

http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/#toc2

1014 organisations

About 183 gov’t

The Semantic Web

From a web of documents to a web of datahttp://dbpedia.org/page/Jerilderie_Letter

PART 1: EXPLAINING SOME KEY CONCEPTS OF THE SEMANTIC WEBLinked Open Data refers to the way in which we have moved from the ability to link web pages and documents over the web to the ability to link data within those web pages and documents over the web to related data and documents.

What? Here’s a page in Wikipedia about the Jerilderie Letter https://en.wikipedia.org/wiki/Jerilderie_Letter and here’s the Wikipedia data behind that page/topic with links to related data http://dbpedia.org/page/Jerilderie_Letter

LODLAM is the acronym for this Linked Open Data process within Libraries, Archives and Museums.Let’s start with an example to see how this happens...

https://en.wikipedia.org/wiki/Jerilderie_Letter

http://dbpedia.org/page/Jerilderie_Letter

A real life example!http://lodlive.it/LodLive-wiki.prov.vic.gov.au/app_en.html?http://wiki.prov.vic.gov.au/index.php/Special:URIResolver/Jerilderie_LetterThis graph is showing us all the metadata contained within the PROV wiki page on the famous Jerilderie Letter http://wiki.prov.vic.gov.au/index.php/Jerilderie_LetterThe lodview application was written by a developer based in Italy whom I met at the LODLAM 2015 Conference in Sydney recently. It is a piece of software that he wrote to help humans browse the linked open data universe in a visual way. The beauty of it is that you can literally follow your nose through all the connections between resources on the internet via their metadata. Or as in this example you can just explore the metadata within the wiki page itself. So this is the machine readable view of the wiki page, which begs the question...

http://lodlive.it/LodLive-wiki.prov.vic.gov.au/app_en.html?http://wiki.prov.vic.gov.au/index.php/Special:URIResolver/Jerilderie_Letter

http://lodlive.it/LodLive-wiki.prov.vic.gov.au/app_en.html?http://wiki.prov.vic.gov.au/index.php/Special:URIResolver/Jerilderie_Letter

http://wiki.prov.vic.gov.au/index.php/Jerilderie_Letter


What's the value in machine readable metadata?It simply means that as developers come up with new presentation environments for our content we will be ready to make it accessible to them in a form they can actually use!Here's a human readable page of the PROV wiki relating to the copy of the famous Ned Kelly Jerilderie Letter we have in our collection http://wiki.prov.vic.gov.au/index.php/Jerilderie_Letter

Okay so how do we create machine readable metadata?Well we don’t need to. The beauty of the platform that the PROV wiki is built on (semantic Mediawiki) is that it automatically creates it for us for every single page we have created metadata for. The metadata is turned into a standard data model for machines to read called RDF. It’s the lingua franca of the semantic web and fortunately there are a lot of smart people out there who have developed software to transform other data types and models into RDF. Because our wiki makes all of its contents machine readable using the standard data model of the semantic web i.e. RDF we also offer developers a machine readable version of the same wiki page for them to consume in whatever applications they build for browsing the semantic web.


Does that mean we’re reliant on the Wiki?No, we can actually turn all of PROV’s Function , Agency and Series metadata into Linked Open Data because we have something very magical called an API!

An AP What? With the help of the developer behind http://metadata.prov.vic.gov.au/provisualizer we can use the PROV API developed by Kaz and David Fowler to gather A1 metadata consisting of http://metadata.prov.vic.gov.au/oai/query?verb=ListSets 139 Functions, 2579 Agencies and 15212 Series and turn that into Linked Open Data. It will be inexpensive, fast and take our ACM into the Semantic Web, similar to how we already have with the PROV wiki http://wiki.prov.vic.gov.au/rdf/Public_Record_Office_Victoria_Semantic_Wiki.rdf

When we make Item level data accessible through our API we’ll be able to create Linked Open Data for it as well associating it with the archives ontology we deem most appropriate.

http://metadata.prov.vic.gov.au/provisualizer

http://metadata.prov.vic.gov.au/oai/query?verb=ListSets

http://wiki.prov.vic.gov.au/rdf/Public_Record_Office_Victoria_Semantic_Wiki.rdf

We’re not the first archive to do or think about this!

http://www.archivesnext.com/?p=3450 Archives Hub (UK): The Archives Hub provides a gateway to thousands of the UK’s richest archives. Representing over 220 institutions across the country.(http://archiveshub.ac.uk/introduction/)

•Linked Jazz(Pratt Institute): a research project investigating the application of Linked Open Data (LOD) technologies to digital cultural heritage materials. (https://linkedjazz.org/about-the-project/ )

•SNAC( Unmiversity of Virginia): an aggregate of biographical information about people, both individuals and groups, who created or are documented in historical resources. Users can search for names of individual people, organizations, and families; browse featured descriptions; and discover and locate connected historical resources. Search results can be filtered by occupation and subject. (http://socialarchive.iath.virginia.edu/snac/search )

•Conal Touhy(Brisbane-based independent software developer: “ I’ve spent a bit of time just recently poking at the new Web API of Museum Victoria Collections, and making a Linked Open Data service based on their API. I’m writing this up as an example of one way — a relatively easy way — to publish Linked Data off the back of some existing API. I hope that some other libraries, archives, and museums with their own API will adopt this approach and start publishing their data in a standard Linked Data style, so it can be linked up with the wider web of data.” (http://conaltuohy.com/blog/lod-from-custom-web-api/ ). And here is san example of the Linked Open Data he created for 1 item from the MV API http://bit.ly/1Zjge5P And these are the item details from the MV website http://collections.museumvictoria.com.au/items/1411018Just one of 93,817 man made objects they have in their collection http://collections.museumvictoria.com.au/ all accessible through their API http://collections.museumvictoria.com.au/api

See more applications pertaining to documentary heritage here http://summit2015.lodlam.net/

http://www.archivesnext.com/?p=3450

http://archiveshub.ac.uk/introduction/

http://archiveshub.ac.uk/introduction/

https://linkedjazz.org/about-the-project/

http://socialarchive.iath.virginia.edu/snac/search

http://conaltuohy.com/blog/lod-from-custom-web-api/

http://bit.ly/1Zjge5P

http://collections.museumvictoria.com.au/items/1411018

http://summit2015.lodlam.net/

Time to Re-Cap and BreatheWay back in our first example of the Jerilderie Letter you'll notice that we serve up some really useful metadata including image URLS, georeferencing data etc that can all be consumed by software (i.e. ‘intelligent agents’ as first communicated by Sir Tim Berners Lee). http://lodlive.it/LodLive-wiki.prov.vic.gov.au/app_en.html?http://wiki.prov.vic.gov.au/index.php/Special:URIResolver/Jerilderie_Letter

So, increased access to and awareness of our cultural collection by other cultural collections and links to significant datasets such as DBpedia (the semantic database version of Wikipedia) carries with it the benefits of increased item count / usage ( metrics that feeds directly into BP3 stats) and ultimately continued funding for all the very important work we all continue to do in storing, preserving and making accessible the State archives to the people of Victoria. However, that’s a very inward looking view.

The flip side of this is something Richard Lehane from SRNSW touched on in a blog post on API’s in October 2011 which argues that making their search tool open and accessible to developers via their API means they can garner the work of others and inform their own choices around mobile application development etc, though that’s probably worth a talk by itself for another time http://data.records.nsw.gov.au/?p=248

http://data.records.nsw.gov.au/?p=248



More breathing space LODLAM is all about promoting the free and open use of collection metadata between cultural institutions around the world in a way that software can parse and use in various applications that generate increased value for the user and the organisation re access, usage, interoperability. It’s still in its early stages but has made significant progress in just a few short years. Tim Berners Lee’s vision for a web of data as opposed to a web of documents is not impossible to imagine as really big organisations across the globe get onboard:http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/#toc2 So how might this work for PROV? Well imagine a situation where a researcher has access to related content across all the archives in Australia or Australasia simply because that content has been annotated with metadata in a shared language (i.e. RDF) which means software can parse it and make the necessary connections for search engines to deliver rich results to complex questions. Not only can a researcher explore related material within a single archival collection but can broaden this out to multiple collections. And then what if it is then possible to bring in related content from Libraries, Galleries and Museums as well, all the time filtering out the irrelevant material you don’t want to see?This is the vision of the LODLAM community which brings me to part 2 of this presentation. Don’t worry it’s going to be brief compared to Part 1

PART 2: LODLAM SYDNEY 29-30/JUNE 2015http://summit2015.lodlam.net/about/100 people from around the world meet up for 2 days every year to try and work out how best to make LODLAM work. It’s a heady mix of digital humanists, developers, data wranglers, curators, geeks and people with an interest such as me! I first attended a LODLAM Conference in San Francisco in 2011, funded by the Internet Archive and the Sloan philanthropic foundation ( all you had to do was apply). This year it was held in Sydney and PROV very kindly paid my registration fee of $US100.00.It’s run according to the unconference format “where attendees propose sessions on the first day, start those sessions then on the 2nd day the same thing happens with a degree of socialising, tweeting etc around the discussions. There are no keynote speakers. The meeting is based on the two primary principles of passion and responsibility: passion to jump in and play an active role; and responsibility to lead, and follow through with action. No papers will be submitted or read, no plenaries given, and everyone will participate.”( https://en.wikipedia.org/wiki/Open_Space_Technology )

http://summit2015.lodlam.net/about/

https://en.wikipedia.org/wiki/Open_Space_Technology

https://en.wikipedia.org/wiki/Open_Space_Technology

So what did I do?I tried to get to as many sessions as possible but it was hard and there was so much on offer!https://docs.google.com/spreadsheets/d/19mfLBoztvaaaik20-P2syANn2fjURzokE-xILLMOVQ0/

pubhtmlI particularly enjoyed • A pre conference presentation by Rachael Frick, Digital Public Library of America g:\

prov\access management\projects\lodlam\fricksydney.pdf• So you've got a collection API, now what? merged with How to add LOD publication

functions to existing collection management systems. Lightweight, plug-in approaches• LODlive graph browser. Diego Valerio Camarda • archive.schema.org. Richard WallisI’ll try and give you a brief overview of what I learned:

What is the DPLA?

The Digital Public Library of America (DPLA) is an all-digital library that aggregates metadata — or information describing an item — and thumbnails for millions of photographs, manuscripts, books, sounds, moving images, and more from libraries, archives, and museums around the United States. DPLA brings together the riches of America’s libraries, archives, and museums, and makes them freely available to the world.”It is very much about creating a portal for developers to use the metadata to build tools:

http://dp.la/info/developers/ The dpla use a number of hubs that reach out to content partners. These hubs facilitate content migration, providing guidance and support around content/ rights and technical issues that might appear. The DPLA provides a beautiful segue into the role that APIs play in exposing collection metadata to the world and allowing others to use it to build tools useful to the collecting organisation and the researcher community.

http://dp.la/info/developers/



What is an API?Basically a way into an organisations’s metadata via a programmatic interface. If you want a really great definition check http://data.records.nsw.gov.au/?p=248 PROV has 2 APIs that I know of, the ANDS API and the PROV wiki API. The first one feeds directly into Research Data Australia and uses the metadata schema Rif-CS. The second one is a little more accessible e.g.http://www.culturevictoria.com/collection-search/ delivering item level content, using the OpenSearch protocol first developed by Amazon. While this isn’t LOD, both are a step in the right direction of improving our discoverability.




http://www.culturevictoria.com/collection-search/

http://www.culturevictoria.com/collection-search/

What is mark-up? Schema.organ initiative launched on 2 June 2011 by Bing, Google and Yahoo![ (the operators of the then world's largest search engines) create and support a common set of schemas for structured data mark-up on web pages. At LODLAM in Sydney , Richard Wallace proposed the creation of a working group to develop an extension to Schema.org to encompass mark-up of web pages relating directly to archives. An initial model of this has recently been created, and as I understand , the NAA will be marking up their pages in the near future. Zoe D’Arcy from the NAA will keep me informed as to their experience after doing this.

Why bother?

Search Engines can deliver richer more relevant results if they can ‘see’ the context behind web pages e.g. a mention of Public Record Office Victoria on our website refers to an archive as described by the Scema.org extension the working group is developing, as opposed to a string of characters that could be the name of a rock band or all manner of things!

https://en.wikipedia.org/wiki/Schema.org#cite_note-1

What might the extension look like?This diagram shows the basic relationship between the proposed main archive, specific types plus

relevant Schema types in the model.

And how might a web page be ‘marked up’?

@prefix schema: <http://schema.org/>.

#An Archive (Organization)<http://archive.example.com> a schema:Archive; schema:name "The Example Archive"; schema:address "The Old Archive, City Square, Anytown"; schema:email "[email protected]"; schema:owns [ a schema:OwnershipInfo; schema:ownedFrom "1957"; schema:typeOfGood <http://archive.example.com/boolarchive>; schema:ownershipType schema:HasCustodyOwnership. ]

#An ArchiveCollection<http://archive.example.com/boolarchive> a schema:ArchiveCollection; schema:name "The Boolean Papers Collection:; schema:creator "Sir Binary Boolean"; schema:accessAndUse "Public view, in archive location, no image reproductions"; schema:itemLocation <http://archive.example.com>.

Conclusions: Yes it’s the end!•We all want to make the archives as discoverable as possible

•As long as we’re on the net we might as well be on it well (clumsy I know but you get the gist)•There are many pieces to the puzzle...APIs, Linked Open Data, non proprietary software, marking up web pages for Search Engines e.g. Schema.org• We have the ability to become highly discoverable right now at low cost and in a way that is scalable.•What will the Access the Collection of the future look like?• All it will take is the ability to join the dots. Many others around the world have already done this so we’re not alone. We are lucky to have some brilliant minds with exceptional skills in our own back yard so let’s use them.

2014 Linked Open Data Cloud

http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/#toc2

1015 organisations

What do I hope you‘ve take away from this presentation?

•The web is moving from a web of documents to a web of data

•Making web content machine readable is important for discoverability

•We can also use APIs, web mark-up, and LOD to make our web resources more discoverable and reusable

•This isn't a fad or a fantasy, it's happening all over the globe right now and we can be a part of it if we want to at low cost in a timely fashion

lodlam presentation v1.0 final al20151104

Education