building a national collection of the historical uk web for scholarly use

16
Building a National Collection of the Historical UK Web for scholarly use Helen Hockx-Yu Head of Web Archiving, British Library IIPC General Assembly, Paris, May April 2014

Upload: buzz

Post on 05-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Building a National Collection of the Historical UK Web for scholarly use. Helen Hockx-Yu Head of Web Archiving, British Library. IIPC General Assembly, Paris, May April 2014. Scholarly interaction with web archives (1). Archive-driven Initiated by archival institutions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Building a National Collection of the Historical UK Web for scholarly use

Building a National Collection of the Historical UK Web for scholarly use

Helen Hockx-Yu

Head of Web Archiving, British Library

IIPC General Assembly, Paris, May April 2014

Page 2: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 2

Scholarly interaction with web archives (1) Archive-driven

– Initiated by archival institutions

– Aimed at understanding scholarly requirements and improving archival practice

Scholar-driven

– Initiated by scholars with research interest related to web archiving or archived web material, including many “unknown” scholars

– A number of active research groups emerging: Netlab, WebArt and DMI, IHR, OII, ODU…

– Attention from the Web Science community

Project-based

– Various scale, scope and funding sources

– Developing web archiving or discipline specific solutions

– Researchers and archiving institutions work as partners

Page 3: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 3

Scholarly interaction with web archives (2)

• Phase 1: Building collections– Scholars’ involvement in scoping collections, selecting and

describing websites relevant to research interest– Creation of specific, (narrow) topical collections, e.g. “Religion,

politics and law since 2005” in the UK Web Archive

• Phase 2: Formulating research questions– Brain-storm sessions, workshops etc.– Shift of focus to web archives in entirety– Lack of awareness & baseline knowledge– Time & resource consuming– Challenging: you don’t know what you don’t know

3

Page 4: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 44

Scholarly interaction: the “go-to” state

Independent use of web archives

Meet common scholarly requirements, support scholarly workflow

Base-line knowledge is self-explanatory, e.g. scope of the archive, its coverage and lacunae, how it was collected, and how a particular website was crawled

Clear interfaces and jargon-free descriptions in alignment with scholarly requirements

Open access− Including provision of downloadable derived or secondary datasets, e.g.

http://data.webarchive.org.uk/opendata/

Publication of work citing web archives

Page 5: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 5

Selective archiving since 2003

• Permission-based

• Open UK Web Archive http://www.webarchive.org.uk/ukwa/

• ~14,000 websites, ~64,000 instances

• URL and full-text search

• Curated collections

• Many websites no longer available on the live web

Page 6: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 6

6th April 2013…

• Legal Deposit Libraries (Non-Print Works) Regulations 2013

• Extension of existing legal framework

• Systematic collection of UK’s published output for heritage & preservation

• By 6 UK Legal Deposit Libraries

Page 7: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 7

JISC UK Web Domain dataset (1996-2014)

• Collaboration between the Internet Archive (IA), the Joint Information Systems Committee (JISC) and the British Library

• Extracted copies of UK websites from the Internet Archives collection – 1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs– 2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated)

• Research agreement between JISC and IA, upholding IA’s Terms of Use– Access via IA’s Wayback Machine– Allows replication / extraction of derivative or secondary datasets

• BL hosts the dataset on behalf of JISC

Page 8: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 8

Completed work

• Analytical Access to the Domain Dark Archive Project

– Use cases & experimental UI

• Demonstrating the Value of the UK Web Domain Dataset for Social Science Research

– Analysis of link graph

– Paper accepted for WebSci’14: Mapping the UK Webspace: Fifteen Years of British Universities on the Web

• MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher: Creating and Analysing

• Secondary datasets under open licence

– Format profile, Geoindex, Host Link Graph

Page 9: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 9

Exploring Host Link Graph

Courtesy of Peter Webster, Rainer Simon and Jules Mataly

Page 10: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 10

Visualising links (to and from bl.uk)

Interactive versionHow it is done

Page 11: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 11

Visualising links (to and from bl.uk)

Interactive versionHow it is done

Page 12: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 12

Big UK Domain Data for Arts and Humanities

• Funded by the UK Arts and Humanities Research Council as one of the 21 “Big Data” projects

• Collaboration between the Institution of Historical Research, Oxford Internet Institute, British Library and Aarhus University

• Develop theoretical and methodological framework for the study of web archives

• Build on ADDAA: researchers and the BL co-produce access tools

• A major study of the history of UK web space from 1996 to 2013 + sub-projects covering a range of disciplines

• Also an online training course and peer-reviewed journal articles.

Page 13: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 13

New projects and initiatives

• "ALEXANDRIA: Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives

– 5-year project funded by the European Research Council– Develop new models and algorithms for retrieval, exploration, and

analytics of web archives– Collaborate on common issues, eg, publications date versus crawl

dates

• RESAW, a Research Infrastructure for the Study of Archived Web Materials

– Currently a coordinated, self-organising, and self-financing open network

– Preparing application for EU’s Horizon 2020 framework 

Page 14: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 14

Benefits

• Helps researchers understand the value of web archives and explore new ways of using these for scholarly research

• Allows BL to obtain hands-on experience with indexing and processing large scale web archive datasets

• Analytics and visualisations can be applied to our own Legal Deposit collection

• Acts as test-bed for research and development projects

• Enables BL to participate in various UK, European and international projects

• Helps curators understand characteristics of large scale digital corpora

• Improve the way we collet and store web archive

Page 15: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 15

Some Issues

Ownership

Data quality

- Different formats, ARC and WARCs

- Partially de-duplicated

Context

- No crawl log or information o data cap applied during crawl time

- No detailed information on extraction mechanism

More general issues related to analytical access

- Scepticism or suspicion about hidden algorithms behind analysis

- Biases in data and how data collection decisions lead to variances in outputs

- Need to manage expectations, analysis and visualisation as finished products and first steps

- Ethical and privacy issues

Page 16: Building a National Collection of the Historical UK Web for scholarly use

www.bl.uk 16

Thank you!

Questions?

Getting in touch:

Twitter: @ukwebarchiveEmail: [email protected] Web Archive: http://www.webarchive.org.uk