data study group final report: the national archives, uk...data study group final report: the...

Data Study Group Final Report: The National Archives, UK9–13 December 2019

Discovering Topics and Trends in the UK Government Web Archive

https://doi.org/10.5281/zenodo.4981184

The copyright and database right in material produced by staff of The National Archives under this collaboration is Crown copyright or Crown database. Free and flexible re-use of Crown Copyright ma-terial is granted under the terms of the Open Government Licence (OGL), and Crown Copyright and database right material is published under the OGL.

Contents1 Executive summary . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Challenge overview . . . . . . . . . . . . . . . . . . 31.2 Data overview . . . . . . . . . . . . . . . . . . . . . 41.3 Main objectives . . . . . . . . . . . . . . . . . . . . 51.4 Approach . . . . . . . . . . . . . . . . . . . . . . . 51.5 Main conclusions . . . . . . . . . . . . . . . . . . . 61.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . 71.7 Recommendations . . . . . . . . . . . . . . . . . . 8

2 Quantitative problem formulation . . . . . . . . . . . . . . . 93 Qualitative problem formulation . . . . . . . . . . . . . . . . 9

3.1 User types . . . . . . . . . . . . . . . . . . . . . . . 93.2 Algorithmic transparency and trust . . . . . . . . . 10

4 User experience . . . . . . . . . . . . . . . . . . . . . . . . 114.1 User journey . . . . . . . . . . . . . . . . . . . . . . 114.2 User interface . . . . . . . . . . . . . . . . . . . . . 124.3 User-facing explanations . . . . . . . . . . . . . . . 14

5 Data overview . . . . . . . . . . . . . . . . . . . . . . . . . 175.1 Dataset description . . . . . . . . . . . . . . . . . . 175.2 Data quality issues . . . . . . . . . . . . . . . . . . 18

6 Experiment: entity recognition and disambiguation at scale 196.1 Task description . . . . . . . . . . . . . . . . . . . . 206.2 Experimental set-up . . . . . . . . . . . . . . . . . 206.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 216.4 Reproducing results . . . . . . . . . . . . . . . . . 226.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . 22

7 Experiment: document embedding . . . . . . . . . . . . . . 227.1 Task description . . . . . . . . . . . . . . . . . . . . 237.2 Experimental set-up . . . . . . . . . . . . . . . . . 237.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 257.4 Reproducing results . . . . . . . . . . . . . . . . . 257.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . 26

8 Experiment: clustering . . . . . . . . . . . . . . . . . . . . . 278.1 Task descriptions . . . . . . . . . . . . . . . . . . . 278.2 Experimental set-up . . . . . . . . . . . . . . . . . 288.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 298.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . 30

1

9 Experiment: interface . . . . . . . . . . . . . . . . . . . . . 309.1 Task descriptions . . . . . . . . . . . . . . . . . . . 319.2 Experimental set-up . . . . . . . . . . . . . . . . . 319.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 329.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . 32

10 Future work and research avenues . . . . . . . . . . . . . . 3310.1 Expand corpus beyond HTML formats . . . . . . . 3410.2 Alternative datasets for HTML resources . . . . . . 3510.3 Data and code in re-usable, open formats . . . . . 3510.4 Combine interface with other visualisations . . . . . 3610.5 Improve approaches for handling duplication . . . . 3610.6 Improve integration with knowledge sources . . . . 3710.7 Leverage existing search query data . . . . . . . . 3710.8 Track semantic change . . . . . . . . . . . . . . . . 3710.9 Develop customised tools for the communities . . . 3810.10 Transparency . . . . . . . . . . . . . . . . . . . . . 38

11 Team members . . . . . . . . . . . . . . . . . . . . . . . . . 3912 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4213 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 43References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

This report has a companion GitHub repository containing example code:https://github.com/alan-turing-institute/DSG-TNA-UKGWA

2

https://github.com/alan-turing-institute/DSG-TNA-UKGWA

1 Executive summary

1.1 Challenge overview

The National Archives is the official archive and publisher for the UKgovernment and for England and Wales. It is the guardian of some of thecountry’s most iconic documents and collections, dating back over 1,000years to today, and including those published on the web by UKgovernment departments and bodies.

The UK Government Web Archive (UKGWA)1 is a vast resource ofgovernment websites and social media content, and an important sourceof recent national history, spanning 23 years. It contains over five billionresources [or distinct Uniform Resource Locators (URLs)] and is one ofthe most heavily used web archives in the world, serving hundreds ofthousands of page views each month. The National Archives has a remitto preserve government-owned web content in all its forms (including webpages, official publications, datasets, social media, such as tweets andmultimedia) and seeks to preserve this part of the record in its originalcontext, wherever possible, through this archival resource. The UKGWA2

is free to use and fully accessible via the web, and includes a full-textsearch service.3 The size, variety of formats, and complexity of this vastcollection makes the discoverability of its content challenging to the userand therefore puts great pressure on the search service to meet theneeds of those users.

The Alan Turing Institute4 is the UK’s national institute for data scienceand artificial intelligence, with headquarters at the British Library. DataStudy Groups are intensive five day collaborative hackathons hosted at theThe Alan Turing Institute, which bring together organisations from industry,government, and the third sector, with multi-disciplinary researchers fromacademia. The National Archives was the Data Study Group ChallengeOwner; their experts were present during the week, and are co-authors ofthis report. They provided the real-world challenge to be tackled by this

1https://www.nationalarchives.gov.uk/webarchive/2https://www.nationalarchives.gov.uk/documents/information-management/

osp27.pdf3https://webarchive.nationalarchives.gov.uk/search/4https://www.turing.ac.uk/

3

https://www.nationalarchives.gov.uk/webarchive/

https://www.nationalarchives.gov.uk/documents/information-management/osp27.pdf

https://www.nationalarchives.gov.uk/documents/information-management/osp27.pdf

https://webarchive.nationalarchives.gov.uk/search/

https://www.turing.ac.uk/

group of researchers led by the Principal Investigator and Facilitator. Thisreport is the culmination of that process.

The challenge we address in this report is to make steps towardsimproving search and discovery of resources within this vast archive forfuture archive users, and how the UKGWA collection could begin to beunlocked for research and experimentation by approaching it as data (i.e.as a dataset at scale). The UKGWA has begun to examine independentlythe usefulness of modelling the hyperlinked structure of its collection foradvanced corpus exploration; the aim of this collaboration is to testalgorithms capable of searching for documents via the topics that theycover (e.g. ‘climate change’), envisioning a future convergence of thesetwo research frameworks. This is a diachronic corpus that is ideal forstudying the emergence of topics and how they feature throughgovernment websites over time, and it will indicate engagement prioritiesand how these change over time.

The National Archives last embarked on a project to use natural languageprocessing (NLP) tools on the UKGWA in 2010.5 It produced promisingresults and highlighted the approach as being useful in addressing someresource discovery challenges in searching unstructured content gatheredover a long period of time. Ultimately that project ran into difficulties at thequerying stage due to scale and did not deliver a user interface suitablefor researchers. However, much was learned by the team involved and agreat deal of progress has been made in the area in the intervening periodin terms of software, computational methods, and increased availabilityof the necessary compute power to deliver the service to users. It wastherefore considered timely to revisit the concepts with the ultimate aim ofdelivering improved access to the archive.

1.2 Data overview

The data consists of plain-text collections derived from HyperText MarkupLanguage (HTML) pages. The HTML pages, drawn from the UKGWA,cover a very broad range of topics, from health to international policy, asthe government influences all aspects of society. The datasets were

5https://webarchive.nationalarchives.gov.uk/20101011131758/http://www.

nationalarchives.gov.uk/documents/research-enews-june-2010.pdf, page 4

4

https://webarchive.nationalarchives.gov.uk/20101011131758/http://www.nationalarchives.gov.uk/documents/research-enews-june-2010.pdf

https://webarchive.nationalarchives.gov.uk/20101011131758/http://www.nationalarchives.gov.uk/documents/research-enews-june-2010.pdf

created by selecting annual snapshots taken of these websites on thedate closest to 1 January each year from 2006 onwards. In certain casesthis covers the entire lifespan of sites. These snapshots were createdthrough a combination of shallow and deep web crawling, and extractingthe plain-text from the resulting HTML. Each page of each site is stored inan individual text file, organised by site and date of snapshot.

1.3 Main objectives

The overall objective is to use these curated datasets of referencedocuments to examine algorithms that are capable of identifying similardocuments across the corpus and of inferring the topics they cover. Thiswork will contribute to generating an overview of the contents of theUKGWA, which will be developed for inclusion in user-facing services,enhancing search and further enabling the use of the UKGWA.

The main aims are to give insight into 1) what approaches can be used toassist the understanding of the data within UKGWA 2) what are the mostviable approaches to improving the resource discovery services offered tousers.

1.4 Approach

The organisation of the UK government web estate has changed overtime, which has a bearing on both the structure and the content of thearchival data. Issues that are particularly challenging in web archiveresearch (e.g. duplication, and changes in the content and technicalmake-up of the source websites over time) add complexity to the analysisand require contextualising information for researchers who are new toapplying data science methods to web archives.

In order to improve the search functionality of the archive, we leveragedthree central pillars of current research in information retrieval (see forinstance Singhal [2012], Mitra and Craswell [2017]):

• Enriching the corpus with annotations to entities with unambiguousmeanings

5

• Designing a preliminary interface for comparing the relevance ofentities and concepts over time

• Providing a preliminary semantic search functionality to the user

To identify all entities mentioned in the corpus we employed spaCy, anamed entity recogniser.6 We then grouped mentions of the same entity(e.g. ‘Obama’, ‘Barack Obama’, and ‘President Obama’) using adatabase of aggregated statistics derived from Wikipedia.7 We then builta prototype interface which would allow users to search the UKGWA bothfor entities and for single concepts, and to compare their frequency overtime.8 Once the user decides to focus on a specific concept or entity in aspecific year, the search interface offers three ways of navigating theretrieved documents:

• Semantic search: the most relevant documents (as measured byDoc2Vec Le and Mikolov [2014]) are sorted based on their semanticsimilarity to the user query

• Cluster-based search: documents are grouped in semantic clusters(by applying hierarchical clustering) which may aid the user inmeaningful browsing of the collection

• Entity-based relational search: the most closely related entities (asprovided by the spaCy named entity recogniser) to the user queryare provided, filtering the returned search results

1.5 Main conclusions

This series of experiments demonstrated how discovery and search ofweb archives can be usefully enhanced from what is widely availablecurrently. However, both the discoverability and comprehension of searchresults may be improved still further by using a combination of different

6https://spacy.io/7https://github.com/fedenanni/Reimplementing-TagMe andalso the scripts spacy preprocess.py, tagme preprocess.py andmultithread calc entities spacy.py in the companion repositoryhttps://github.com/alan-turing-institute/DSG-TNA-UKGWA

8See the script demo interface.py in the companion repository https://github.

com/alan-turing-institute/DSG-TNA-UKGWA

6

https://spacy.io/

https://github.com/fedenanni/Reimplementing-TagMe




techniques through a novel navigation interface. Our user-centredapproach was informed by the experience of The National Archives’experts, in turn informed through user surveys (while The NationalArchives do not publish the comments received, some summary data isavailable).9 They focused the team’s efforts on the evaluation of thepotential approaches for improving search and discoverability, and thepromotion of transparency, as demanded by trusted institutions. Theachievements of the week, and the areas identified for future work,demonstrate a notable change in machine learning capability since TheNational Archives last conducted experiments in 2010.

The team’s work over the week showed that enriching the collection withnamed entities is possible and advances the browsing and discoveryexperience. There was additional promise in adopting topic based search(semantic search): the Data Study Group showed that it may achievemore accurate results and thereby further improve the user experience.Overall, we showed how a variety of NLP approaches could support TheNational Archives in offering more targeted access to the collection.Beyond the week’s experimentation with sampled data, the aim would beto employ the suite of tools, techniques and approaches on the full scaleof the UKGWA dataset. Equally, these methods are applicable to otherweb archives, or large textual corpora.

1.6 Limitations

The focus of this Data Study Group was on text-based content. While thisis the most significant content in the UKGWA, non-text resources (images,videos and audio) are an increasingly significant part of the governmentweb estate and therefore the UKGWA. Our work noted the negative impactof web pages reusing common content (e.g. navigation) on the results, butit was outside the scope of the week to remedy this.

Considering topic modelling, the work presented here is simplified, anddoes not account for the semantic change of words over time. Today,a ‘tweet’ is likely to be a piece of social media as much as it is a birdnoise. A further limitation is that infrequent, niche or technical terminology

9https://www.nationalarchives.gov.uk/about/our-role/transparency/

our-public-services/

7

https://www.nationalarchives.gov.uk/about/our-role/transparency/our-public-services/

https://www.nationalarchives.gov.uk/about/our-role/transparency/our-public-services/

may be suppressed by the larger corpus of more general text, hamperingdiscoverability of these documents. The opposite problem also exists, thatif a user searches using a term that does not occur in the UKGWA data, itis ignored, even if the topic is discussed using different words.

1.7 Recommendations

Given further research and development, there is potential for TheNational Archives to make enhancements to the existing search anddiscovery capabilities for the UKGWA. This may be achieved by furtherdevelopment of the existing services, developing bespoke servicesoutside of UKGWA but using existing APIs, or creating standaloneservices based on extracts of its data. The National Archives team have awide variety of data that can be used to impart context and structure onthe data. Therefore, having the potential to give the user more control andpower in navigating the archive’s content, and to understand its structure.Understanding the needs of user communities is critical, not only existingusers looking for specific content but also researchers who wish toexplore the UKGWA as data.

Further research is needed to improve and extend the data sciencemethods for enhanced impact and performance. The application ofcurrent state-of-the-art pre-trained deep language models [Devlin et al.,2018] could enhance semantic information retrieval and entity linking.Additional information sources, for example the creation of a domainknowledge graph starting from topical information from Wikipedia (basedon how articles link to each other), would add more context and improvethe accuracy of search results. This, combined with specific domain datafrom The National Archives, the application of existing data in structuredformats, and making other knowledge sources machine-readable, wouldoffer a transformative service to UKGWA’s users. This could be furtherenhanced by the integration and extensions of systems for fast parallelresearch over large-scale collections of vectorised semantic documents(see for instance Johnson et al. [2017]).

8

2 Quantitative problem formulation

Ahead of the Data Study Group week itself, a set of seven researchquestions were developed to stimulate potential research directions to theresearch group. Based on the given time frame and skill set of the group,some of the questions have been addressed more thoroughly (1-3) overthe week than others (4, 6, 7), and two questions will be used to shapeideas for future research (4, 5). They were:

1. How can we enrich each document with semantic information relyingupon out-of-the-box Natural Language Processing (NLP) tools?

2. What unsupervised machine learning can be used to aggregatedocuments in thematically similar clusters?

3. How can the resulting metadata be used to inform description of thenature of the information they contain and guide the interpretation ofcategories?

4. What methods can track the emergence and evolution of topicsacross time?

5. What approaches can be used to differentiate between the functionsand aims of government departments as expressed in individualwebsites?

6. How do we best explain the data science methods used on theUKGWA to its readers and users?

7. Can we develop workflows to aid the interpretation of any machinelearning algorithms we develop, and encourage engagement withtheir strengths and limitations?

3 Qualitative problem formulation

3.1 User types

The purpose of this challenge is to explore algorithms that have thepotential to enable further and purposeful use of UKGWA by end users.To that end, we dedicated time to tentatively scope and describe potential

9

personas and needs of end users, in order to anchor the variety ofalgorithmic approaches we explore in concrete use cases. The UKGWAis a very large collection that is actively curated by a team of experts,while being freely and fully accessible to its users, making it an extremelyvaluable resource. However, the size and its complexity make it difficultfor researchers to use to its full potential. Although it is quite different andseparate from other collections belonging to The National Archives,existing user research into users of The National Archives’ onlinecatalogue Discovery10 brings useful insights on user behaviour.

Evidence collected through (unpublished) interviews and surveys during2015 indicate that a ‘steep learning curve’ exists for new users; however,once the user has familiarised theirself with the system, they can build amental model of it and can use the service successfully in their researchefforts.

When we come to large-scale computational analyses of the web thereare different levels of desired interaction, depending on the requirementsand interests of the user. Some will just want to see results, maybe avisualisation of frequent words and phrases, or a network diagram.Others will want to understand the provenance of the results they areseeing, which might include algorithmic explanation. This is a particularchallenge with the advent of data processing algorithms which are morecomplex and less generally understandable and introduce an element ofuncertainty and chance into the results.

3.2 Algorithmic transparency and trust

Without knowledge of the capture process, an archived web page lookslike any other web page. Transparency and explanation is needed not onlyto improve user experience, but also to enhance confidence and trust inthe web archive.11

Therefore, an important obligation is to provide an explanation of theprocessing and outputs that are maximally understandable to the public,and communicate the complexity and uncertainty in the data processing10http://discovery.nationalarchives.gov.uk/11https://www.nationalarchives.gov.uk/documents/

the-national-archives-digital-strategy-2017-19.pdf

10

http://discovery.nationalarchives.gov.uk/

https://www.nationalarchives.gov.uk/documents/the-national-archives-digital-strategy-2017-19.pdf

https://www.nationalarchives.gov.uk/documents/the-national-archives-digital-strategy-2017-19.pdf

and results. However, in here lies a delicate balance. Alongside the needfor transparency and openness regarding its limitations, care must alsobe taken not to overstate the challenging aspects inherent in thecollection, which could deter use of the archive or undermine the sense ofskill or mastery a user can develop given some time to explore it.

Transparency, nonetheless, is a critical prerequisite for users to developtrust in the archive’s search, exploration and presentation of results. It isalso a recognised issue amongst humanities scholars that researchersoften do not understand the provenance of the results they receive fromsearch engines [Kemman et al., 2013]. Currently, search results from theUKGWA are not ordered by relevance; if they were in future, the interfacewould need to explain that ordering and how ‘relevance’ had beendetermined. In some situations this could influence the choice ofalgorithm but there must be a sensible balance between transparencyand functionality. From the data science perspective, The Alan TuringInstitute has published guidance on algorithmic transparency and trust, inthe form of a guide for the responsible design and implementation of AIsystems.12

4 User experience

4.1 User journey

For the purpose of this report, we assume that UKGWA users bear someresemblance to the Discovery users, and on that basis, suggest somepotential user personas. Therefore, we have assumed the UKGWA userneeds fall into two broad categories, which we term advanced searchand quick search. Indeed, this is supported by user research conductedspecifically on UKGWA, which led to the formation of personas such as‘professional services’, which may be an example of an advanced searchuser, and ‘member of the public’ as a potential example of the latter. Anydevelopments would need to support the needs of these personas.

Quick search will follow some or all of the following journey:12https://www.turing.ac.uk/research/publications/

understanding-artificial-intelligence-ethics-and-safety

11

https://www.turing.ac.uk/research/publications/understanding-artificial-intelligence-ethics-and-safety

https://www.turing.ac.uk/research/publications/understanding-artificial-intelligence-ethics-and-safety

1. Enter a search term which consists of a single word, or multiplewords

2. Receive a list of ordered results (for some definition of ordered)

3. If the desired page has been found stop, or

4. Use the returned results to inform/refine a further search

Advanced search will follow some or all of the following journey:

1. Start with a search term which consists of a single word, or multiplewords

2. Receive a list of ordered results (for some definition of ordered)

3. The returned list is annotated with key entities which were previouslyidentified and extracted from the archive content

4. Select one or more of these entities to:

(a) See a visual representation of how the predominance of thoseentities changes over time

(b) Explore an interactive visual graph of related entities, wheresimilarity is driven by co-occurrence in the archive. Thesediscovered entities can then be used to start new searches

(c) Explore an interactive visualisation of documents most similarto document that best matches the current entities

4.2 User interface

Figure 1 illustrates the results of an initial query issued by both quicksearch and advanced search users.

A linear list of results are presented, ordered by relevance to the searchterms. Each result consists of the document title, augmented with contextinformation, including the originating department and date.

Each search result is also assigned semantic tags which we here callentities. These are key terms which have been filtered by a match againstWikipedia (or another third party knowledge source) as a proxy for validityas a concept, person, location or object.

12

Figure 1: Wire-frame mock-up showing results of an initial query

Figure 2: Wire-frame mock-up showing the relative frequency of entities insearch results

These entities can be clicked to navigate to the following exploratoryvisualisations.

The wire-frame in figure 2 represents a chart presenting the prominenceof terms over time. These terms are those which were selected from thesearch results prior to opening this visualisation. This meets a key userneed to understand how focus on topics has evolved over time. Advancedsearch users will require further explanation of what is meant by ‘RelativeFrequency’, and this is covered later in this report.

The mock-up in figure 3 represents a network graph of related entities.The links between entities are strong, bringing nodes closer together,when the entities at each end are closely related in terms of similarity.How similarity is defined will depend on the algorithms used and thetypes of entities. For example, co-occurrence of entities could be asufficient similarity metric if those entities are words or people, while

13

Figure 3: Wire-frame mock-up showing the graph of related entities

cosine similarity would be more appropriate for vectorisedrepresentations of words and documents.

These graphs will often be very busy, so the ability to filter by number ofnodes is essential. This filter could be based on entity frequency, or thenearest neighbours to an entity of interest. The chart is interactive, in thatit allows users to move the nodes around so that clusters can be viewedmore clearly, and users can also double-click a node in order for it to becentred on the chart. After each user-initiated movement, the graph isadjusted using a mechanical model of attraction and repulsion. Thepopular D3.js13 force-directed graph14 is a good basis for such avisualisation.

Instead of a network representation, we can also visualise documents in a2-dimensional vector space, as shown in figure 4. This depicts a view ofdocuments that are related to the one that best matches the user’s query,and specifically, its semantic tags. This view can be considered as a cloudof points. Clustering algorithms can be applied to label and to separategroups of documents in this vector space. This labelling is represented bythe varying colours of the nodes in the picture.

4.3 User-facing explanations

For users who require explanation, they will need to understand howentities are represented in the space and how they are linked together,13https://d3js.org/14https://observablehq.com/@d3/force-directed-graph

14

https://d3js.org/

https://observablehq.com/@d3/force-directed-graph

Figure 4: Wire-frame mock-up showing the documents in vector space

i.e. what is the similarity metric and how is it calculated? The followingsample passages are suggested as explanations of the data processingand results to aid user understanding.

Quick search results, explanation:

The search query text, which you have typed into the searchbox, is used to find documents that match. Those matchingdocuments are shown on the results page.

Those that match best are higher up the search results list.This works in a very similar way to the internet search engineswhich you might be familiar with.

A more detailed explanation for advanced search:

Documents are indexed by separating their text content intoword tokens. This would result in a very large dictionary oftokens of varying usefulness. To remove less useful tokens,they are filtered by matching against Wikipedia. If a Wikipediaentry exists, that token in considered to be valid and retained,and referred to here as an entity.

When performing a search using query text, matching onlyoccurs using the entities which appeared on Wikipedia. Thismeans that if a search word does not have an entry onWikipedia, it will not be used as part of the search.

Search results are ordered by how well the documents matchthe entities.

15

An explanation of the trends over time chart for quick search:

The chart shows how frequently topics occur over time. Morethan one topic can be compared, and this results in a stackedbar chart.

Note that if a document appears over a period of years, even ifunchanged, the matching entities will still be counted for eachyear.


The chart shows how frequently topics occur over time. Morethan one topic can be compared, and this results in a stackedbar chart.

The height of the bars is the frequency, per document, withwhich the entity appears in the archive for that year. A matchingdocument will only contribute a count of one, and so multipleentity occurrences do not boost the match. A height of 0.05means that a term matches 5% of the documents in that year.

The network graph of linked entities can be explained to quick searchusers as follows:

The chart shows topics that are related to each other. Thecloser the topics on the chart, the more related they are in thearchive.

You can follow links from one topic to another to explore relatedtopics and discover topics that you might not have known wererelated.


The graph of links and nodes shows how entities are related toothers. The nodes represent the entities, previously matchedagainst Wikipedia entries.

The graph is automatically arranged by modelling how stronglinks are between nodes. This strength is the similaritybetween entities, which here is determined by the distance

16

between them in a vector space created using the widely usedDoc2Vec algorithm run against the documents.

Because the Doc2Vec algorithm is non-deterministic, that is,the algorithm involves some randomness, the resulting outputcan vary even with the same data as input. You should take theresulting graphs as indicative.

5 Data overview

5.1 Dataset description

The dataset consisted of 3,893,092 files, totalling 23.9 gigabytes (GB).While it would be usual to perform exploratory data analysis prior toperforming machine learning, it was decided that a better use of timewould be to run an expert-led session describing the data. The datasetwas a derivative extract which had been defined by The National Archivesand was therefore already understood to a deeper level than summarystatistics would provide.

In this session, The National Archives’ experts described the processesbehind populating the archive, how the datasets were created and wherethe challenges lie for both the archive and its users. This session then ledto a wider discussion of the problem space and potential researchavenues, and it was felt that this was particularly beneficial for the teamas they started their experimentation with a stronger understanding ofwhat we wanted to achieve.

The final resource presented at the event consisted of four datasets. Eachis a plain-text collection derived from HTML pages in the UKGWA:

• Dataset1 Government Hub Websites 2006–2019: pages fromdirect.gov.uk and www.gov.uk, throughout this period, withJavaScript removed by machine learning

• Dataset1-raw Government Hub Websites 2006–2019: pages fromdirect.gov.uk and www.gov.uk, throughout this period, without anypre-processing

17

• Dataset2 A subpart of Dataset1. Thematically Sampled Websites2006–2019: pages from approximately 450 government websites,which were pre-selected through high-level topic modelling [usingMAchine Learning for LanguagE Toolkit (MALLET)]15

• Supplementary Home Pages 1996-2019 and Blogs 2013-2019:every home page captured since the beginning of the UKGWA,which was the dataset used to seed the creation of Dataset2.Shallow crawls of government blogs, although most blogs are notcrawled every month

The websites from which these datasets were drawn address all areas oflife and activity influenced by UK government work. The subset ofwebsites in Dataset2 were selected to reflect a range of these activitiesidentified through prior topic modelling as relating to the Olympics,healthy eating, regional development, climate change, and statistics andtransparency.

Dataset1 and Dataset2 were created by selecting snapshots taken ofspecific websites on the date closest to 1 January of each year, from2006 onwards. Websites identified were then crawled to sufficient depthto capture most of their content, with sitemaps being used to identifypages in the domain when available. In many cases this range ofsnapshots covers the entire lifespan of sites. The home pages datasetwas included partly because it was used to select Dataset2, and alsobecause it could be used for a fine-grained analysis of topics over time.The blogs provided larger amounts of text material, but also a differenttype of content which was more likely to be driven by topical events fromthe outside world. For each dataset the text was extracted from the HTMLusing the Python16 library Beautiful Soup17, and individual text file (namedaccording to the URL), and organised by website and snapshot.

5.2 Data quality issues

The pre-processed data files contain the body, header and footer of thecorresponding pages, along with residual text resulting from the extracting15http://mallet.cs.umass.edu/16https://www.python.org/17https://www.crummy.com/software/BeautifulSoup/

18

http://mallet.cs.umass.edu/

https://www.python.org/

https://www.crummy.com/software/BeautifulSoup/

process. The headings in the file contain information about the structure ofthe websites; this content can be used in the future but is not relevant forcomputing the entities at this moment. However, files are stored in plain-text format, and to seperate the page body from header and footer contentis challenging, as the HTML tags making this explicit were removed duringthe extraction. A more natural data format for web archive data is the WebARChive (WARC)18 file format, which combines multiple digital resourcesinto an aggregate archival file together with related information. It providesbetter support for harvesting, access, exchange of data and has been thepredominant format for web archives to present. However, it was decidedfor this Data Study Group that the volume of data in WARC format wouldeither be too large to provide a reasonable subset of the archive, or wouldrequire significant pre-processing during the week.

The work was focussed on documents containing larger amounts of text,such as blog posts, news articles, speech or similar content. These wouldbe more likely to be entity rich and therefore be more suited to documentsimilarity techniques than home pages which are light on entities andheavy on navigation.

6 Experiment: entity recognition anddisambiguation at scale

The first objective of the Data Study Group was to semantically enrich thecollection, by automatically attributing entities and concepts to words withunambiguous meanings. This was done following current entity-centricapproaches in information retrieval systems (see for instance Dalton et al.[2014], Nanni et al. [2017]), which embody the ‘Things not Strings’vision.19 Such a step would allow for a better understanding of thecollection. Moreover, as this semantic enrichment focuses on conceptsand entities rather than raw tokens, and compares concepts and entitiescontained in the user query with those contained in the documents, it18https://iipc.github.io/warc-specifications/specifications/warc-format/

warc-1.1/19https://www.blog.google/products/search/

introducing-knowledge-graph-things-not/

19

https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

https://www.blog.google/products/search/introducing-knowledge-graph-things-not/

https://www.blog.google/products/search/introducing-knowledge-graph-things-not/

allows us to model the relation between user query and relevantdocuments in a more fine-grained way.

6.1 Task description

The first part of our work involved extracting and disambiguating allnamed-entities (people, location, organisation, etc) in the texts. To do sowe employed the Python toolkit spaCy and the English pre-trained model‘en core web lg’ (described as: ‘English multi-task Convolutional NeuralNetwork trained on OntoNotes, with GloVe vectors trained on CommonCrawl’). Three scripts drive this part of the process, focusing onperformance gains via multi-threading and extracting entity occurrencefrequencies.

We have also explored how to link detected entities to Wikipedia articles.Whilst having a high potential to reduce ambiguity in website and searchquery content, most available pipelines have drawbacks: the web-serviceTagMe20 for instance relies on an API, which is a bottleneck whenprocessing 350,000 documents. On the other hand, spaCy does not offerpre-trained models for entity linking and DBpedia spotlight models21 arenot readily accessible. Furthermore, there might be issues with dataprotection and intellectual property rights, since we would have todisclose data to a third-party server like TagMe via an API. For thisreason, we provide a shallow disambiguation of entities relying on outlinkstatistics derived from Wikipedia.22

6.2 Experimental set-up

Whilst many options for parallelising the tagging process are available,the overall speed with which the full dataset is tagged is limited by therate at which the data can be read in and the results can be saved.Therefore, we used an optimisation process to reduce the time requiredfor processing the whole dataset. We used the ThreadPoolExecutor fromthe concurrent.futures23 Python library to parallelise the code, and then20https://sobigdata.d4science.org/web/tagme/tagme-help21https://www.dbpedia-spotlight.org/22https://github.com/fedenanni/Reimplementing-TagMe23https://docs.python.org/3/library/concurrent.futures.html

20

https://sobigdata.d4science.org/web/tagme/tagme-help

https://www.dbpedia-spotlight.org/

https://github.com/fedenanni/Reimplementing-TagMe

https://docs.python.org/3/library/concurrent.futures.html

compared run times over a smaller dataset of 300 files, between fullysequential and parallel versions. The benchmark test was carried out ona 64 core Microsoft Azure24 virtual machine with 32GB of memory andSSD storage.

Threads Time (sec.)1 995 497 5010 78

Table 1: Benchmark times from processing files in parallel by number ofthreads.

As can be observed in Table 1, running 5 threads in parallel appears tobe the optimal setting, beyond this point using more threads does notreduce processing time. Each text file is read as input and has acorresponding JavaScript Object Notation (JSON)25 file output, whichcauses the execution to be input/output (I/O) bound. Benchmarking wasnot conducted beyond 10 threads, and we cannot adequately explain whyexecution time sharply rises with additional threads.

6.3 Results

We processed 357,831 files from Dataset1 and focused on filescontaining the words ‘speech’ and ‘news’ in the filename. Please refer tosection 5.2 for details on why we focused on these files. The processingtook place over night and we did not capture the exact run time. Using avirtual machine with fewer cores may seem appropriate, but Azure virtualmachines scale core count, memory, network and I/O bandwidth together.As we were I/O bound, an Azure virtual machine of the same type withfewer cores would also have reduced I/O bandwidth, making for slowerperformance.

We placed the content of the obtained 357,831 JSON files into a SQLite26

database table DocumentEntity defined as follows:24https://azure.microsoft.com/25https://www.json.org/26https://www.sqlite.org/

21

https://azure.microsoft.com/

https://www.json.org/

https://www.sqlite.org/

CREATE TABLE IF NOT EXISTS

DocumentEntity

(

[ent] TEXT, -- Entity

[cat] TEXT, -- Category of Entity

[doc] TEXT, -- Document name

[year] INT -- Year of the document

)

6.4 Reproducing results

The reader interested in reproducing the results obtained in this work canaccess the scripts available in the companion GitHub repository.27 Thescript named multithread calc entities spacy.py processes a list offile names as input and save the output entities as JSON files. An extraparameter sets the number of threads to be processed in parallel. Inaddition, a version with serial processing was made available (pleaserefer to calc entities spacy.py).

6.5 Conclusions

Entities are an established way of exploring the collection of web archives[Ruest and Milligan, 2019]. We show that annotating a very largecollection with out-of-the-box tools is achievable in a short amount of timeand the output, which comprises names of political leaders (e.g. DavidCameron, Theresa May), organisations and events (e.g. the OlympicsGames, the European Union), enables a fine-grained exploration of thecollection. Nevertheless, it is important to consider that named entityrecognition and entity linking tools are still far from perfect, especiallywhen applied in noisy contexts.

7 Experiment: document embedding

The second experiment focused on preparing for semantic search overthe collection, to improve over simple string matching. Being able to27https://github.com/alan-turing-institute/DSG-TNA-UKGWA

22


understand the information needs of a user and offering back the mostrelevant documents has always been the central goal of informationretrieval approaches [Manning et al., 2008]. We have approached such atask by leveraging current research in distributional semantics using wordembedddings [Mikolov et al., 2013] and pre-trained language modelapproaches [Devlin et al., 2018], and their application in search scenariosover large-scale collections [Mitra and Craswell, 2017].

7.1 Task description

Thinking of a user engaging with the UKGWA, this experiment focusedon returning the most relevant document to a user query. A number ofdifferent approaches were used for this task.

One of the simplest approaches is keyword matching, currently used bythe vast majority of web archives that provide a search service. Given astring entered by a user (which we will call a search string), keywordmatching finds all the documents containing this search string. Theadvantages of this approach are ease of implementation, explainability,and speed of information retrieval. However, keyword matching does notconsider the semantics of a user query. For example, if given the searchstring ‘games’, a user may refer to the ‘Olympics Games’ or computergames, the system does not know which.

A common solution to this problem involves document embeddings. Inthis approach, the user query, as well as all websites in the database, areidentified with a vector of 300 dimensions, for instance. The website in thedatabase, whose vector is most similar to the vector of the user query, isthen returned as the most relevant answer to the user query.


In our experiment, we created document embeddings by modelling theentities extracted from the previous entity recognition and disambiguationexperiment (see previous section). We used entities instead of wordswithin the websites, to overcome the fact that the data contained noisefrom JavaScript code and HTML tags. It is important to mention that thiswas the only cleaning step made before creating the embeddings.

23

Figure 5: Illustration of document embeddings. Each point is a documentembedding, representing a website from Wikipedia. This figure was notobtained from the datasets described in this report and features here forillustration purposes. Source: Christopher Olah, https://colah.github.io/

To create the document embeddings (see figure 5), i.e. the embeddingsof each website page, we used the Paragraph Vector model, anunsupervised neural network approach. The Gensim Python library byRehurek and Sojka [2010] provides a ready-to-use implementation of thismodel (named Doc2Vec), which we have employed using defaultparameters.

The state-of-the-art algorithms for word embeddings (vectorrepresentations of words instead of documents), notably ELMo28 andBidirectional Encoder Representations from Transformers (BERT)29,employ Recurrent Neural Networks (RNNs). These algorithms offer ahigher quality of embedding, but are subject to substantially highercomputation time. While we have experimented with them, we were not28https://allennlp.org/elmo29https://arxiv.org/pdf/1810.04805.pdf

24

https://allennlp.org/elmo

https://arxiv.org/pdf/1810.04805.pdf

able to apply these techniques at scale during the Data Study Groupweek. Nevertheless, we expect to see an improvement in the resultswhen employing them in comparison with traditional documentembeddings. ELMo and BERT are usually not used for documentembeddings, and would need careful changes for the given challenge. Anoption might be to employ recently released BERT-based sentenceembeddings [Reimers and Gurevych, 2019]. Representing the corpus asdocument or word embeddings would help solve some challenges ofsearch and recommendation: when a user performs a search, return a listof documents which are the most similar to the query terms.

7.3 Results

The initial computation of document embeddings for the 357,831considered websites took approximately 3 hours. The size of alldocument embeddings is 2.1GB. The size of the resulting deep learningmodel is approximately 100MB. The subsequent computation of anembedding for an arbitrary user query takes less than a second.

The quality of the clustering results is assessed in the following section,which describes the clusters within the resulting dataset.

7.4 Reproducing results

To reproduce the experiments, one must first obtain the entities, asoutlined in the previous experiment on Entity Recognition. The resultingentities are stored as JSON files, one for each website. In the currentset-up, the JSON files would be stored in the directory/data/entities/results[number]/, where [number] is a numberbetween 1 and 10.

If the outputs are stored in a different location, the doc2vec helper.py filehas to be edited accordingly.

Then, the doc2vec.py file can be run. This creates the documentembeddings in the directory /data/doc2vec/. The Doc2Vec algorithmproceeds in iterations (also known as epochs), in each of which a bettermodel is derived. The algorithm finishes after a specified number ofiterations, and saves the resulting model to

25

/data/doc2vec/model [version], where [version] is a numericaltimestamp. In addition, all intermediate models are stored in the samedirectory. The overall number of iterations, as well as otherhyperparameters, can be changed in the doc2vec.py file. Such changes,notably higher numbers of iterations, may improve the quality ofdocument embeddings.

Lastly, the generated model is used to map each document to a smallvector (of 300 entries in our experiments), the document embedding.These document embeddings are saved in the same directory as themodel, as /data/doc2vec/results [version].csv, where [version] is anumerical timestamp.

7.5 Conclusions

The document embeddings were generated from the list of documententities, rather than the plain text. We made this decision because of thenoise in the document texts, such as mark-up tags and comments.Another possible approach to generating document embeddings wouldbe to take all the content of the documents, to clean the content and thento produce document embeddings from the list of the tokens from all ofthe documents. A number of steps would be needed to clean the content,such as removing all English stopwords, all non-alphanumeric characters,etc.

Another limitation of this approach is the challenge of processing a userquery that contains a word which does not appear in the word embeddinglist. Currently, this problem is unaddressed and the word is ignored;however, more advanced approaches would employ sub-word orcharacter embeddings to address the issue [Athiwaratkun et al.,2018].

Document embeddings have the potential to improve search quality andto assist in clustering the dataset into themes, which can then be used forfurther analysis. The runtime of the Doc2Vec algorithm we used (both fortraining and inference) is good, and suitable for practical use. However,the quality of the resulting embeddings can be improved, as analysed inthe next section.

26

We suggest the following areas of focus for future investigation:

• Use other input data, specifically cleaned website data, or operateon words instead of entities

• Change the parameters of Doc2Vec (such as number of iterations)

• Consider alternatives to Doc2Vec such as ELMo or BERT, asdiscussed above

In particular, BERT models for word representations are a promisingavenue of exploration, likely to be highly adaptable to a corpus such asthe UKGWA, which is both entity-rich and context-rich. The potential ofthis approach is further advanced through the relative ease with which theframework can apparently work in conjunction with other NLP pipelines.Sentence-based approaches aid the contextualisation that would bringmore powerful insights into the UKGWA dataset and would deliver avaluable complementary system to the current keyword-based full textsearch, without necessitating a radically different user experience.

8 Experiment: clustering

The third experiment involved the use of semantic representations of thedocuments, in order to form a bottom-up, data-driven organisation of thecollection through a structure of interconnected topics. To achieve this,we explored the adoption of clustering algorithms, employing thepreviously extracted entities and topics. The combination of theseclusters with semantic search and disambiguated entities allow us to offera large number of options to the final user for exploring topics and trendsin the archive, as discussed in the following section.

8.1 Task descriptions

A key part of an archival search is the ability to group documents togetherin well-defined topics, which a user may use to both narrow down theirsearch results and to investigate topic evolution over time. The documentembeddings described above should provide a means to do so, throughthe application of unsupervised clustering algorithms. We investigated

27

the adoption of two well known and widely used algorithms, k-means andhierarchical agglomerative clustering; both of them are implemented inpython as part of the Scikit-Learn30 SciPy packages.

K-means works by randomly initialising a predetermined number ofcentroids (the centres of the clusters), then iteratively assigning points(here documents) to their nearest clusters and then recalculating thecentre of these new clusters. This is computationally inexpensive toperform, but the resulting clusters may be of poor quality due to the lackof structure or relations between them. Hierarchical agglomerativeclustering addresses this issue, by iteratively merging clusters, startingfrom a singleton (that is, clusters containing one document only).


To evaluate the quality of these clusters for a given number of topics, weapply silhouette analysis. This is a simple but effective metric: first, themean distance of each point to all others within its cluster is calculated(call this a(i) for point i). Next, define b(i) to be the smallest mean distanceof i to points in another cluster, and define the silhouette score s(i) =(b(i) − a(i))/max(a(i), b(i)). This value lies between −1 and 1, and bytaking the mean value over the full dataset we may quickly evaluate clusterquality. A value close to one corresponds to well clustered data, whilevalues around zero demonstrates that clusters are overlapping. To reducecomputational expense of calculation, these scores may be approximatedvia a random sample of the full dataset.

Finally, so that clusters are interpretable for users, we require a way ofautomatically producing labels for these groups, that may then be refinedby domain experts if desired. To do so, we considered the tagged entitiespresent in a number of documents closest to each calculated centroid.By exploring those most unique to each cluster, we can in theory list keyentities for each group and thus interpret what topic it corresponds to (asdone by Lauscher et al. [2016]).30https://scikit-learn.org/

28

https://scikit-learn.org/

Figure 6: Hierarchical clustering on the document embeddings. Shownare only the top layers.

8.3 Results

The result of our initial experiment is the dendrogram shown in Figure 6.This is a diagram that shows the hierarchical relationship betweenobjects, in our case document embeddings. While in this first attempt thedocument embeddings show no clearly defined clusters (i.e. a tree withdistinct macro-groups), further work on document pre-processing, featureextraction and hyperparameter tuning (number of clusters, distancethreshold, kernel k-means (see Dhillon et al. [2004])) would guide us ingenerating an interpretable structured overview of the collection.

The auto-labelling approach, attempted through taking 100 nearestdocuments in the embedding space to each centroid and then taking themost common entities of these documents is also limited by the largeamount of boiler-plate content present in the datasets. Many documents(even when treated as embeddings) would have similar most frequententities.

All the steps described in this section have been reached relying upon thescikit-learn Python library, in particular the clustering functions.

29

8.4 Conclusions

As discussed above, the presence of boilerplate text such as banners,links and disclaimers have a drastic negative impact when using theDoc2Vec model and consequently on the clustering of web pages.

Future work should therefore focus on two core issues of contemporaryresearch on enhancing access to web archives:

• Improving content extraction from web pages, while in parallel

• Developing approaches to deal with highly noisy collections

Concerning the second point, approaches that leverage pairwise mutualinformation could be applied to identify the most relevant entities percluster, ignoring the underlying noise.

Additional clustering approaches could be investigated. For instance, onecould create a dissimilarity matrix of documents using Doc2Vec andcosine similarity, then treat the matrix as a network of documents andapply community detection methods to it. This works by thresholding thedissimilarity matrix, forming a weighted graph through MST-kNN (MinimalSpanning Tree (MST) and k-Nearest Neighbour (kNN), then permittingapplication of Markov stability approaches to community detection acrossa range of resolutions. This could also allow us to better understandnested modular structures inherent in the data (i.e. subtopics within abroader field). Other methods include manifold detection techniques toform the graph, such as contagion map.

9 Experiment: interface

This final experiment consisted of the exploration of different types ofinterface for the user of the UKGWA, based on the adoption ofdisambiguated entities, semantic search, and document clustering. Thecurrent interface offers a keyword based search, with some filteringoptions included to refine result sets. At the time of the event, the indexcovers nearly 400 million resources and contains a full-text search.However, this can lead to the results of a search delivering to the user toomany hits, often in the order of thousands or millions. It is not transparent

30

in which order the results are ordered, and their relevance appears to theuser be quite random at times. As explained in a previous section, this isindeed the case, but this is not currently explained to user.

9.1 Task descriptions

For this experiment, we explore the following questions:

• Is there a different way to visualise the data of the Web Archive?

• What interface would make it possible for the user to explore thedata?

• How do we make this explainable for the user?

In order to make this visualisation possible a tool would have to bedeveloped. This tool must be web-based, easy to implement on TheNational Archives web pages and be interactive. Additionally, end usersshould be given the opportunity to explore the data.


The interface should give the user an overview of the trends and topicswithin the web archive over time. To allow this, we would use theextracted entities and concepts to offer a comprehensive representationof the concepts in use at each point in time. Based on this temporaloverview, the user should be able to drill down to specific sub-corporathat they are interested in. For this second part, the semantic search andclustering developed in earlier experiments would be used, as it wouldgive the user a more refined search experience than the current full-textsearch.

The output from the previous experiments became the input for theinterface experiment. Different types of graphs were trialled amongst thegroup to find the best way to visualise the temporal aspect of the data andtheir usefulness was assessed in collaboration with The NationalArchives experts present.

31

Figure 7: Initial prototype of the interface, presenting a comparison of thefrequency of mentions of two entities over time.

9.3 Results

After trying a number of plots, we decided to use the stacked bar chartsfor the final implementation, as it easily allows the comparison of entitiesand concepts as topics over time. A visualisation of the graph with theinput entities sport and Olympics can be seen in figure 7 (note: the entityOlympics also includes references to ‘London 2012’):

Based on the interactive interface experiments, we have generated thefollowing mock-up seen in figure 8.

We envisioned extra information accessible to the user by hovering overthe question marks to make the results clearly understandable to a widerange of users. We also envisioned other graphs offered to the side, toallow data exploration in different ways. Finally, by selecting a sub-partof the results (for instance ‘Olympics’ in 2012), we would provide relevantdocuments based on the results of semantic search and clustering on thissub-part of the collection.

9.4 Conclusions

Next steps concerning the implementation of the interface wouldinvolve:

• The full integration of the developed components

32

Figure 8: A mockup interface with improvements for the user, based onrecommendations from the data owners.

• A user study on the pros and cons of integrating such novel interfacein comparison with the previous string-matching search tool

• An analysis of the stability of such a tool when employed at scale,instead of only tested on a sub-sample of the collection

10 Future work and research avenues

Given further research and development, there is the clear potential fornew search methods to enable users to discover content in the UKGWA,exploiting the rich information within.

This Data Study Group showed that the deployment of entity recognitionand disambiguation at scale, using readily available software, is possiblewith data of this kind. Those methods would improve with additionalcontextual information, including that present in the metadata orcatalogue and with data processed in a different way.

Document embeddings were shown to be useful to summarise thecontent in each document, by using those entities extracted. There would

33

be even greater successes, we expect, if these techniques were appliedto all the textual content of each document, not just the smaller set ofentities. If steps to improve the data by isolating or removing repeatedweb page content (e.g. navigation) were employed, then we are hopefulthe the clustering techniques would prove to hold even furtherbenefit.

Following best practice, for instance the ones described by The TuringWay Project: [The Turing Way Community et al., 2019], the technical andsocial design of any future system should be evaluated thoroughly beforeits interface and technologies are trialled with users.

In addition to scaling up and extending the work of the experimentsoutlined above, other areas of focus have been identified, as discussedbelow.

10.1 Expand corpus beyond HTML formats

HTML is the most common format in the UKGWA, but thousands ofdifferent MIME types31, text and non-text, are also present. Extraction andprocessing of these document formats is somewhat different to thechallenges we face when processing HTML, but they may make someissues such as duplicate detection easier.

There is untapped potential in the multimedia content that sits alongsidethe text-based content in the UKGWA. For example, speech-to-textconversions and/or image classification could be deployed to theseformats and the extracted text passed through the same enrichmentpipeline.

Backlinks (i.e. preserving and displaying to the user the location of a linkto a file on a referring web page) will provide useful context and could aidmore effective exploration around the topic. For example, the same pagethat provides a link to a report may also provide links to relateddocuments.31https://en.wikipedia.org/wiki/Media_type

34

https://en.wikipedia.org/wiki/Media_type

10.2 Alternative datasets for HTML resources

A major issue, which was expected at the start of the week, was theheterogeneous nature of the data, in terms of variety of sources atdisposal and the complexities of parsing their structure at scale. Usingeither the data from the source WARC files or the extracted text in theexisting UKGWA Elastic index may reduce some of the data problemsencountered.

If using HTML extracts in future, it should be determined whether certainHTML tags should be retained and preserved alongside the content as away of encoding meaning information. However, retaining these risksintroducing even more noise into the dataset through repetition ofnavigational and template content.

To overcome the noise in the dataset, we used entity recognition methodsinstead of operating on the words within the websites directly. Thisremoved the boilerplate code effectively, but also some of the informationcontent. The navigational features of web pages provide valuablecontextual information, so reaching a balanced approach will take moreevaluation.

10.3 Data and code in re-usable, open formats

The focus should not only be on systems, but on practice too, especiallywith regard to publishing datasets that third parties can access andprocess themselves. Several institutions and initiatives publish such data,which encourages research into their collections and raises their profilesin academic research circles. A good example is Common Crawl32 andthe permissive re-use terms attached to the vast majority of UKGWAcontent support a similar approach. Care would need to be taken toprovide high quality documentation.

Similarly, the code used to generate datasets derived from the collection,along with all algorithms deployed on the collection, should be publishedunder open re-use terms to aid transparency in developed systems and tosupport critical analysis, for example by forking.32https://commoncrawl.org/

35

https://commoncrawl.org/

10.4 Combine interface with other visualisations

The user experience and sense of exploration could also be enhanced byproviding complementary interactive services alongside the systemproposed in this report. For example, including interactive network graphscould improve the user experience by placing their search terms in thecontext of the wider collection and allowing them to view the websites asentities possessing intrinsic properties.

The most obvious example of this is with a campaign website focussedon a particular issue; however, such an approach should give users theability to explore topics and themes across multiple domains, as we knowfrom anecdotal evidence that most users are not primarily interested inwhere something appears but rather when, why and in relation to what itappears.

Providing a temporal dimension to such a visualisation would also highlightone of the key parts of the archived web, namely changes over time in bothstructure and content.

10.5 Improve approaches for handling duplication

In order to avoid the skewing effect of repeated counting of the same entityin an unchanging document (URL), data could be pre-processed to markup the first occurrence and the fact that it persists for a given period oftime. As described earlier in this report, this approach depends on theability to reliably identify and extract pertinent content in the first place,in order to assert that it has not materially changed between point X andpoint Y.

Generating scores between documents based on their degree of similaritymay be a useful route into this problem. Such analysis is likely to be easierwith non-HTML resources, such as PDFs, because, as static documents,they are less vulnerable to interference by changes to website templates,or other stylistic (rather than content-based) changes.

36

10.6 Improve integration with knowledge sources

Wikipedia, and its open linked data source, DBPedia [Auer et al., 2007],has been mentioned as a useful contextualising and enhancing service.Other third party sources may be equally promising for leveraging factualinformation. Additional datasets can be sourced from government to addstructure to domain knowledge, by using, for example, ontologies ofgovernment structure, an early version of which is already available toThe National Archives, derived from various sources of data, such asorganograms33 and data published as registers.34

Furthermore, consideration would be given to whether other third partydata sources could be integrated to provide useful context. Examplesinclude media organisations, data repositories, or other webarchives.

10.7 Leverage existing search query data

Existing search data relating to the use of the UKGWA have enormouspotential and future work should leverage these to identify potential focusareas. The existing UKGWA search service handles around 30,000queries per month,35 which offer a great basis for building comparativedatasets for testing and validation and gaining insights into userexperience and user success.

10.8 Track semantic change

In addition to the emergence of entities and themes over time, evolutionof meanings is an avenue of future exploration. Recent research in NLPhas developed various computational models for automatic semanticchange detection (see overview in Tahmasebi et al. [2018]) and somestudies have used web archive data (Basile and McGillivray [2018] andTsakalidis et al. [2019]). This could be particularly interesting if additionalsources of knowledge could be employed in the development of33https://webarchive.nationalarchives.gov.uk/20120405144126/http://

reference.data.gov.uk/doc/department/co34https://www.registers.service.gov.uk/35https://webarchive.nationalarchives.gov.uk/search/

37

https://webarchive.nationalarchives.gov.uk/20120405144126/http://reference.data.gov.uk/doc/department/co

https://webarchive.nationalarchives.gov.uk/20120405144126/http://reference.data.gov.uk/doc/department/co

https://www.registers.service.gov.uk/

https://webarchive.nationalarchives.gov.uk/search/

computational systems, especially government-specific knowledge, butalso third party sources such as other archives and media.

10.9 Develop customised tools for the communities

Building on existing tools by developing customised versions could beconsidered as an avenue of further exploration, and a useful way ofopening new collaborations with the web archiving, digital preservationand broader research community.

One good example of such a set of tools is The Archives UnleashedProject.36 The project received funding from the Andrew W. MellonFoundation,37 to develop a web archive search and data analysis tool toenable scholars, librarians and archivists to access, share, andinvestigate recent history since the early days of the World Wide Web.One of the main aims of The Archives Unleashed Project is to makepetabytes of historical internet content accessible to scholars and othersinterested in researching the recent past. The system is a suite of toolsthat uses WARC files as an input to provide historians, who do nottypically possess the requisite skills for such data analysis, to interrogatethe data and extract meaning from it.

10.10 Transparency

During the week the decision of The National Archives not to applyweightings to the search results was discussed. As a well-establishedmethod for improving discovery of resources, some positive and negativeweightings may serve users well. Such a decision has hitherto not beentaken due to the necessity of transparency: that is, to indicate clearly tousers that a weighting has been applied, what it is and why. Inimplementation, we might choose to promote transparency andexplainability of results by including an ‘opt out’ option for suchweightings.36https://archivesunleashed.org/37https://mellon.org/

38

https://archivesunleashed.org/

https://mellon.org/

11 Team members

Fazl Barez is a PhD student at Edinburgh Center for Robotics. Fazl's main research interest lies in interpretability and safety. Fazl Facilitated this study group, helped with the analysis and report writing.

David Beavan is Senior Research Software Engineer – Digital Humanities in the Research Engineering Group in The Alan Turing Institute and Research Affiliate at the University of Edinburgh Centre for Data, Culture & Society. He has been working in the Digital Humanities (DH) for over 15 years, working collaboratively, applying cutting edge computational methods to explore new humanities challenges. He is currently Co-Investigator for the flag-ship Arts and Humanities Research Council (AHRC) funded project Living with Machines. David is co-organiser of the Humanities and Data Science Turing Interest Group, and is Research Engineering’s challenge lead for Arts & Humanities in the Data Science for Science programme. He was Principle Investigator (PI) for this challenge.

Mark Bell is a member of the Research and Academic Engagement Team at The National Archives. Mark started his career at The National Archives working on the Arts and Humanities Research Council (AHRC) funded Traces through Time project, researching methods for probabilistically linking biographical records. He was also a researcher on the Engineering and Physical Sciences Research Council (ESPRC) funded ARCHANGEL project, which aimed to increase the sustainability of digital archives through the development of distributed ledger technology (DLT). Mark’s other research interests include the use of handwritten text recognition to increase accessibility and enable new ways to explore digitised collections at scale, and the use of the UKGWA as a data source. He was Challenge Owner for this challenge.

John Fitzgerald is a PhD student at the University of Oxford, focusing on investigating how publication data can be used to understand the process of knowledge accumulation. By applying temporal network analysis and natural language processing techniques, the path of researchers through their career and their interactions can be explored at both a micro and macro level. This information can then be leveraged to make policy

39

recommendations, especially to developing countries, in how to improvethe retention and development of the domestic knowledge base. He wasone of the participants.

Eirini Goudarouli is the Head of Digital Research Programmes, and amember of the Research and Academic Engagement team at TheNational Archives. She has extensive experience working on digital andinterdisciplinary research projects across the archival and highereducation sectors. She was the Co-Investigator of the InternationalResearch Collaboration Network in Computational Archival Science(IRCN-CAS; 2019 - 2020),38 funded by the Arts and HumanitiesResearch Council (AHRC). She led discussions for this challenge,including the co-development of its research goals and the establishmentof a formal research agreement between The National Archives and TheAlan Turing Institute.

Konrad Kollnig is a PhD Student at the University of Oxford, studyingdata protection law. Previously, he read mathematics and computerscience in Aachen (Germany), Edinburgh, and Oxford. He was one of theparticipants.

Barbara McGillivray is a Turing Research Fellow at The Alan TuringInstitute and at the University of Cambridge. She is editor-in-chief of theJournal of Open Humanities Data39 and Co-Investigator of theAHRC-funded project Living with Machines. She founded the Humanitiesand Data Science Turing Interest Group and her research interests are oncomputational models of language change. She has worked on semanticchange detection from different sources, including the UK Web ArchiveJISC dataset 1996-2013. She conceived the idea of this challenge.

Federico Nanni is a Research Data Scientist at The Alan Turing Institute,working as part of the Research Engineering Group, and a visiting fellowat the School of Advanced Study, University of London. He completed aPhD in History of Technology and Digital Humanities at the University ofBologna focused on the use of web archives in historical research andhas been a post-doc in Computational Social Science at the Data andWeb Science Group of the University of Mannheim. He was one of the38https://computationalarchives.net/39https://openhumanitiesdata.metajnl.com/

40

https://computationalarchives.net/

https://openhumanitiesdata.metajnl.com/

participants.

Tariq Rashid is the founder of Digital Dynamics, which specialises in helping organisations understand the risks around their use of machine learning and automated decision making, covering issues such as data bias, algorithmic transparency, benchmarking and testing processes, and accountable governance. He is active in building communities around machine learning and artificial i ntelligence a nd l eads D ata Science Cornwall. He is also the author of an accessible guide to neural networks that has been translated into six languages. He is passionate about ethical and responsible use of AI and is part of an EU working group testing their emerging framework. He was one of the participants.

Sandro Sousa is a PhD student at the School of Mathematical Sciences, Queen Mary University of London. He is interested in emergent phenomena in complex systems, networks, segregation dynamics, urban transportation. His PhD focusses on quantifying the heterogeneity of spatially-embedded systems through random walks on graphs with particular interest on socio-economic segregation. Previously, he has worked as a Research Associate in a project funded by the Economic and Social Research Council and Fundac ao de Amparo a Pesquisa do Estado de Sao Paulo which looked into the relationship of spatial segregation and transport accessibility in London and Sao Paulo. He was one of the participants.

Tom Storrar is Web Archiving Service Owner at The National Archives, where he manages a team to deliver the UK Government Web Archive and the EU Exit Web Archive. In addition to delivering improvements to these online services, he is interested in developing innovative ways in which the archival data can be used for research. He has been involved in several projects involving the use of this data, exploring questions relating to its linguistic content, and network and longitudinal analysis. He was the Challenge Owner of this challenge.

Kirill Svetlov is a research fellow in quantitative finance at the Laboratory for Research of Social-Economic and Political Processes of Modern Society at Saint-Petersburg State University. His research interests are closely connected to stochastic process modelling and quantitative finance. His main mathematical interests are partial differential equations, inverse and ill posed problems, probability theory, stochastic processes,

41

scientific c omputing, a nd n umerical a nalysis. H e w as o ne o f the participants.

Leontien Talboom is a collaborative PhD student at The National Archives, UK and University College London. Her research is on the constraints faced by digital preservation practitioners when making born-digital material accessible. She has a Master’s in Digital Archaeology and completed this with a dissertation focusing on NLP techniques to improve the discoverability of zooarchaeological material in unpublished archaeological reports. She was one of the participants.

Aude Vuilliomenet is a graduate from The Bartlett Center for Advanced Spatial Analysis at University College London. She is interested in using data science techniques and implementing IoT technologies to explore human interaction and nature dynamics in urban green spaces. She has previously employed NLP techniques to research food trends. She was one of the participants.

Pip Willcox is Head of Research at The National Archives and leads the Research and Academic Engagement team. She has worked at the intersection of digital scholarship and heritage collections for over 20 years, designing and delivering interdisciplinary research in fields including Web Science, digital musicology, history, and heritage collections. She is investigator of two AHRC-funded projects:40 Engaging Crowds: Citizen Research and Heritage Data at Scale as Principal Investigator; and Deep Discoveries as Co-Investigator. She co-developed the goals of this challenge as part of The National Archives’ ongoing collaborations with The Alan Turing Institute, and its research strategy.

12 Notes

The copyright and database right in material produced by staff of The National Archives under this collaboration is Crown copyright or Crown database. Free and flexible re-use of Crown Copyright material is granted40https://www.gov.uk/government/news/government-investment-backs-museums-of-the-future

42

https://www.gov.uk/government/news/government-investment-backs-museums-of-the-future

https://www.gov.uk/government/news/government-investment-backs-museums-of-the-future

under the terms of the Open Government Licence41 (OGL), and CrownCopyright and database right material is published under the OGL.

13 Acknowledgements

This work originated as part of the activities of the Humanities and DataScience Turing Interest Group.42

This work was supported by The Alan Turing Institute under the EPSRCgrant EP/N510129/1 and by the Living with Machines project43 under theAHRC grant AH/S01179X/1.

The authors, participants and contributors would like to express theirgratitude to the work of the following staff at The National Archives: LynnSwyny, Copyright Manager; Howard Davies, Policy Manager; JonRyder-Oliver, Senior Business Partner; and John Sheridan, DigitalDirector. We acknowledge and thank the hard work of the following fromthe Alan Turing Institute: Jules Manser, Project Manager - Data StudyGroups; Daisy Parry, Data Study Groups Administrator; Warwick Wood,IT Support Engineer; James Robinson, Senior Research SoftwareEngineer; all our reviewers, and those in the Partnerships and Eventsteams who made the Data Study Group happen.

41https://www.nationalarchives.gov.uk/doc/open-government-licence/version/

3/42https://www.turing.ac.uk/research/interest-groups/

humanities-and-data-science/43https://livingwithmachines.ac.uk/

43

https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

https://www.turing.ac.uk/research/interest-groups/humanities-and-data-science/

https://www.turing.ac.uk/research/interest-groups/humanities-and-data-science/

https://livingwithmachines.ac.uk/

References

Ben Athiwaratkun, Andrew Gordon Wilson, and Anima Anandkumar.Probabilistic fasttext for multi-sense word embeddings. arXiv preprintarXiv:1806.02901, 2018.

Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, RichardCyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of opendata. In The semantic web, pages 722–735. Springer, 2007.

Pierpaolo Basile and Barbara McGillivray. Exploiting the Web for SemanticChange Detection. In Discovery Science 2018, volume 5, pages 194–208. Springer International Publishing, 2018. ISBN 9783030017712.doi: 10.1007/978-3-030-01771-2 13.

Jeffrey Dalton, Laura Dietz, and James Allan. Entity query featureexpansion using knowledge base links. In Proceedings of the 37thinternational ACM SIGIR conference on Research & development ininformation retrieval, pages 365–374, 2014.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for languageunderstanding. arXiv preprint arXiv:1810.04805, 2018.

Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means:Spectral clustering and normalized cuts. In Proceedings of the TenthACM SIGKDD International Conference on Knowledge Discovery andData Mining, KDD ’04, pages 551–556, New York, NY, USA, 2004. ACM.ISBN 1-58113-888-1. doi: 10.1145/1014052.1014118.

Jeff Johnson, Matthijs Douze, and Herve Jegou. Billion-scale similaritysearch with gpus. arXiv preprint arXiv:1702.08734, 2017.

Max Kemman, Martijn Kleppe, and Stef Scagliola. Just google it - digitalresearch practices of humanities scholars, 2013.

Anne Lauscher, Federico Nanni, Pablo Ruiz Fabo, and Simone PaoloPonzetto. Entities as topic labels: combining entity linking and labeledlda to improve topic interpretability and evaluability. IJCol-Italian journalof computational linguistics, 2(2):67–88, 2016.

44

Quoc Le and Tomas Mikolov. Distributed representations of sentencesand documents. In Proceedings of the 31st International Conference onInternational Conference on Machine Learning - Volume 32, ICML’14,pages II–1188–II–1196, 2014. URL http://dl.acm.org/citation.

cfm?id=3044805.3045025.

Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze.Introduction to information retrieval. Cambridge university press, 2008.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and JeffDean. Distributed representations of words and phrases and theircompositionality. In Advances in neural information processing systems,pages 3111–3119, 2013.

Bhaskar Mitra and Nick Craswell. Neural models for information retrieval.arXiv preprint arXiv:1705.01509, 2017.

Federico Nanni, Simone Paolo Ponzetto, and Laura Dietz. Building entity-centric event collections. In 2017 ACM/IEEE Joint Conference on DigitalLibraries (JCDL), pages 1–10. IEEE, 2017.

Radim Rehurek and Petr Sojka. Software Framework for Topic Modellingwith Large Corpora. In Proceedings of the LREC 2010 Workshop onNew Challenges for NLP Frameworks, pages 45–50, Valletta, Malta,May 2010. ELRA. http://is.muni.cz/publication/884893/en.

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddingsusing siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.

Nick Ruest and Ian Milligan. Web archives analysis at scale with thearchives unleashed cloud, 2019. URL http://hdl.handle.net/10315/

36119.

Amit Singhal. Introducing the knowledge graph: Things, not string. officialblog of google, 2012.

Nina Tahmasebi, Lars Borin, and Adam Jatowt. Survey of computationalapproaches to lexical semantic change. In Preprint at ArXiv 2018., 2018.URL https://arxiv.org/abs/1811.06278.

45

http://dl.acm.org/citation.cfm?id=3044805.3045025

http://dl.acm.org/citation.cfm?id=3044805.3045025

http://is.muni.cz/publication/884893/en

http://hdl.handle.net/10315/36119

http://hdl.handle.net/10315/36119

https://arxiv.org/abs/1811.06278

The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson,Patricia Herterich, Rosie Higman, Anna Krystalli, Alexander Morley,Martin O’Reilly, and Kirstie Whitaker. The turing way: A handbookfor reproducible data science, March 2019. URL https://zenodo.org/

record/3233853.

Adam Tsakalidis, Marya Bazzi, Mihai Cucuringu, Pierpaolo Basile, andBarbara McGillivray. Mining the UK Web Archive for Semantic ChangeDetection. In Proceedings of International Conference Recent Advancesin Natural Language Processing, 2019. URL https://www.aclweb.org/

anthology/R19-1139.pdf.

46

https://zenodo.org/record/3233853

https://zenodo.org/record/3233853

https://www.aclweb.org/anthology/R19-1139.pdf

https://www.aclweb.org/anthology/R19-1139.pdf

turing.ac.uk @turinginst