binary trees? automatically identifying the links between born-digital records

Download Binary Trees? Automatically identifying the links between born-digital records

If you can't read please download the document

Upload: ross-spencer

Post on 14-Apr-2017

417 views

Category:

Government & Nonprofit


1 download

TRANSCRIPT

PowerPoint Presentation




Australian Society of Archivists Conference 2016, Parramatta

Session 5: Description and Innovation

Binary Trees? Automatically Identifying the links between born-digital records.

Ross SpencerDigital Preservation AnalystSystems Strategy and Standards team

How do we view the world?

Binary Trees?

But that looks like a network graph?!

It is!

Records (Items) connected across many recordkeeping and archival contexts

Across functions; People; Agency; Subject; Context; References; Subject... Date, File Format...

No boundaries!@ArvhivesNZ:ItemA -> references -> @DOC:ItemB

We know this...

Continuum model (Multiple contexts over space and time)

ICA Draft Conceptual Model (RiC)

73 Record Relations RiC-R1 to RiC-R73

Three of which we might be able to (more easily) automate?

Has Copy; Is Copy Of; Has Part

Wherein (I suggest) lies the issue...

Archives NZ Context

2011 Archives New Zealand developed its new conceptual model and metadata schema for archival description.

Designed to accommodate description of born-digital records.

much discussion among archivists about the practicalities of describing relationships between items. It was acknowledged that, given the volumes of digital records likely to be in each transfer, neither agency nor Archives staff were likely to examine the content of items visually one-by-one to determine which other items they referred to...

~ Talei Masters

What then do we do?

Mathematical properties of digital files...

Signals -> Numbers -> Encoding Schemes (UTF8, ASCII) - >Data Structures ->File Formats -> User Content.

Reduce again to a series of numbers that we can interpret to use numerical properties:

Greater than; less than; equal to; not equal to...

In the relationship between numbers we can find the relationships between records

Relations we might be able to create...

Relationship One: Is Identical

Relationship Two: Is Similar

Relationship Three: Contains Hyperlink

Relationship Four: Contains CMS Reference

Relationship Five: Contains Embedded Digital Objects

Relationship Six: Contains Intra-Item Relationships

Relationship Seven: Contains Object References

Relationship Eight: Item Mentions

Relationship One: Is Identical

We often have checksums available in digital repository

First comparison in a digital transfer...

Does Checksum A still equal Checksum A?

If yes, accept, continue to transfer...

If no... reject! Inspect!

Expose this information in the catalogue and compare; what happens?

Relationship One: Is Identical

Archival Context ARecord Keeping System A

Archival Context BRecord Keeping System B

Relationship Two: Is Similar

MD5 (Rivest, 1992):File A (Zero changes): 8c69dc0668c4c73092a7042df45e756adb170742

File B (1 Byte Removed): 6b75b8f235c148efd1b03d9c113664895b5aa7cd

Relationship Two: Is Similar

SSDEEP (Kornblum, 2006):File A (Zero changes): 1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4kCK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9ZBB1U6hP

File C (First 250 Bytes Removed*): 1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4kCK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9ZBB1U6hP

*Less than two tweets (140 bytes)

Relationship Two: Is Similar

First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.)

Oliver et al. (2014) Thresholds should be tuned for each application

Fiirst application is item level sentencing during transfer feasibility investigations

Manually sentence... 10 records per hour

Follow links to those not of archival value...

* Trend Micro Locality Sensitivity Hash!

Relationship Two: Is Similar

MD5 Hash

Fuzzy Hash

Relationship Two: Similar

Relationship Two: Similar

Relationship Two: Is Similar

You liked this record... you might also like...

Relationship Three: Contains HTTP://

Burnhill et al. (2015)

64,000 e-theses, 46,000 pointed out to external sources

Websites, external files, etc.

Relationship Three: Contains HTTP://

#!/bin/bashset -e#FILES LOCATIONFILES='/home/digital-preservation/accessions'dp_analysis (){ echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]' echo}# Find loop...oIFS=$IFSIFS=$'\n'time(find "$FILES" -type f | while read -r file; do dp_analysis "$file"done)IFS=$oIFS

Relationship Three: Contains HTTP://

https://gist.github.com/ross-spencer/a6411a021afb7de7e3dc6dd713f7b520

~5059 parseable born-digital records

~4800 lines contained hyperlinks

Relationship Four: Contains CMS Reference

echo -e $(catdoc "$file" | grep -e "A[0-9]\{6\}"

Matches the Archway catalogue reference number, e.g. A204050; A123456; and not AZ12345

CMS reference could be sent alongside transfer metadata for such searches.

Flag existence (at least) - FYI to the end user be that the transfer archivist, to the agency, to the researcher

Relationship Five: Contains Embedded Object

$ java -jar tika-app1.13.jar -z --extract-dir=

Relationship Six: Contains Intra-Item Record

Relationship Seven: Contains Object Reference

A digital preservation risk...

Relationship Seven: Contains Object Reference

Extract files from PPT OLE2 -> Read PowerPoint Document Obect -> Look for:

Relationship Eight: Item Mentions

Dictionary:Helen ClarkHelen Elizabeth ClarkJohn KeyUnited NationsPrime MinisterUniversity of AucklandJenny ShipleyLabour Party

Relationship
Eight:

Item Mentions

Discussion

Data structures support needed in catalogue, and digital preservation system...

Extensbile, flexible enough not to (need to) know what the future holds...

AS/NZS 5478:2015, Recordkeeping metadata property reference set (RMPRS) states:

The digital world is increasingly using networked relationships.

Discussion

Verhoeven (2016) Devils Bridges! Ontological, graph/network based infrastuctures

Vernacular ontologies

Understand, Make, Improve Quality of our Connections

redistribution of power and the possibilities of world making (and remaking) in the archive

Providing the algorithms are transparent, what then provides a more objective view of the world than machine generated relations?

Discussion

ICA... RiC-R7: is Draft Of semantics (A Speech):Still a draft if 80% content is different from published?

Draft because its marked as such in metadata?

Draft when it has been delivered in the wild?

ICA... RiC-R4: has Subject semantics (This Presentation):Graph technologies?

Digtial preservation?

Processing of digital archives?

Binary trees?!

RiC-CM aspires to reflect both facets of the Principle of Provenance, as these have traditionally been understood and practiced, and at the same time recognize a more expansive and dynamic understanding of provenance. It is this more expansive understanding that is embodied in the word Contexts. RiC-CM is intended to enable a fuller, if forever incomplete, description of the contexts in which records emerge and exist, so as to enable multiple perspectives and multiple avenues of access.

Discussion

Impact for record keeping; transfer; digital preservtion, discovery...

Digital preservation linked objects, hyperlinks, embedded objects...

Not all geekery!

Remember the content of these records...

Remember the connections...

Remember use-cases for digital preservation, it does not operate, in and of itself!

Conclusion

Computer forensic examiners are often overwhelmed with data. Modern hard drives contain more information that cannot be manually examined in a reasonable time period creating a need for data reduction techniques. - Kornblum (2006)

So how do we begin?

One relation at a time...

Links

Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101

SSCOMPARE: https://github.com/exponential-decay/sscompare

TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments

Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop

Apache Tika: https://tika.apache.org/

Full Paper: Hopefully in Archives and Manuscripts sometime soon!

Thank you

[email protected]

@beet_keeper

Department of Internal Affairs