binary trees? automatically identifying the links between born-digital records
TRANSCRIPT
PowerPoint Presentation
Australian Society of Archivists Conference 2016, Parramatta
Session 5: Description and Innovation
Binary Trees? Automatically Identifying the links between
born-digital records.
Ross SpencerDigital Preservation AnalystSystems Strategy and Standards team
How do we view the world?
Binary Trees?
But that looks like a network graph?!
It is!
Records (Items) connected across many recordkeeping and archival contexts
Across functions; People; Agency; Subject; Context; References; Subject... Date, File Format...
No boundaries!@ArvhivesNZ:ItemA -> references -> @DOC:ItemB
We know this...
Continuum model (Multiple contexts over space and time)
ICA Draft Conceptual Model (RiC)
73 Record Relations RiC-R1 to RiC-R73
Three of which we might be able to (more easily) automate?
Has Copy; Is Copy Of; Has Part
Wherein (I suggest) lies the issue...
Archives NZ Context
2011 Archives New Zealand developed its new conceptual model and metadata schema for archival description.
Designed to accommodate description of born-digital records.
much discussion among archivists about the practicalities of describing relationships between items. It was acknowledged that, given the volumes of digital records likely to be in each transfer, neither agency nor Archives staff were likely to examine the content of items visually one-by-one to determine which other items they referred to...
~ Talei Masters
What then do we do?
Mathematical properties of digital files...
Signals -> Numbers -> Encoding Schemes (UTF8, ASCII) - >Data Structures ->File Formats -> User Content.
Reduce again to a series of numbers that we can interpret to use numerical properties:
Greater than; less than; equal to; not equal to...
In the relationship between numbers we can find the relationships between records
Relations we might be able to create...
Relationship One: Is Identical
Relationship Two: Is Similar
Relationship Three: Contains Hyperlink
Relationship Four: Contains CMS Reference
Relationship Five: Contains Embedded Digital Objects
Relationship Six: Contains Intra-Item Relationships
Relationship Seven: Contains Object References
Relationship Eight: Item Mentions
Relationship One: Is Identical
We often have checksums available in digital repository
First comparison in a digital transfer...
Does Checksum A still equal Checksum A?
If yes, accept, continue to transfer...
If no... reject! Inspect!
Expose this information in the catalogue and compare; what happens?
Relationship One: Is Identical
Archival Context ARecord Keeping System A
Archival Context BRecord Keeping System B
Relationship Two: Is Similar
MD5 (Rivest, 1992):File A (Zero changes): 8c69dc0668c4c73092a7042df45e756adb170742
File B (1 Byte Removed): 6b75b8f235c148efd1b03d9c113664895b5aa7cd
Relationship Two: Is Similar
SSDEEP (Kornblum, 2006):File A (Zero changes): 1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4kCK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9ZBB1U6hP
File C (First 250 Bytes Removed*): 1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4kCK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9ZBB1U6hP
*Less than two tweets (140 bytes)
Relationship Two: Is Similar
First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.)
Oliver et al. (2014) Thresholds should be tuned for each application
Fiirst application is item level sentencing during transfer feasibility investigations
Manually sentence... 10 records per hour
Follow links to those not of archival value...
* Trend Micro Locality Sensitivity Hash!
Relationship Two: Is Similar
MD5 Hash
Fuzzy Hash
Relationship Two: Similar
Relationship Two: Similar
Relationship Two: Is Similar
You liked this record... you might also like...
Relationship Three: Contains HTTP://
Burnhill et al. (2015)
64,000 e-theses, 46,000 pointed out to external sources
Websites, external files, etc.
Relationship Three: Contains HTTP://
#!/bin/bashset -e#FILES LOCATIONFILES='/home/digital-preservation/accessions'dp_analysis (){ echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]' echo}# Find loop...oIFS=$IFSIFS=$'\n'time(find "$FILES" -type f | while read -r file; do dp_analysis "$file"done)IFS=$oIFS
Relationship Three: Contains HTTP://
https://gist.github.com/ross-spencer/a6411a021afb7de7e3dc6dd713f7b520
~5059 parseable born-digital records
~4800 lines contained hyperlinks
Relationship Four: Contains CMS Reference
echo -e $(catdoc "$file" | grep -e "A[0-9]\{6\}"
Matches the Archway catalogue reference number, e.g. A204050; A123456; and not AZ12345
CMS reference could be sent alongside transfer metadata for such searches.
Flag existence (at least) - FYI to the end user be that the transfer archivist, to the agency, to the researcher
Relationship Five: Contains Embedded Object
$ java -jar tika-app1.13.jar -z --extract-dir=
Relationship Six: Contains Intra-Item Record
Relationship Seven: Contains Object Reference
A digital preservation risk...
Relationship Seven: Contains Object Reference
Extract files from PPT OLE2 -> Read PowerPoint Document Obect -> Look for:
Relationship Eight: Item Mentions
Dictionary:Helen ClarkHelen Elizabeth ClarkJohn KeyUnited NationsPrime MinisterUniversity of AucklandJenny ShipleyLabour Party
Relationship
Eight:
Item Mentions
Discussion
Data structures support needed in catalogue, and digital preservation system...
Extensbile, flexible enough not to (need to) know what the future holds...
AS/NZS 5478:2015, Recordkeeping metadata property reference set (RMPRS) states:
The digital world is increasingly using networked relationships.
Discussion
Verhoeven (2016) Devils Bridges! Ontological, graph/network based infrastuctures
Vernacular ontologies
Understand, Make, Improve Quality of our Connections
redistribution of power and the possibilities of world making (and remaking) in the archive
Providing the algorithms are transparent, what then provides a more objective view of the world than machine generated relations?
Discussion
ICA... RiC-R7: is Draft Of semantics (A Speech):Still a draft if 80% content is different from published?
Draft because its marked as such in metadata?
Draft when it has been delivered in the wild?
ICA... RiC-R4: has Subject semantics (This Presentation):Graph technologies?
Digtial preservation?
Processing of digital archives?
Binary trees?!
RiC-CM aspires to reflect both facets of the Principle of Provenance, as these have traditionally been understood and practiced, and at the same time recognize a more expansive and dynamic understanding of provenance. It is this more expansive understanding that is embodied in the word Contexts. RiC-CM is intended to enable a fuller, if forever incomplete, description of the contexts in which records emerge and exist, so as to enable multiple perspectives and multiple avenues of access.
Discussion
Impact for record keeping; transfer; digital preservtion, discovery...
Digital preservation linked objects, hyperlinks, embedded objects...
Not all geekery!
Remember the content of these records...
Remember the connections...
Remember use-cases for digital preservation, it does not operate, in and of itself!
Conclusion
Computer forensic examiners are often overwhelmed with data. Modern hard drives contain more information that cannot be manually examined in a reasonable time period creating a need for data reduction techniques. - Kornblum (2006)
So how do we begin?
One relation at a time...
Links
Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101
SSCOMPARE: https://github.com/exponential-decay/sscompare
TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments
Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop
Apache Tika: https://tika.apache.org/
Full Paper: Hopefully in Archives and Manuscripts sometime soon!
Thank you
@beet_keeper
Department of Internal Affairs