the library of congress experience - we are smpte · 2017-01-17 · • oracle sun storagetek...

29
The Library of Congress Experience Wednesday 5/21/2014 SMPTE Bits by the Bay 2014 James Snyder – Library of Congress 45

Upload: others

Post on 14-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

The Library of Congress Experience

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress45

Page 2: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

LOC Storage

• SANs & NAS used for transactional storage:– Shared production storage (files created directly

on shared storage)

– Quality control & proxy generation production storage

– Staging area for data robot & transmission to backup off-site location

• Failed hard drives replaced as needed

• Complete systems replaced every 5 years or so

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress46

Page 3: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

LOC Transactional Storage

• SANs connectivity: Minimum of FC-8 to meet required transfer and processing speeds

– Duale FC-8 and FC-16 also in use for very high speed applications like film scanning

• NAS connectivity:

– 1Gig-E OK for audio and SD content

– 10Gig-E fiber required for HD content

• If connections shared with other devices, storage devices must be given network priority

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress47

Page 4: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

LOC Digital Repository

• Data set is effectively permanent

• Archive contents must stand on their own (no external databases required to know all about essence within a file)

• Must be file format agnostic

• Must scale to very large size (EiB+)

• Very Low Bit Error Rates (BER)

– 10 -19th

48

Page 5: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

LOC-MBRS Content Archive

• Dual copies, geographically dispersed• Oracle Sun StorageTek T10000-C data tapes

– 5 TB per tape– 9800 slots, 4900 currently populated– Will skip a generation & go to E tapes when available

• SHA-1 Cryptographic checksum used to verify integrity of files while in transit to archive– Also used to verify integrity of the archive

• Metadata maintained in databases– Copy needs to be inserted in each archive file as well

• Proxy files maintained on servers; also stored in the archive with the archive, QC, metadata files

49

Page 6: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

LOC-MBRS Content Archive

• Lesson learned: keeping associated files together with a persistent ID is important

• Current system: – file system creates sequentially numbered IDs for each

number in the file and associates it with the MAVIS record in the database

– Problem: original file names are lost– Tie to the original unique MAVIS ID is lost at the file

level: file names have no relation to the original ID– When files are pulled from the archive the sequential

number is retained as the file name, making renaming a requirement for any workflow tracking

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress50

Page 7: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

LOC-MBRS Content Archive

• Lesson learned: keeping associated files together with a persistent ID is important

• What is needed:– Wrap associated assets together in one archive

file/object

– Create single MAVIS-based unique ID that persists all the way to the file name in the archive

– Append the MAVIS ID to all file names so original file name is retained, but also tied to the master asset ID number should the files get separated

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress51

Page 8: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

LOC-MBRS Content Archive

• Lesson learned: more than 100 million files and the file management system starts to slow down significantly

• File dependent formats need to be in wrappers

– DPX: each film frame is a single file

– 16mm collection: 40 million feet = 1.6 billion frames

– 35mm nitrate collection: 140 million feet = 2.24 billion frames

– MXF? AXF? Depends on you process your data

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress52

Page 9: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Film Storage Calculator

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress53

Page 10: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Other LOC Repositories

• Hold digital copies of virtually every type of content

• Dual copies, geographically dispersed

• Oracle Sun StorageTek T10000-C & D data tapes– C: 5TB/tape; D: 8.5 TB/tape

• IBM TS-1140 also used

• SHA-1 Cryptographic checksum used to verify integrity of files while in transit to archive– Also used to verify integrity of the archive

• Metadata handled in different ways depending on internal client needs

54

Page 11: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Long Term Archive Challenges

• Standardized data structures to ease long-term repository collection maintenance

– Currently vendor dependent

– AXF standard designed to enhance long-term sustainability of a content collection even through multiple migrations

• File specs, Archive Object specs and metadata standards must be well documented

Wednesday 5/21/2014SMPTE Bits by the Bay 2014

James Snyder – Library of Congress55

Page 12: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

What Is Archived

• All files are saved individually• Includes all files produced in the archive process

for each asset:– Archive file– Viewing/listening proxy– Any production proxies– Metadata files

• Not a sustainable model for the future• Future: AXF standard (SMPTE 2034) Archive

Objects may replace BagIt archive objects

56

Page 13: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Migration at Five+ Years:Lessons Learned

Page 14: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Lessons Learned

• Mass migration is not only possible, but relatively easy once you have the processes figured out

• Physical workflows (people, cataloging, movement of assets) MUST be created and implemented at the same time as the technical workflows

• Metadata is one of the most complex challenges, but MUST be solved

• Get the humans out of the process as much as possible!

• Cataloging processes must be updated to raise throughput– Cataloging terminology and media terminology are very

different!

58

Page 15: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Lessons Learned

• A lot of work is required to maintain the machinery to playback content

• Everything it took to run the machines 10-50 years ago is still needed today!– Manuals, training for personnel, spare parts

– Compressed air; 3-phase power

• Some parts simply can’t be replaced and are failing due to age– Integrated circuits don’t age well! They weren’t

designed to last for decades59

Page 16: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Lessons Learned

• Fixity checks on files is essential

• Data tapes are still the lowest cost-per-unit for medium and long-term storage

• Don’t move your data tapes if you don’t have to!– The very act of moving them physically increases

bit error rates

• Bit error rate matters!

• What does “30 years” really mean?

60

Page 17: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Lessons Learned

• You CAN make archiving a part of an overall production facility & workflows

– Archive files, production proxies, streaming proxies: one workflow, multiple outputs

– Metadata: Identify, use, document, IMPLEMENT!!!

• You must design, documented and implement files and metadata just as the facility itself was designed, documented and implemented.

• Change control documentation is essential61

Page 18: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Digital Archive: The Challenge

• Long term archival storage planning:– Regular migrations (every 5-7 years)– Verifying your archive– Cryptographic checksums (SHA-1) to validate archive

integrity• Future data workflows

– Updating file wrappers as necessary • Archival MXF spec AS-07 being worked on

– Updating metadata within file wrappers at migration points

• Plan for the future of storage– What happens when increased capacities level off?

Page 19: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Migrations

• Every 5-7 years

• Plan to skip a tape generation:

– T10000-C tape contents will be migrated when the T10000-E tapes become available

– Some or all of the retired T10000 tapes & drives kept for age testing

• Is the 30 year vendor rating real? Let’s find out….

63

Page 20: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Born Digital Content

©, collections and beyond

64

Page 21: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Born Digital

• Two meanings:– Content category

• Content created digitally as files

– Born Digital System: how to handle content• Born Digital File Reception & Processing

– Secure file processing system with strong security procedures

• Live Capture:– Live recordings

– Retention for commercial verification

• Physical Media Intake– Ingesting file content stored on file-based physical media

Page 22: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Born Digital: LOC

• Content types:– Direct electronic submission for © & collections

• Television, film industry, internet; software, gaming & learning

– Files generated by Live Capture & Physical Media Intake

– Internal Library audiovisual productions– Files generated by direct archive file creation:

• Convert Congressional video and audio from physical media to direct archival encoding– Direct encoding to HD JPEG2000 Lossless MXF OP1a of LIVE video

with metadata– Will add 2-5 PB per year to archive when in production– Requires automated workflow to minimize human involvement

Page 23: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Born Digital

• Challenges:– Interoperability for both current use and for the

long term• Users need to view today• Retention period for archiving & use

– Long-term archive storage

• Use international standards whenever possible:– Interoperability: MXF & metadata– Long-term archive storage:

• SMPTE AXF (2034-1)

Page 24: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Future Challenges?

Page 25: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Future Challenges: Content

• 4k Digital Cinema is here• 4k video will start fall 2014• 4k home camcorders are in the wings• 8k is in the wings• Wider color ranges• Higher frame rates• Eventual retirement of NTSC (1.001) frame rates?

– Please oh please oh PLEASE!!!– No more drop frame time codes– Transition will need to be planned: consumer equipment can’t

handle even integer frame rates yet

• Future archive file designs?

69

Page 26: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Future Challenges

• Finding enough equipment to keep the migrations going– Hey buddy, can you spare a piece of obsolete equipment?

• Growing the Digital Repository into the exabyte realm…and beyond?– Ettabyte….yettabyte….then what….?

• Developing the knowledge and training needed to make sure the employees working on your project are adequately trained with proper documentation– We are the LAST GENERATION to have worked with analog in the

production environment! The next will have to be taught.– Manuals, basic training info, retired standards documentation

• Updating workflows and workflow software for new requirements

70

Page 27: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Future Challenges

• Dealing with how our collection continues to age• Studying how our digital collection’s physical

storage & equipment ages• Watch how automated QC functionality is working

& adjusting as necessary• Encourage vendors to think beyond the 2-5 year

survival window: just because you WANT your equipment to obsolesce doesn’t mean it won’t be out there for another 50 years!

• Storage vendors: what do the survival time period statements (like ‘30 years’) REALLY mean? What’s under the hood?

71

Page 28: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

IT Issues

• Most commercial IT equipment has bit error rates of 10-14, including Ethernet backbone equipment: what good is storage BER of 10-17

when your system’s best BER is 10-14

• How often to check data integrity?– Continuous process above a certain archive size– Reading the data can also damage it!

• How often to migrate?– Individual files: every 5-10 years– Update the metadata when you migrate

72

Page 29: The Library of Congress Experience - We are SMPTE · 2017-01-17 · • Oracle Sun StorageTek T10000-C data tapes – 5 TB per tape – 9800 slots, 4900 currently populated – Will

Thank You!

James SnyderSenior Systems AdministratorNational Audio Visual Conservation CenterCulpeper, [email protected] 202-707-7097