architecting for data integrity

13
Packard Campus for Audio Visual Conservation http://www.loc.gov/avconservation/packard/ Formerly NAVCC Architecting for Data Integrity

Upload: avital

Post on 23-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Architecting for Data Integrity. “This is an Archive. We can’t afford to lose anything!” Our customers are custodians to the history of the United States and do not want to consider the potential loss of content that is statistically likely to happen at some point. Solutions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Architecting forData Integrity

Page 2: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

2

The Challenge

“This is an Archive. We can’t afford to lose anything!” Our customers are custodians to the history of the United States

and do not want to consider the potential loss of content that is statistically likely to happen at some point.

Solutions Generate SHA1 at source and verify content after each copy Store 2 copies of everything digital Test and monitor for failures: Archive Integrity Checker and

fix_damaged Refresh the damaged copy from the good copy Automate as much as possible Acknowledge that someday we’re going to lose something

What’s that likelihood? What costs are reasonable to reduce that?

Page 3: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Real World Example

About 2 years ago another video archive with similar hardware and software reported errors in content staged from about 1 out of every 5 of their T10000B tapes

We had seen no unrecoverable errors in staging but that is not enough assurance

Given that we store a SHA1 digest for each file, we developed a simple perl program to stage content from our oldest tapes and compare the SHA1 values. No errors were found in the initial test of 600 tapes.

Something else in the environment or processes must be contributing to content loss.

We cannot afford to wait for problems to present.

Page 4: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Current Solutions

Archive Integrity Checker Contracted with a developer to write code to

systematically check all the content and keep track of the status of each verification in a database

Group the files to be staged by tape and then sort by location on tape to improve efficiency

Requires some management as it must stage the content to our ingest storage thereby affecting our ingest speeds Uses at least 1 tape drive Generates 2X file size in throughput: write to disk and read for SHA1

Fix_damaged.pl Uses sfind –damaged to identify files and then stages

them to test whether the tape is bad or if there was a transient issue

Page 5: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Alternative Solution

Verification flag in SAM (ssum). Similar in HPSS Enabled via software for a collection of files (filesystem,

directory, file(s)) Generates a 32 bit running checksum for file and stores

in SAM’s meta data (viewable via sls –D) File staged back to disk from tape and verified

Takes extra time and generates 2X additional throughput requirements Make sure to architect it in We turned it off

Can be turned on to generate only and use when files are staged back from tape

Page 6: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Bigger Problems

File sizes are increasing. We will soon be processing 1 TB files

Audio: 1GB (96/24) SD Video: 30 GB (mxf wrapped jpeg2000) HD Video: 100’s GB 4K scan of film: 1 TBThis means that a failed SHA1 will require retransmission Costing hours of time and wasted resources Typical individual PC disk speeds are 25 MB/s. Typical times to retransmit

Audio: < 1 min SD Video < 30 min HD Video < 3 hours 4K scan of film < 4 hours (assuming faster drives in this environment)

Ensuring data integrity becomes more crucial as files sizes increase and monitoring for marginal errors greatly improves reliability and cost effectiveness of technological investments

Page 7: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Datapath Integrity Field (T10-DIF)

Page 8: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Oracle T10000C Solution

T10-DIV A variation of the T10-DIF (Data Integrity Field) that

adds a CRC to each data block from the tape kernel driver to the tape drive (off/on/verify)

Verifies the FC path to the tape drive and verifies each write to tape by reading afterward

Requires Solaris 11.1 and SAM 5.3 Can be used to verify the tape content without staging

back to disk We look forward to a tape to tape migration that will

use this information to validate the content read from tape during the migration. This assumes very few files are deleted and repacking will provide little value. True for our archive.

Page 9: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Oracle T10000C DIV Performance

T10-DIV How does DIV scale? There’s a computation involved.

Where does that happen? Oracle created a test bed and shared the results of their

testing. In the process several interesting things were discovered and improvements added

There are three states for DIV: off, on, verify. These results are for the on setting

The server used here is a T4-4 with a limited domain of 32 of 64 cores

8 FC 8 ports, 3 drives zoned per port, 4 ports for tape, 4 ports for disk

Page 10: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

Oracle T10000C DIV Performance

Individual drive throughput

0

50

100

150

200

250

300

0

2

4

6

8

10

12

SAM Archive Average Performance Per Drive

Average MB/Sec DIVNumber of drives

MB/

Sec

Avg

. for

eac

h dr

ives

Num

ber

of T

10K

C D

rive

s

• Average Performance per drive not impacted by additional drives

• Maintain Optimal Transfer rates as drives are added

• Recommend 4 tape drives per HBA port

Page 11: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

11

Key Architecture Points Know the amount of data you need to move over a given time period. 5

TB/12 hours Know the speeds. Test the configuration

Disk Tape Ports Protocol (FC/Ethernet) Server (backplane)

We built our cache for 6X our throughput needs because Daily: Write once, read three times. Once for SHA1 check, two more times for each tape

copy Testing: Write once, read once

We built a separate set of LUNs for migration Migration: Write once, read three times. Once for SHA1 check, two more times for each

tape copy Test and Monitor

Standard Operating Procedure for checking network, systems, storage and tape Syslog on all devices that support it. Build pattern recognition for issues whenever they

present. HSM filesystem for damaged files, errors Storage Tape Analytics Service Delivery Platform/Automatic Service Request

Page 12: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

12

Future Testing HPSS implementation Oracle Tape Analytics to monitor marginal issues with tape drives/library

Drive code upgrade, required ACSLS version not available for Solaris 11 Older libraries require an HBT upgrade for memory

Migrating 3.6 PB of T10Kb tapes/data to T10Kc tapes Migrating current 2 GB/s infrastructure to 6.5 GB/s

DDN SFA10K with 150 TB of 600GB disks with 16 FC8 connections T4-4 as archive server with 8 FC8 connections X4470 as data mover/web server with 8 FC8 connections NAS for user accessible filesystem (from FC/server based)

Evaluating next generation infrastructure at 15 GB/s Continue to monitor HSM solutions for technical excellence

Evaluate underlying software and hardware technologies Data integrity Scaling meta data Scaling data throughput Migration strategies

Page 13: Architecting for Data Integrity

Packard Campus for Audio Visual Conservationhttp://www.loc.gov/avconservation/packard/

Formerly NAVCC

13

Architecting for Data IntegrityQuestions? AXF format Oracle provides end to end data path integrity in some of its appliances Red Hat is in discussions to improve data path integrity in Linux SHA1 versus SHA2-256/512

Scott Rifesrif at loc dot gov