a dna-based archival storage system - snia · 2019-12-21 · a dna-based archival storage system...
TRANSCRIPT
![Page 1: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/1.jpg)
A DNA-BasedArchival Storage System
Luis Ceze
University of Washington Microsoft Research
joint work with Karin Strauss, Doug Carmean, Georg Seelig, James Bornholt, Randolph Lopez, Lee Organick, Yuan Chen, Chris Takahashi, Bichlien Nguyen, Sergey Yekhanin, Siena Dumas Ang.
![Page 2: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/2.jpg)
Life evolved a fantastic storage medium…
But why?
Gene
Protein
Function
DNA
100101010
![Page 3: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/3.jpg)
DNA molecules for digital data
Extremely dense Theory: 1 exabyte in 1 mm3
Extremely durable Half life > 500 years
And readers never become obsolete!
![Page 4: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/4.jpg)
Comparing Storage Technologies
Would require:
4,000,000tapecartridges
Stack80kmhigh
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
ThinFilm(MEMS)
Tape Optical Flash(Chip) DNA
Gb/m
m^3
Today Projection Limit
redundancy
spacing
addressing
Potential of DNA:>107 Improvement
![Page 5: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/5.jpg)
The ultimate hierarchy
FlashHDDTape
DNA-based Archival
~5 yrs
centuries
Access Timems
10s msminutes
hrs
Durability
~5 yrs ~15-30 yrs
![Page 6: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/6.jpg)
A DNA-based archival storage system
Store
Write Read
Redundancy and density
Efficient retrieval
Wet lab experiments
[ASPLOS’16]
11010101… 11010101…
ACATCG…
![Page 7: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/7.jpg)
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C TG A C A C C T
![Page 8: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/8.jpg)
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each otherif they are complementary:
C T G T G G A
G A C A C C T
C, G arecomplementary
A, T arecomplementary
T
Partial errors allowed
![Page 9: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/9.jpg)
9
DNA data storage at 30,000 feet
11011101 11011101
Encoding
Decoding
AGCTATCAG AGCTATCAGSynthesis
Sequencing
DNA Manipulation
write path read path
![Page 10: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/10.jpg)
10
DNA Manipulation: Synthesis
AGCTATCAG AGCTATCAG
Synthesis
Sequencing
DNA Synthesis: manufacturing DNA strands
GACACCT G A C A C C T
• Synthesis process appends one letter at a time• Maximum practical sequence length of ~100s of letters• Produces millions of copies of each sequence• Can make many different sequences in parallel• Normally used for genomics and genetic engineering
![Page 11: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/11.jpg)
11
DNA Manipulation: Sequencing
AGCTATCAG AGCTATCAG
Synthesis
Sequencing
DNA Sequencing: reading DNA strandsGACACCTG A C A C C T
• Produces many reads of a strand• Normally used for research and diagnostics• Currently much higher throughput than synthesis
![Page 12: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/12.jpg)
12
DNA data storage at 30,000 feet
11011101 11011101
Encoding
Decoding
AGCTATCAG AGCTATCAGSynthesis
Sequencing
write path read path
![Page 13: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/13.jpg)
Writing digital data to DNAThe easy way: convert base 2 to base 4
But this approach isn’t feasible for more than a few bytes
G G A T G C A
P[Attach] = 99%
99% 98% 97% 96.1% 95.1% 94.2%
100 nts
36.6%
200 nts
13.4%
… …
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
![Page 14: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/14.jpg)
G C A C
Chunking dataBreak binary data into chunks stored in separate strands10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
![Page 15: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/15.jpg)
C A T C C
G C A C
Chunking dataBreak binary data into chunks stored in separate strands10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T
T G C T T A C C
G C C A G T T C
A A A A
A A A C
A A A G
C A T C C
C A T C C
A T G T T
A T G T T
A T G T TAddresses within the file
File identifiers(“primers”)
![Page 16: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/16.jpg)
…
1 of N
2 of N
3 of N
~ 10 bytes per DNA strand.
C A T C C
G C A C G G A T
T G C T T A C C
G C C A G T T C
A A A A
A A A C
A A A G
C A T C C
C A T C C
A T G T T
A T G T T
A T G T TAddresses within the file
File identifiers(“primers”)
![Page 17: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/17.jpg)
Other encoding considerations
G C C TA AAHomopolymers are bad…
G A C A C C T G
C T G T G C A CA T
G
TCA
T
A
G
C
Secondary structures are problematic …
![Page 18: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/18.jpg)
18
DNA data storage at 30,000 feet
11011101 11011101
Encoding
Decoding
AGCTATCAG AGCTATCAGSynthesis
Sequencing
write path read path
![Page 19: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/19.jpg)
Random access?
?
![Page 20: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/20.jpg)
C A T C C
Random access!
PCR
Selectively amplify strands based on their primer
SampleStrands with 3 different primers
Almost all strands have desired primer
Reads are destructive, so replenish when necessary
T A T C T
G C A C G G A T T G C T T A C C
G C C A G T T C
A A A A A A A C
A A A G
G A T A C A T G T T C C A C T
A T G T T T A C A A A T A G A
![Page 21: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/21.jpg)
Going back to digital: Decoding
Sequencing
Clustering reads
…110110101101…
C A T C C
G C A C G G A T
T G C T T A C C
G C C A G T T C
A A A A
A A A C
A C A G
C A T C C
C A T G C
A T G T T
A T G T T
A T G T T
C A T C C
G C A C G G A T
T G C T T A C C
G C C A G T T C
A A G A
C A A C
A A A G
C A T C C
C A T C C
A T G T T
A T G T T
A T G T T
……..
C A T C C
G C A C G G A T
T G C T T A C C
G C C A G T T C
A A G A
C A A C
A A A G
C A T C C
C A T C C
A T G T T
A T G T T
A T G T T
Error correction
![Page 22: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/22.jpg)
Error correctionBoth synthesis and sequencing are error prone:
G G A T A G C
G G A T G C A
G G A T G A
G G A T C C A
Insertions
Deletions
Substitutions
A
Aggregate error rates ~1% per position
![Page 23: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/23.jpg)
Logical redundancy
C A T C C
G C A C G G A T
T G C T T A C C
G C C A G T T C
A A A A
A A A C
A A A G
C A T C C
C A T C C
A T G T T
A T G T T
A T G T T
Addresses within the value
Key identifiers(“primers”)
Redundant data in additional DNA strands. Many possibilities: parity, Reed Solomon, LDPC, … C A T C C
T C G C T A C G
T G C A A T T C
C A A C
CA A G
C A T C C A T G T T
A T G T T C A T C C
![Page 24: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/24.jpg)
SAS / SATA
Putting it all together
![Page 25: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/25.jpg)
Putting it all together
DNA storage (physical) library
~100TB
Data address/key specifies physical location and primer for random access.
keyfoo.mp4
encode andselect primers seqs
manufacture DNA
keyfoo.mp4
PCR
amplification
primers
sequencing
decoding
1101000100…
![Page 26: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/26.jpg)
Wet lab results
![Page 27: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/27.jpg)
Experiments
catcatgg
catcatgc
ThroughputMBs/week
![Page 28: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/28.jpg)
Photo: Tara Brown / UW
![Page 29: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/29.jpg)
DecodingEncoded and synthesized 3 files (151 kB):
Selected and PCRed one file for random access (42 kB):
Sequenced and decoded the resulting amplified pool:
Recovered every bit despite errors in synthesis and sequencing
![Page 30: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/30.jpg)
The importance of redundancy
C A T C C
G C A C G G A T
T G C T T A C C
G C C A G T T C
A A A A
A A A C
A A A G
C A T C C
C A T C C
A T G T T
A T G T T
A T G T T
Addresses within the value
Key identifiers(“primers”)
Redundant data in additional DNA strands. Many possibilities: parity, Reed Solomon, LDPC, … C A T C C
T C G C T A C G
T G C A A T T C
C A A C
CA A G
C A T C C A T G T T
A T G T T C A T C C
0
25
50
75
0 2500 5000 7500Number of copies
Freq
uenc
y Some strands are missing entirely
If we ignore redundancy data, we cannot recover the file.
![Page 31: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/31.jpg)
Error Analysis: Synthesis and SequencingBoth synthesis and sequencing are error prone, but how much?
G G A T G C A
Array synthesis Precise synthesis
$$$$
Sequencing
Synthesis error+
Sequencing error Sequencing error
Synthesis
Sequencing
0.0%
0.5%
1.0%
1.5%
0 25 50 75 100Location
Errors
SynthesisArraySingle
Most error is due to sequencing, not synthesis
G G A T G C A
![Page 32: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/32.jpg)
Durability
G G A T G C A
G G A T G C A
G G A T G C A
WaterRadiation…
![Page 33: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/33.jpg)
Durability
10
100
1000
1 10 100 1000Time (years)
Cop
ies
Req
uire
d
Reliability99.99%99.9%99%
Half-life in bone: ~521 years.
DNAsyntheticfossilssurvive:
Source:Grassetal.RobustChemicalPreservationofDigitalInformationonDNAinSilicawithError-CorrectingCodes
Time Temperature1week 70˚C2,000 years 10˚C2,000,000 years -18˚C
![Page 34: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/34.jpg)
MBs/week GBs/second
![Page 35: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/35.jpg)
DNA manipulation productivity is growing
102
104
106
108
1010
1970 1980 1990 2000 2010Year
Prod
uctiv
ity
Transistors on ChipReading DNAWriting DNA
Source: Robert Carlson
And cost is decreasing…
![Page 36: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000](https://reader031.vdocuments.net/reader031/viewer/2022013021/5f0c0f3f7e708231d4338bc3/html5/thumbnails/36.jpg)
Molecular Information Systems Lab
Thanks!