Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s...

55
Systems | Fueling future disruptions Research Faculty Summit 2018

Transcript of Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s...

Page 1: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Systems | Fueling future disruptions

ResearchFaculty Summit 2018

Page 2: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Molecular Information Systems LabSampa Lab for Hardware/Software Co-DesignPaul G. Allen School of Computer Science & Engineering University of Washington

Luis Ceze

Getting Polymers to Tell a Story: DNA Data Storage and Processing-in-Molecules

joint work with Karin Strauss, Georg Seelig, Doug Carmean, Sergey Yekhanin, Lee Organick, Yuan-Jyue Chen, Bichlien Nguyen, Chris Takahashi, Ashley Stephenson, Pranav Vaid, Sharon Newmann, Cyrus Rashtchian, Miklos Racz, Siena Ang, David Ward, Randolph Lopez, Max Willsey, Kendall Stewart, James Bornholt, Rob Carlson, Hsing-Yeh Parker.

Microsoft Faculty Summit 2018

Page 3: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Gene

Protein

Function/ Characteristic

DNA is Nature’s information storage medium

Page 4: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

1 0 1 0 0 0 1 1 1 0 01 0 0 0 1 1 1 1 0 0 1 11 1 1 0 0 0 1 0 1 1 00 1 0 1 0 0 1 0 1 1 11 0 1 …Manufactured DNA

Using Synthetic DNA for Data Storage

Original idea from 1960s

Revival in 2012

Builds on the progress of the biotech industry

Page 5: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

1 Exabyte in 1 in3

Extremely Dense

1,000s of Years

Extremely Durable

Never Gets Obsolete

DNA Molecules for Digital Data

Making Copies Is Nearly Free

Page 6: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

A few exabytes

Page 7: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 8: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

1.0E-02 1.0E-01 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09 1.0E+10

Disk (Rot) Disk (SSD) Tape Optical Flash (Chip) DNA

Gb/m

m3

Today Projection Limit

redundancy

spacing

addressing

Potential of DNA:>107 Improvement

1.0E-02

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10

Disk(Rot) Disk(SSD) Tape Optical Flash(Chip) DNA

Gb/m

m^3

Today Projection Limit

redundancy

spacingaddressing

Potential of DNA:>107 Improvement

3-5 5 10-30 5 1000+100Lifetime(years) /indefinite (glass)

SMR, HAMR, bit patterning

Going verticalLimited by

wavelength of light

Page 9: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

µs-ms TBs

10s ms 100s TBs

minutes EBs

hours YBsDNA-based Archival

Tape/Optical

HDD

Flash

Access Time Capacity

Online

Backup

[in ASPLOS’16]

Page 10: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Write Path

1 1 0 1 1 1 0 1

Encoding

1

A G C T A T C A G

Synthesis

2

Read Path

S equencing

A G C T A T C A G

1

1 1 0 1 1 1 0 1

Decoding

2

Page 11: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0

T

C

G

A

11

01

10

00

Encoding Digital Data in DNA

G G A T G

C A C T G

C T T A C

Page 12: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

A CG TA A C

Repeated Letters Are Bad: Avoid them with Randomization & Rotation.

P [ At t a c h ] = 9 9 %

9 9 % 9 8 % 97 % 9 6.1 % 9 5.1 % 9 4.2 %

1 0 0 n t s

3 6.6 %

2 0 0 n t s

1 3.4%

… …G A CG T G A

Synthetic DNA has limited length: Break it into chunks.

Page 13: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

A T T C T G C A

A T G T T

A T G T T

C A A C

C A A G

C A T C C

C A T C C

T A C G T C G CRedundant Data in Addit ional DNA Strands. Many Possibi l i t ies: Par i ty, Reed Solomon, LDPC, …

Break into chunks and add redundancy

G A C G T G C T

G C T C G C T A

A T G T T

A T G T T

A T G T TKey Ident i f iers

(“pr imers”)

A C T A

A C T C

A C T GAddresses

Within the Fi le

C A T C C

C A T C C

C A T C CKey Ident i f iers

(“pr imers”)

G C A C C G A T

Payload Chunks

~20 Bytes per 150nt DNA sequence. Many sequences per file. ~1% (in, del, sub) error per base.

Page 14: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Write Path

1 1 0 1 1 1 0 1

Encoding

1

A G C T A T C A G

Synthesis

2

Read Path

S equencing

A G C T A T C A G

1

1 1 0 1 1 1 0 1

Decoding

2

Page 15: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Large array DNA synthesis

Page 16: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

G

C

A

C

A

C

G

C

A

C

A

C

G

C

A

C

A

C

G

C

A

C

A

C

C

C

C

T

G

T

A

C

C

C

T

G

T

A

C

C

C

T

G

T

A

T

A

C

C

G

A

C

T

A

C

C

G

A

C

T

A

C

C

G

A

C

G

C

A

C

T

A

C

G

C

A

C

T

A

C

G

C

A

C

T

A

C

G

A

C

A

C

G

A

C

A

C

G

A

C

A

C

C

T

G

T

C

T

G

T

C

T

G

T

C

C

C

T

G

T

C

C

C

T

G

T

C

C

C

T

G

T

T

C

C

A

C

T

C

C

A

C

T

C

C

A

C

C

T

G

T

C

T

G

T

C

T

G

T

G

A

C

G

A

C

G

A

C

G

A

C

G

A

C

G

A

C

G

A

C

G

A

G

A

G

AC

CC

T

C

T

C

T

C

Each spot grows many copies of a given sequenceLarge array DNA synthesis

Page 17: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Write Path

1 1 0 1 1 1 0 1

Encoding

1

A G C T A T C A G

Synthesis

2

Read Path

S equencing

A G C T A T C A G

1

1 1 0 1 1 1 0 1

Decoding

2

Page 18: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

A CG TA CC G A C A C C T

DNA Sequencing

Image credit: Oxford Nanopore

Page 19: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

A T G T T T G C T T A C C A A A C C A T G C

A T G T T G C C A G T T C A C A G C A T C C

A T G T T G G A T G C A C A A G A C A T C C

A T G T T T G C T T A C C C A A C C A T C C

A T G T T G C C A G T T C A A A G C A T G C

A T G T T G G A T G C A C A A G A C A T C C

A T G T T T G C T T A C C C A A C C A T C C

A T G T T G C C A G T T C A A A G C A T C C

Selected DNA strands

Sequencing (Noisy reads)

Cluster ing (No reference)

[in NIPS’17]

Page 20: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

…1 1 0 1 10 1 0 1 10 1 …

Reassemble Data

Error Correct ion

Error-Free Data

© 2

018

Nat

ure

Am

eric

a, In

c., p

art o

f Spr

inge

r N

atur

e. A

ll ri

ghts

res

erve

d.

244 VOLUME 36 NUMBER 3 MARCH 2018 NATURE BIOTECHNOLOGY

ART I C L ES

work5,6, our approach employs concatenated codes with Reed–Solomon (RS) as the outer code (Fig. 2b). (However, unlike most earlier work, we used very long codes (length up to 65,536) to handle large variations in the number of errors between code words.) Input data are then rand-omized by XOR with a pseudo-random sequence. Randomization facil-itates coping with errors by breaking multi-bit repeats (e.g., 00000000) and ensures that the DNA sequences we produce are dissimilar, which makes decoding less computationally costly.

The encoder first partitions the randomized digital file into mul-tiple blocks, up to a megabyte in size. We represent each block by a matrix M with up to ten rows and up to 55,000 columns, where every matrix cell carries a 16-bit value. Next, we encode each row of M with a Reed–Solomon code to obtain a larger matrix M` that extends M by appending redundant columns. Every column of M` is later converted into a DNA sequence of length 110 (114 for File 33; Supplementary Note 3 and Supplementary Table 3). When Reed–Solomon redun-dancy is set to 15%, 87% of the DNA sequences carry raw input data

(systematic RS coordinates), while 13% carry redundant data used for error correction (redundant RS coordinates).

The conversion of columns of M` to DNA sequences involves rep-resenting a column in base 4, appending a prefix with address infor-mation (block index and column index), breaking the column into consecutive fragments of size three each, treating the content of each fragment as a number between 0 and 63 written in base four, repre-senting this number in base three to obtain a fragment of size four, putting the new fragments together, and applying a rotating code4 to turn a base-three representation into a base-four representation that eliminates homopolymers.

Finally, all DNA sequences are appended with 20-base PCR primer targets selected from the primer library on both ends to allow random access to the file (Supplementary Note 6 and Supplementary Figs. 3 and 4). Resulting DNA sequences are synthesized into DNA strands, which can then be preserved using a variety of methods, and later selected via random access.

�/'*$/�)'!/ /6�#$0'%+'���$0'%+�4,/(8,4

�$+$/ 1$�/ +#,*��*$/�! 0$#�,+���",+1$+1�

0$.�",*-)$*$+1 /'16�),+%�&,*,-,)6*$/0�� **'+%�#'01 +"$

������0$.2$+"$0

��� �0$.2$+"$0

�� �0$.2$+"$0

�')1$/0'*') /'16

�')1$/0$",+# /601/2"12/$� +#��*

���9)$0

��9)$0

�2)1'-)$5���

�� � � ��

�')$���

),% �

�� 3%�/$ #0�

''��� )'# 1',+

�+",#'+%

���

�����

�'+ /6�# 1

� +#,*'7$

�--)6,21$/�",#$

�--)6'++$/�",#$

�$)$"1�-/'*$/0

�$",#'+%

��������

���������$.2$+"$#

/$ #0�)201$/

�0'*') /�/$ #0�$",+01/2"101/ +#0

�$3$/0$'++$/�",#$

�$3$/0$,21$/�",#$

�##/

�##/

�##/

�##/

� 6), #

� 6), #

� 6), #

� 6), #

� 6), #

� 6), #

� 6), #

� 6), # � 6), #

� 6), #

� 6), #

� 6), #�##/

�##/

�##/

�##/

�##/

�##/

�##/

�##/

��

��

��

��

��

��

��

��

�:��0$.2$+"$0����������

� 1 �

�$#2+# +"6

a

b

c

�'+ /6����������

Figure 2 Design of random access primers and coding algorithm. (a, i) We designed a primer library for our PCR-based random access method using an in silico process. Starting with a set of random 20-mers, the sequences keep mutating until they satisfy all the design criteria, which include their GC-content, the absence of long sequence-complementarities, absence of long stretches of homopolymers, and a minimum Hamming distance of 6 bases from other primers. The preselected sequence set (19,480 sequences) is then filtered by melting temperature and a set that is as diverse as possible, that is, has low similarity between the sequences, is selected. (a, ii) The resulting set of candidate primers is then validated experimentally by synthesizing a pool of about 100,000 strands containing sets of size 1 to 200 DNA sequences each, surrounded by one of the 3,240 candidate primer pairs, and then randomly selecting 48 of those pairs for amplification. The product is sequenced, and sequences with each of the 48 primer pairs appear among sequencing reads, albeit at different relative proportions when normalized to the number of sequences in each set. (b) Our encoding process starts by randomizing data to reduce chances of secondary structures, primer–payload non-specific binding, and improved properties during decoding. It then breaks the data into fixed-size payloads, adds addressing information (Addr), and applies outer coding, which adds redundant sequences using a Reed–Solomon code to increase robustness to missing sequences and errors. The level of redundancy is determined by expected errors in sequencing and synthesis, as well as DNA degradation. Next, it applies inner coding, which ultimately converts the bits to DNA sequences. The resulting set of sequences is surrounded by a primer pair chosen from the library based on (low) level of overlap with payloads. (c) The decoding process starts by clustering reads based on similarity, and finding a consensus between the sequences in each cluster to reconstruct the original sequences, which are then decoded back to digital data.

Page 21: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Write Path

1 1 0 1 1 1 0 1

Encoding

1

A G C T A T C A G

Synthesis

2

Read Path

S equencing

A G C T A T C A G

1

1 1 0 1 1 1 0 1

Decoding

2

Page 22: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 23: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

~100TB per spot

275um

Page 24: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Droplet

Electrode ElectrodeElectrode

Activated Electrode

Electrode Electrode

Hydrophobic Layers

Digital microfluidics

Molecular domain

Electronic domain

Hardware, software, wetware :)

Page 25: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Putting it all together as (key, value) store

DNA storage (physical) library

~100TB

keyfoo.mp4

encode andselect primers seqs

manufacture DNA

Data address specifies physical location and primer for random access.

keyfoo.mp4

primers

PCR

amplification

sequencing

decoding

1101000100…

Page 26: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

[Nature Biotechnology’18]

Over 700MB. 50M+ sequences. 9B+ Nucleotides, 4B+ reads. Demonstrated random access w/ 40+ objects.

Illumina and Nanopore sequencing readout.

Page 27: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

10MBs/day 100GBs/second

Page 28: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Source: Robert Carlson

Y E A R

1 9 7 0 1 9 8 0 1 9 9 0 2 0 0 0 2 0 1 0

1 0 2

1 0 4

1 0 6

1 0 8

1 0 1 0

Transistors On Chip Reading DNA Writing DNA

2 0 1 8

Life sciences

perfect

Data storage

arbitrarilytolerant

Quality

Page 29: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

•Data centers offer perfect abstraction for “exotic technologies” (Carmean)

•Large-scale fluidics for synthesis, manipulation and sequencing

•Throughput of ~1TB/s at the data-center level

•Computational cost significant •Today:~2.8KB/s encode, ~1KB/s decode on 16 core Xeon.

Scalability

Page 30: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Beyond DNA Data Storage

Page 31: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

DNA “computing" in the age of big data

~100TB per spot

275um

If DNA data storage succeeds, what if we could process data directly in DNA?

Extremely parallel and energy efficient

Page 32: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Processing-in-Memory

RAM

RAM

RAM

PIM

CPU

PIM

PIM

Slide credit: Kendall Stewart

Page 33: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Processing-in-Molecules

Slide credit: Kendall Stewart

Page 34: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Storage and Processing-in-Molecules (DNA)

Slide credit: Kendall Stewart

Page 35: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

ACGGA T C T T C A G C G T C

AG

Page 36: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Image 209A

CGGA T C T T C A G C G T C

AG

Page 37: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Image 209A

CGGA T C T T C A G C G T C

AG

TCA A

AGC

C

TTC

Page 38: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Image 209A

CGGA T C T T C A G C G T C

AG

TCA A

AGC

C

TTC

Page 39: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

ACGGA T C T T C A G C G T C

AG

TCA A

AGC

C

TTC

Page 40: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 41: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 42: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 43: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 44: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 45: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 46: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 47: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Input Features

Convolutional Weights 1

Convolutional Weights 2

Fully Connected Weights

ATC G

A T G C C TSequence Output

ReLU + Softmax Activations

Sine Activations

ReLU Activations

30 x 1

30 x 4

10 x 128

10 x 128

10 x 1

1 x 128

1 x 128 x 128

10 x 128 x 30 x 4

Layer Size

`

Query Image: Query Image: All Queries

(a) Number of aigned reads vs. distance to query. Points indicate the mean across three

replicates, and error bars indicate standard error.

(b) Cumulative distribution of aligned reads as a function of increasing distance to the

query. Uniform distribution shown for reference.

Fig. 10: Selected results for two of the ten query images, and aggregated

results for all queries.

6 Discussion

In practice, the 10-dimensional image feature subspace used for our experimentsis insu�ciently selective. Referring back to Figure 1, the 100-dimensional spacewas more e↵ective at relating distance to qualitative similarity. But it is di�cultto train an encoder transform this already-compressed 100-dimensional subspaceonto a 30-nucleotide feature sequence.

We might be tempted to try longer feature regions, but this will likely ex-perience more noise of the type seen in our results. Figure 11 illustrates this bygeneralizing Figure 4 to feature regions of di↵erent sizes. These plots bin acrosssequence length and target-query Hamming distance, and the color indicates ei-ther the mean (on the left) or the standard deviation (on the right) of the yieldvalues in that bin, at our protocol temperature of 21�C. These plots tell us thatthe average Hamming distance su�cient to reliably retrieve the target increasesas length increases, and that the variance in yield for targets past that Hammingdistance threshold increases as well. The increased su�cient Hamming distancewill make it harder for the encoder to push unrelated items apart in the sequencespace, and the additional noise will make the yield approximation used to trainour encoder less reliable.

These problems pose a di�cult challenge to scaling this system. One approachis to use alternative probe designs that have less noise, such as the toehold-exchange probes of Zhang et al. [24, 26], which use protector strands that arecomplimentary to the query to increase the specificity of hybridization. In ourcase, we want hybridization to be more specific, but not so specific that targets

Yottabyte-scale Associate Memories?

Page 48: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 49: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 50: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

HD Video of OK Go’s This Too Shall Pass — watched 100M+ times

Page 51: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Photo: Tara Brown / UW

~10 mi l l ion copies of the HD movie

Page 52: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 53: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely
Page 54: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely

Thank you!

Page 55: Research Faculty Summit 2018 - microsoft.com · DNA for Data Storage Original idea from 1960s Revival in 2012 Builds on the progress of the biotech industry . 1 Exabyte in 1 in3 Extremely