00:00:00.000 --> 00:00:00.541
[TEASER]

00:00:00.541 --> 00:00:01.383
[MUSIC PLAYS UNDER DIALOGUE]

00:00:01.383 --> 00:00:06.120
JAKE SMITH: This really starts from the&nbsp;
fundamental data production–data storage gap,&nbsp;&nbsp;

00:00:06.120 --> 00:00:11.960
where we produce way more data nowadays than&nbsp;
we could ever have imagined years ago. And it's&nbsp;&nbsp;

00:00:11.960 --> 00:00:18.000
more than we can practically store in magnetic&nbsp;
media. And so we really need a denser medium&nbsp;&nbsp;

00:00:18.000 --> 00:00:23.840
on the other side to contain that. DNA is&nbsp;
extremely dense. It holds far, far more&nbsp;&nbsp;

00:00:23.840 --> 00:00:30.280
information per unit volume, per unit mass than&nbsp;
any storage media that we have available today.&nbsp;&nbsp;

00:00:30.280 --> 00:00:35.720
This, along with the fact that DNA is itself a&nbsp;
relatively rugged molecule—it lives in our body;&nbsp;&nbsp;

00:00:35.720 --> 00:00:40.240
it lives outside our body for thousands&nbsp;
and thousands of years if we, you know,&nbsp;&nbsp;

00:00:40.240 --> 00:00:45.023
leave it alone to do its thing—makes&nbsp;
it a very attractive media.

00:00:45.023 --> 00:00:47.200
BICHLIEN NGUYEN: It's such&nbsp;
a futuristic technology,&nbsp;&nbsp;

00:00:47.200 --> 00:00:55.200
right? When you begin to work on the tech, you&nbsp;
realize how many disciplines and domains you&nbsp;&nbsp;

00:00:55.200 --> 00:01:01.440
actually have to reach in and leverage. It's&nbsp;
really interesting, this multidisciplinarity,&nbsp;&nbsp;

00:01:01.440 --> 00:01:08.720
because we're, in a way, bridging software&nbsp;
with wetware with hardware. And so you,&nbsp;&nbsp;

00:01:08.720 --> 00:01:14.103
kind of, need all the different disciplines&nbsp;
to actually get you to where you need to go.

00:01:14.103 --> 00:01:16.920
SERGEY YEKHANIN: We all work for Microsoft;&nbsp;
we are all Microsoft researchers. Microsoft&nbsp;&nbsp;

00:01:16.920 --> 00:01:21.240
isn’t a startup. But that team, the team&nbsp;
that drove the DNA Data Storage Project,&nbsp;&nbsp;

00:01:21.240 --> 00:01:26.306
it did feel like a startup, and it was&nbsp;
something unusual and exciting for me.

00:01:26.306 --> 00:01:30.920
SERIES INTRO: You’re listening to Ideas, a&nbsp;
Microsoft Research Podcast that dives deep into&nbsp;&nbsp;

00:01:30.920 --> 00:01:43.840
the world of technology research and the profound&nbsp;
questions behind the code. In this series, we’ll&nbsp;&nbsp;

00:01:43.840 --> 00:01:52.400
explore the technologies that are shaping our&nbsp;
future and the big ideas that propel them forward.

00:01:52.400 --> 00:01:52.414
[MUSIC FADES]

00:01:52.414 --> 00:01:56.080
GUEST HOST KARIN STRAUSS: I'm your guest host&nbsp;
Karin Strauss, a senior principal research&nbsp;&nbsp;

00:01:56.080 --> 00:02:01.960
manager at Microsoft. For nearly a decade, my&nbsp;
colleagues and I—along with a fantastic and&nbsp;&nbsp;

00:02:01.960 --> 00:02:07.520
talented group of collaborators from academia&nbsp;
and industry—have been working together to help&nbsp;&nbsp;

00:02:07.520 --> 00:02:13.360
close the data creation–data storage gap. We're&nbsp;
producing far more digital information than we can&nbsp;&nbsp;

00:02:13.360 --> 00:02:20.960
possibly store. One solution we've explored uses&nbsp;
synthetic DNA as a medium, and over the years,&nbsp;&nbsp;

00:02:20.960 --> 00:02:26.640
we've contributed to steady and promising progress&nbsp;
in the area. We've helped push the boundaries of&nbsp;&nbsp;

00:02:26.640 --> 00:02:32.320
how much DNA writer can simultaneously store,&nbsp;
shown that full automation is possible,&nbsp;&nbsp;

00:02:32.320 --> 00:02:38.080
and helped create an ecosystem for the commercial&nbsp;
success of DNA data storage. And just this week,&nbsp;&nbsp;

00:02:38.080 --> 00:02:43.600
we've made one of our most advanced tools&nbsp;
for encoding and decoding data in DNA open&nbsp;&nbsp;

00:02:43.600 --> 00:02:48.760
source. Joining me today to discuss the&nbsp;
state of DNA data storage and some of our&nbsp;&nbsp;

00:02:48.760 --> 00:02:54.280
contributions are several members of the DNA&nbsp;
Data Storage Project at Microsoft Research:&nbsp;&nbsp;

00:02:54.280 --> 00:03:00.000
Principal Researcher Bichlien Nguyen,&nbsp;
Senior Researcher Jake Smith, and Partner&nbsp;&nbsp;

00:03:00.000 --> 00:03:06.183
Research Manager Sergey Yekhanin. Bichlien,&nbsp;
Jake, and Sergey, welcome to the podcast.

00:03:06.183 --> 00:03:07.223
BICHLIEN NGUYEN: Thanks for having us, Karin.

00:03:07.223 --> 00:03:07.897
SERGEY YEKHANIN: Thank you so much.

00:03:07.897 --> 00:03:08.560
JAKE SMITH: Yes, thank you.

00:03:08.560 --> 00:03:14.960
STRAUSS: So before getting into the details of DNA&nbsp;
data storage and our work, I'd like to talk about&nbsp;&nbsp;

00:03:14.960 --> 00:03:21.440
the big idea behind the work and how we all got&nbsp;
here. I've often described the DNA Data Storage&nbsp;&nbsp;

00:03:21.440 --> 00:03:28.680
Project as turning science fiction into reality.&nbsp;
When we started the project in 2015, though, the&nbsp;&nbsp;

00:03:28.680 --> 00:03:34.920
idea of using DNA for archival storage was already&nbsp;
out there and had been for over five decades.&nbsp;&nbsp;

00:03:34.920 --> 00:03:39.360
Still, when I talked about the work in the area,&nbsp;
people were pretty skeptical in the beginning,&nbsp;&nbsp;

00:03:39.360 --> 00:03:45.800
and I heard things like, “Wow, why are you&nbsp;
thinking about that? It's so far off.” So, first,&nbsp;&nbsp;

00:03:45.800 --> 00:03:51.080
please share a bit of your research backgrounds&nbsp;
and then how you came to work on this project.&nbsp;&nbsp;

00:03:51.080 --> 00:03:56.400
Where did you first encounter this idea, what do&nbsp;
you remember about your initial impressions—or the&nbsp;&nbsp;

00:03:56.400 --> 00:04:02.360
impressions of others—and what made you want&nbsp;
to get involved? Sergey, why don’t you start.

00:04:02.360 --> 00:04:06.440
YEKHANIN: Thanks so much. So I’m a coding&nbsp;
theorist by training, so, like, my core areas&nbsp;&nbsp;

00:04:06.440 --> 00:04:12.520
of research have been error-correcting codes&nbsp;
and also computational complexity theory. So&nbsp;&nbsp;

00:04:12.520 --> 00:04:17.120
I joined the project probably, like, within half&nbsp;
a year of the time that it was born, and thanks,&nbsp;&nbsp;

00:04:17.120 --> 00:04:21.680
Karin, for inviting me to join. So, like,&nbsp;
that was roughly the time when I moved from&nbsp;&nbsp;

00:04:21.680 --> 00:04:27.000
a different lab, from the Silicon Valley lab&nbsp;
in California to the Redmond lab, and actually,&nbsp;&nbsp;

00:04:27.000 --> 00:04:30.400
it just so happened that at that moment, I&nbsp;
was thinking about what to do next. Like,&nbsp;&nbsp;

00:04:30.400 --> 00:04:35.440
in California, I was mostly working on coding&nbsp;
for distributed storage, and when I joined here,&nbsp;&nbsp;

00:04:36.000 --> 00:04:40.360
that effort kept going. But I had some free&nbsp;
cycles, and that was the moment when Karin&nbsp;&nbsp;

00:04:40.360 --> 00:04:45.400
came just to my office and told me about the&nbsp;
project. So, indeed, initially, it did feel a&nbsp;&nbsp;

00:04:45.400 --> 00:04:50.240
lot like science fiction. Because, I mean, we&nbsp;
are used to coding for digital storage media,&nbsp;&nbsp;

00:04:50.240 --> 00:04:55.480
like for magnetic storage media, and here, like,&nbsp;
this is biology, and, like, why exactly these&nbsp;&nbsp;

00:04:55.480 --> 00:04:59.920
kind of molecules? There are so many different&nbsp;
molecules. Like, why that? But honestly, like,&nbsp;&nbsp;

00:04:59.920 --> 00:05:04.320
I didn't try to pretend to be a biologist and make&nbsp;
conclusions about whether this is the right medium&nbsp;&nbsp;

00:05:04.320 --> 00:05:09.880
or the wrong medium. So I tried to look into these&nbsp;
kinds of questions from a technical standpoint,&nbsp;&nbsp;

00:05:09.880 --> 00:05:15.040
and there was a lot of, kind of, deep, interesting&nbsp;
coding questions, and that was the main attraction&nbsp;&nbsp;

00:05:15.040 --> 00:05:19.320
for me. At the same time, I wasn’t convinced&nbsp;
that we will get as far as we actually got,&nbsp;&nbsp;

00:05:19.320 --> 00:05:24.080
and I wasn't immediately convinced about the&nbsp;
future of the field, but, kind of, just the depth&nbsp;&nbsp;

00:05:24.080 --> 00:05:29.840
and the richness of the, what I’ll call, technical&nbsp;
problems, that's what made it appealing for me,&nbsp;&nbsp;

00:05:29.840 --> 00:05:34.160
and I, kind of, enthusiastically joined. And&nbsp;
also, I guess, the culture of the team. So, like,&nbsp;&nbsp;

00:05:34.160 --> 00:05:37.720
it did feel like a startup. Like, we all work&nbsp;
for Microsoft; we’re all Microsoft researchers.&nbsp;&nbsp;

00:05:37.720 --> 00:05:42.680
Microsoft isn’t a startup. But that team, the&nbsp;
team that drove the DNA Data Storage Project,&nbsp;&nbsp;

00:05:42.680 --> 00:05:46.020
it did feel like a startup, and it was&nbsp;
something unusual and exciting for me.

00:05:46.020 --> 00:05:53.240
NGUYEN: Oh, I love that, Sergey. So my background&nbsp;
is in organic chemistry, and Karin had reached out&nbsp;&nbsp;

00:05:53.240 --> 00:05:58.880
to me, and I interviewed not knowing what Karin&nbsp;
wanted. Actually … so I took the job kind of&nbsp;&nbsp;

00:05:58.880 --> 00:06:05.520
blind because I was like, “Hmm, Microsoft&nbsp;
Research? … DNA biotech? ...” I was very,&nbsp;&nbsp;

00:06:05.520 --> 00:06:10.200
very curious, and then when she told me that this&nbsp;
project was about DNA data storage, I was like,&nbsp;&nbsp;

00:06:10.200 --> 00:06:17.680
this is a crazy, crazy idea. I definitely was&nbsp;
not sold on it, but I was like, well, look,&nbsp;&nbsp;

00:06:17.680 --> 00:06:24.240
I get to meet and work with so many interesting&nbsp;
people from different backgrounds that, one,&nbsp;&nbsp;

00:06:24.240 --> 00:06:27.400
even if it doesn't work out, I’m&nbsp;
going to learn something, and, two,&nbsp;&nbsp;

00:06:27.400 --> 00:06:33.520
I think it could work, like it could work. And so&nbsp;
I think that's really what motivated me to join.

00:06:33.520 --> 00:06:38.880
SMITH: The first thing that you think when&nbsp;
you hear about we're going to take what is&nbsp;&nbsp;

00:06:38.880 --> 00:06:44.200
our hard drive and we're going to turn that&nbsp;
into DNA is that this is nuts. But, you know,&nbsp;&nbsp;

00:06:44.200 --> 00:06:49.880
it didn't take very long after that. I come&nbsp;
from a chemistry, biotech-type background&nbsp;&nbsp;

00:06:49.880 --> 00:06:55.480
where I've been working on designing drugs, and&nbsp;
there, DNA is this thing off in the nethers,&nbsp;&nbsp;

00:06:55.480 --> 00:07:00.680
you know. You look at it every now and then&nbsp;
to see what information it can tell you about,&nbsp;&nbsp;

00:07:00.680 --> 00:07:05.840
you know, what maybe your drug might be hitting&nbsp;
on the target side, and it's, you know, that&nbsp;&nbsp;

00:07:05.840 --> 00:07:11.400
connection—that the DNA contains the information&nbsp;
in the living systems, the DNA contains the&nbsp;&nbsp;

00:07:11.400 --> 00:07:17.280
information in our assays, and why could the DNA&nbsp;
not contain the information that we, you know,&nbsp;&nbsp;

00:07:17.280 --> 00:07:22.800
think more about every day, that information that&nbsp;
lives in our computers—as an extremely cool idea.

00:07:22.800 --> 00:07:27.560
STRAUSS: Through our work, we've had years to&nbsp;
wrap our heads around DNA data storage. But,&nbsp;&nbsp;

00:07:27.560 --> 00:07:29.680
Jake, could you tell us a little bit about&nbsp;&nbsp;

00:07:29.680 --> 00:07:35.360
how DNA data storage works and why we're&nbsp;
interested in looking into the technology?

00:07:35.360 --> 00:07:40.240
SMITH: So you mentioned it earlier, Karin,&nbsp;
that this really starts from the fundamental&nbsp;&nbsp;

00:07:40.240 --> 00:07:46.040
data production–data storage gap, where we&nbsp;
produce way more data nowadays than we could&nbsp;&nbsp;

00:07:46.040 --> 00:07:51.800
ever have imagined years ago. And it's more than&nbsp;
we can practically store in magnetic media. This&nbsp;&nbsp;

00:07:51.800 --> 00:07:58.840
is a problem because, you know, we have data;&nbsp;
we have recognized the value of data with the&nbsp;&nbsp;

00:07:58.840 --> 00:08:04.320
rise of large language models and these other big&nbsp;
generative models. The data that we do produce,&nbsp;&nbsp;

00:08:04.320 --> 00:08:10.680
our video has gone from, you know, substantially&nbsp;
small, down at 480 resolution, all the way up to&nbsp;&nbsp;

00:08:10.680 --> 00:08:16.720
things at 8K resolution that now take orders of&nbsp;
magnitude more storage. And so we really need a&nbsp;&nbsp;

00:08:16.720 --> 00:08:23.400
denser medium on the other side to contain that.&nbsp;
DNA is extremely dense. It holds far, far more&nbsp;&nbsp;

00:08:23.400 --> 00:08:29.880
information per unit volume, per unit mass than&nbsp;
any storage media that we have available today.&nbsp;&nbsp;

00:08:29.880 --> 00:08:35.320
This, along with the fact that DNA is itself a&nbsp;
relatively rugged molecule—it lives in our body;&nbsp;&nbsp;

00:08:35.320 --> 00:08:40.160
it lives outside our body for thousands&nbsp;
and thousands of years if we, you know,&nbsp;&nbsp;

00:08:40.160 --> 00:08:44.240
leave it alone to do its thing—makes&nbsp;
it a very attractive media,&nbsp;&nbsp;

00:08:44.240 --> 00:08:48.760
particularly compared to the traditional&nbsp;
magnetic media, which has lower density&nbsp;&nbsp;

00:08:48.760 --> 00:08:52.840
and a much shorter lifetime on the,&nbsp;
you know, scale of decades at most.

00:08:52.840 --> 00:08:58.640
So how does DNA data storage actually work?&nbsp;
Well, at a very high level, we start out in the&nbsp;&nbsp;

00:08:58.640 --> 00:09:04.080
digital domain, where we have our information&nbsp;
represented as ones and zeros, and we need to&nbsp;&nbsp;

00:09:04.080 --> 00:09:10.320
convert that into a series of A's, C's, T's,&nbsp;
and G's that we could then actually produce,&nbsp;&nbsp;

00:09:10.320 --> 00:09:15.080
and this is really the domain of Sergey. He'll&nbsp;
tell us much more about how this works later on.&nbsp;&nbsp;

00:09:15.080 --> 00:09:20.080
For now, let's just assume we've done this. And&nbsp;
now our information, you know, lives in the DNA&nbsp;&nbsp;

00:09:20.080 --> 00:09:25.280
base domain. It's still in the digital world. It's&nbsp;
just represented as A’s, C’s, T’s, and G’s, and&nbsp;&nbsp;

00:09:25.280 --> 00:09:30.360
we now need to make this physical so that we can&nbsp;
store it. This is accomplished through large-scale&nbsp;&nbsp;

00:09:30.360 --> 00:09:37.600
DNA synthesis. Once the DNA has been synthesized&nbsp;
with the sequences that we specified, we need to&nbsp;&nbsp;

00:09:37.600 --> 00:09:42.560
store it. There's a lot of ways we can think about&nbsp;
storing it. Bichlien’s done great work looking at&nbsp;&nbsp;

00:09:42.560 --> 00:09:50.480
DNA encapsulation, as well as, you know, other&nbsp;
more raw just DNA-on-glass-type techniques. And&nbsp;&nbsp;

00:09:50.480 --> 00:09:57.360
we've done some work looking at the susceptibility&nbsp;
of DNA stored in this unencapsulated form to&nbsp;&nbsp;

00:09:57.360 --> 00:10:03.720
things like atmospheric humidity, to temperature&nbsp;
changes and, most excitingly, to things like&nbsp;&nbsp;

00:10:03.720 --> 00:10:10.400
neutron radiation. So we've stored our data&nbsp;
in this physical form, we've archived it, and&nbsp;&nbsp;

00:10:10.400 --> 00:10:16.480
coming back to it, likely many years in the future&nbsp;
because the properties of DNA match up very well&nbsp;&nbsp;

00:10:16.480 --> 00:10:22.440
with archival storage, we need to convert it back&nbsp;
into the digital domain. And this is done through&nbsp;&nbsp;

00:10:22.440 --> 00:10:29.320
a technique called DNA sequencing. What this does&nbsp;
is it puts the molecules through some sort of&nbsp;&nbsp;

00:10:29.320 --> 00:10:35.640
machine, and on the other side of the machine, we&nbsp;
get out, you know, a noisy representation of what&nbsp;&nbsp;

00:10:35.640 --> 00:10:42.440
the actual sequence of bases in the molecules&nbsp;
were. We have one final step. We need to take&nbsp;&nbsp;

00:10:42.440 --> 00:10:50.360
this series of noisy sequences and convert it back&nbsp;
into ones and zeros. Once we do this, we return&nbsp;&nbsp;

00:10:50.360 --> 00:10:55.680
to our original data and we've completed,&nbsp;
let's call it, one DNA data storage cycle.

00:10:55.680 --> 00:11:01.360
STRAUSS: We'll get into this in more detail&nbsp;
later, but maybe, Sergey, we dig a little bit&nbsp;&nbsp;

00:11:01.360 --> 00:11:09.400
on encoding-decoding end of things and how DNA is&nbsp;
different as a medium from other types of media.

00:11:09.400 --> 00:11:14.520
YEKHANIN: Sure. So, like, I mean, coding is an&nbsp;
important aspect of this whole idea of DNA data&nbsp;&nbsp;

00:11:14.520 --> 00:11:19.120
storage because we have to deal with errors—it’s&nbsp;
a new medium—but talking about error-correcting&nbsp;&nbsp;

00:11:19.120 --> 00:11:23.560
codes in the context of DNA data storage, so, I&nbsp;
mean, usually, like … what are error-correcting&nbsp;&nbsp;

00:11:23.560 --> 00:11:28.800
codes about? Like, on the very high level, right,&nbsp;
I mean, you have some data—think of it as a binary&nbsp;&nbsp;

00:11:28.800 --> 00:11:33.640
string—you want to store it, but there are&nbsp;
errors. So usually, like, in most, kind of,&nbsp;&nbsp;

00:11:33.640 --> 00:11:36.600
forms of media, the errors are bit flips. Like,&nbsp;
you store a 0; you get a 1. Or you store a 1; you&nbsp;&nbsp;

00:11:36.600 --> 00:11:44.600
get a 0. So these are called substitution errors.&nbsp;
The field of error-correcting codes, it started,&nbsp;&nbsp;

00:11:44.600 --> 00:11:50.160
like, in the 1950s, so, like, it’s 70 years old&nbsp;
at least. So we, kind of, we understand how to&nbsp;&nbsp;

00:11:50.160 --> 00:11:55.400
deal with this kind of error reasonably well, so&nbsp;
with substitution errors. In DNA data storage,&nbsp;&nbsp;

00:11:55.400 --> 00:12:01.040
the way you store your data is that given,&nbsp;
like, some large amount of digital data,&nbsp;&nbsp;

00:12:01.040 --> 00:12:05.920
you have the freedom of choosing which short&nbsp;
DNA molecules to generate. So in a DNA molecule,&nbsp;&nbsp;

00:12:05.920 --> 00:12:10.320
it’s a sequence of the bases A, G, C, and&nbsp;
T, and you have the freedom to decide,&nbsp;&nbsp;

00:12:10.320 --> 00:12:14.720
like, which of the short molecules you need to&nbsp;
generate, and then those molecules get stored,&nbsp;&nbsp;

00:12:14.720 --> 00:12:19.160
and then during the storage, some of them&nbsp;
are lost; some of them can be damaged. There&nbsp;&nbsp;

00:12:19.160 --> 00:12:25.160
can be insertions and deletions of bases on every&nbsp;
molecule. Like, we call them strands. So you need&nbsp;&nbsp;

00:12:25.160 --> 00:12:30.840
redundancy, and there are two forms of redundancy.&nbsp;
There's redundancy that goes across strands,&nbsp;&nbsp;

00:12:30.840 --> 00:12:36.160
and there is redundancy on the strand. And so,&nbsp;
yeah, so, kind of, from the error-correcting&nbsp;&nbsp;

00:12:36.160 --> 00:12:40.520
side of things, like, we get to decide what kind&nbsp;
of redundancy we want to introduce—across strands,&nbsp;&nbsp;

00:12:40.520 --> 00:12:44.000
on the strand—and then, like, we want to&nbsp;
make sure that our encoding and decoding&nbsp;&nbsp;

00:12:44.000 --> 00:12:47.600
algorithms are efficient. So that's&nbsp;
the coding theory angle on the field.

00:12:47.600 --> 00:12:52.840
NGUYEN: Yeah, and then, you know, from there,&nbsp;
once you have that data encoded into DNA,&nbsp;&nbsp;

00:12:52.840 --> 00:12:57.880
the question is how do you make that data&nbsp;
on a scale that's compatible with digital&nbsp;&nbsp;

00:12:57.880 --> 00:13:05.080
data storage? And so that's where a lot of the&nbsp;
work came in for really automating the synthesis&nbsp;&nbsp;

00:13:05.080 --> 00:13:11.280
process and also the reading process, as well. So&nbsp;
synthesis is what we consider the writing process&nbsp;&nbsp;

00:13:11.280 --> 00:13:17.880
of DNA data storage. And so, you know, we came&nbsp;
up with some unique ideas there. We made a chip&nbsp;&nbsp;

00:13:17.880 --> 00:13:23.800
that enabled us to get to the densities that&nbsp;
we needed. And then on the reading side, we&nbsp;&nbsp;

00:13:23.800 --> 00:13:29.200
used different sequencing technologies. And it was&nbsp;
great to see that we could actually just, kind of,&nbsp;&nbsp;

00:13:29.200 --> 00:13:35.360
pull sequencing technologies off the shelf because&nbsp;
people are so interested in reading biological&nbsp;&nbsp;

00:13:35.360 --> 00:13:42.760
DNA. So we explored the Illumina technologies and&nbsp;
also Oxford Nanopore, which is a new technology&nbsp;&nbsp;

00:13:42.760 --> 00:13:48.320
coming in the horizon. And then preservation, too,&nbsp;
because we have to make sure that the data that’s&nbsp;&nbsp;

00:13:48.320 --> 00:13:53.840
stored in the DNA doesn't get damaged and that we&nbsp;
can recover it using the error-correcting codes.

00:13:53.840 --> 00:14:00.600
STRAUSS: Yeah, absolutely. And it's clear&nbsp;
that—and it's also been our experience that—DNA&nbsp;&nbsp;

00:14:00.600 --> 00:14:06.240
data storage and projects like this require more&nbsp;
than just a team of computer scientists. Bichlien,&nbsp;&nbsp;

00:14:06.240 --> 00:14:11.920
you’ve had the opportunity to collaborate with&nbsp;
many people in all different disciplines. So&nbsp;&nbsp;

00:14:11.920 --> 00:14:16.360
do you want to talk a little bit about&nbsp;
that? What kind of expertise, you know,&nbsp;&nbsp;

00:14:16.360 --> 00:14:20.940
other disciplines that are relevant to&nbsp;
bringing DNA data storage to reality?

00:14:20.940 --> 00:14:26.920
NGUYEN: Yeah, well, it's such a futuristic&nbsp;
technology, right? When you begin to work&nbsp;&nbsp;

00:14:26.920 --> 00:14:35.120
on the tech, you realize how many disciplines&nbsp;
and domains you actually have to reach in and&nbsp;&nbsp;

00:14:35.120 --> 00:14:43.800
leverage. One concrete example is that in order&nbsp;
to fabricate an electronic chip to synthesize DNA,&nbsp;&nbsp;

00:14:43.800 --> 00:14:50.040
we really had to pull in a lot of material science&nbsp;
research because there's different capabilities&nbsp;&nbsp;

00:14:50.040 --> 00:14:58.600
that are needed when trying to use liquid on a&nbsp;
chip. We, you know, have to think about DNA data&nbsp;&nbsp;

00:14:58.600 --> 00:15:05.960
storage itself. And that's a very different beast&nbsp;
than, you know, the traditional storage mediums.&nbsp;&nbsp;

00:15:05.960 --> 00:15:13.680
And so we worked with teams who literally create,&nbsp;
you know, these little tiny micro- or nanocapsules&nbsp;&nbsp;

00:15:13.680 --> 00:15:20.480
in glass and being able to store that there. It's&nbsp;
really interesting, this multidisciplinarity,&nbsp;&nbsp;

00:15:20.480 --> 00:15:28.000
because we're, in a way, bridging software&nbsp;
with wetware with hardware. And so you,&nbsp;&nbsp;

00:15:28.000 --> 00:15:33.260
kind of, need all the different disciplines&nbsp;
to actually get you to where you need to go.

00:15:33.260 --> 00:15:38.280
STRAUSS: Yeah, absolutely. And, you know,&nbsp;
building on, you know, collaborators,&nbsp;&nbsp;

00:15:38.280 --> 00:15:43.880
I think one area that was super interesting,&nbsp;
as well, and was pretty early on in the project&nbsp;&nbsp;

00:15:43.880 --> 00:15:49.840
was building that first end-to-end system that&nbsp;
we collaborated with University of Washington,&nbsp;&nbsp;

00:15:49.840 --> 00:15:56.480
the Molecular Information Systems Lab there,&nbsp;
to build. And really, at that point, you know,&nbsp;&nbsp;

00:15:56.480 --> 00:16:02.000
there had been work suggesting that DNA data&nbsp;
storage was viable, but nobody had really shown&nbsp;&nbsp;

00:16:02.000 --> 00:16:08.640
an end-to-end system, from beginning to end, and&nbsp;
in fact, my manager at the time, Doug Carmean,&nbsp;&nbsp;

00:16:08.640 --> 00:16:16.560
used to call it the “bubble gum and shoestring”&nbsp;
system. But it was a crucial first step because&nbsp;&nbsp;

00:16:16.560 --> 00:16:21.480
it shows it was possible to really fully&nbsp;
automate the process. And there have been&nbsp;&nbsp;

00:16:21.480 --> 00:16:27.320
several interesting challenges there in the&nbsp;
system, but we noticed that one particularly&nbsp;&nbsp;

00:16:27.320 --> 00:16:32.040
challenging one was synthesis. That first system&nbsp;
that we built was capable of storing the word&nbsp;&nbsp;

00:16:32.040 --> 00:16:38.280
“hello,” and that was all we could store. So&nbsp;
it wasn't a very high-capacity system. But in&nbsp;&nbsp;

00:16:38.280 --> 00:16:47.800
order to be able to store a lot more volumes of&nbsp;
data instead of a simple word, we really needed&nbsp;&nbsp;

00:16:47.800 --> 00:16:54.040
much more advanced synthesis systems. And this is&nbsp;
what both Bichlien and Jake ended up working on,&nbsp;&nbsp;

00:16:54.040 --> 00:16:59.660
so do you want to talk a little bit about that&nbsp;
and the importance of that particular work?

00:16:59.660 --> 00:17:05.400
SMITH: Yeah, absolutely. As you said, Karin,&nbsp;
the amount of DNA that is required to store&nbsp;&nbsp;

00:17:05.400 --> 00:17:09.400
the massive amount of data we spoke&nbsp;
about earlier is far beyond the amount&nbsp;&nbsp;

00:17:09.400 --> 00:17:15.680
of DNA that's needed for any, air quotes,&nbsp;
traditional applications of synthetic DNA,&nbsp;&nbsp;

00:17:15.680 --> 00:17:22.720
whether it's your gene construction or it's your&nbsp;
primer synthesis or such. And so we really had&nbsp;&nbsp;

00:17:22.720 --> 00:17:28.840
to rethink how you make DNA at scale and&nbsp;
think about how could this actually scale&nbsp;&nbsp;

00:17:29.520 --> 00:17:36.200
to meet the demand. And so Bichlien started out&nbsp;
looking at a thing called a microelectrode array,&nbsp;&nbsp;

00:17:36.200 --> 00:17:42.240
where you have this big checkerboard of small&nbsp;
individual reaction sites, and in each reaction&nbsp;&nbsp;

00:17:42.240 --> 00:17:50.520
site, we used electrochemistry in order to&nbsp;
control base by base—A, C, T, or G by A, C,&nbsp;&nbsp;

00:17:50.520 --> 00:17:56.400
T, or G—the sequence that was growing at that&nbsp;
particular reaction site. We got this down to&nbsp;&nbsp;

00:17:56.400 --> 00:18:03.400
the nanoscale. And so what this means practically&nbsp;
is that on one of these chips, we could synthesize&nbsp;&nbsp;

00:18:03.400 --> 00:18:10.920
at any given time on the order of hundreds of&nbsp;
millions of individual strands. So once we had the&nbsp;&nbsp;

00:18:10.920 --> 00:18:15.960
synthesis working with the traditional chemistry&nbsp;
where you're doing chemical synthesis—each base&nbsp;&nbsp;

00:18:15.960 --> 00:18:23.280
is added in using a mixture of chemicals that are&nbsp;
added to the individual spots—they're activated.&nbsp;&nbsp;

00:18:23.280 --> 00:18:29.840
But each coupling happens due to some energy you&nbsp;
prestored in the synthesis of your reagents. And&nbsp;&nbsp;

00:18:29.840 --> 00:18:36.200
this makes the synthesis of those reagents costly&nbsp;
and themselves a bottleneck. And so taking, you&nbsp;&nbsp;

00:18:36.200 --> 00:18:41.040
know, a look forward at what else was happening&nbsp;
in the synthetic biology world, the, you know,&nbsp;&nbsp;

00:18:41.040 --> 00:18:47.720
next big word in DNA synthesis was and still is&nbsp;
enzymatic synthesis, where rather than having to,&nbsp;&nbsp;

00:18:47.720 --> 00:18:53.880
you know, spend a lot of energy to chemically&nbsp;
pre-activate reagents that will go in to make&nbsp;&nbsp;

00:18:53.880 --> 00:19:02.120
your actual DNA strands, we capitalize on&nbsp;
nature's synthetic robots—enzymes—to start&nbsp;&nbsp;

00:19:02.120 --> 00:19:08.320
with less-activated, less-expensive-to-get-to,&nbsp;
cheaply-produced-through-natural-processes&nbsp;&nbsp;

00:19:08.320 --> 00:19:14.680
substrates, and we use the enzymes themselves,&nbsp;
toggling their activity over each of the&nbsp;&nbsp;

00:19:14.680 --> 00:19:21.000
individual chips, or each of the individual&nbsp;
spots on our checkerboard, to construct DNA&nbsp;&nbsp;

00:19:21.000 --> 00:19:26.320
strands. And so we got a little bit into this&nbsp;
project. You know, we successfully showed that&nbsp;&nbsp;

00:19:26.320 --> 00:19:32.440
we could put down selectively one base at a&nbsp;
given time. We hope that others will, kind of,&nbsp;&nbsp;

00:19:32.440 --> 00:19:37.520
take up the work that we've put out there, you&nbsp;
know, particularly our wonderful collaborators&nbsp;&nbsp;

00:19:37.520 --> 00:19:42.480
at Ansa who helped us design the enzymatic&nbsp;
system. And one day we will see, you know,&nbsp;&nbsp;

00:19:42.480 --> 00:19:49.260
a truly parallelized, in this fashion, enzymatic&nbsp;
DNA system that can achieve the scales necessary.

00:19:49.260 --> 00:19:54.760
NGUYEN: It's interesting to note that even&nbsp;
though it's DNA and we're still storing data&nbsp;&nbsp;

00:19:54.760 --> 00:20:01.200
in these DNA strands, chemical synthesis and&nbsp;
enzymatic synthesis provide different errors&nbsp;&nbsp;

00:20:01.200 --> 00:20:07.560
that you see in the actual files, right, in&nbsp;
the DNA files. And so I know that we talked&nbsp;&nbsp;

00:20:07.560 --> 00:20:14.440
to Sergey about how do we deal with these new&nbsp;
types of errors and also the new capabilities&nbsp;&nbsp;

00:20:14.440 --> 00:20:21.420
that you can have, for example, if you don't&nbsp;
control base by base the DNA synthesis.  
 
 

00:20:21.420 --> 00:20:25.840
YEKHANIN: This whole field of DNA data storage,&nbsp;
like, the technologies on the biology side are&nbsp;&nbsp;

00:20:25.840 --> 00:20:30.000
advancing rapidly, right. And there are different&nbsp;
approaches to synthesis. There are different&nbsp;&nbsp;

00:20:30.000 --> 00:20:34.600
approaches to sequencing. And, presumably,&nbsp;
the way the storage is actually done, like,&nbsp;&nbsp;

00:20:34.600 --> 00:20:39.400
is also progressing, right, and we had works on&nbsp;
that. So there is, kind of, this very general,&nbsp;&nbsp;

00:20:39.400 --> 00:20:42.800
kind of, high-level error profile that you can&nbsp;
say that these are the type of errors that you&nbsp;&nbsp;

00:20:42.800 --> 00:20:47.360
encounter in DNA data storage. Like, in DNA&nbsp;
molecules—just the sequence of these bases,&nbsp;&nbsp;

00:20:49.040 --> 00:20:51.920
A, G, C, T, in maybe a length of,&nbsp;
like, 200 or so and you store a very,&nbsp;&nbsp;

00:20:51.920 --> 00:20:56.000
very large number of them—the errors that you&nbsp;
see is that some of these strands, kind of,&nbsp;&nbsp;

00:20:56.000 --> 00:21:00.920
will disappear. Some of these strings can be&nbsp;
torn apart like, let’s say, in two pieces,&nbsp;&nbsp;

00:21:00.920 --> 00:21:05.560
maybe even more. And then on every strand, you&nbsp;
also encounter these errors—insertions, deletions,&nbsp;&nbsp;

00:21:05.560 --> 00:21:10.560
substitutions—with different rates. Like, the&nbsp;
likelihood of all kinds of these errors may differ&nbsp;&nbsp;

00:21:10.560 --> 00:21:14.760
very significantly across different technologies&nbsp;
that you use on the biology side. And also there&nbsp;&nbsp;

00:21:14.760 --> 00:21:19.400
can be error bursts somehow. Maybe you can get&nbsp;
an insertion of, I don’t know, 10 A’s, like, in a&nbsp;&nbsp;

00:21:19.400 --> 00:21:26.240
row, or you can lose, like, you know, 10 bases in&nbsp;
a row. So if you don't, kind of, quantify, like,&nbsp;&nbsp;

00:21:26.240 --> 00:21:31.080
what are the likelihoods of all these bad events&nbsp;
happening, then I think this still, kind of,&nbsp;&nbsp;

00:21:31.080 --> 00:21:36.240
fits at least the majority of approaches to DNA&nbsp;
data storage, maybe not exactly all of them,&nbsp;&nbsp;

00:21:36.240 --> 00:21:40.200
but it fits the majority. So when we design&nbsp;
coding schemes, we are trying also, kind of,&nbsp;&nbsp;

00:21:40.200 --> 00:21:44.560
to look ahead in the sense that, like,&nbsp;
we don't know, like, in five years, like,&nbsp;&nbsp;

00:21:44.560 --> 00:21:48.520
how will these error profiles, how will it look&nbsp;
like. So the technologies that we develop on the&nbsp;&nbsp;

00:21:48.520 --> 00:21:53.400
error-correction side, we try to keep them very&nbsp;
flexible, so whether it's enzymatic synthesis,&nbsp;&nbsp;

00:21:53.400 --> 00:21:58.680
whether it's Nanopore technology, whether it’s&nbsp;
Illumina technology that is being used, the&nbsp;&nbsp;

00:21:58.680 --> 00:22:03.360
error-correction algorithms would be able to adapt&nbsp;
and would still be useful. But, I mean, this makes&nbsp;&nbsp;

00:22:03.360 --> 00:22:07.460
also coding aspect harder because, [LAUGHTER] kind&nbsp;
of, you want to keep all this flexibility in mind.

00:22:07.460 --> 00:22:11.520
STRAUSS: So, Sergey, we are&nbsp;
at an interesting moment now&nbsp;&nbsp;

00:22:11.520 --> 00:22:15.680
because you’re open sourcing the&nbsp;
Trellis BMA piece of code, right,&nbsp;&nbsp;

00:22:15.680 --> 00:22:20.440
that you published a few years ago. Can&nbsp;
you talk a little bit about that specific&nbsp;&nbsp;

00:22:20.440 --> 00:22:24.860
problem of trace reconstruction and then&nbsp;
the paper specifically and how it solves it?

00:22:24.860 --> 00:22:29.320
YEKHANIN: Absolutely, yeah, so this Trellis BMA&nbsp;
paper for that we are releasing the source code&nbsp;&nbsp;

00:22:29.320 --> 00:22:33.520
right now, this is, kind of, this is the latest in&nbsp;
our sequence of publications on error-correction&nbsp;&nbsp;

00:22:33.520 --> 00:22:39.280
for DNA data storage. And I should say that, like,&nbsp;
we already discussed that the project is, kind of,&nbsp;&nbsp;

00:22:39.280 --> 00:22:44.960
very interdisciplinary. So, like, we have experts&nbsp;
from all kinds of fields. But really even within,&nbsp;&nbsp;

00:22:44.960 --> 00:22:48.880
like, within this coding theory, like,&nbsp;
within computer science/information theory,&nbsp;&nbsp;

00:22:48.880 --> 00:22:53.560
coding theory, in our algorithms, we use ideas&nbsp;
from very different branches. I mean, there are&nbsp;&nbsp;

00:22:53.560 --> 00:22:58.640
some core ideas from, like, core algorithm space,&nbsp;
and I won’t go into these, but let me just focus,&nbsp;&nbsp;

00:22:58.640 --> 00:23:04.800
kind of, on two aspects. So when we just faced&nbsp;
this problem of coding for DNA data storage and we&nbsp;&nbsp;

00:23:04.800 --> 00:23:08.960
were thinking about, OK, so how to exactly design&nbsp;
the coding scheme and what are the algorithms&nbsp;&nbsp;

00:23:08.960 --> 00:23:13.640
that we’ll be using for error correction, so,&nbsp;
I mean, we’re always studying the literature,&nbsp;&nbsp;

00:23:13.640 --> 00:23:18.680
and we came up on this problem called trace&nbsp;
reconstruction that was pretty popular—I mean,&nbsp;&nbsp;

00:23:18.680 --> 00:23:23.960
somewhat popular, I would say—in computer science&nbsp;
and in statistics. It didn’t have much motivation,&nbsp;&nbsp;

00:23:23.960 --> 00:23:28.920
but very strong mathematicians had been looking&nbsp;
at it. And the problem is as follows. So, like,&nbsp;&nbsp;

00:23:28.920 --> 00:23:33.760
there is a long binary string picked at random,&nbsp;
and then it’s transmitted over a deletion channel,&nbsp;&nbsp;

00:23:33.760 --> 00:23:39.200
so some bits—some zeros and some ones—at certain&nbsp;
coordinates get deleted and you get to see, kind&nbsp;&nbsp;

00:23:39.200 --> 00:23:43.680
of, the shortened version of the string. But you&nbsp;
get to see it multiple times. And the question is,&nbsp;&nbsp;

00:23:43.680 --> 00:23:48.240
like, how many times do you need to see it so that&nbsp;
you can get a reasonably accurate estimate of the&nbsp;&nbsp;

00:23:48.240 --> 00:23:54.080
original string that was transmitted? So that was&nbsp;
called trace reconstruction, and we took a lot of&nbsp;&nbsp;

00:23:54.080 --> 00:23:58.480
motivation—we took a lot of inspiration—from the&nbsp;
problem, I would say, because really, in DNA data&nbsp;&nbsp;

00:23:58.480 --> 00:24:03.440
storage, if we think about a single strand, like,&nbsp;
a single strand which is being stored, after we&nbsp;&nbsp;

00:24:03.440 --> 00:24:08.600
read it, we usually get multiple reads of this&nbsp;
string. And, well, the errors there are not just&nbsp;&nbsp;

00:24:08.600 --> 00:24:12.960
deletions. There are insertions, substitutions,&nbsp;
and, like, inversive errors, but still we could&nbsp;&nbsp;

00:24:12.960 --> 00:24:18.600
rely on this literature in computer science that&nbsp;
already had some ideas. So there was an algorithm&nbsp;&nbsp;

00:24:18.600 --> 00:24:23.360
called BMA, Bitwise Majority Alignment. We&nbsp;
extended it—we adopted it, kind of, for the needs&nbsp;&nbsp;

00:24:23.360 --> 00:24:28.440
of DNA data storage—and it became, kind of, one&nbsp;
of the tools in our toolbox for error correction.

00:24:28.440 --> 00:24:32.160
So we also started to use ideas from&nbsp;
literature on electrical engineering,&nbsp;&nbsp;

00:24:32.160 --> 00:24:35.960
what are called convolutional error-correcting&nbsp;
codes and a certain, kind of, class of algorithms&nbsp;&nbsp;

00:24:35.960 --> 00:24:41.000
for decoding errors in these convolutional&nbsp;
error-correcting codes called, like, I mean,&nbsp;&nbsp;

00:24:41.000 --> 00:24:44.400
Trellis is the main data structure, like,&nbsp;
Trellis-based algorithms for decoding&nbsp;&nbsp;

00:24:44.400 --> 00:24:49.160
convolutional codes, like, Viterbi algorithm or&nbsp;
BCJR algorithm. Convolutional codes allow you to&nbsp;&nbsp;

00:24:49.160 --> 00:24:56.600
introduce redundancy on the string. So, like, with&nbsp;
algorithms kind of similar to BMA, like, they were&nbsp;&nbsp;

00:24:56.600 --> 00:25:00.840
good for doing error correction when there was no&nbsp;
redundancy on the strand itself. Like, when there&nbsp;&nbsp;

00:25:00.840 --> 00:25:05.360
is redundancy on the strand, kind of, we could do&nbsp;
some things, but really it was very limited. With&nbsp;&nbsp;

00:25:05.360 --> 00:25:11.440
Trellis-based approaches, like, again inspired&nbsp;
by the literature in electrical engineering,&nbsp;&nbsp;

00:25:11.440 --> 00:25:16.360
we had an approach to introduce redundancy on the&nbsp;
strand, so that allowed us to have more powerful&nbsp;&nbsp;

00:25:16.360 --> 00:25:21.080
error-correction algorithms. And then in the end,&nbsp;
we have this algorithm, which we call Trellis BMA,&nbsp;&nbsp;

00:25:21.080 --> 00:25:26.080
which, kind of, combines ideas from both&nbsp;
fields. So it's based on Trellis, but it's&nbsp;&nbsp;

00:25:26.080 --> 00:25:30.280
also more efficient than standard Trellis-based&nbsp;
algorithms because it uses ideas from BMA from&nbsp;&nbsp;

00:25:30.280 --> 00:25:34.920
computer science literature. So this is, kind of,&nbsp;
this is a mix of these two approaches. And, yeah,&nbsp;&nbsp;

00:25:34.920 --> 00:25:39.800
that’s the paper that we wrote about three years&nbsp;
ago. And now we're open sourcing it. So it is the&nbsp;&nbsp;

00:25:39.800 --> 00:25:44.560
most powerful algorithm for DNA error correction&nbsp;
that we developed in the group. We’re really happy&nbsp;&nbsp;

00:25:44.560 --> 00:25:48.120
that now we are making it publicly available&nbsp;
so that anybody can experiment with the source&nbsp;&nbsp;

00:25:48.120 --> 00:25:51.800
code. Because, again, the field has expanded a&nbsp;
lot, and now there are multiple groups around&nbsp;&nbsp;

00:25:51.800 --> 00:25:56.600
the globe that work just specifically on error&nbsp;
correction apart from all other aspects, so, yeah,&nbsp;&nbsp;

00:25:56.600 --> 00:26:00.940
so we are really happy that it’s become publicly&nbsp;
available to hopefully further advance the field.

00:26:00.940 --> 00:26:05.120
STRAUSS: Yeah, absolutely, and I'm&nbsp;
always amazed by, you know, how,&nbsp;&nbsp;

00:26:05.120 --> 00:26:10.680
it is really about building on other&nbsp;
people's work. Jake and Bichlien,&nbsp;&nbsp;

00:26:10.680 --> 00:26:15.560
you recently published a paper in Nature&nbsp;
Communications. Can you tell us a little&nbsp;&nbsp;

00:26:15.560 --> 00:26:23.600
bit about what it was, what you exposed the&nbsp;
DNA to, and what it was specifically about?

00:26:23.600 --> 00:26:28.240
NGUYEN: Yeah. So that paper was on the&nbsp;
effects of neutron radiation on DNA&nbsp;&nbsp;

00:26:29.040 --> 00:26:33.400
data storage. So, you know, when we&nbsp;
started the DNA Data Storage Project,&nbsp;&nbsp;

00:26:33.400 --> 00:26:39.080
it was really a comparison, right, between the&nbsp;
different storage medias that exist today. And&nbsp;&nbsp;

00:26:39.080 --> 00:26:44.640
one of the issues that have come up through the&nbsp;
years of development of those technologies was,&nbsp;&nbsp;

00:26:44.640 --> 00:26:51.600
you know, hard errors and soft errors that were&nbsp;
induced by radiation. So we wanted to know,&nbsp;&nbsp;

00:26:51.600 --> 00:27:00.640
does that maybe happen in DNA? We know that DNA,&nbsp;
in humans at least, is affected by radiation from&nbsp;&nbsp;

00:27:00.640 --> 00:27:07.640
cosmic rays. And so that was really the motivation&nbsp;
for this type of experiment. So what we did was&nbsp;&nbsp;

00:27:07.640 --> 00:27:16.960
we essentially took our DNA files and dried&nbsp;
them and threw them in a neutron accelerator,&nbsp;&nbsp;

00:27:16.960 --> 00:27:24.024
which was fantastic. It was so exciting. That's,&nbsp;
kind of, the merge of, you know, sci fi with sci&nbsp;&nbsp;

00:27:24.024 --> 00:27:33.160
fi at the same time. [LAUGHS] It was fantastic.&nbsp;
And we irradiated for over 80 million years—

00:27:33.160 --> 00:27:36.171
STRAUSS: The equivalent of …
NGUYEN: The equivalent of 80 million years.

00:27:36.171 --> 00:27:37.265
STRAUSS: Yes, because it's a lot of&nbsp;
radiation all at the same time, …

00:27:37.265 --> 00:27:38.787
NGUYEN: It’s a lot of radiation …

00:27:38.787 --> 00:27:41.880
STRAUSS: … and it's&nbsp;
accelerated radiation exposure?

00:27:41.880 --> 00:27:46.400
NGUYEN: Yeah, I would say it's accelerated&nbsp;
aging with radiation. It's an insane amount&nbsp;&nbsp;

00:27:46.400 --> 00:27:53.280
of radiation. And it was surprising that&nbsp;
even though we irradiated our DNA files&nbsp;&nbsp;

00:27:53.280 --> 00:27:58.520
with that much radiation, there wasn't that much&nbsp;
damage. And that's surprising because, you know,&nbsp;&nbsp;

00:27:58.520 --> 00:28:04.560
we know that humans, if we were to be irradiated&nbsp;
like that, it would be disastrous. But in,&nbsp;&nbsp;

00:28:04.560 --> 00:28:09.860
you know, DNA, our files were able&nbsp;
to be recovered with zero bit errors.

00:28:09.860 --> 00:28:12.440
STRAUSS: And why that difference?

00:28:12.440 --> 00:28:18.480
NGUYEN: Well, we think there's a few reasons.&nbsp;
One is that when you look at the interaction&nbsp;&nbsp;

00:28:18.480 --> 00:28:25.440
between a neutron and the actual elemental&nbsp;
composition of DNA—which is basically carbons,&nbsp;&nbsp;

00:28:25.440 --> 00:28:32.160
oxygens, and hydrogens, maybe a phosphorus—the&nbsp;
neutrons don't interact with the DNA much.&nbsp;&nbsp;

00:28:32.160 --> 00:28:36.520
And if it did interact, we would&nbsp;
have, for example, a strand break,&nbsp;&nbsp;

00:28:36.520 --> 00:28:42.280
which based on the error-correcting codes,&nbsp;
we can recover from. So essentially,&nbsp;&nbsp;

00:28:42.280 --> 00:28:46.320
there's not much … one, there's not much&nbsp;
interaction between neutrons and DNA,&nbsp;&nbsp;

00:28:46.320 --> 00:28:51.380
and second, we have error-correcting&nbsp;
codes that would prevent any data loss.

00:28:51.380 --> 00:28:58.200
STRAUSS: Awesome, so yeah, this is another&nbsp;
milestone that contributes towards the&nbsp;&nbsp;

00:28:58.200 --> 00:29:04.080
technology becoming a reality. There are also&nbsp;
other conditions that are needed for technology&nbsp;&nbsp;

00:29:04.080 --> 00:29:10.720
to be brought to the market. And one thing I've&nbsp;
worked on is to, you know, create the DNA Data&nbsp;&nbsp;

00:29:10.720 --> 00:29:16.680
Storage Alliance; this is something Microsoft&nbsp;
co-founded with, Illumina, Twist Bioscience,&nbsp;&nbsp;

00:29:16.680 --> 00:29:23.680
and Western Digital. And the goal there was to&nbsp;
essentially provide the right conditions for the&nbsp;&nbsp;

00:29:23.680 --> 00:29:32.960
technology to thrive commercially. We did bring&nbsp;
together multiple universities and companies that&nbsp;&nbsp;

00:29:32.960 --> 00:29:39.200
were interested in the technology. And one thing&nbsp;
that we've seen with storage technologies that's&nbsp;&nbsp;

00:29:39.200 --> 00:29:45.360
been pretty important is standardization and&nbsp;
making sure that the technology’s interoperable.&nbsp;&nbsp;

00:29:45.360 --> 00:29:53.440
And, you know, we've seen stalemate situations&nbsp;
like Blu-ray and high-definition DVD, where, you&nbsp;&nbsp;

00:29:53.440 --> 00:29:58.720
know, really we couldn't decide on a standard, and&nbsp;
the technology, it took a while for the technology&nbsp;&nbsp;

00:29:58.720 --> 00:30:04.000
to be picked up, and the intent of the DNA Data&nbsp;
Storage [Alliance] is to provide an ecosystem&nbsp;&nbsp;

00:30:04.000 --> 00:30:10.800
of companies, universities, groups interested in&nbsp;
making sure that this time, it's an interoperable&nbsp;&nbsp;

00:30:10.800 --> 00:30:17.840
technology from the get-go, and that increases&nbsp;
the chances of commercial adoption. As a group,&nbsp;&nbsp;

00:30:17.840 --> 00:30:22.920
we often talk about how amazing it is to work&nbsp;
for a company that empowers us to do this kind of&nbsp;&nbsp;

00:30:22.920 --> 00:30:28.280
research. And for me, one of Microsoft Research’s&nbsp;
unique strengths, particularly in this project,&nbsp;&nbsp;

00:30:28.280 --> 00:30:32.880
is the opportunity to work with such a&nbsp;
diverse set of collaborators on such a&nbsp;&nbsp;

00:30:32.880 --> 00:30:38.560
multidisciplinary project like we have. How&nbsp;
do you all think where you've done this work&nbsp;&nbsp;

00:30:38.560 --> 00:30:43.240
has impacted how you've gone about it and&nbsp;
the contributions you’ve been able to make?

00:30:43.240 --> 00:30:48.880
NGUYEN: I'm going to start with if we look&nbsp;
around this table and we see who's sitting at it,&nbsp;&nbsp;

00:30:48.880 --> 00:30:56.720
which is two chemists, a computer architect, and&nbsp;
a coding theorist, and we come together and we're&nbsp;&nbsp;

00:30:56.720 --> 00:31:03.720
like, what can we make that would be super, super&nbsp;
impactful? I think that's the answer right there,&nbsp;&nbsp;

00:31:03.720 --> 00:31:09.760
is that being at Microsoft and being in&nbsp;
a culture that really fosters this type&nbsp;&nbsp;

00:31:09.760 --> 00:31:16.680
of interdisciplinary collaboration is the key&nbsp;
to getting a project like this off the ground.

00:31:16.680 --> 00:31:20.720
SMITH: Yeah, absolutely. And we should&nbsp;
acknowledge the gigantic contributions&nbsp;&nbsp;

00:31:20.720 --> 00:31:25.880
made by our collaborators at the University of&nbsp;
Washington. Many of them would fall in not any&nbsp;&nbsp;

00:31:25.880 --> 00:31:30.120
of these three categories. They’re electrical&nbsp;
engineers, they're mechanical engineers,&nbsp;&nbsp;

00:31:30.120 --> 00:31:35.040
they're pure biologists that we worked with.&nbsp;
And each of them brought their own perspective,&nbsp;&nbsp;

00:31:35.040 --> 00:31:39.080
and particularly when you talk about&nbsp;
going to a true end-to-end system,&nbsp;&nbsp;

00:31:39.080 --> 00:31:43.420
those perspectives were invaluable as we were&nbsp;
trying to fit all the puzzle pieces together.

00:31:43.420 --> 00:31:49.040
STRAUSS: Yeah, absolutely. We've had great&nbsp;
collaborations over time—University of Washington,&nbsp;&nbsp;

00:31:49.040 --> 00:31:56.880
ETH Zürich, Los Alamos National Lab, ChipIr,&nbsp;
Twist Bioscience, Ansa Biotechnologies. Yeah,&nbsp;&nbsp;

00:31:56.880 --> 00:32:02.760
it’s been really great and a great set of&nbsp;
different disciplines, all the way from coding&nbsp;&nbsp;

00:32:02.760 --> 00:32:12.040
theorists to the molecular biology and chemistry,&nbsp;
electrical and mechanical engineering. One of the&nbsp;&nbsp;

00:32:12.040 --> 00:32:18.240
great things about research is there's never&nbsp;
a shortage of interesting questions to pursue,&nbsp;&nbsp;

00:32:18.240 --> 00:32:24.960
and for us, this particular work has opened the&nbsp;
door to research in adjacent domains, including&nbsp;&nbsp;

00:32:24.960 --> 00:32:31.920
sustainability fields. DNA data storage requires&nbsp;
small amounts of materials to accommodate the&nbsp;&nbsp;

00:32:31.920 --> 00:32:38.680
large amounts of data, and early on, we wanted to&nbsp;
understand if DNA data storage was, as it seemed,&nbsp;&nbsp;

00:32:38.680 --> 00:32:45.640
a more sustainable way to store information.&nbsp;
And we learned a lot. Bichlien and Jake,&nbsp;&nbsp;

00:32:45.640 --> 00:32:52.880
you had experience in green chemistry when you&nbsp;
came to Microsoft. What new findings did we make,&nbsp;&nbsp;

00:32:52.880 --> 00:32:58.120
and what sustainability benefits do&nbsp;
we get with DNA data storage? And,&nbsp;&nbsp;

00:32:58.120 --> 00:33:02.520
finally, what new sustainability&nbsp;
work has the project led to?

00:33:02.520 --> 00:33:09.160
NGUYEN: As a part of this project, if we're&nbsp;
going to bring new technologies to the forefront,&nbsp;&nbsp;

00:33:09.160 --> 00:33:14.560
you know, to the world, we should make sure that&nbsp;
they have a lower carbon footprint, for example,&nbsp;&nbsp;

00:33:14.560 --> 00:33:22.360
than previous technologies. And so we ran a life&nbsp;
cycle assessment—which is a way to systematically&nbsp;&nbsp;

00:33:22.360 --> 00:33:29.160
evaluate the environmental impacts of anything of&nbsp;
interest—and we did this on DNA data storage and&nbsp;&nbsp;

00:33:29.160 --> 00:33:38.040
compared it to electronic storage medium, and we&nbsp;
noticed that if we were able to store all of our&nbsp;&nbsp;

00:33:38.040 --> 00:33:43.880
digital information in DNA, that we would have&nbsp;
benefits associated with carbon emissions. We&nbsp;&nbsp;

00:33:43.880 --> 00:33:49.640
would be able to reduce that because we don't need&nbsp;
as much infrastructure compared to the traditional&nbsp;&nbsp;

00:33:49.640 --> 00:33:57.920
storage methods. And there would be an energy&nbsp;
reduction, as well, because this is a passive way&nbsp;&nbsp;

00:33:57.920 --> 00:34:04.960
of archival data storage. So that was, you know,&nbsp;
the main takeaways that we had. But that also,&nbsp;&nbsp;

00:34:04.960 --> 00:34:10.240
kind of, led us to think about other&nbsp;
technologies that would be beneficial&nbsp;&nbsp;

00:34:10.240 --> 00:34:16.380
beyond data storage and how we could use the&nbsp;
same kind of life cycle thinking towards that.

00:34:16.380 --> 00:34:23.040
SMITH: This design approach that you've, you know,&nbsp;
talked about us stumbling on, not inventing but&nbsp;&nbsp;

00:34:23.040 --> 00:34:27.240
seeing other people doing in the literature and&nbsp;
trying to implement ourselves on the DNA Data&nbsp;&nbsp;

00:34:27.240 --> 00:34:33.400
Storage Project, you know, is something that can&nbsp;
be much bigger than any single material. And where&nbsp;&nbsp;

00:34:33.400 --> 00:34:38.640
we think there's a, you know, chance for folks&nbsp;
like ourselves at Microsoft Research to make a&nbsp;&nbsp;

00:34:38.640 --> 00:34:45.960
real impact on this sustainability-focused design&nbsp;
is through the application of machine learning,&nbsp;&nbsp;

00:34:45.960 --> 00:34:53.160
artificial intelligence—the new tools that will&nbsp;
allow us to look at much bigger design spaces&nbsp;&nbsp;

00:34:53.160 --> 00:34:57.920
than we could previously to evaluate&nbsp;
sustainability metrics that were not&nbsp;&nbsp;

00:34:57.920 --> 00:35:04.360
possible when everything was done manually and&nbsp;
to ultimately, you know, at the end of the day,&nbsp;&nbsp;

00:35:04.360 --> 00:35:11.160
take a sustainability-first look at what a&nbsp;
material should be composed of. And so we've&nbsp;&nbsp;

00:35:11.160 --> 00:35:16.080
tried to prototype this with a few projects.&nbsp;
We had another wonderful collaboration with&nbsp;&nbsp;

00:35:16.080 --> 00:35:21.600
the University of Washington where we looked at&nbsp;
recyclable circuit boards and a novel material&nbsp;&nbsp;

00:35:21.600 --> 00:35:26.080
called a vitrimer that it could possibly be made&nbsp;
out of. We've had another great collaboration with&nbsp;&nbsp;

00:35:26.080 --> 00:35:31.280
the University of Michigan, where we've looked at&nbsp;
the design of charge-carrying molecules in these&nbsp;&nbsp;

00:35:31.280 --> 00:35:37.480
things called flow batteries that have good&nbsp;
potential for energy smoothing in, you know,&nbsp;&nbsp;

00:35:37.480 --> 00:35:43.440
renewables production, trying to get us out&nbsp;
of that day-night, boom-bust cycle. And we&nbsp;&nbsp;

00:35:43.440 --> 00:35:48.040
had one more project, you know, this time with&nbsp;
collaborators at the University of Berkeley,&nbsp;&nbsp;

00:35:48.040 --> 00:35:53.840
where we looked at, you know, design of a class&nbsp;
of materials called a metal organic framework,&nbsp;&nbsp;

00:35:53.840 --> 00:36:01.520
which have great promise in low-energy-cost&nbsp;
gas separation, such as pulling CO2 out of the,&nbsp;&nbsp;

00:36:01.520 --> 00:36:05.600
you know, plume of a smokestack or, you&nbsp;
know, ideally out of the air itself.

00:36:05.600 --> 00:36:12.440
STRAUSS: For me, the DNA work has made&nbsp;
me much more open to projects outside my&nbsp;&nbsp;

00:36:12.440 --> 00:36:18.320
own research area—as Bichlien mentioned, my&nbsp;
core research area is computer architecture,&nbsp;&nbsp;

00:36:18.320 --> 00:36:24.120
but we've ventured in quite a bit of&nbsp;
other areas here—and going way beyond&nbsp;&nbsp;

00:36:24.120 --> 00:36:30.640
my own comfort zone and really made me love&nbsp;
interdisciplinary projects like this and try,&nbsp;&nbsp;

00:36:30.640 --> 00:36:36.800
really try, to do the most important work I&nbsp;
can. And this is what attracted me to these&nbsp;&nbsp;

00:36:36.800 --> 00:36:42.960
other areas of environmental sustainability&nbsp;
that Bichlien and Jake covered, where there's&nbsp;&nbsp;

00:36:42.960 --> 00:36:49.880
absolutely no lack of problems. Like them, I'm&nbsp;
super interested in using AI to solve many of&nbsp;&nbsp;

00:36:49.880 --> 00:36:57.760
them. So how do each of you think working on&nbsp;
the DNA Data Storage Project has influenced&nbsp;&nbsp;

00:36:57.760 --> 00:37:05.200
your research approach more generally and how you&nbsp;
think about research questions to pursue next?

00:37:05.200 --> 00:37:09.240
YEKHANIN: It definitely expanded the horizons&nbsp;
a lot, like, just, kind of, just having this&nbsp;&nbsp;

00:37:09.240 --> 00:37:14.480
interactions with people, kind of, whose core&nbsp;
areas of research are so different from my own&nbsp;&nbsp;

00:37:14.480 --> 00:37:18.040
and also a lot of learning even within my&nbsp;
own field that we had to do to, kind of,&nbsp;&nbsp;

00:37:18.040 --> 00:37:21.400
carry this project out. So, I mean, it&nbsp;
was a great and rewarding experience.

00:37:21.400 --> 00:37:27.280
NGUYEN: Yeah, for me, it's kind of the opposite of&nbsp;
Karin, right. I started as an organic chemist and&nbsp;&nbsp;

00:37:27.280 --> 00:37:37.760
then now really, one, appreciate the breadth&nbsp;
and depth of going from a concept to a real&nbsp;&nbsp;

00:37:37.760 --> 00:37:45.400
end-to-end prototype and all the requirements&nbsp;
that you need to get there. And then also,&nbsp;&nbsp;

00:37:45.400 --> 00:37:52.960
really the importance of having, you know,&nbsp;
a background in computer science and really&nbsp;&nbsp;

00:37:52.960 --> 00:37:59.360
being able to understand the lingo that is used&nbsp;
in multidisciplinary projects because you might&nbsp;&nbsp;

00:37:59.360 --> 00:38:04.600
say something and someone else interprets it very&nbsp;
differently, and it's because you're not speaking&nbsp;&nbsp;

00:38:04.600 --> 00:38:11.680
the same language. And so that understanding&nbsp;
that you have to really be … you have to learn&nbsp;&nbsp;

00:38:11.680 --> 00:38:18.400
a little bit of vocabulary from each person&nbsp;
and understand how they contribute and then&nbsp;&nbsp;

00:38:18.400 --> 00:38:24.060
how your ideas can contribute to their ideas&nbsp;
has been really impactful in my career here.

00:38:24.060 --> 00:38:28.960
SMITH: Yeah, I think the key change&nbsp;
in approach that I took away—and I&nbsp;&nbsp;

00:38:28.960 --> 00:38:33.840
think many of us took away from the DNA Data&nbsp;
Storage Project—was rather than starting with&nbsp;&nbsp;

00:38:33.840 --> 00:38:37.640
an academic question, we started with&nbsp;
a vision of what we wanted to happen,&nbsp;&nbsp;

00:38:37.640 --> 00:38:43.800
and then we derived the research questions&nbsp;
from analyzing what would need to happen in&nbsp;&nbsp;

00:38:43.800 --> 00:38:50.360
the world—what are the bottlenecks that need to&nbsp;
be solved in order for us to achieve, you know,&nbsp;&nbsp;

00:38:50.360 --> 00:38:54.920
that goal? And this is something that we've&nbsp;
taken with us into the sustainability-focused&nbsp;&nbsp;

00:38:54.920 --> 00:39:00.920
research and, you know, something that I think&nbsp;
will affect all the research I do going forward.

00:39:00.920 --> 00:39:06.400
STRAUSS: Awesome. As we close, let's&nbsp;
reflect a bit on what a world in which&nbsp;&nbsp;

00:39:06.400 --> 00:39:12.560
DNA data storage is widely used might&nbsp;
look like. If everything goes as planned,&nbsp;&nbsp;

00:39:12.560 --> 00:39:18.040
what do you hope the lasting impact of this&nbsp;
work will be? Sergey, why don’t you lead us off.

00:39:18.040 --> 00:39:22.080
YEKHANIN: Sure, I remember that, like, when …&nbsp;
in the early days when I started working on this&nbsp;&nbsp;

00:39:22.080 --> 00:39:27.560
project actually, you, Karin, told me that you&nbsp;
were taking an Uber ride somewhere and you were&nbsp;&nbsp;

00:39:27.560 --> 00:39:31.160
talking to the taxi driver, and the taxi&nbsp;
driver—I don't know if you remember that—but&nbsp;&nbsp;

00:39:31.160 --> 00:39:35.200
the taxi driver mentioned that he has a camera&nbsp;
which is recording everything that's happening&nbsp;&nbsp;

00:39:35.200 --> 00:39:40.240
in the car. And then you had a discussion with&nbsp;
him about, like, how long does he keep the data,&nbsp;&nbsp;

00:39:40.240 --> 00:39:44.080
how long does he keep the videos. And he told&nbsp;
you that he keeps it for about a couple of days&nbsp;&nbsp;

00:39:44.080 --> 00:39:48.400
because it's too expensive. But otherwise, like,&nbsp;
if it weren't that expensive, he would keep it&nbsp;&nbsp;

00:39:48.400 --> 00:39:52.680
for much, much longer because, like, he wants to&nbsp;
have these recordings if later somebody is upset&nbsp;&nbsp;

00:39:52.680 --> 00:39:57.400
about the ride and, I don’t know, he is getting&nbsp;
sued or something. So this is, like, this is one&nbsp;&nbsp;

00:39:57.400 --> 00:40:01.440
small narrow application area where DNA data&nbsp;
storage would clearly, kind of, if it happens,&nbsp;&nbsp;

00:40:01.440 --> 00:40:06.720
then it will solve it. Because then, kind of, this&nbsp;
long-term archival storage will become very cheap,&nbsp;&nbsp;

00:40:06.720 --> 00:40:11.040
available to everybody; it would become a&nbsp;
commodity basically. There are many things&nbsp;&nbsp;

00:40:11.040 --> 00:40:17.500
that will be enabled, like this helping the Uber&nbsp;
drivers, for instance. But also one has to think&nbsp;&nbsp;

00:40:17.500 --> 00:40:21.880
of, of course, like, about, kind of, the broader&nbsp;
implications so that we don't get into something&nbsp;&nbsp;

00:40:21.880 --> 00:40:26.680
negative because again this power of recording&nbsp;
everything and storing everything, it can also&nbsp;&nbsp;

00:40:26.680 --> 00:40:32.400
lead to some use cases that might be, kind of,&nbsp;
morally wrong. So, again, hopefully by the time&nbsp;&nbsp;

00:40:32.400 --> 00:40:37.520
that we get to, like, really wide deployments&nbsp;
of this technology, the regulation will also be&nbsp;&nbsp;

00:40:37.520 --> 00:40:42.440
catching up and the, like, we will have great use&nbsp;
cases and we won’t have bad ones. I mean, that's&nbsp;&nbsp;

00:40:42.440 --> 00:40:48.060
how I think of it. But definitely there are lots&nbsp;
of, kind of, great scenarios that this can enable.

00:40:48.060 --> 00:40:54.040
SMITH: Yeah. I'll grab onto the word you use&nbsp;
there, which is making DNA a commodity. And&nbsp;&nbsp;

00:40:54.040 --> 00:40:57.960
one of the things that I hope comes out of this&nbsp;
project, you know, besides all the great benefits&nbsp;&nbsp;

00:40:57.960 --> 00:41:03.880
of DNA data storage itself is spillover benefits&nbsp;
into the field of health—where if we make DNA&nbsp;&nbsp;

00:41:03.880 --> 00:41:10.320
synthesis at large scale truly a commodity thing,&nbsp;
which I hope some of the work that we've done to&nbsp;&nbsp;

00:41:10.320 --> 00:41:16.280
really accelerate the throughput of synthesis&nbsp;
will do—then this will open new doors in what&nbsp;&nbsp;

00:41:16.280 --> 00:41:23.040
we can do in terms of gene synthesis, in terms of,&nbsp;
like, fundamental biotech research that will lead&nbsp;&nbsp;

00:41:23.040 --> 00:41:29.800
to that next set of drugs and, you know, give us&nbsp;
medications or treatments that we could not have&nbsp;&nbsp;

00:41:29.800 --> 00:41:35.220
thought possible if we were not able to synthesize&nbsp;
DNA and related molecules at that scale.

00:41:35.220 --> 00:41:41.840
NGUYEN: So much information gets lost&nbsp;
because of just time. And so I think&nbsp;&nbsp;

00:41:41.840 --> 00:41:49.440
being able to recover really ancient history&nbsp;
that humans wrote in the future, I think,&nbsp;&nbsp;

00:41:49.440 --> 00:41:55.560
is something that I really hope could be&nbsp;
achieved because we're so information rich,&nbsp;&nbsp;

00:41:55.560 --> 00:42:02.200
but in the course of time, we become information&nbsp;
poor, and so I would like for our future&nbsp;&nbsp;

00:42:02.200 --> 00:42:09.660
generations to be able to understand the life&nbsp;
of, you know, an everyday 21st-century person.

00:42:09.660 --> 00:42:14.960
STRAUSS: Well, Bichlien, Jake, Sergey,&nbsp;
it's been fun having this conversation&nbsp;&nbsp;

00:42:14.960 --> 00:42:18.560
with you today and collaborating&nbsp;
with you in all of this amazing&nbsp;&nbsp;

00:42:18.560 --> 00:42:21.763
project [MUSIC] and all the research&nbsp;
we've done together. Thank you so much.

00:42:21.763 --> 00:42:22.520
YEKHANIN: Thank you, Karin.
SMITH: Thank you.

00:42:22.520 --> 00:42:27.120
NGUYEN: Thanks.

00:42:27.120 --> 00:42:27.962
[MUSIC FADES]

