00:00:02.393 --> 00:00:07.280
Gretchen Huizinga: Welcome to Abstracts,&nbsp;
a Microsoft Research Podcast that puts the&nbsp;&nbsp;

00:00:07.280 --> 00:00:15.920
spotlight on world-class research in brief.&nbsp;
I’m Gretchen Huizinga. In this series,&nbsp;&nbsp;

00:00:15.920 --> 00:00:18.960
members of the research community&nbsp;
at Microsoft give us a quick&nbsp;&nbsp;

00:00:18.960 --> 00:00:24.240
snapshot – or a podcast abstract –&nbsp;
of their new and noteworthy papers.

00:00:25.200 --> 00:00:30.720
On today's episode, I'm talking to Alex Lu,&nbsp;
a senior researcher at Microsoft Research&nbsp;&nbsp;

00:00:30.720 --> 00:00:36.400
and co-author of a paper called Assessing&nbsp;
the Limits of Zero Shot Foundation Models&nbsp;&nbsp;

00:00:36.400 --> 00:00:42.631
in Single-cell Biology. Alex Lu, wonderful to&nbsp;
have you on the podcast. Welcome to Abstracts!

00:00:42.631 --> 00:00:45.080
Alex Lu: Yeah, I'm really&nbsp;
excited to be joining you today.

00:00:45.080 --> 00:00:49.040
Huizinga: So let's start with a little&nbsp;
background of your work. In just a few&nbsp;&nbsp;

00:00:49.040 --> 00:00:53.480
sentences, tell us about your study&nbsp;
and more importantly, why it matters.

00:00:53.480 --> 00:01:00.720
Lu: Absolutely. And before I dive in, I want&nbsp;
to give a shout out to the MSR research intern&nbsp;&nbsp;

00:01:00.720 --> 00:01:04.960
who actually did this work. This was led&nbsp;
by Kasia Kedzierska, who interned with us&nbsp;&nbsp;

00:01:05.600 --> 00:01:12.720
two summers ago in 2023, and she's the lead author&nbsp;
on the study. But basically, in this research,&nbsp;&nbsp;

00:01:12.720 --> 00:01:18.000
we study single-cell foundation models, which&nbsp;
have really recently rocked the world of biology,&nbsp;&nbsp;

00:01:18.000 --> 00:01:23.840
because they basically claim to be able to use AI&nbsp;
to unlock understanding about single-cell biology.&nbsp;&nbsp;

00:01:23.840 --> 00:01:28.160
Biologists for a myriad of applications,&nbsp;
everything from understanding how single&nbsp;&nbsp;

00:01:28.160 --> 00:01:34.560
cells differentiate into different kinds of&nbsp;
cells, to discovering new drugs for cancer,&nbsp;&nbsp;

00:01:34.560 --> 00:01:39.840
will conduct experiments where they measure how&nbsp;
much of every gene is expressed inside of just&nbsp;&nbsp;

00:01:39.840 --> 00:01:45.840
one single cell. So these experiments give&nbsp;
us a powerful view into the cell's internal&nbsp;&nbsp;

00:01:45.840 --> 00:01:51.200
state. But measurements from these experiments&nbsp;
are incredibly complex. There are about 20,000&nbsp;&nbsp;

00:01:51.200 --> 00:01:56.320
different human genes. So you get this really long&nbsp;
chain of numbers that measure how much there is&nbsp;&nbsp;

00:01:56.320 --> 00:02:02.160
of 20,000 different genes. So deriving meaning&nbsp;
from this really long chain of numbers is really&nbsp;&nbsp;

00:02:02.160 --> 00:02:07.600
difficult. And single-cell foundation models claim&nbsp;
to be capable of unraveling deeper insights than&nbsp;&nbsp;

00:02:07.600 --> 00:02:13.840
ever before. So that's the claim that these&nbsp;
works have made. And in our recent paper,&nbsp;&nbsp;

00:02:13.840 --> 00:02:17.680
we showed that these models may actually&nbsp;
not live up to these claims. Basically,&nbsp;&nbsp;

00:02:17.680 --> 00:02:21.760
we showed that single-cell foundation models&nbsp;
perform worse in settings that are fundamental&nbsp;&nbsp;

00:02:21.760 --> 00:02:26.480
to biological discovery than much simpler&nbsp;
machine learning and statistical methods&nbsp;&nbsp;

00:02:26.480 --> 00:02:30.800
that were used in the field before single-cell&nbsp;
foundation models emerged and are the go-to&nbsp;&nbsp;

00:02:30.800 --> 00:02:35.680
standard for unpacking meaning from these&nbsp;
complicated experiments. So in a nutshell,&nbsp;&nbsp;

00:02:35.680 --> 00:02:40.320
we should care about these results because it&nbsp;
has implications on the toolkits that biologists&nbsp;&nbsp;

00:02:40.320 --> 00:02:44.880
use to understand their experiments. Our work&nbsp;
suggests that single-cell foundation models may&nbsp;&nbsp;

00:02:44.880 --> 00:02:50.320
not be appropriate for practical use just yet, at&nbsp;
least in the discovery applications that we cover.

00:02:50.320 --> 00:02:55.200
Huizinga: Well, let's go a little deeper there.&nbsp;
Generative pre-trained transformer models,&nbsp;&nbsp;

00:02:55.200 --> 00:03:00.080
GPTs, are relatively new on the research&nbsp;
scene in terms of how they're being used&nbsp;&nbsp;

00:03:00.080 --> 00:03:03.200
in novel applications, which&nbsp;
is what you're interested in,&nbsp;&nbsp;

00:03:03.200 --> 00:03:08.240
like single-cell biology. So I'm curious, just&nbsp;
sort of as a foundation, what other research&nbsp;&nbsp;

00:03:08.240 --> 00:03:13.480
has already been done in this area, and how&nbsp;
does this study illuminate or build on it?

00:03:13.480 --> 00:03:20.560
Lu: Absolutely. Okay, so we were the first to&nbsp;
notice and document this issue in single-cell&nbsp;&nbsp;

00:03:20.560 --> 00:03:25.040
foundation models, specifically. And this is&nbsp;
because that we have proposed evaluation methods&nbsp;&nbsp;

00:03:25.040 --> 00:03:31.840
that, while are common in other areas of AI, have&nbsp;
yet to be commonly used to evaluate single-cell&nbsp;&nbsp;

00:03:31.840 --> 00:03:38.400
foundation models. We performed something called&nbsp;
zero-shot evaluation on these models. Prior to our&nbsp;&nbsp;

00:03:38.400 --> 00:03:44.160
work, most works evaluated single-cell foundation&nbsp;
models with fine tuning. And the way to understand&nbsp;&nbsp;

00:03:44.160 --> 00:03:49.520
this is because single-cell foundation models are&nbsp;
trained in a way that tries to expose these models&nbsp;&nbsp;

00:03:49.520 --> 00:03:55.120
to millions of single-cells. But because you’re&nbsp;
exposing them to a large amount of data, you can't&nbsp;&nbsp;

00:03:55.120 --> 00:04:00.480
really rely upon this data being annotated&nbsp;
or like labeled in any particular fashion&nbsp;&nbsp;

00:04:00.480 --> 00:04:06.880
then. So in order for them to actually do the&nbsp;
specialized tasks that are useful for biologists,&nbsp;&nbsp;

00:04:06.880 --> 00:04:11.760
you typically have to add on a second training&nbsp;
phase. We call this the fine-tuning phase,&nbsp;&nbsp;

00:04:11.760 --> 00:04:17.280
where you have a smaller number of single&nbsp;
cells, but now they are actually labeled&nbsp;&nbsp;

00:04:17.280 --> 00:04:23.120
with the specialized tasks that you want the&nbsp;
model to perform. So most people, they typically&nbsp;&nbsp;

00:04:23.120 --> 00:04:27.520
evaluate the performance of single-cell&nbsp;
models after they fine-tune these models.&nbsp;&nbsp;

00:04:28.080 --> 00:04:33.600
However, what we noticed is that this evaluating&nbsp;
these fine-tuned models has several problems.&nbsp;&nbsp;

00:04:33.600 --> 00:04:39.200
First, it might not actually align with how these&nbsp;
models are actually going to be used by biologists&nbsp;&nbsp;

00:04:39.200 --> 00:04:44.960
then. A critical distinction in biology is&nbsp;
that we're not just trying to interact with&nbsp;&nbsp;

00:04:44.960 --> 00:04:49.760
an agent that has access to knowledge through&nbsp;
its pre-training, we're trying to extend these&nbsp;&nbsp;

00:04:49.760 --> 00:04:56.160
models to discover new biology beyond the sphere&nbsp;
of influence then. And so in many cases, the point&nbsp;&nbsp;

00:04:56.160 --> 00:05:01.040
of using these models, the point of analysis, is&nbsp;
to explore the data with the goal of potentially&nbsp;&nbsp;

00:05:01.040 --> 00:05:04.800
discovering something new about the single cell&nbsp;
that the biologists worked with that they weren't&nbsp;&nbsp;

00:05:04.800 --> 00:05:10.800
aware of before. So in these kinds of cases, it&nbsp;
is really tough to fine-tune a model. There's&nbsp;&nbsp;

00:05:10.800 --> 00:05:15.280
a bit of a chicken and egg problem going on. If&nbsp;
you don't know, for example, there's a new kind&nbsp;&nbsp;

00:05:15.280 --> 00:05:20.080
of cell in the data, you can't really instruct&nbsp;
the model to help us identify these kinds of new&nbsp;&nbsp;

00:05:20.080 --> 00:05:26.160
cells. So in other words, fine-tuning these models&nbsp;
for those tasks essentially becomes impossible&nbsp;&nbsp;

00:05:26.160 --> 00:05:32.560
then. So the second issue is that evaluations&nbsp;
on fine-tuned models can sometimes mislead us&nbsp;&nbsp;

00:05:32.560 --> 00:05:38.000
in our ability to understand how these models&nbsp;
are working. So for example, the claim behind&nbsp;&nbsp;

00:05:38.000 --> 00:05:43.040
single-cell foundation model papers is that these&nbsp;
models learn a foundation of biological knowledge&nbsp;&nbsp;

00:05:43.040 --> 00:05:49.520
by being exposed to millions of single cells in&nbsp;
its first training phase, right? But it's possible&nbsp;&nbsp;

00:05:49.520 --> 00:05:55.920
when you fine-tune a model, it may just be that&nbsp;
any performance increases that you see using the&nbsp;&nbsp;

00:05:55.920 --> 00:06:01.920
model is simply because that you're using a&nbsp;
massive model that is really sophisticated,&nbsp;&nbsp;

00:06:01.920 --> 00:06:07.440
really large. And even if there's any exposure&nbsp;
to any cells at all then, that model is going&nbsp;&nbsp;

00:06:07.440 --> 00:06:12.320
to do perfectly fine then. So going back to&nbsp;
our paper, what's really different about this&nbsp;&nbsp;

00:06:12.320 --> 00:06:19.200
paper is that we propose zero-shot evaluation for&nbsp;
these models. What that means is that we do not&nbsp;&nbsp;

00:06:19.200 --> 00:06:25.200
fine-tune the model at all, and instead we keep&nbsp;
the model frozen during the analysis step. So how&nbsp;&nbsp;

00:06:25.200 --> 00:06:30.480
we specialize it to be a downstream task instead&nbsp;
is that we extract the model's internal embedding&nbsp;&nbsp;

00:06:30.480 --> 00:06:36.240
of single-cell data, which is essentially a&nbsp;
numerical vector that contains information that&nbsp;&nbsp;

00:06:36.240 --> 00:06:42.480
the model is extracting and organizing from input&nbsp;
data. So it's essentially how the model perceives&nbsp;&nbsp;

00:06:42.480 --> 00:06:48.000
single-cell data and how it's organizing&nbsp;
in its own internal state. So basically,&nbsp;&nbsp;

00:06:48.000 --> 00:06:52.000
this is the better way for us to test the claim&nbsp;
that single-cell foundation models are learning&nbsp;&nbsp;

00:06:52.000 --> 00:06:56.640
foundational biological insights. Because if&nbsp;
they actually are learning these insights,&nbsp;&nbsp;

00:06:56.640 --> 00:07:01.240
they should be present in the models embedding&nbsp;
space even before we fine-tune the model.

00:07:01.240 --> 00:07:05.920
Huizinga: Well, let's talk about&nbsp;
methodology on this particular study.&nbsp;&nbsp;

00:07:05.920 --> 00:07:10.160
You focused on assessing existing&nbsp;
models in zero-shot learning for&nbsp;&nbsp;

00:07:10.160 --> 00:07:14.480
single-cell biology. How did you&nbsp;
go about evaluating these models?

00:07:14.480 --> 00:07:22.320
Lu: Yes, so let's dive deeper into how zero-shot&nbsp;
evaluations are conducted, okay? So the premise&nbsp;&nbsp;

00:07:22.320 --> 00:07:27.840
here is that we're relying upon the fact that&nbsp;
if these models are fully learning foundational&nbsp;&nbsp;

00:07:27.840 --> 00:07:33.440
biological insights, if we take the model's&nbsp;
internal representation of cells, then cells&nbsp;&nbsp;

00:07:33.440 --> 00:07:38.640
that are biologically similar should be close in&nbsp;
that internal representation, where cells that are&nbsp;&nbsp;

00:07:38.640 --> 00:07:44.960
biologically distinct should be further apart. And&nbsp;
that is exactly what we tested in our study. We&nbsp;&nbsp;

00:07:44.960 --> 00:07:51.200
compared two popular single-cell foundation models&nbsp;
and importantly, we compared these models against&nbsp;&nbsp;

00:07:51.200 --> 00:07:56.400
older and reliable tools that biologists have&nbsp;
used for exploratory analyses. So these include&nbsp;&nbsp;

00:07:57.200 --> 00:08:02.160
simpler machine learning methods like scVI,&nbsp;
statistical algorithms like Harmony, and&nbsp;&nbsp;

00:08:02.160 --> 00:08:07.520
even basic data pre-processing steps, just like&nbsp;
filtering your data down to a more robust subset&nbsp;&nbsp;

00:08:07.520 --> 00:08:14.080
of genes, then. So basically, we tested embeddings&nbsp;
from our two single-cell foundation models against&nbsp;&nbsp;

00:08:14.080 --> 00:08:19.280
this baseline in a variety of settings. And&nbsp;
we tested the hypothesis that biologically&nbsp;&nbsp;

00:08:19.280 --> 00:08:24.800
similar cells should be similar across these&nbsp;
distinct methods across these datasets.

00:08:24.800 --> 00:08:29.920
Huizinga: Well, and as you as you did&nbsp;
the testing, you obviously were aiming&nbsp;&nbsp;

00:08:29.920 --> 00:08:34.960
towards research findings, which is&nbsp;
my favorite part of a research paper,&nbsp;&nbsp;

00:08:34.960 --> 00:08:41.760
so tell us what you did find and what you feel&nbsp;
the most important takeaways of this paper are.

00:08:41.760 --> 00:08:48.960
Lu: Absolutely. So in a nutshell, we found that&nbsp;
these two newly proposed single-cell foundation&nbsp;&nbsp;

00:08:48.960 --> 00:08:55.920
models substantially underperformed compared&nbsp;
to older methods then. So to contextualize why&nbsp;&nbsp;

00:08:55.920 --> 00:09:01.680
that is such a surprising result, there&nbsp;
is a lot of hype around these methods.&nbsp;&nbsp;

00:09:01.680 --> 00:09:05.760
So basically, I think that,yeah,&nbsp;
it's a very surprising result,&nbsp;&nbsp;

00:09:05.760 --> 00:09:11.040
given how hyped these models are and how&nbsp;
people were already adopting them. But our&nbsp;&nbsp;

00:09:11.040 --> 00:09:16.400
results basically caution that these shouldn't&nbsp;
really be adopted for these use purposes.

00:09:16.400 --> 00:09:22.800
Huizinga: Yeah, so this is serious real-world&nbsp;
impact here in terms of if models are being&nbsp;&nbsp;

00:09:22.800 --> 00:09:31.680
adopted and adapted in these applications, how&nbsp;
reliable are they, et cetera? So given that,&nbsp;&nbsp;

00:09:31.680 --> 00:09:36.960
who would you say benefits most from what&nbsp;
you've discovered in this paper and why?

00:09:36.960 --> 00:09:42.560
Lu: Okay, so two ways, right? So I think this&nbsp;
has at least immediate implications on the way&nbsp;&nbsp;

00:09:42.560 --> 00:09:48.640
that we do discovery in biology. And as I've&nbsp;
discussed, these experiments are used for&nbsp;&nbsp;

00:09:48.640 --> 00:09:54.800
cases that have practical impact, drug discovery&nbsp;
applications, investigations into basic biology,&nbsp;&nbsp;

00:09:54.800 --> 00:10:00.400
then. But let's also talk about the impact for&nbsp;
methodologists, people who are trying to improve&nbsp;&nbsp;

00:10:00.400 --> 00:10:05.200
these single-cell foundation models, right?&nbsp;
I think at the base, they're really excited&nbsp;&nbsp;

00:10:05.200 --> 00:10:11.040
proposals. Because if you look at what some of the&nbsp;
prior and less sophisticated methods couldn’t do,&nbsp;&nbsp;

00:10:11.040 --> 00:10:15.920
they tended to be more bespoke. So the excitement&nbsp;
of single-cell foundation models is that you have&nbsp;&nbsp;

00:10:15.920 --> 00:10:19.520
this general-purpose model that can be&nbsp;
used for everything and while they're not&nbsp;&nbsp;

00:10:20.080 --> 00:10:26.160
living up to that purpose just now, just&nbsp;
currently, I think that it's important that&nbsp;&nbsp;

00:10:26.160 --> 00:10:31.760
we continue to bank onto that vision, right? So&nbsp;
if you look at our contributions in that area,&nbsp;&nbsp;

00:10:31.760 --> 00:10:36.480
where single-cell foundation models are a&nbsp;
really new proposal, so it makes sense that&nbsp;&nbsp;

00:10:36.480 --> 00:10:41.600
we may not know how to fully evaluate them just&nbsp;
yet then. So you can view our work as basically&nbsp;&nbsp;

00:10:41.600 --> 00:10:46.640
being a step towards more rigorous evaluation of&nbsp;
these models. Now that we did this experiment,&nbsp;&nbsp;

00:10:46.640 --> 00:10:51.520
I think the methodologists know to use this as a&nbsp;
signal on how to improve the models and if they're&nbsp;&nbsp;

00:10:51.520 --> 00:10:56.560
going in the right direction. And in fact, you&nbsp;
are seeing more and more papers adopt zero-shot&nbsp;&nbsp;

00:10:56.560 --> 00:11:01.360
evaluations since we put out our paper then.&nbsp;
And so this essentially helps future computer&nbsp;&nbsp;

00:11:01.360 --> 00:11:06.080
scientists that are working on single-cell&nbsp;
foundation models know how to train better models.

00:11:06.080 --> 00:11:12.880
Huizinga: That said, Alex, finally, what&nbsp;
are the outstanding challenges that you&nbsp;&nbsp;

00:11:12.880 --> 00:11:18.240
identified for zero-shot learning research in&nbsp;
biology, and what foundation might this paper&nbsp;&nbsp;

00:11:18.240 --> 00:11:20.680
lay for future research agendas in the field?

00:11:20.680 --> 00:11:26.720
Lu: Yeah, absolutely. So now that we've shown&nbsp;
single-cell foundation models don't necessarily&nbsp;&nbsp;

00:11:26.720 --> 00:11:31.760
perform well, I think the natural question on&nbsp;
everyone's mind is how do we actually train&nbsp;&nbsp;

00:11:31.760 --> 00:11:36.640
single-cell foundation models that live up to that&nbsp;
vision, that can perform in helping us discover&nbsp;&nbsp;

00:11:36.640 --> 00:11:41.840
new biology then? So I think in the short&nbsp;
term, yeah, we're actively investigating&nbsp;&nbsp;

00:11:41.840 --> 00:11:47.120
many hypotheses in this area. So for example,&nbsp;
my colleagues, Lorin Crawford and Ava Amini,&nbsp;&nbsp;

00:11:47.120 --> 00:11:52.080
who were co-authors in the paper, recently put&nbsp;
out a pre-print understanding how training data&nbsp;&nbsp;

00:11:52.080 --> 00:11:58.240
composition impacts model performance. And so one&nbsp;
of the surprising findings that they had was that&nbsp;&nbsp;

00:11:58.240 --> 00:12:02.800
many of the training data sets that people used&nbsp;
to train single-cell foundation models are highly&nbsp;&nbsp;

00:12:02.800 --> 00:12:08.000
redundant, to the point that you can even sample&nbsp;
just a tiny fraction of the data and get basically&nbsp;&nbsp;

00:12:08.000 --> 00:12:13.360
the same performance then. But you can also look&nbsp;
forward to many other explorations in this area&nbsp;&nbsp;

00:12:13.360 --> 00:12:18.960
as we continue to develop this research at the end&nbsp;
of the day. But also zooming out into the bigger&nbsp;&nbsp;

00:12:18.960 --> 00:12:25.120
picture, I think one major takeaway from this&nbsp;
paper is that developing AI methods for biology&nbsp;&nbsp;

00:12:25.120 --> 00:12:31.840
requires thought about the context of use, right?&nbsp;
I mean, this is obvious for any AI method then,&nbsp;&nbsp;

00:12:31.840 --> 00:12:38.080
but I think people have gotten just too used to&nbsp;
taking methods that work out there for natural&nbsp;&nbsp;

00:12:38.080 --> 00:12:42.560
vision or natural language maybe in the consumer&nbsp;
domain and then extrapolating these methods&nbsp;&nbsp;

00:12:42.560 --> 00:12:46.880
to biology and expecting that they will work&nbsp;
in the same way then, right? So for example,&nbsp;&nbsp;

00:12:46.880 --> 00:12:52.400
one reason why zero-shot evaluation was not&nbsp;
routine practice for single-cell foundation models&nbsp;&nbsp;

00:12:52.400 --> 00:12:57.760
prior to our work, I mean, we were the first to&nbsp;
fully establish that as a practice for the field,&nbsp;&nbsp;

00:12:57.760 --> 00:13:03.680
was because I think people who have been working&nbsp;
in AI for biology have been looking to these more&nbsp;&nbsp;

00:13:03.680 --> 00:13:09.760
mainstream AI domains to shape their work then.&nbsp;
And so with single-cell foundation models, many&nbsp;&nbsp;

00:13:09.760 --> 00:13:14.240
of these models are adopted from large language&nbsp;
models with natural language processing, recycling&nbsp;&nbsp;

00:13:14.240 --> 00:13:19.200
the exact same architecture, the exact same code,&nbsp;
basically just recycling practices in that field&nbsp;&nbsp;

00:13:19.200 --> 00:13:24.800
then. So when you look at like practices in like&nbsp;
more mainstream domains, zero-shot evaluation is&nbsp;&nbsp;

00:13:24.800 --> 00:13:30.640
definitely explored in those domains, but it's&nbsp;
more of like a niche instead of being considered&nbsp;&nbsp;

00:13:30.640 --> 00:13:36.400
central to model understanding. So again,&nbsp;
because biology is different from mainstream&nbsp;&nbsp;

00:13:36.400 --> 00:13:42.000
language processing, it's a scientific discipline,&nbsp;
zero-shot evaluation becomes much more important,&nbsp;&nbsp;

00:13:42.000 --> 00:13:46.720
and you have no choice but to use these&nbsp;
models, zero-shot then. So in other words,&nbsp;&nbsp;

00:13:46.720 --> 00:13:51.440
I think that we need to be thinking carefully&nbsp;
about what it is that makes training a model&nbsp;&nbsp;

00:13:51.440 --> 00:13:57.760
for biology different from training a&nbsp;
model, for example, for consumer purposes.

00:13:57.760 --> 00:14:01.120
Huizinga: Alex Lu, thanks for joining&nbsp;
us today, and to our listeners,&nbsp;&nbsp;

00:14:01.120 --> 00:14:09.120
thanks for tuning in. If you want to read this&nbsp;
paper, you can find a link at aka.ms/Abstracts,&nbsp;&nbsp;

00:14:09.120 --> 00:14:21.040
or you can read it on the Genome Biology&nbsp;
website. See you next time on Abstracts!