WEBVTT
Kind: captions
Language: en-US

00:00:02.405 --> 00:00:03.107
[MUSIC] 
 
 

00:00:03.107 --> 00:00:06.800
KATHLEEN SULLIVAN: Welcome&nbsp;
to AI Testing and Evaluation:&nbsp;&nbsp;

00:00:06.800 --> 00:00:11.520
Learnings from Science and Industry.&nbsp;
I'm your host, Kathleen Sullivan.  
 
 

00:00:11.520 --> 00:00:16.640
As generative AI continues to advance, Microsoft&nbsp;
has gathered a range of experts—from genome&nbsp;&nbsp;

00:00:16.640 --> 00:00:21.040
editing to cybersecurity—to share how&nbsp;
their fields approach evaluation and&nbsp;&nbsp;

00:00:21.040 --> 00:00:26.160
risk assessment. Our goal is to learn from&nbsp;
their successes and their stumbles to move&nbsp;&nbsp;

00:00:26.160 --> 00:00:31.360
the science and practice of AI testing&nbsp;
forward. In this series, we'll explore&nbsp;&nbsp;

00:00:31.360 --> 00:00:40.137
how these insights might help guide the future of&nbsp;
AI development, deployment, and responsible use.

00:00:40.126 --> 00:00:40.659
[MUSIC ENDS]

00:00:40.659 --> 00:00:42.800
SULLIVAN: Today, I'm excited to&nbsp;
welcome Dan Carpenter and Timo&nbsp;&nbsp;

00:00:42.800 --> 00:00:46.400
Minssen to the podcast to explore&nbsp;
testing and risk assessment in the&nbsp;&nbsp;

00:00:46.400 --> 00:00:50.160
areas of pharmaceuticals and&nbsp;
medical devices, respectively.

00:00:50.160 --> 00:00:54.640
Dan Carpenter is chair of the Department&nbsp;
of Government at Harvard University. His&nbsp;&nbsp;

00:00:54.640 --> 00:01:00.160
research spans the sphere of social and political&nbsp;
science, from petitioning in democratic society&nbsp;&nbsp;

00:01:00.160 --> 00:01:05.600
to regulation and government organizations.&nbsp;
His recent work includes the FDA Project,&nbsp;&nbsp;

00:01:05.600 --> 00:01:09.440
which examines pharmaceutical&nbsp;
regulation in the United States.

00:01:09.440 --> 00:01:12.720
Timo is a professor of law at&nbsp;
the University of Copenhagen,&nbsp;&nbsp;

00:01:12.720 --> 00:01:18.400
where he is also director of the Center for&nbsp;
Advanced Studies in Bioscience Innovation Law.&nbsp;&nbsp;

00:01:18.400 --> 00:01:22.080
He specializes in legal aspects of&nbsp;
biomedical innovation, including&nbsp;&nbsp;

00:01:22.080 --> 00:01:27.680
intellectual property law and regulatory law.&nbsp;
He's exercised his expertise as an advisor&nbsp;&nbsp;

00:01:27.680 --> 00:01:33.040
to such organizations as the World Health&nbsp;
Organization and the European Commission.

00:01:33.040 --> 00:01:38.000
And after our conversations, we'll talk to&nbsp;
Microsoft's Chad Atalla, an applied scientist&nbsp;&nbsp;

00:01:38.000 --> 00:01:43.520
in responsible AI, about how we should think&nbsp;
about these insights in the context of AI.

00:01:43.520 --> 00:01:46.160
Daniel, it's a pleasure to&nbsp;
welcome you to the podcast.&nbsp;&nbsp;

00:01:46.160 --> 00:01:49.662
I'm just so appreciative of you being&nbsp;
here. Thanks for joining us today.

00:01:49.662 --> 00:01:50.720
DANIEL CARPENTER: Thanks for having me.

00:01:50.720 --> 00:01:56.480
SULLIVAN: Dan, before we dissect policy,&nbsp;
let's rewind the tape to your origin story.&nbsp;&nbsp;

00:01:56.480 --> 00:02:00.960
Can you take us to the moment that you first&nbsp;
became fascinated with regulators rather than,&nbsp;&nbsp;

00:02:00.960 --> 00:02:05.840
say, politicians? Was there a spark&nbsp;
that pulled you toward the FDA story?

00:02:05.840 --> 00:02:08.720
CARPENTER: At one point during graduate school,&nbsp;&nbsp;

00:02:08.720 --> 00:02:16.960
I was studying a combination of American&nbsp;
politics and political theory, and I did a&nbsp;&nbsp;

00:02:16.960 --> 00:02:24.720
summer interning at the Department of Housing&nbsp;
and Urban Development. And I began to think,&nbsp;&nbsp;

00:02:24.720 --> 00:02:30.960
why don't people study these administrators&nbsp;
more and the rules they make, the, you know,&nbsp;&nbsp;

00:02:30.960 --> 00:02:37.360
inefficiencies, the efficiencies? Really more&nbsp;
from, kind of, a descriptive standpoint, less from&nbsp;&nbsp;

00:02:37.360 --> 00:02:45.520
a normative standpoint. And I was reading a lot&nbsp;
that summer about the Food and Drug Administration&nbsp;&nbsp;

00:02:45.520 --> 00:02:51.590
and some of the decisions it was making on&nbsp;
AIDS drugs. That was a, sort of, a major, …

00:02:51.590 --> 00:02:52.800
SULLIVAN: Right.

00:02:52.800 --> 00:02:57.680
CARPENTER: … sort of, you know, moment in the&nbsp;
news, in the global news as well as the national&nbsp;&nbsp;

00:02:57.680 --> 00:03:06.880
news during, I would say, what? The late ’80s,&nbsp;
early ’90s? And so I began to look into that.

00:03:06.880 --> 00:03:09.520
SULLIVAN: So now that we know what pulled you in,&nbsp;&nbsp;

00:03:09.520 --> 00:03:13.120
let’s zoom out for our listeners.&nbsp;
Give us the whirlwind tour. I think&nbsp;&nbsp;

00:03:13.120 --> 00:03:17.480
most of us know pharma involves years of&nbsp;
trials, but what’s the part we don’t know?

00:03:17.480 --> 00:03:24.480
CARPENTER: So I think when most businesses develop&nbsp;
a product, they all go through some phases of&nbsp;&nbsp;

00:03:24.480 --> 00:03:30.320
research and development and testing. And&nbsp;
I think what's different about the FDA is,&nbsp;&nbsp;

00:03:30.960 --> 00:03:32.880
sort of, two- or three-fold.

00:03:32.880 --> 00:03:38.560
First, a lot of those tests are much more&nbsp;
stringently specified and regulated by the&nbsp;&nbsp;

00:03:38.560 --> 00:03:46.960
government, and second, one of the reasons&nbsp;
for that is that the FDA imposes not simply&nbsp;&nbsp;

00:03:46.960 --> 00:03:56.000
safety requirements upon drugs in particular but&nbsp;
also efficacy requirements. The FDA wants you to&nbsp;&nbsp;

00:03:56.000 --> 00:04:02.720
prove not simply that it's safe and non-toxic but&nbsp;
also that it's effective. And the final thing,&nbsp;&nbsp;

00:04:02.720 --> 00:04:07.280
I think, that makes the FDA different is&nbsp;
that it stands as what I would call the “veto&nbsp;&nbsp;

00:04:07.280 --> 00:04:13.200
player” over R&amp;D [research and development]&nbsp;
to the marketplace. The FDA basically has,&nbsp;&nbsp;

00:04:13.200 --> 00:04:16.880
sort of, this control over&nbsp;
entry to the marketplace.

00:04:16.880 --> 00:04:23.600
And so what that involves is usually first,&nbsp;
a set of human trials where people who have&nbsp;&nbsp;

00:04:23.600 --> 00:04:29.120
no disease take it. And you're only looking&nbsp;
for toxicity generally. Then there's a set&nbsp;&nbsp;

00:04:29.120 --> 00:04:34.160
of Phase 2 trials, where they look more&nbsp;
at safety and a little bit at efficacy,&nbsp;&nbsp;

00:04:34.160 --> 00:04:41.440
and you're now examining people who have the&nbsp;
disease that the drug claims to treat. And&nbsp;&nbsp;

00:04:41.440 --> 00:04:47.440
you're also basically comparing people who&nbsp;
get the drug, often with those who do not.

00:04:47.440 --> 00:04:53.520
And then finally, Phase 3 involves a much more&nbsp;
direct and large-scale attack, if you will,&nbsp;&nbsp;

00:04:53.520 --> 00:04:59.760
or assessment of efficacy, and that's where you&nbsp;
get the sort of large randomized clinical trials&nbsp;&nbsp;

00:04:59.760 --> 00:05:08.000
that are very expensive for pharmaceutical&nbsp;
companies, biomedical companies to launch,&nbsp;&nbsp;

00:05:08.000 --> 00:05:16.640
to execute, to analyze. And those are often the&nbsp;
sort of core evidence base for the decisions that&nbsp;&nbsp;

00:05:16.640 --> 00:05:22.240
the FDA makes about whether or not to approve&nbsp;
a new drug for marketing in the United States.

00:05:22.240 --> 00:05:27.600
SULLIVAN: Are there differences&nbsp;
in how that process has, you know,&nbsp;&nbsp;

00:05:27.600 --> 00:05:32.400
changed through other countries and maybe just&nbsp;
how that's evolved as you've seen it play out?

00:05:32.400 --> 00:05:38.320
CARPENTER: Yeah, for a long time, I would say that&nbsp;
the United States had probably the most stringent&nbsp;&nbsp;

00:05:38.320 --> 00:05:44.960
regime of regulation for biopharmaceutical&nbsp;
products until, I would say, about the 1990s&nbsp;&nbsp;

00:05:44.960 --> 00:05:52.080
and early 2000s. It used to be the case that a&nbsp;
number of other countries, especially in Europe&nbsp;&nbsp;

00:05:52.080 --> 00:05:58.720
but around the world, basically waited for the&nbsp;
FDA to mandate tests on a drug and only after&nbsp;&nbsp;

00:05:58.720 --> 00:06:05.760
the drug was approved in the United States would&nbsp;
they deem it approvable and marketable in their&nbsp;&nbsp;

00:06:05.760 --> 00:06:12.880
own countries. And then after the formation of the&nbsp;
European Union and the creation of the European&nbsp;&nbsp;

00:06:12.880 --> 00:06:20.080
Medicines Agency, gradually the European Medicines&nbsp;
Agency began to get a bit more stringent.

00:06:20.080 --> 00:06:24.720
But, you know, over the long run, there's&nbsp;
been a lot of, sort of, heterogeneity,&nbsp;&nbsp;

00:06:24.720 --> 00:06:29.760
a lot of variation over time and space,&nbsp;
in the way that the FDA has approached&nbsp;&nbsp;

00:06:29.760 --> 00:06:35.920
these problems. And I'd say in the last 20&nbsp;
years, it's begun to partially deregulate,&nbsp;&nbsp;

00:06:37.200 --> 00:06:45.440
namely, you know, trying to find all sorts of&nbsp;
mechanisms or pathways for really innovative&nbsp;&nbsp;

00:06:45.440 --> 00:06:52.800
drugs for deadly diseases without a lot&nbsp;
of treatments to basically get through the&nbsp;&nbsp;

00:06:52.800 --> 00:06:58.880
process at lower cost. For many people, that&nbsp;
has not been sufficient. They're concerned&nbsp;&nbsp;

00:06:58.880 --> 00:07:06.400
about the cost of the system. Of course, then&nbsp;
the agency also gets criticized by those who&nbsp;&nbsp;

00:07:06.400 --> 00:07:13.080
believe it's too lax. It is potentially letting&nbsp;
ineffective and unsafe therapies on the market.

00:07:13.080 --> 00:07:16.960
SULLIVAN: In your view, when does the&nbsp;
structured model genuinely safeguard&nbsp;&nbsp;

00:07:16.960 --> 00:07:21.200
patients and where do you think it&nbsp;
maybe slows or limits innovation?

00:07:21.200 --> 00:07:30.320
CARPENTER: So I think the worry is that if you&nbsp;
approach pharmaceutical approval as a world&nbsp;&nbsp;

00:07:30.320 --> 00:07:36.480
where only things can go wrong, then you're&nbsp;
really at a risk of limiting innovation. And&nbsp;&nbsp;

00:07:36.480 --> 00:07:44.000
even if you end up letting a lot of things&nbsp;
through, if by your regulations you end up&nbsp;&nbsp;

00:07:44.000 --> 00:07:48.480
basically slowing down the development&nbsp;
process or making it very, very costly,&nbsp;&nbsp;

00:07:48.480 --> 00:07:54.000
then there's just a whole bunch of drugs that&nbsp;
either come to market too slowly or they come&nbsp;&nbsp;

00:07:54.000 --> 00:08:00.000
to market not at all because they just aren't&nbsp;
worth the kind of cost-benefit or, sort of,&nbsp;&nbsp;

00:08:00.000 --> 00:08:05.760
profit analysis of the firm. You know, so that's&nbsp;
been a concern. And I think it's been one of the&nbsp;&nbsp;

00:08:05.760 --> 00:08:11.120
reasons that the Food and Drug Administration&nbsp;
as well as other world regulators have begun to&nbsp;&nbsp;

00:08:11.120 --> 00:08:18.640
basically try to smooth the process and&nbsp;
accelerate the process at the margins.

00:08:18.640 --> 00:08:24.000
The other thing is that they've started to&nbsp;
basically make approvals on the basis of what&nbsp;&nbsp;

00:08:24.000 --> 00:08:29.600
are called surrogate endpoints. So the idea&nbsp;
is that a cancer drug, we really want to know&nbsp;&nbsp;

00:08:29.600 --> 00:08:36.720
whether that drug saves lives, but if we wait&nbsp;
to see whose lives are saved or prolonged by&nbsp;&nbsp;

00:08:36.720 --> 00:08:41.360
that drug, we might miss the opportunity&nbsp;
to make judgments on the basis of, well,&nbsp;&nbsp;

00:08:41.360 --> 00:08:47.360
are we detecting tumors in the bloodstream? Or&nbsp;
can we measure the size of those tumors in, say,&nbsp;&nbsp;

00:08:47.360 --> 00:08:53.920
a solid cancer? And then the further question&nbsp;
is, is the size of the tumor basically a really&nbsp;&nbsp;

00:08:53.920 --> 00:08:59.040
good correlate or predictor of whether&nbsp;
people will die or not, right? Generally,&nbsp;&nbsp;

00:08:59.040 --> 00:09:07.760
the FDA tends to be less stringent when you've&nbsp;
got, you know, a remarkably innovative new therapy&nbsp;&nbsp;

00:09:07.760 --> 00:09:14.400
and the disease being treated is one that just&nbsp;
doesn't have a lot of available treatments, right.

00:09:14.960 --> 00:09:18.880
The one thing that people often think&nbsp;
about when they're thinking about&nbsp;&nbsp;

00:09:18.880 --> 00:09:23.510
pharmaceutical regulation is they often&nbsp;
contrast, kind of, speed versus safety, …

00:09:23.510 --> 00:09:24.065
SULLIVAN: Right.

00:09:24.065 --> 00:09:28.400
CARPENTER: … right. And that's useful as a&nbsp;
tradeoff, but I often try to remind people&nbsp;&nbsp;

00:09:28.400 --> 00:09:34.400
that it's not simply about whether the drug&nbsp;
gets out there and it's unsafe. You know,&nbsp;&nbsp;

00:09:34.400 --> 00:09:39.520
you and I as patients and even doctors have&nbsp;
a hard time knowing whether something works&nbsp;&nbsp;

00:09:39.520 --> 00:09:44.960
and whether it should be prescribed.&nbsp;
And the evidence for knowing whether&nbsp;&nbsp;

00:09:44.960 --> 00:09:49.440
something works isn't just, well, you&nbsp;
know, Sally took it, or Dan took it,&nbsp;&nbsp;

00:09:49.440 --> 00:09:54.000
or Kathleen took it, and they seem to get&nbsp;
better or they didn't seem to get better.

00:09:54.000 --> 00:09:58.240
The really rigorous evidence comes from&nbsp;
randomized clinical trials. And I think&nbsp;&nbsp;

00:09:58.240 --> 00:10:03.280
it's fair to say that if you didn't&nbsp;
have the FDA there as a veto player,&nbsp;&nbsp;

00:10:03.280 --> 00:10:07.840
you wouldn't get as many randomized clinical&nbsp;
trials and the evidence probably wouldn't be&nbsp;&nbsp;

00:10:07.840 --> 00:10:12.800
as rigorous for whether these things work. And&nbsp;
as I like to put it, basically there's a whole&nbsp;&nbsp;

00:10:12.800 --> 00:10:18.800
ecology of expectations and beliefs around&nbsp;
the biopharmaceutical industry in the United&nbsp;&nbsp;

00:10:18.800 --> 00:10:24.960
States and globally, and to some extent, it's&nbsp;
undergirded by all of these tests that happen.

00:10:24.960 --> 00:10:25.810
SULLIVAN: Right.

00:10:25.810 --> 00:10:31.040
CARPENTER: And in part, that means it's&nbsp;
undergirded by regulation. Would there still be&nbsp;&nbsp;

00:10:31.040 --> 00:10:36.640
a market without regulation? Yes. But it would be&nbsp;
a market in which people had far less information&nbsp;&nbsp;

00:10:36.640 --> 00:10:43.840
in and confidence about the drugs that are being&nbsp;
taken. And so I think it's important to recognize&nbsp;&nbsp;

00:10:43.840 --> 00:10:51.760
that kind of confidence-boosting potential&nbsp;
of, kind of, a scientific regulation base.

00:10:51.760 --> 00:10:55.040
SULLIVAN: Actually, if we could double-click&nbsp;
on that for a minute, I'd love to hear your&nbsp;&nbsp;

00:10:55.040 --> 00:10:59.600
perspective on, testing has been completed;&nbsp;
there's results. Can you walk us through how&nbsp;&nbsp;

00:10:59.600 --> 00:11:05.360
those results actually shape the next steps and&nbsp;
decisions of a particular drug and just, like, how&nbsp;&nbsp;

00:11:05.360 --> 00:11:11.920
regulators actually think about using that data&nbsp;
to influence really what happens next with it?

00:11:11.920 --> 00:11:18.320
CARPENTER: Right. So it's important to&nbsp;
understand that every drug is approved&nbsp;&nbsp;

00:11:18.320 --> 00:11:23.840
for what's called an indication. It&nbsp;
can have a first primary indication,&nbsp;&nbsp;

00:11:23.840 --> 00:11:28.160
which is the main disease that it treats, and&nbsp;
then others can be added as more evidence is&nbsp;&nbsp;

00:11:28.160 --> 00:11:34.640
shown. But a drug is not something that just&nbsp;
kind of exists out there in the ether. It&nbsp;&nbsp;

00:11:34.640 --> 00:11:39.120
has to have the right form of administration.&nbsp;
Maybe it should be injected. Maybe it should be&nbsp;&nbsp;

00:11:39.120 --> 00:11:44.720
ingested. Maybe it should be administered only&nbsp;
at a clinic because it needs to be kind of&nbsp;&nbsp;

00:11:44.720 --> 00:11:50.720
administered in just the right way. As doctors&nbsp;
will tell you, dosage is everything, right.

00:11:50.720 --> 00:11:56.000
And so one of the reasons that you want&nbsp;
those trials is not simply a, you know,&nbsp;&nbsp;

00:11:56.000 --> 00:12:02.560
yes or no answer about whether the drug&nbsp;
works, right. It's not simply if-then. It's&nbsp;&nbsp;

00:12:02.560 --> 00:12:06.320
literally what goes into what you might&nbsp;
call the dose response curve. You know,&nbsp;&nbsp;

00:12:06.320 --> 00:12:11.600
how much of this drug do we need to basically,&nbsp;
you know, get the benefit? At what point does&nbsp;&nbsp;

00:12:11.600 --> 00:12:17.680
that fall off significantly that we can basically&nbsp;
say, we can stop there? All that evidence comes&nbsp;&nbsp;

00:12:17.680 --> 00:12:25.600
from trials. And that's the kind of evidence&nbsp;
that is required on the basis of regulation.

00:12:25.600 --> 00:12:30.080
Because it's not simply a drug&nbsp;
that's approved. It's a drug and&nbsp;&nbsp;

00:12:30.080 --> 00:12:36.000
a frequency of administration. It's a method&nbsp;
of administration. And so the drug isn't just,&nbsp;&nbsp;

00:12:36.000 --> 00:12:39.120
there's something to be taken off the&nbsp;
shelf and popped into your mouth. I mean,&nbsp;&nbsp;

00:12:39.120 --> 00:12:43.360
sometimes that's what happens, but even then,&nbsp;
we want to know what the dosage is, right.&nbsp;&nbsp;

00:12:44.160 --> 00:12:47.760
We want to know what to look for in&nbsp;
terms of side effects, things like that.

00:12:47.760 --> 00:12:49.440
SULLIVAN: Going back to that point, I mean,&nbsp;&nbsp;

00:12:49.440 --> 00:12:55.920
it sounds like we're making a lot of progress&nbsp;
from a regulation perspective in, you know,&nbsp;&nbsp;

00:12:55.920 --> 00:13:01.200
sort of speed and getting things approved but&nbsp;
doing it in a really balanced way. I mean,&nbsp;&nbsp;

00:13:01.200 --> 00:13:05.840
any other kind of closing thoughts on the&nbsp;
tradeoffs there or where you're seeing that going?

00:13:05.840 --> 00:13:10.560
CARPENTER: I think you're going to see some move&nbsp;
in the coming years—there's already been some of&nbsp;&nbsp;

00:13:10.560 --> 00:13:16.400
it—to say, do we always need a really large&nbsp;
Phase 3 clinical trial? And to what degree do&nbsp;&nbsp;

00:13:16.400 --> 00:13:21.040
we need the, like, you know, all the i's dotted&nbsp;
and the t's crossed or a really, really large&nbsp;&nbsp;

00:13:21.040 --> 00:13:29.920
sample size? And I'm open to innovation there.&nbsp;
I'm also open to the idea that we consider, again,&nbsp;&nbsp;

00:13:29.920 --> 00:13:37.520
things like accelerated approvals or pathways&nbsp;
for looking at different kinds of surrogate&nbsp;&nbsp;

00:13:37.520 --> 00:13:42.360
endpoints. I do think, once we do that, then&nbsp;
we also have to have some degree of follow-up.

00:13:42.360 --> 00:13:44.480
SULLIVAN: So I know we're&nbsp;
getting close to out of time,&nbsp;&nbsp;

00:13:44.480 --> 00:13:50.800
but maybe just a quick rapid fire if you’re&nbsp;
open to it. Biggest myth about clinical trials?

00:13:50.800 --> 00:13:54.640
CARPENTER: Well, some people tend to think&nbsp;
that the FDA performs them. You know,&nbsp;&nbsp;

00:13:54.640 --> 00:14:00.160
it's companies that do it. And the only other&nbsp;
thing I would say is the company that does a&nbsp;&nbsp;

00:14:00.160 --> 00:14:05.600
lot of the testing and even the innovating is not&nbsp;
always the company that takes the drug to market,&nbsp;&nbsp;

00:14:05.600 --> 00:14:12.240
and it tells you something about how powerful&nbsp;
regulation is in our system, in our world,&nbsp;&nbsp;

00:14:12.240 --> 00:14:18.640
that you often need a company that has dealt with&nbsp;
the FDA quite a bit and knows all the regulations&nbsp;&nbsp;

00:14:18.640 --> 00:14:23.760
and knows how to dot the i's and cross the t's&nbsp;
in order to get a drug across the finish line.

00:14:23.760 --> 00:14:27.920
SULLIVAN: If you had a magic wand, what's the&nbsp;
one thing you'd change in regulation today?

00:14:27.920 --> 00:14:32.960
CARPENTER: I would like people to think a little&nbsp;
bit less about just speed versus safety and,&nbsp;&nbsp;

00:14:32.960 --> 00:14:37.280
again, more about this basic issue of&nbsp;
confidence. I think it's fundamental&nbsp;&nbsp;

00:14:37.280 --> 00:14:41.200
to everything that happens in markets&nbsp;
but especially in biopharmaceuticals.

00:14:41.200 --> 00:14:45.821
SULLIVAN: Such a great point. This has been really&nbsp;
fun. Just thanks so much for being here today.

00:14:45.821 --> 00:14:46.358
[TRANSITION MUSIC]

00:14:46.358 --> 00:14:49.920
We're really excited to share your&nbsp;
thoughts out to our listeners. Thanks.

00:14:49.920 --> 00:14:54.160
CARPENTER: Likewise.

00:14:58.120 --> 00:14:59.120
SULLIVAN:&nbsp;&nbsp;

00:14:59.120 --> 00:15:04.480
Now to the world of medical devices, I'm joined&nbsp;
by Professor Timo Minssen. Professor Minssen,&nbsp;&nbsp;

00:15:04.480 --> 00:15:07.746
it's great to have you here.&nbsp;
Thank you for joining us today.

00:15:07.746 --> 00:15:09.720
TIMO MINSSEN: Yeah, thank you&nbsp;
very much, it's a pleasure.

00:15:09.720 --> 00:15:13.440
SULLIVAN: Before getting into the&nbsp;
regulatory world of medical devices,&nbsp;&nbsp;

00:15:13.440 --> 00:15:17.440
tell our audience a bit about your personal&nbsp;
journey or your origin story, as we're asking&nbsp;&nbsp;

00:15:17.440 --> 00:15:22.480
our guests. How did you land in regulation,&nbsp;
and what's kept you hooked in this space?

00:15:22.480 --> 00:15:28.160
MINSSEN: So I started out as a patent expert in&nbsp;
the biomedical area, starting with my PhD thesis&nbsp;&nbsp;

00:15:28.160 --> 00:15:34.240
on patenting biologics in Europe and in the US.&nbsp;
So during that time, I was mostly interested in&nbsp;&nbsp;

00:15:34.240 --> 00:15:39.200
patent and trade secret questions. But at the&nbsp;
same time, I also developed and taught courses&nbsp;&nbsp;

00:15:39.200 --> 00:15:45.680
in regulatory law and held talks on regulating&nbsp;
advanced medical therapy medicinal products.&nbsp;&nbsp;

00:15:45.680 --> 00:15:52.720
I then started to lead large research projects&nbsp;
on legal challenges in a wide variety of health&nbsp;&nbsp;

00:15:52.720 --> 00:15:57.840
and life science innovation frontiers. I also&nbsp;
started to focus increasingly on AI-enabled&nbsp;&nbsp;

00:15:57.840 --> 00:16:03.760
medical devices and software as a medical device,&nbsp;
resulting in several academic articles in this&nbsp;&nbsp;

00:16:03.760 --> 00:16:09.800
area and also in the regulatory area and a book&nbsp;
on the future of medical device regulation.

00:16:09.800 --> 00:16:12.800
SULLIVAN: Yeah, what's kept&nbsp;
you hooked in the space?

00:16:12.800 --> 00:16:15.600
MINSSEN: It's just incredibly exciting,&nbsp;&nbsp;

00:16:15.600 --> 00:16:22.880
in particular right now with everything that&nbsp;
is going on, you know, in the software arena,&nbsp;&nbsp;

00:16:22.880 --> 00:16:28.000
in the marriage between AI and medical&nbsp;
devices. And this is really challenging&nbsp;&nbsp;

00:16:28.000 --> 00:16:32.360
not only societies but also regulators&nbsp;
and authorities in Europe and in the US.

00:16:32.360 --> 00:16:35.760
SULLIVAN: Yeah, it's a super exciting&nbsp;
time to be in this space. You know,&nbsp;&nbsp;

00:16:35.760 --> 00:16:39.360
we talked to Daniel a little earlier and,&nbsp;
you know, I think similar to pharmaceuticals,&nbsp;&nbsp;

00:16:39.360 --> 00:16:45.280
people have a general sense of what we mean when&nbsp;
we say medical devices, but most listeners may&nbsp;&nbsp;

00:16:45.280 --> 00:16:52.240
picture like a stethoscope or a hip implant.&nbsp;
The word “medical device” reaches much wider.&nbsp;&nbsp;

00:16:52.240 --> 00:16:57.120
Can you give us a quick, kind of, range from&nbsp;
perhaps very simple to even, I don't know,&nbsp;&nbsp;

00:16:57.120 --> 00:17:03.040
sci-fi and then your 90-second tour of how risk&nbsp;
assessment works and why a framework is essential?

00:17:03.040 --> 00:17:06.800
MINSSEN: Let me start out by saying that the WHO&nbsp;
[World Health Organization] estimates that today&nbsp;&nbsp;

00:17:06.800 --> 00:17:11.840
there are approximately 2 million different&nbsp;
kinds of medical devices on the world market,&nbsp;&nbsp;

00:17:11.840 --> 00:17:19.360
and as of the FDA's latest update that I'm&nbsp;
aware of, the FDA has authorized more than 1,000&nbsp;&nbsp;

00:17:19.360 --> 00:17:24.720
AI-, machine learning-enabled medical&nbsp;
devices, and that number is rising rapidly.

00:17:24.720 --> 00:17:29.040
So in that context, I think it is important&nbsp;
to understand that medical devices can be&nbsp;&nbsp;

00:17:29.040 --> 00:17:35.520
any instrument, apparatus, implement, machine,&nbsp;
appliance, implant, reagent for in vitro use,&nbsp;&nbsp;

00:17:35.520 --> 00:17:42.080
software, material, or other similar or related&nbsp;
articles that are intended by the manufacturer to&nbsp;&nbsp;

00:17:42.080 --> 00:17:47.920
be used alone or in combination for a medical&nbsp;
purpose. And the spectrum of what constitutes&nbsp;&nbsp;

00:17:47.920 --> 00:17:52.800
a medical device can thus range from very&nbsp;
simple devices such as tongue depressors,&nbsp;&nbsp;

00:17:52.800 --> 00:17:58.240
contact lenses, and thermometers to more&nbsp;
complex devices such as blood pressure monitors,&nbsp;&nbsp;

00:17:58.240 --> 00:18:04.960
insulin pumps, MRI machines, implantable&nbsp;
pacemakers, and even software as a medical&nbsp;&nbsp;

00:18:04.960 --> 00:18:09.680
device or AI-enabled monitors or&nbsp;
drug device combinations, as well.

00:18:09.680 --> 00:18:13.520
So talking about regulation, I think&nbsp;
it is also very important to stress&nbsp;&nbsp;

00:18:13.520 --> 00:18:17.280
that medical devices are used in&nbsp;
many diverse situations by very&nbsp;&nbsp;

00:18:17.280 --> 00:18:22.480
different stakeholders. And testing has&nbsp;
to take this variety into consideration,&nbsp;&nbsp;

00:18:22.480 --> 00:18:28.560
and it is intrinsically tied to regulatory&nbsp;
requirements across various jurisdictions.

00:18:28.560 --> 00:18:33.120
During the pre-market phase, medical&nbsp;
testing establishes baseline safety&nbsp;&nbsp;

00:18:33.120 --> 00:18:37.120
and effectiveness metrics through&nbsp;
bench testing, performance standards,&nbsp;&nbsp;

00:18:37.120 --> 00:18:43.360
and clinical studies. And post-market testing&nbsp;
ensures that real-world data informs ongoing&nbsp;&nbsp;

00:18:43.360 --> 00:18:49.120
compliance and safety improvements. So testing&nbsp;
is indispensable in translating technological&nbsp;&nbsp;

00:18:49.120 --> 00:18:53.600
innovation into safe and effective&nbsp;
medical devices. And while particular&nbsp;&nbsp;

00:18:53.600 --> 00:18:58.960
details of pre-market and post-market review&nbsp;
procedures may slightly differ among countries,&nbsp;&nbsp;

00:18:58.960 --> 00:19:04.640
most developed jurisdictions regulate medical&nbsp;
devices similarly to the US or European models.  

00:19:04.640 --> 00:19:09.840
So most jurisdictions with medical device&nbsp;
regulation classify devices based on their&nbsp;&nbsp;

00:19:09.840 --> 00:19:15.440
risk profile, intended use, indications&nbsp;
for use, technological characteristics,&nbsp;&nbsp;

00:19:15.440 --> 00:19:22.200
and the regulatory controls necessary to provide a&nbsp;
reasonable assurance of safety and effectiveness.

00:19:22.200 --> 00:19:27.280
SULLIVAN: So medical devices face&nbsp;
a pretty prescriptive multi-level&nbsp;&nbsp;

00:19:27.280 --> 00:19:30.320
testing path before they hit the&nbsp;
market. From your vantage point,&nbsp;&nbsp;

00:19:30.320 --> 00:19:34.120
what are some of the downsides of that&nbsp;
system and when does it make the most sense?

00:19:34.120 --> 00:19:39.440
MINSSEN: One primary drawback is, of course,&nbsp;
the lengthy and expensive approval process.&nbsp;&nbsp;

00:19:39.440 --> 00:19:44.240
High-risk devices, for example, often&nbsp;
undergo years of clinical trials,&nbsp;&nbsp;

00:19:44.240 --> 00:19:47.840
which can cost millions of dollars, and&nbsp;
this can create a significant barrier&nbsp;&nbsp;

00:19:47.840 --> 00:19:54.240
for startups and small companies with limited&nbsp;
resources. And even for moderate-risk devices,&nbsp;&nbsp;

00:19:54.240 --> 00:19:59.520
the regulatory burden can slow product&nbsp;
development and time to the market.

00:19:59.520 --> 00:20:05.760
And the approach can also limit flexibility.&nbsp;
Prescriptive requirements may not accommodate&nbsp;&nbsp;

00:20:05.760 --> 00:20:10.800
emerging innovations like digital therapeutics&nbsp;
or AI-based diagnostics in a feasible way. And&nbsp;&nbsp;

00:20:10.800 --> 00:20:16.240
in such cases, the framework can unintentionally&nbsp;
[stiffen] innovation by discouraging creative&nbsp;&nbsp;

00:20:16.240 --> 00:20:20.960
solutions or iterative improvements,&nbsp;
which as matter of fact can also put&nbsp;&nbsp;

00:20:20.960 --> 00:20:27.040
patients at risk when you don't use new&nbsp;
technologies and AI. And additionally,&nbsp;&nbsp;

00:20:27.040 --> 00:20:31.680
the same level of scrutiny may be&nbsp;
applied to low-risk devices, where&nbsp;&nbsp;

00:20:31.680 --> 00:20:37.680
the extensive testing and documentation may also&nbsp;
be disproportionate to the actual patient risk.

00:20:37.680 --> 00:20:42.320
However, the prescriptive model is&nbsp;
highly appropriate where we have high&nbsp;&nbsp;

00:20:42.320 --> 00:20:47.360
testing standards for high-risk medical&nbsp;
devices, in my view, particularly those&nbsp;&nbsp;

00:20:47.360 --> 00:20:52.960
that are life-sustaining, implanted,&nbsp;
or involve new materials or mechanisms.

00:20:52.960 --> 00:20:57.680
I also wanted to say that I think that&nbsp;
these higher compliance thresholds can&nbsp;&nbsp;

00:20:57.680 --> 00:21:03.520
be OK and necessary if you have a system&nbsp;
where authorities and stakeholders also&nbsp;&nbsp;

00:21:03.520 --> 00:21:10.320
have the capacity and funding to enforce,&nbsp;
monitor, and achieve compliance with such&nbsp;&nbsp;

00:21:10.320 --> 00:21:14.240
rules in a feasible, time-effective,&nbsp;
and straightforward manner. And this,&nbsp;&nbsp;

00:21:14.240 --> 00:21:18.480
of course, requires resources,&nbsp;
novel solutions, and investments.

00:21:18.480 --> 00:21:23.600
SULLIVAN: A range of tests are undertaken across&nbsp;
the life cycle of medical devices. How do these&nbsp;&nbsp;

00:21:23.600 --> 00:21:28.480
testing requirements vary across different stages&nbsp;
of development and across various applications?

00:21:28.480 --> 00:21:35.280
MINSSEN: Yes, that's a good question. So&nbsp;
I think first it is important to realize&nbsp;&nbsp;

00:21:35.280 --> 00:21:42.160
that testing is conducted by various entities,&nbsp;
including manufacturers, independent third-party&nbsp;&nbsp;

00:21:42.160 --> 00:21:49.120
laboratories, and regulatory agencies. And&nbsp;
it occurs throughout the device life cycle,&nbsp;&nbsp;

00:21:49.120 --> 00:21:54.000
beginning with iterative testing during the&nbsp;
research and development stage, advancing&nbsp;&nbsp;

00:21:54.000 --> 00:22:01.120
to pre-market evaluations, and continuing into&nbsp;
post-market monitoring. And the outcomes of these&nbsp;&nbsp;

00:22:01.120 --> 00:22:09.280
tests directly impact regulatory approvals, market&nbsp;
access, and device design refinements, as well.&nbsp;&nbsp;

00:22:09.840 --> 00:22:14.800
So the testing results are typically shared&nbsp;
with regulatory authorities and in some cases&nbsp;&nbsp;

00:22:14.800 --> 00:22:20.480
with healthcare providers and the broader&nbsp;
public to enhance transparency and trust.

00:22:20.480 --> 00:22:25.520
So if you talk about the different phases that&nbsp;
play a role here … so let's turn to the pre-market&nbsp;&nbsp;

00:22:25.520 --> 00:22:31.760
phase, where manufacturers must demonstrate&nbsp;
that the device is conformed to safety and&nbsp;&nbsp;

00:22:31.760 --> 00:22:38.560
performance benchmarks defined by regulatory&nbsp;
authorities. Pre-market evaluations include&nbsp;&nbsp;

00:22:38.560 --> 00:22:45.360
functional bench testing, biocompatibility, for&nbsp;
example, assessments and software validation,&nbsp;&nbsp;

00:22:45.360 --> 00:22:50.960
all of which are integral components&nbsp;
of a manufacturer's submission.

00:22:50.960 --> 00:22:55.520
But, yes, but, testing also, and&nbsp;
we touched already up on that,&nbsp;&nbsp;

00:22:55.520 --> 00:23:03.200
extends into the post-market phase, where it&nbsp;
continues to ensure device safety and efficacy,&nbsp;&nbsp;

00:23:03.200 --> 00:23:09.280
and post-market surveillance relies on testing&nbsp;
to monitor real-world performance and identify&nbsp;&nbsp;

00:23:09.280 --> 00:23:15.200
emerging risks on the post-market phase. By&nbsp;
integrating real-world evidence into ongoing&nbsp;&nbsp;

00:23:15.200 --> 00:23:22.560
assessments, manufacturers can address unforeseen&nbsp;
issues, update devices as needed, and maintain&nbsp;&nbsp;

00:23:22.560 --> 00:23:27.840
compliance with evolving regulatory expectations.&nbsp;
And I think this is particularly important in&nbsp;&nbsp;

00:23:27.840 --> 00:23:33.440
this new generation of medical devices that&nbsp;
are AI-enabled or machine-learning enabled.

00:23:33.440 --> 00:23:38.640
I think we have to understand that in this&nbsp;
AI-enabled medical devices field, you know,&nbsp;&nbsp;

00:23:38.640 --> 00:23:45.040
the devices and the algorithms that are working&nbsp;
with them, they can improve in the lifetime of a&nbsp;&nbsp;

00:23:45.040 --> 00:23:51.200
product. So actually, not only you could assess&nbsp;
them and make sure that they maintain safe,&nbsp;&nbsp;

00:23:51.200 --> 00:23:56.880
you could also sometimes lower the risk&nbsp;
category by finding evidence that these&nbsp;&nbsp;

00:23:56.880 --> 00:24:04.720
devices are actually becoming more precise and&nbsp;
safer. So it can both, you know, heighten the risk&nbsp;&nbsp;

00:24:04.720 --> 00:24:11.520
category or lower the risk category, and that's&nbsp;
why this continuous testing is so important.

00:24:11.520 --> 00:24:15.360
SULLIVAN: Given what you just said, how&nbsp;
should regulators handle a device whose&nbsp;&nbsp;

00:24:15.360 --> 00:24:18.480
algorithm keeps updating itself after approval? 

00:24:18.480 --> 00:24:27.040
MINSSEN: Well, it has to be an iterative process&nbsp;
that is feasible and straightforward and that is&nbsp;&nbsp;

00:24:27.040 --> 00:24:33.520
based on a very efficient, both time efficient&nbsp;
and performance efficient, communication between&nbsp;&nbsp;

00:24:33.520 --> 00:24:39.760
the regulatory authorities and the medical device&nbsp;
developers, right. We need to have the sensors in&nbsp;&nbsp;

00:24:39.760 --> 00:24:50.240
place that spot potential changes, and we need&nbsp;
to have the mechanisms in place that allow us to&nbsp;&nbsp;

00:24:50.240 --> 00:24:57.440
quickly react to these changes both regulatory&nbsp;
wise and also in the technological way. 

00:24:58.320 --> 00:25:04.880
So I think communication is important, and we need&nbsp;
to have the pathways and the feedback loops in the&nbsp;&nbsp;

00:25:04.880 --> 00:25:14.680
regulation that quickly allow us to monitor&nbsp;
these self-learning algorithms and devices.

00:25:14.680 --> 00:25:19.840
SULLIVAN: It sounds like it's just … there's such&nbsp;
a delicate balance between advancing technology&nbsp;&nbsp;

00:25:19.840 --> 00:25:26.000
and really ensuring public safety. You know, if&nbsp;
we clamp down too hard, we stifle that innovation.&nbsp;&nbsp;

00:25:26.000 --> 00:25:31.760
You already touched upon this a bit. But if we're&nbsp;
too lax, we risk unintended consequences. And I'd&nbsp;&nbsp;

00:25:31.760 --> 00:25:37.200
just love to hear how you think the field is&nbsp;
balancing that and any learnings you can share.

00:25:37.200 --> 00:25:41.920
MINSSEN: So this is very true, and you just&nbsp;
touched upon a very central question also&nbsp;&nbsp;

00:25:41.920 --> 00:25:47.440
in our research and our writing. And this is&nbsp;
also the reason why medical device regulation&nbsp;&nbsp;

00:25:47.440 --> 00:25:53.600
is so fascinating and continues to evolve in&nbsp;
response to rapid advancements in technologies,&nbsp;&nbsp;

00:25:53.600 --> 00:25:58.560
particularly dual technologies regarding&nbsp;
digital health, artificial intelligence,&nbsp;&nbsp;

00:25:58.560 --> 00:26:01.040
for example, and personalized medicine.

00:26:01.040 --> 00:26:05.360
And finding the balance is tricky because&nbsp;
also [a] related major future challenge&nbsp;&nbsp;

00:26:06.160 --> 00:26:10.880
relates to the increasing regulatory&nbsp;
jungle and the complex interplay between&nbsp;&nbsp;

00:26:10.880 --> 00:26:16.640
evolving regulatory landscapes&nbsp;
that regulate AI more generally.

00:26:16.640 --> 00:26:20.960
We really need to make sure that the regulatory&nbsp;
authorities that deal with this, that need to&nbsp;&nbsp;

00:26:20.960 --> 00:26:27.200
find the right balance to promote innovation&nbsp;
and mitigate and prevent risks, need to have&nbsp;&nbsp;

00:26:27.200 --> 00:26:32.560
the capacity to do this. So this requires&nbsp;
investments, and it also requires new ways&nbsp;&nbsp;

00:26:32.560 --> 00:26:39.960
to regulate this technology more flexibly, for&nbsp;
example through regulatory sandboxes and so on.

00:26:39.960 --> 00:26:43.600
SULLIVAN: Could you just expand upon&nbsp;
that a bit and double-click on what&nbsp;&nbsp;

00:26:43.600 --> 00:26:47.360
it is you're seeing there? What excites&nbsp;
you about what's happening in that space?

00:26:47.360 --> 00:26:53.600
MINSSEN: Yes, well, the research of my group at&nbsp;
the Center for Advanced Studies in Bioscience&nbsp;&nbsp;

00:26:53.600 --> 00:27:01.680
Innovation Law is very broad. I mean, we are&nbsp;
looking into gene editing technologies. We are&nbsp;&nbsp;

00:27:01.680 --> 00:27:09.520
looking into new biologics. We are looking&nbsp;
into medical devices, as well, obviously,&nbsp;&nbsp;

00:27:09.520 --> 00:27:12.800
but also other technologies&nbsp;
in advanced medical computing.

00:27:12.800 --> 00:27:20.400
And what we see across the line here is that&nbsp;
there is an increasing demand for having&nbsp;&nbsp;

00:27:20.400 --> 00:27:26.000
more adaptive and flexible regulatory&nbsp;
frameworks in these new technologies,&nbsp;&nbsp;

00:27:26.000 --> 00:27:31.520
in particular when they have new uses,&nbsp;
regulations that are focusing more on&nbsp;&nbsp;

00:27:31.520 --> 00:27:37.520
the product rather than the process. And I&nbsp;
have recently written a report, for example,&nbsp;&nbsp;

00:27:37.520 --> 00:27:44.720
for emerging biotechnologies and bio-solutions&nbsp;
for the EU commission. And even in that area,&nbsp;&nbsp;

00:27:44.720 --> 00:27:50.160
regulatory sandboxes are increasingly&nbsp;
important, increasingly considered.

00:27:50.160 --> 00:27:56.480
So this idea of regulatory sandboxes has been&nbsp;
developing originally in the financial sector,&nbsp;&nbsp;

00:27:56.480 --> 00:28:02.560
and it is now penetrating into other&nbsp;
sectors, including synthetic biology,&nbsp;&nbsp;

00:28:03.360 --> 00:28:08.720
emerging biotechnologies, gene&nbsp;
editing, AI, quantum technology,&nbsp;&nbsp;

00:28:08.720 --> 00:28:15.200
as well. This is basically creating an&nbsp;
environment where actors can test new&nbsp;&nbsp;

00:28:15.200 --> 00:28:21.520
ideas in close collaboration and under&nbsp;
the oversight of regulatory authorities.

00:28:21.520 --> 00:28:29.680
But to implement this in the AI sector now also&nbsp;
leads us to a lot of questions and challenges.&nbsp;&nbsp;

00:28:29.680 --> 00:28:37.280
For example, you need to have the capacities of&nbsp;
authorities that are governing and monitoring and&nbsp;&nbsp;

00:28:37.280 --> 00:28:43.360
deciding on these regulatory sandboxes. There are&nbsp;
issues relating to competition law, for example,&nbsp;&nbsp;

00:28:43.360 --> 00:28:49.680
which you call antitrust law in the US, because&nbsp;
the question is, who can enter the sandbox and how&nbsp;&nbsp;

00:28:49.680 --> 00:28:57.200
may they compete after they exit the sandbox?&nbsp;
And there are many questions relating to,&nbsp;&nbsp;

00:28:57.200 --> 00:29:03.411
how should we work with these sandboxes and&nbsp;
how should we implement these sandboxes?

00:29:03.411 --> 00:29:04.055
[TRANSITION MUSIC]

00:29:04.055 --> 00:29:06.520
SULLIVAN: Well, Timo, it has just been&nbsp;
such a pleasure to speak with you today.

00:29:06.520 --> 00:29:08.560
MINSSEN: Yes, thank you very much.

00:29:16.800 --> 00:29:19.600
And now I'm happy to introduce Chad Atalla.

00:29:19.600 --> 00:29:23.760
Chad is senior applied scientist in&nbsp;
Microsoft Research New York City's&nbsp;&nbsp;

00:29:23.760 --> 00:29:28.800
Sociotechnical Alignment Center, where they&nbsp;
contribute to foundational responsible AI&nbsp;&nbsp;

00:29:28.800 --> 00:29:33.280
research and practical responsible AI&nbsp;
solutions for teams across Microsoft.

00:29:33.280 --> 00:29:34.627
Chad, welcome!

00:29:34.627 --> 00:29:35.200
CHAD ATALLA: Thank you.

00:29:35.200 --> 00:29:40.480
SULLIVAN: So we'll kick off with a couple&nbsp;
questions just to dive right in. So tell me a&nbsp;&nbsp;

00:29:40.480 --> 00:29:46.240
little bit more about the Sociotechnical Alignment&nbsp;
Center, or STAC? I know it was founded in 2022.&nbsp;&nbsp;

00:29:46.240 --> 00:29:50.160
I'd love to just learn a little bit more about&nbsp;
what the group does, how you're thinking about&nbsp;&nbsp;

00:29:50.160 --> 00:29:54.320
evaluating AI, and maybe just give us a sense&nbsp;
of some of the projects you're working on.

00:29:54.320 --> 00:29:57.486
ATALLA: Yeah, absolutely.&nbsp;
The name is quite a mouthful.

00:29:57.486 --> 00:29:58.044
SULLIVAN: It is! [LAUGHS]

00:29:58.044 --> 00:30:01.000
ATALLA: So let's start by breaking&nbsp;
that down and seeing what that means.

00:30:01.000 --> 00:30:01.943
SULLIVAN: Great.

00:30:01.943 --> 00:30:05.360
ATALLA: So modern AI systems&nbsp;
are sociotechnical systems,&nbsp;&nbsp;

00:30:05.360 --> 00:30:10.160
meaning that the social and technical&nbsp;
aspects are deeply intertwined. And&nbsp;&nbsp;

00:30:10.160 --> 00:30:16.560
we're interested in aligning the behaviors of&nbsp;
these sociotechnical systems with some values.&nbsp;&nbsp;

00:30:16.560 --> 00:30:21.520
Those could be societal values; they could&nbsp;
be regulatory values, organizational values,&nbsp;&nbsp;

00:30:21.520 --> 00:30:28.560
etc. And to make this alignment happen, we&nbsp;
need the ability to evaluate the systems.

00:30:28.560 --> 00:30:34.640
So my team is broadly working on an evaluation&nbsp;
framework that acknowledges the sociotechnical&nbsp;&nbsp;

00:30:34.640 --> 00:30:40.640
nature of the technology and the often-abstract&nbsp;
nature of the concepts we're actually interested&nbsp;&nbsp;

00:30:40.640 --> 00:30:46.800
in evaluating. As you noted, it's an applied&nbsp;
science team, so we split our time between some&nbsp;&nbsp;

00:30:46.800 --> 00:30:54.320
fundamental research and time to bridge the&nbsp;
work into real products across the company.&nbsp;&nbsp;

00:30:54.320 --> 00:30:59.760
And I also want to note that to power this&nbsp;
sort of work, we have an interdisciplinary team&nbsp;&nbsp;

00:30:59.760 --> 00:31:04.320
drawing upon the social sciences, linguistics,&nbsp;
statistics, and, of course, computer science.

00:31:04.320 --> 00:31:08.560
SULLIVAN: Well, I'm eager to get into our&nbsp;
takeaways from the conversation with both&nbsp;&nbsp;

00:31:08.560 --> 00:31:12.240
Daniel and Timo. But maybe just to&nbsp;
double-click on this for a minute,&nbsp;&nbsp;

00:31:12.240 --> 00:31:16.960
can you talk a bit about some of the overarching&nbsp;
goals of the AI evaluations that you noted?

00:31:16.960 --> 00:31:25.040
ATALLA: So evaluation is really the act of making&nbsp;
valuative judgments based on some evidence,&nbsp;&nbsp;

00:31:25.040 --> 00:31:32.160
and in the case of AI evaluation, that evidence&nbsp;
might be from tests or measurements, right. And&nbsp;&nbsp;

00:31:32.160 --> 00:31:38.240
the goal of why we're doing this in the first&nbsp;
place is to make decisions and claims most often.

00:31:38.240 --> 00:31:44.080
So perhaps I am going to make a claim about&nbsp;
a model that I'm producing, and I want to say&nbsp;&nbsp;

00:31:44.080 --> 00:31:50.560
that it's better than this other model. Or we are&nbsp;
asking whether a certain product is safe to ship.&nbsp;&nbsp;

00:31:50.560 --> 00:31:57.360
All of these decisions need to be informed by&nbsp;
good evaluation and therefore good measurement&nbsp;&nbsp;

00:31:57.360 --> 00:32:05.040
or testing. And I'll also note that in the&nbsp;
regulatory conversation, risk is often what&nbsp;&nbsp;

00:32:05.040 --> 00:32:11.160
we want to evaluate. So that is a goal in and&nbsp;
of itself. And I'll touch more on that later.

00:32:11.160 --> 00:32:15.280
SULLIVAN: I read a recent paper that you&nbsp;
had put out with some of our colleagues&nbsp;&nbsp;

00:32:15.280 --> 00:32:19.040
from Microsoft Research, from the&nbsp;
University of Michigan, and Stanford,&nbsp;&nbsp;

00:32:19.040 --> 00:32:23.360
and you were arguing that evaluating&nbsp;
generative AI is the social-science&nbsp;&nbsp;

00:32:23.360 --> 00:32:27.120
measurement challenge. Maybe for those&nbsp;
who haven't read the paper, what does this&nbsp;&nbsp;

00:32:27.120 --> 00:32:31.200
mean? And can you tell us a little bit more&nbsp;
about what motivated you and your coauthors?

00:32:31.200 --> 00:32:37.280
ATALLA: So the measurement tasks involved in&nbsp;
evaluating generative AI systems are often&nbsp;&nbsp;

00:32:37.280 --> 00:32:42.400
abstract and contested. So that means&nbsp;
they cannot be directly measured and&nbsp;&nbsp;

00:32:42.400 --> 00:32:48.400
must instead [be] indirectly measured via other&nbsp;
observable phenomena. So this is very different&nbsp;&nbsp;

00:32:48.400 --> 00:32:54.080
than the older machine learning paradigm, where,&nbsp;
let's say, for example, I had a system that took a&nbsp;&nbsp;

00:32:54.080 --> 00:32:59.440
picture of a traffic light and told you whether&nbsp;
it was green, yellow, or red at a given time.

00:32:59.440 --> 00:33:05.760
If we wanted to evaluate that system, the task is&nbsp;
much simpler. But with the modern generative AI&nbsp;&nbsp;

00:33:05.760 --> 00:33:15.200
systems that are also general purpose, they have&nbsp;
open-ended output, and language in a whole chat or&nbsp;&nbsp;

00:33:15.200 --> 00:33:20.480
multiple paragraphs being outputted can have a lot&nbsp;
of different properties. And as I noted, these are&nbsp;&nbsp;

00:33:20.480 --> 00:33:25.920
general-purpose systems, so we don't know exactly&nbsp;
what task they're supposed to be carrying out.

00:33:25.920 --> 00:33:32.640
So then the question becomes, if I want to make&nbsp;
some decision or claim—maybe I want to make a&nbsp;&nbsp;

00:33:32.640 --> 00:33:41.200
claim that this system has human-level reasoning&nbsp;
capabilities—well, what does that mean? Do I have&nbsp;&nbsp;

00:33:41.200 --> 00:33:48.320
the same impression of what that means as you&nbsp;
do? And how do we know whether the downstream,&nbsp;&nbsp;

00:33:48.320 --> 00:33:53.520
you know, measurements and tests that I'm&nbsp;
conducting actually will support my notion of&nbsp;&nbsp;

00:33:53.520 --> 00:33:59.920
what it means to have human-level reasoning,&nbsp;
right? Difficult questions. But luckily,&nbsp;&nbsp;

00:33:59.920 --> 00:34:03.840
social scientists have been dealing with these&nbsp;
exact sorts of challenges for multiple decades&nbsp;&nbsp;

00:34:03.840 --> 00:34:09.920
in fields like education, political science, and&nbsp;
psychometrics. So we're really attempting to avoid&nbsp;&nbsp;

00:34:09.920 --> 00:34:16.000
reinventing the wheel here and trying&nbsp;
to learn from their past methodologies.

00:34:16.000 --> 00:34:21.200
And so the rest of the paper goes on to delve into&nbsp;
a four-level framework, a measurement framework,&nbsp;&nbsp;

00:34:21.200 --> 00:34:26.320
that's grounded in the measurement theory from&nbsp;
the quantitative social sciences that takes us&nbsp;&nbsp;

00:34:26.320 --> 00:34:32.960
all the way from these abstract and contested&nbsp;
concepts through processes to get much clearer&nbsp;&nbsp;

00:34:32.960 --> 00:34:37.320
and eventually reach reliable and valid&nbsp;
measurements that can power our evaluations.

00:34:37.320 --> 00:34:40.400
SULLIVAN: I love that. I mean, that's&nbsp;
the whole point of this podcast, too,&nbsp;&nbsp;

00:34:40.400 --> 00:34:44.320
right. Is to really build on those other&nbsp;
learnings and frameworks that we're taking&nbsp;&nbsp;

00:34:44.320 --> 00:34:49.760
from industries that have been thinking about this&nbsp;
for much longer. Maybe from your vantage point,&nbsp;&nbsp;

00:34:49.760 --> 00:34:54.880
what are some of the biggest day-to-day&nbsp;
hurdles in building solid AI evaluations and,&nbsp;&nbsp;

00:34:54.880 --> 00:34:59.040
I don't know, do we need more shared&nbsp;
standards? Are there bespoke methods?&nbsp;&nbsp;

00:34:59.040 --> 00:35:01.680
Are those the way to go? I would love&nbsp;
to just hear your thoughts on that.

00:35:01.680 --> 00:35:06.800
ATALLA: So let's talk about some of those&nbsp;
practical challenges. And I want to briefly&nbsp;&nbsp;

00:35:06.800 --> 00:35:12.400
go back to what I mentioned about risk before,&nbsp;
all right. Oftentimes, some of the regulatory&nbsp;&nbsp;

00:35:12.400 --> 00:35:18.640
environment is requiring practitioners to measure&nbsp;
the risk involved in deploying one of their&nbsp;&nbsp;

00:35:18.640 --> 00:35:27.200
models or AI systems. Now, risk is importantly&nbsp;
a concept that includes both event and impact,&nbsp;&nbsp;

00:35:27.200 --> 00:35:33.280
right. So there's the probability of some event&nbsp;
occurring. For the case of AI evaluation, perhaps&nbsp;&nbsp;

00:35:33.280 --> 00:35:41.520
this is us seeing a certain AI behavior exhibited.&nbsp;
Then there's also the severity of the impacts,&nbsp;&nbsp;

00:35:41.520 --> 00:35:47.840
and this is a complex chain of effects&nbsp;
in the real world that happen to people,&nbsp;&nbsp;

00:35:47.840 --> 00:35:56.640
organizations, systems, etc., and it's a lot&nbsp;
more challenging to observe the impacts, right.

00:35:56.640 --> 00:36:03.280
So if we're saying that we need to measure&nbsp;
risk, we have to measure both the event and&nbsp;&nbsp;

00:36:03.280 --> 00:36:09.280
the impacts. But realistically, right now,&nbsp;
the field is not doing a very good job of&nbsp;&nbsp;

00:36:09.280 --> 00:36:14.320
actually measuring the impacts. This requires&nbsp;
vastly different techniques and methodologies&nbsp;&nbsp;

00:36:14.320 --> 00:36:19.120
where if I just wanted to measure something&nbsp;
about the event itself, I can, you know,&nbsp;&nbsp;

00:36:19.120 --> 00:36:26.160
do that in a technical sandbox environment and&nbsp;
perhaps have some automated methods to detect&nbsp;&nbsp;

00:36:26.160 --> 00:36:30.720
whether a certain AI behavior is being exhibited.&nbsp;
But if I want to measure the impacts? Now,&nbsp;&nbsp;

00:36:30.720 --> 00:36:35.360
we're in the realm of needing to have real people&nbsp;
involved, and perhaps a longitudinal study where&nbsp;&nbsp;

00:36:35.360 --> 00:36:39.520
you have interviews, questionnaires,&nbsp;
and more qualitative evidence-gathering&nbsp;&nbsp;

00:36:39.520 --> 00:36:47.200
techniques to truly understand the long-term&nbsp;
impacts. So that's a significant challenge.

00:36:47.200 --> 00:36:52.560
Another is that, you know, let's say we forget&nbsp;
about the impacts for now and we focus on the&nbsp;&nbsp;

00:36:52.560 --> 00:37:00.560
event side of things. Still, we need datasets, we&nbsp;
need annotations, and we need metrics to make this&nbsp;&nbsp;

00:37:00.560 --> 00:37:08.000
whole thing work. When I say we need datasets,&nbsp;
if I want to test whether my system has good&nbsp;&nbsp;

00:37:08.000 --> 00:37:14.240
mathematical reasoning, what questions should I&nbsp;
ask? What are my set of inputs that are relevant?&nbsp;&nbsp;

00:37:14.240 --> 00:37:18.800
And then when I get the response from the system,&nbsp;
how do I annotate them? How do I know if it was a&nbsp;&nbsp;

00:37:18.800 --> 00:37:23.840
good response that did demonstrate mathematical&nbsp;
reasoning or if it was a mediocre response?&nbsp;&nbsp;

00:37:24.640 --> 00:37:29.680
And then once I have an annotation of&nbsp;
all of these outputs from the AI system,&nbsp;&nbsp;

00:37:29.680 --> 00:37:33.800
how do I aggregate those all up&nbsp;
into a single informative number?

00:37:33.800 --> 00:37:38.320
SULLIVAN: Earlier in this episode, we&nbsp;
heard Daniel and Timo walk through the&nbsp;&nbsp;

00:37:38.320 --> 00:37:42.960
regulatory frameworks in pharma and medical&nbsp;
devices. I'd be curious what pieces of those&nbsp;&nbsp;

00:37:42.960 --> 00:37:47.240
mature systems are already showing up or at&nbsp;
least may be bubbling up in AI governance.

00:37:47.240 --> 00:37:53.200
ATALLA: Great question. You know, Timo was&nbsp;
talking about the pre-market and post-market&nbsp;&nbsp;

00:37:53.200 --> 00:38:00.720
testing difference. Of course, this is similarly&nbsp;
important in the AI evaluation space. But again,&nbsp;&nbsp;

00:38:00.720 --> 00:38:04.720
these have different methodologies&nbsp;
and serve different purposes.

00:38:04.720 --> 00:38:12.400
So within the pre-deployment phase, we&nbsp;
don't have evidence of how people are&nbsp;&nbsp;

00:38:12.400 --> 00:38:16.560
going to use the system. And when we&nbsp;
have these general-purpose AI systems,&nbsp;&nbsp;

00:38:16.560 --> 00:38:22.320
to understand what the risks are, we really&nbsp;
need to have a sense of what might happen&nbsp;&nbsp;

00:38:22.320 --> 00:38:26.640
and how they might be used. So there are&nbsp;
significant challenges there where I think&nbsp;&nbsp;

00:38:26.640 --> 00:38:33.680
we can learn from other fields and how they do&nbsp;
pre-market testing. And the difference in that&nbsp;&nbsp;

00:38:33.680 --> 00:38:38.800
pre- versus post-market testing also ties to&nbsp;
testing at different stages in the life cycle.

00:38:38.800 --> 00:38:45.920
For AI systems, we already see some regulations&nbsp;
saying you need to start with the base model&nbsp;&nbsp;

00:38:45.920 --> 00:38:52.160
and do some evaluation of the base model,&nbsp;
some basic attributes, some core attributes,&nbsp;&nbsp;

00:38:52.160 --> 00:38:57.760
of that base model before you start putting&nbsp;
it into any real products. But once we have&nbsp;&nbsp;

00:38:57.760 --> 00:39:03.520
a product in mind, we have a user base in mind,&nbsp;
we have a specific task—like maybe we're going&nbsp;&nbsp;

00:39:03.520 --> 00:39:07.440
to integrate this model into Outlook and&nbsp;
it's going to help you write emails—now&nbsp;&nbsp;

00:39:07.440 --> 00:39:12.320
we suddenly have a much crisper picture of&nbsp;
how the system will interact with the world&nbsp;&nbsp;

00:39:12.320 --> 00:39:18.400
around it. And again, at that stage, we need&nbsp;
to think about another round of evaluation.

00:39:18.400 --> 00:39:23.680
Another part that jumped out to me in what&nbsp;
they were saying about pharmaceuticals is&nbsp;&nbsp;

00:39:23.680 --> 00:39:28.080
that sometimes approvals can be&nbsp;
based on surrogate endpoints. So&nbsp;&nbsp;

00:39:28.080 --> 00:39:33.120
this is like we're choosing some heuristic.&nbsp;
Instead of measuring the long-term impact,&nbsp;&nbsp;

00:39:33.120 --> 00:39:36.880
which is what we actually care about,&nbsp;
perhaps we have a proxy that we feel&nbsp;&nbsp;

00:39:36.880 --> 00:39:42.960
like is a good enough indicator of what&nbsp;
that long-term impact might look like.

00:39:42.960 --> 00:39:50.480
This is occurring in the AI evaluation space&nbsp;
right now and is often perhaps even the default&nbsp;&nbsp;

00:39:50.480 --> 00:39:56.320
here since we're not seeing that many studies&nbsp;
of the long-term impact itself. We are seeing,&nbsp;&nbsp;

00:39:56.320 --> 00:40:03.600
instead, folks constructing these heuristics or&nbsp;
proxies and saying if I see this behavior happen,&nbsp;&nbsp;

00:40:03.600 --> 00:40:09.440
I'm going to assume that it indicates this&nbsp;
sort of impact will happen downstream. And&nbsp;&nbsp;

00:40:09.440 --> 00:40:16.480
that's great. It's one of the techniques that&nbsp;
was used to speed up and reduce the barrier&nbsp;&nbsp;

00:40:16.480 --> 00:40:22.320
to innovation in the other fields. And I think&nbsp;
it's great that we are applying that in the AI&nbsp;&nbsp;

00:40:22.320 --> 00:40:27.920
evaluation space. But special care is, of course,&nbsp;
needed to ensure that those heuristics and proxies&nbsp;&nbsp;

00:40:27.920 --> 00:40:33.280
you're using are reasonable indicators of&nbsp;
the greater outcome you're looking for.

00:40:33.280 --> 00:40:38.880
SULLIVAN: What are some of the promising ideas&nbsp;
from maybe pharma or med device regulation&nbsp;&nbsp;

00:40:38.880 --> 00:40:43.360
that maybe haven't made it to AI testing&nbsp;
yet and maybe should? And where would you&nbsp;&nbsp;

00:40:43.360 --> 00:40:47.440
urge technologists, policymakers, and&nbsp;
researchers to focus their energy next?

00:40:47.440 --> 00:40:51.200
ATALLA: Well, one of the key things&nbsp;
that jumped out to me in the discussion&nbsp;&nbsp;

00:40:51.200 --> 00:40:58.240
about pharmaceuticals was driving home the&nbsp;
emphasis that there is a holistic focus on&nbsp;&nbsp;

00:40:58.240 --> 00:41:04.800
safety and efficacy. These go hand in hand and&nbsp;
decisions must be made while considering both&nbsp;&nbsp;

00:41:04.800 --> 00:41:11.360
pieces of the picture. I would like to see that&nbsp;
further emphasized in the AI evaluation space.

00:41:11.360 --> 00:41:18.080
Often, we are seeing evaluations of&nbsp;
risk being separated from evaluations&nbsp;&nbsp;

00:41:18.080 --> 00:41:27.440
of performance or quality or efficacy, but&nbsp;
these two pieces of the puzzle really are&nbsp;&nbsp;

00:41:27.440 --> 00:41:33.760
not enough for us to make informed&nbsp;
decisions independently. And that&nbsp;&nbsp;

00:41:33.760 --> 00:41:41.040
ties back into my desire to really&nbsp;
also see us measuring the impacts.

00:41:41.040 --> 00:41:47.920
So we see Phase 3 trials as something that occurs&nbsp;
in the medical devices and pharmaceuticals field.&nbsp;&nbsp;

00:41:47.920 --> 00:41:52.640
That's not something that we are doing an&nbsp;
equivalent of in the AI evaluation space&nbsp;&nbsp;

00:41:52.640 --> 00:41:59.440
at this time. These are really cost intensive.&nbsp;
They can last years and really involve careful&nbsp;&nbsp;

00:41:59.440 --> 00:42:05.840
monitoring of that holistic picture of&nbsp;
safety and efficacy. And realistically,&nbsp;&nbsp;

00:42:05.840 --> 00:42:10.560
we are not going to be able to put that&nbsp;
on the critical path to getting specific&nbsp;&nbsp;

00:42:10.560 --> 00:42:16.400
individual AI models or AI systems vetted&nbsp;
before they go out into the world. However,&nbsp;&nbsp;

00:42:16.400 --> 00:42:24.800
I would love to see a world in which this sort&nbsp;
of work is prioritized and funded or required.&nbsp;&nbsp;

00:42:24.800 --> 00:42:31.360
Think of how, with social media, it took quite&nbsp;
a long time for us to understand that there are&nbsp;&nbsp;

00:42:31.360 --> 00:42:38.720
some long-term negative impacts on mental health,&nbsp;
and we have the opportunity now, while the AI wave&nbsp;&nbsp;

00:42:38.720 --> 00:42:44.880
is still building, to start prioritizing and&nbsp;
funding this sort of work. Let it run in the&nbsp;&nbsp;

00:42:44.880 --> 00:42:52.960
background and as soon as possible develop a good&nbsp;
understanding of the subtle, long-term effects.

00:42:52.960 --> 00:42:59.040
More broadly, I would love to see us focus on&nbsp;
reliability and validity of the evaluations we're&nbsp;&nbsp;

00:42:59.040 --> 00:43:06.240
conducting because trust in these decisions&nbsp;
and claims is important. If we don't focus&nbsp;&nbsp;

00:43:06.240 --> 00:43:11.760
on building reliable, valid, and trustworthy&nbsp;
evaluations, we're just going to continue to be&nbsp;&nbsp;

00:43:11.760 --> 00:43:18.320
flooded by a bunch of competing, conflicting,&nbsp;
and largely meaningless AI evaluations.

00:43:18.320 --> 00:43:22.480
SULLIVAN: In a number of the discussions&nbsp;
we've had on this podcast, we talked about&nbsp;&nbsp;

00:43:22.480 --> 00:43:28.000
how it's not just one entity that really needs&nbsp;
to ensure safety across the board, and I’d just&nbsp;&nbsp;

00:43:28.000 --> 00:43:33.040
love to hear from you how you think about some&nbsp;
of those ecosystem collaborations, and you know,&nbsp;&nbsp;

00:43:33.040 --> 00:43:38.800
from across … where we think about ourselves&nbsp;
as more of a platform company or places that&nbsp;&nbsp;

00:43:38.800 --> 00:43:43.760
these AI models are being deployed more at the&nbsp;
application level. Tell me a little bit about how&nbsp;&nbsp;

00:43:43.760 --> 00:43:49.400
you think about, sort of, stakeholders in that mix&nbsp;
and where responsibility lies across the board.

00:43:49.400 --> 00:43:54.560
ATALLA: It's interesting. In this age&nbsp;
of general-purpose AI technologies,&nbsp;&nbsp;

00:43:54.560 --> 00:44:00.080
we're often seeing one company or&nbsp;
organization being responsible for&nbsp;&nbsp;

00:44:00.080 --> 00:44:05.440
building the foundational model. And then&nbsp;
many, many other people will take that&nbsp;&nbsp;

00:44:05.440 --> 00:44:11.120
model and build it into specific products that&nbsp;
are designed for specific tasks and contexts.

00:44:11.120 --> 00:44:17.440
Of course, in that, we already see that&nbsp;
there is a responsibility of the owners&nbsp;&nbsp;

00:44:17.440 --> 00:44:22.560
of that foundational model to do some&nbsp;
testing of the central model before&nbsp;&nbsp;

00:44:22.560 --> 00:44:27.120
they distribute it broadly. And then&nbsp;
again, there is responsibility of all&nbsp;&nbsp;

00:44:27.120 --> 00:44:34.320
of the downstream individuals digesting that&nbsp;
and turning it into products to consider the&nbsp;&nbsp;

00:44:34.320 --> 00:44:39.520
specific contexts that they are deploying&nbsp;
into and how that may affect the risks we're&nbsp;&nbsp;

00:44:39.520 --> 00:44:47.040
concerned with or the types of quality and&nbsp;
safety and performance we need to evaluate.

00:44:47.040 --> 00:44:52.640
Again, because that field of risks&nbsp;
we may be concerned with is so broad,&nbsp;&nbsp;

00:44:52.640 --> 00:44:59.680
some of them also require an immense amount of&nbsp;
expertise. Let's think about whether AI systems&nbsp;&nbsp;

00:44:59.680 --> 00:45:06.640
can enable people to create dangerous chemicals&nbsp;
or dangerous weapons at home. It's not that every&nbsp;&nbsp;

00:45:06.640 --> 00:45:12.640
AI practitioner is going to have the knowledge&nbsp;
to evaluate this, so in some of those cases,&nbsp;&nbsp;

00:45:12.640 --> 00:45:19.600
we really need third-party experts, people&nbsp;
who are experts in chemistry, biology,&nbsp;&nbsp;

00:45:19.600 --> 00:45:26.800
etc., to come in and evaluate certain systems&nbsp;
and models for those specific risks, as well.

00:45:26.800 --> 00:45:31.680
So I think there are many reasons why&nbsp;
multiple stakeholders need to be involved,&nbsp;&nbsp;

00:45:31.680 --> 00:45:36.400
partly from who owns what and is responsible&nbsp;
for what and partly from the perspective of&nbsp;&nbsp;

00:45:36.400 --> 00:45:42.200
who has the expertise to meaningfully&nbsp;
construct the evaluations that we need.

00:45:42.200 --> 00:45:47.360
SULLIVAN: Well, Chad, this has just been great&nbsp;
to connect, and in a few of our discussions,&nbsp;&nbsp;

00:45:47.360 --> 00:45:53.040
we've done a bit of a lightning round, so I'd&nbsp;
love to just hear your 30-second responses&nbsp;&nbsp;

00:45:53.040 --> 00:45:58.400
to a few of these questions. Perhaps favorite&nbsp;
evaluation you've run so far this year?

00:45:58.400 --> 00:46:04.560
ATALLA: So I've been involved in trying to&nbsp;
evaluate some language models for whether&nbsp;&nbsp;

00:46:04.560 --> 00:46:10.720
they infer sensitive attributes about people.&nbsp;
So perhaps you're chatting with a chatbot,&nbsp;&nbsp;

00:46:10.720 --> 00:46:17.440
and it infers your religion or sexuality based&nbsp;
on things you're saying or how you sound, right.&nbsp;&nbsp;

00:46:17.440 --> 00:46:23.120
And in working to evaluate this, we encounter a&nbsp;
lot of interesting questions. Or, like, what is&nbsp;&nbsp;

00:46:23.120 --> 00:46:28.560
a sensitive attribute? What makes these attributes&nbsp;
sensitive, and what are the differences that make&nbsp;&nbsp;

00:46:28.560 --> 00:46:33.840
it inappropriate for an AI system to infer these&nbsp;
things about a person? Whereas realistically,&nbsp;&nbsp;

00:46:33.840 --> 00:46:39.920
whenever I meet a person on the street, my brain&nbsp;
is immediately forming first impressions and some&nbsp;&nbsp;

00:46:39.920 --> 00:46:45.600
assumptions about these people. So it's a very&nbsp;
interesting and thought-provoking evaluation to&nbsp;&nbsp;

00:46:46.240 --> 00:46:51.840
conduct and think about the norms that&nbsp;
we place upon people interacting with&nbsp;&nbsp;

00:46:51.840 --> 00:46:56.120
other people and the norms we place upon&nbsp;
AI systems interacting with other people.

00:46:56.120 --> 00:47:02.144
SULLIVAN: That’s fascinating! I'd love to hear&nbsp;
the AI buzzword you'd retire tomorrow. [LAUGHTER]

00:47:02.144 --> 00:47:08.320
ATALLA: I would love to see the term&nbsp;
“bias” being used less when referring&nbsp;&nbsp;

00:47:08.320 --> 00:47:15.440
to fairness-related issues and systems.&nbsp;
Bias happens to be a highly overloaded&nbsp;&nbsp;

00:47:15.440 --> 00:47:20.960
term in statistics and machine learning&nbsp;
and has a lot of technical meanings and&nbsp;&nbsp;

00:47:20.960 --> 00:47:26.800
just fails to perfectly capture&nbsp;
what we mean in the AI risk sense.

00:47:26.800 --> 00:47:30.280
SULLIVAN: And last one. One&nbsp;
metric we're not tracking enough.

00:47:30.280 --> 00:47:39.440
ATALLA: I would say over-blocking, and this comes&nbsp;
into that connection between the holistic picture&nbsp;&nbsp;

00:47:39.440 --> 00:47:47.440
of safety and efficacy. It's too easy to produce&nbsp;
systems that throw safety to the wind and focus&nbsp;&nbsp;

00:47:47.440 --> 00:47:52.560
purely on utility or achieving some goal, but&nbsp;
simultaneously, the other side of the picture&nbsp;&nbsp;

00:47:52.560 --> 00:47:59.760
is possible, where we can clamp down too hard&nbsp;
and reduce the utility of our systems and block&nbsp;&nbsp;

00:47:59.760 --> 00:48:07.520
even benign and useful outputs just because they&nbsp;
border on something sensitive. So it's important&nbsp;&nbsp;

00:48:07.520 --> 00:48:13.680
for us to track that over-blocking and actively&nbsp;
track that tradeoff between safety and efficacy.

00:48:13.680 --> 00:48:18.480
SULLIVAN: Yeah, we talk a lot about this on the&nbsp;
podcast, too, of how do you both make things safe&nbsp;&nbsp;

00:48:18.480 --> 00:48:24.320
but also ensure innovation can thrive, and I think&nbsp;
you hit the nail on the head with that last piece.

00:48:24.320 --> 00:48:25.158
[MUSIC]

00:48:25.158 --> 00:48:30.290
Well, Chad, this was really terrific. Thanks&nbsp;
for joining us and thanks for your work and your&nbsp;&nbsp;

00:48:30.290 --> 00:48:33.656
perspectives. And another big thanks to Daniel and&nbsp;
Timo for setting the stage earlier in the podcast.

00:48:33.656 --> 00:48:38.755
And to our listeners, thanks for tuning&nbsp;
in. You can find resources related to&nbsp;&nbsp;

00:48:38.755 --> 00:48:41.440
this podcast in the show notes. And&nbsp;
if you want to learn more about how&nbsp;&nbsp;

00:48:41.440 --> 00:48:47.360
Microsoft approaches AI governance,&nbsp;
you can visit microsoft.com/RAI.  

00:48:47.360 --> 00:49:03.440
See you next time! 

00:49:03.440 --> 00:49:04.334
[MUSIC FADES]