00:00:00.034 --> 00:00:03.233
[MUSIC PLAYS]

00:00:03.233 --> 00:00:07.160
GRETCHEN HUIZINGA: Welcome to Abstracts,&nbsp;
a Microsoft Research Podcast that puts the&nbsp;&nbsp;

00:00:07.160 --> 00:00:11.931
spotlight on world-class research in &nbsp;
brief. I’m Dr. Gretchen Huizinga. 

00:00:15.211 --> 00:00:20.080
In this series, members of the research community 
at Microsoft give us a quick snapshot—or&nbsp;&nbsp;

00:00:20.080 --> 00:00:23.829
a podcast abstract—of their&nbsp;
new and noteworthy papers.

00:00:23.829 --> 00:00:25.200
[MUSIC FADES]

00:00:25.200 --> 00:00:29.680
I'm here today with Dr. Arindam Mitra, a&nbsp;
senior researcher at Microsoft Research&nbsp;&nbsp;

00:00:29.680 --> 00:00:35.440
and the lead researcher for Microsoft's Orca&nbsp;
project. Dr. Mitra is coauthor of a paper&nbsp;&nbsp;

00:00:35.440 --> 00:00:41.280
called “AgentInstruct: Toward Generative&nbsp;
Teaching with Agentic Flows.” Arindam,&nbsp;&nbsp;

00:00:41.280 --> 00:00:43.665
it's a pleasure to have you on Abstracts today.

00:00:43.665 --> 00:00:44.720
ARINDAM MITRA: Thank you, Gretchen.

00:00:44.720 --> 00:00:48.440
HUIZINGA: So let's start with&nbsp;
a brief overview of your paper.&nbsp;&nbsp;

00:00:48.440 --> 00:00:51.900
What problem does your research&nbsp;
address, and why does it matter?

00:00:51.900 --> 00:00:56.920
MITRA: So the post-training phase is very&nbsp;
important for language models. You can really&nbsp;&nbsp;

00:00:56.920 --> 00:01:03.640
improve the model a lot by creating high-quality&nbsp;
synthetic data. The problem is, however, though,&nbsp;&nbsp;

00:01:03.640 --> 00:01:09.160
high-quality synthetic data creation requires&nbsp;
lots of human effort and expertise. The problem&nbsp;&nbsp;

00:01:09.160 --> 00:01:15.520
that we're trying to tackle is, how do you reduce&nbsp;
human effort? How can you create high-quality data&nbsp;&nbsp;

00:01:15.520 --> 00:01:21.200
with really low amount of human effort? When you&nbsp;
have a language model and, let's say, you want&nbsp;&nbsp;

00:01:21.200 --> 00:01:26.520
to apply it somewhere, you might have to train a&nbsp;
generic model before. Which could be small or big.&nbsp;&nbsp;

00:01:26.520 --> 00:01:33.280
Doesn’t matter. After that, you can specialize&nbsp;
it on the domain that you are looking for, and&nbsp;&nbsp;

00:01:33.280 --> 00:01:40.160
when you want to do that—to make it really fast,&nbsp;
this particular process—it's best if you go for&nbsp;&nbsp;

00:01:40.160 --> 00:01:45.520
synthetic data. If you have a way to, actually,&nbsp;
generate very high-quality synthetic data,&nbsp;&nbsp;

00:01:45.520 --> 00:01:52.520
you can fast-track this part of specialization&nbsp;
process. Not only single model. So this year,&nbsp;&nbsp;

00:01:52.520 --> 00:01:56.640
you're going to see a lot more multi-agent&nbsp;
models. And when you are trying to build these&nbsp;&nbsp;

00:01:56.640 --> 00:02:02.360
multi-agent models, you're fearing like, OK, it&nbsp;
might increase the cost too much, the latency too&nbsp;&nbsp;

00:02:02.360 --> 00:02:06.880
much. So it's also very much important that you&nbsp;
have a multi-agent system and you can, sort of,&nbsp;&nbsp;

00:02:06.880 --> 00:02:12.280
replace some of those agents with specialized&nbsp;
small models. And when you're trying to&nbsp;&nbsp;

00:02:12.280 --> 00:02:18.880
address these goals, you want this process to be&nbsp;
something which you know works fast. So that's why&nbsp;&nbsp;

00:02:18.880 --> 00:02:23.740
we are trying to make sure we have a very good way&nbsp;
to create synthetic data for your specific need.

00:02:23.740 --> 00:02:30.240
HUIZINGA: No research exists in a vacuum, and&nbsp;
most of it fills some kind of a gap. So tell us&nbsp;&nbsp;

00:02:30.240 --> 00:02:34.320
what's already been done in this field&nbsp;
and how this work is building on it.

00:02:34.320 --> 00:02:41.520
MITRA: So previously, actually, we have seen&nbsp;
that in post-training, the more data you have,&nbsp;&nbsp;

00:02:41.520 --> 00:02:47.440
the better the performance goes for the model&nbsp;
you're training. So what we wanted to test is how&nbsp;&nbsp;

00:02:47.440 --> 00:02:52.720
much we can scale and what happens if we scale a&nbsp;
lot and lot. But we didn't have the tools for it.&nbsp;&nbsp;

00:02:52.720 --> 00:02:58.480
So the other approaches people previously used&nbsp;
was you had a small set of data and how do we&nbsp;&nbsp;

00:02:58.480 --> 00:03:03.040
expand this dataset into much larger and larger&nbsp;
amount of data. That's where people were mostly&nbsp;&nbsp;

00:03:03.040 --> 00:03:07.840
focusing. But it's not that easy to create that&nbsp;
initial seed set. [LAUGHTER] You need to be very&nbsp;&nbsp;

00:03:07.840 --> 00:03:13.520
expert. The way that we're doing is, actually,&nbsp;
rather you define what you want to create. Like,&nbsp;&nbsp;

00:03:13.520 --> 00:03:17.360
OK, you want to create tool-use data.&nbsp;
So you say, OK, I have a bunch of tools,&nbsp;&nbsp;

00:03:17.360 --> 00:03:22.360
and I am looking for data in the scenarios where&nbsp;
someone can just come give me a description and&nbsp;&nbsp;

00:03:22.360 --> 00:03:27.600
then maybe that person interact with the AI to&nbsp;
figure out how to get the job done. It's not a&nbsp;&nbsp;

00:03:27.600 --> 00:03:32.200
one-step thing. And maybe you also have a setting&nbsp;
where it's more like an app developer. You have&nbsp;&nbsp;

00:03:32.200 --> 00:03:36.880
a bunch of APIs in your phone. You just want to&nbsp;
figure out which one is best for the user request,&nbsp;&nbsp;

00:03:36.880 --> 00:03:39.640
which came through voice command. So&nbsp;
different scenarios could be there. So&nbsp;&nbsp;

00:03:39.640 --> 00:03:43.560
what we're saying [is], OK, we are not going&nbsp;
through the method where you have to come up&nbsp;&nbsp;

00:03:43.560 --> 00:03:47.240
with your initial own seed data and then we&nbsp;
expand. It is more like you define what you&nbsp;&nbsp;

00:03:47.240 --> 00:03:52.280
want to do. It's much more abstract. And then,&nbsp;
we are, sort of, automating the effort of data&nbsp;&nbsp;

00:03:52.280 --> 00:03:56.840
creation. So this setting actually of synthetic&nbsp;
data creation, we are referring [to] it as&nbsp;&nbsp;

00:03:56.840 --> 00:03:59.440
generative teaching, and that's where we&nbsp;
are, sort of, differing. So previously,&nbsp;&nbsp;

00:03:59.440 --> 00:04:05.480
it was more like expansion, and now we are trying&nbsp;
from specification to the data that you need.

00:04:05.480 --> 00:04:08.280
HUIZINGA: Gotcha. Well talk a little bit more&nbsp;&nbsp;

00:04:08.280 --> 00:04:12.600
about your methodology and how you&nbsp;
went about conducting this research.

00:04:12.600 --> 00:04:18.160
MITRA: So first of all, what we are proposing&nbsp;
actually is a multi-agent solution. So you start&nbsp;&nbsp;

00:04:18.160 --> 00:04:22.600
with first describing what you really&nbsp;
need. So you describe in detail, like,&nbsp;&nbsp;

00:04:22.600 --> 00:04:28.440
I need data for this specific skill or this&nbsp;
specific scenario. Then, what we do is like,&nbsp;&nbsp;

00:04:28.440 --> 00:04:34.480
OK, you have some unstructured data or raw data&nbsp;
like text documents or code files that you gather&nbsp;&nbsp;

00:04:34.480 --> 00:04:41.120
from web with permissible license or use something&nbsp;
that you own. We don't care much about what the&nbsp;&nbsp;

00:04:41.120 --> 00:04:46.960
content is really. So it's more like we got some&nbsp;
random stuff, some random content. And then we'll&nbsp;&nbsp;

00:04:46.960 --> 00:04:51.200
guide you how to convert this random something&nbsp;
which is not meaningful for you into something&nbsp;&nbsp;

00:04:51.200 --> 00:04:54.360
which is meaningful for your data creation.&nbsp;
For example, like, if you are creating data&nbsp;&nbsp;

00:04:54.360 --> 00:04:59.200
to teach how to use APIs, you might think about,&nbsp;
you need lots of APIs and how do you get these&nbsp;&nbsp;

00:04:59.200 --> 00:05:04.640
APIs. So what we are saying is, like, we can take&nbsp;
something like code and we'll have agents which&nbsp;&nbsp;

00:05:04.640 --> 00:05:12.760
will convert these raw code files into list of&nbsp;
APIs which is more like a library. So you create&nbsp;&nbsp;

00:05:12.760 --> 00:05:16.800
automatically this input that is very meaningful&nbsp;
for data creation. And then once we have that,&nbsp;&nbsp;

00:05:16.800 --> 00:05:20.560
we have basically the seed instruction creation&nbsp;
step based on your specification. Like, what&nbsp;&nbsp;

00:05:20.560 --> 00:05:24.920
do you want to create data for? So you have all&nbsp;
these different scenarios, and we have multiple&nbsp;&nbsp;

00:05:24.920 --> 00:05:28.360
agents creating data for different scenarios.&nbsp;
And then the last step is actually what we&nbsp;&nbsp;

00:05:28.360 --> 00:05:33.840
call refinement step. So it's more like whatever&nbsp;
data you created, we’ll go through them and we’ll&nbsp;&nbsp;

00:05:33.840 --> 00:05:39.040
make them better and better—improve the quality,&nbsp;
improve the complexity, improve the trickiness,&nbsp;&nbsp;

00:05:39.040 --> 00:05:45.040
we’ll teach when not to answer, etc., etc.&nbsp;
So make sure we cover the whole space. So by&nbsp;&nbsp;

00:05:45.040 --> 00:05:48.973
changing the stochastic seed, we are trying&nbsp;
to cover the entire possible data space.

00:05:48.973 --> 00:05:49.593
HUIZINGA: Right.

00:05:49.593 --> 00:05:54.600
MITRA: So that's the key thing. The way we, sort&nbsp;
of, conducted this research is actually we defined&nbsp;&nbsp;

00:05:54.600 --> 00:06:00.480
17 skills. Skills meaning reading comprehension,&nbsp;
tool use, text modification, content creation,&nbsp;&nbsp;

00:06:00.480 --> 00:06:04.120
RAG (retrieval-augmented generation) ... we have,&nbsp;
like, list of 17 skills … conversation … and then&nbsp;&nbsp;

00:06:04.120 --> 00:06:10.600
we created one multi-agent flow for each of the&nbsp;
skills and we generate data. So one key thing I&nbsp;&nbsp;

00:06:10.600 --> 00:06:16.160
want to highlight is, like, this work, compared&nbsp;
to other work, it was not benchmark driven. We&nbsp;&nbsp;

00:06:16.160 --> 00:06:20.480
want to teach a skill. We don't care which&nbsp;
benchmarks we're trying to evaluate it on.&nbsp;&nbsp;

00:06:20.480 --> 00:06:25.800
So we define the skill, like tool use means this&nbsp;
to us, reading comprehension means this to us,&nbsp;&nbsp;

00:06:25.800 --> 00:06:30.480
text modification means this to us. And then we,&nbsp;
sort of, generate the data to teach everything for&nbsp;&nbsp;

00:06:30.480 --> 00:06:35.800
that skill. And then what we did, we created&nbsp;
actually 22 million instructions. And we had&nbsp;&nbsp;

00:06:35.800 --> 00:06:40.320
previously in Orca series, we had 3 million,&nbsp;
around, instructions. So the 25 million is what&nbsp;&nbsp;

00:06:40.320 --> 00:06:46.280
we, sort of, have at the end. And that's where&nbsp;
we actually trained a Mistral model as of now.&nbsp;&nbsp;

00:06:46.280 --> 00:06:50.400
And we're going to measure, like, how much we&nbsp;
improve the Mistral model by this post-training.

00:06:50.400 --> 00:06:53.440
HUIZINGA: Moving from methods to findings,&nbsp;&nbsp;

00:06:53.440 --> 00:06:57.040
I always look forward to the part of the&nbsp;
research paper that finishes the sentence&nbsp;&nbsp;

00:06:57.040 --> 00:07:02.500
“and what we found was … ,” so give us a quick&nbsp;
overview of your results. What did you find?

00:07:02.500 --> 00:07:09.280
MITRA: Yes, so the results were actually very&nbsp;
exciting for us. So Mistral 7B was our main,&nbsp;&nbsp;

00:07:09.280 --> 00:07:12.240
sort of, baseline because that's&nbsp;
where we’re trying to showcase, like,&nbsp;&nbsp;

00:07:12.240 --> 00:07:16.000
how much improvement we are getting. On the other&nbsp;
side, we have, like, frontier models—ChatGPT,&nbsp;&nbsp;

00:07:16.000 --> 00:07:20.320
GPT-4. We want to also measure how far we&nbsp;
are from those frontier models, so that's,&nbsp;&nbsp;

00:07:20.320 --> 00:07:25.720
sort of, our evaluation setup. So on average&nbsp;
actually, we got like 20 percent performance&nbsp;&nbsp;

00:07:25.720 --> 00:07:31.520
gain over the Mistral, and we evaluated that&nbsp;
across 14 benchmarks that test reasoning,&nbsp;&nbsp;

00:07:31.520 --> 00:07:37.560
content creation, instruction following, format&nbsp;
following, etc. But what was more important to us&nbsp;&nbsp;

00:07:37.560 --> 00:07:41.720
was to do a skill-specific evaluation because we&nbsp;
are trying to teach certain skills, and we had,&nbsp;&nbsp;

00:07:41.720 --> 00:07:45.920
like, 17 skills as we mentioned earlier. So, for&nbsp;
example, like, if you are focusing on reading&nbsp;&nbsp;

00:07:45.920 --> 00:07:51.480
comprehension as a skill, we took LSAT, SAT,&nbsp;
and DROP, and many other benchmarks; we created&nbsp;&nbsp;

00:07:51.480 --> 00:07:55.840
a collection of reading comprehension-based&nbsp;
benchmark. And there, we are observing, like,&nbsp;&nbsp;

00:07:55.840 --> 00:08:00.720
20 percent improvement over Mistral, and what it&nbsp;
means, like, we're actually achieving GPT-4–level&nbsp;&nbsp;

00:08:00.720 --> 00:08:06.600
performance. Similarly, if I'm focusing on math&nbsp;
skill, there are many datasets which test, like,&nbsp;&nbsp;

00:08:06.600 --> 00:08:12.680
elementary math, high school math, college-level&nbsp;
math. And we improved actually across all these&nbsp;&nbsp;

00:08:12.680 --> 00:08:19.080
different levels of math. So we see from&nbsp;
40 percent to 150 percent of improvement&nbsp;&nbsp;

00:08:19.080 --> 00:08:24.440
on different benchmarks of math. So it was&nbsp;
more like what we wanted to see. We're not&nbsp;&nbsp;

00:08:24.440 --> 00:08:27.720
optimizing for a particular benchmark. We&nbsp;
wanted to optimize the skill, and that's&nbsp;&nbsp;

00:08:27.720 --> 00:08:31.600
what you're observing. So you're observing&nbsp;
improvement in math across all these levels,&nbsp;&nbsp;

00:08:31.600 --> 00:08:37.040
from elementary to high school to college to&nbsp;
middle school, etc., everything. The same goes&nbsp;&nbsp;

00:08:37.040 --> 00:08:43.800
for RAG, as well. We’re observing on RAG skill&nbsp;
92 percent, around, improvement over Mistral. The&nbsp;&nbsp;

00:08:43.800 --> 00:08:47.960
format following numbers are pretty interesting&nbsp;
to us. So format following is very important for&nbsp;&nbsp;

00:08:47.960 --> 00:08:52.480
SLMs (small language models). You want to make&nbsp;
these models practical. You want to make sure&nbsp;&nbsp;

00:08:52.480 --> 00:08:57.080
that they follow the format so you can parse&nbsp;
the result. And we were able to take Mistral&nbsp;&nbsp;

00:08:57.080 --> 00:09:02.320
beyond Gemini Pro. So that was a very strong&nbsp;
performance from the post-training that we&nbsp;&nbsp;

00:09:02.320 --> 00:09:07.680
did. For summarization, actually we were able to&nbsp;
reduce the hallucination rate by 31 percent while&nbsp;&nbsp;

00:09:07.680 --> 00:09:12.080
achieving the GPT-4–level quality. So overall,&nbsp;
all these results were, sort of, highlighting&nbsp;&nbsp;

00:09:12.080 --> 00:09:17.060
that the methodology that we have, which we're&nbsp;
calling AgentInstruct, is very promising.

00:09:17.060 --> 00:09:21.400
HUIZINGA: I think it's important to&nbsp;
get practical and talk about real-world&nbsp;&nbsp;

00:09:21.400 --> 00:09:27.480
impact. So tell us who you think this&nbsp;
research will benefit most and why.

00:09:27.480 --> 00:09:34.480
MITRA: Yeah, so again the model builders&nbsp;
will, sort of, find it most beneficial. So the&nbsp;&nbsp;

00:09:34.480 --> 00:09:41.120
significance of our work actually lies in the way&nbsp;
we are trying to revolutionize the language model&nbsp;&nbsp;

00:09:41.120 --> 00:09:46.520
development through scalable, low-effort synthetic&nbsp;
creation. And the scalable and low effort is,&nbsp;&nbsp;

00:09:46.520 --> 00:09:52.200
sort of, the key thing, right. We have shown&nbsp;
that we can create very high-quality data.&nbsp;&nbsp;

00:09:52.200 --> 00:09:56.080
That's what the numbers are telling us. We&nbsp;
want to mention that this is very scalable&nbsp;&nbsp;

00:09:56.080 --> 00:10:01.020
and low effort, and that's what we think&nbsp;
might help the most for model builders.

00:10:01.020 --> 00:10:06.120
HUIZINGA: So, Arindam, let's borrow a phrase&nbsp;
from the machine learning lexicon and go for&nbsp;&nbsp;

00:10:06.120 --> 00:10:11.320
a little one-shot learning here: if you had&nbsp;
to boil down why your work is important,&nbsp;&nbsp;

00:10:11.320 --> 00:10:15.960
what's the one thing you want our&nbsp;
listeners to take away from this research?

00:10:15.960 --> 00:10:20.960
MITRA: The key takeaway would be, like, the&nbsp;
AgentInstruct method enables the generation&nbsp;&nbsp;

00:10:20.960 --> 00:10:26.800
of vast, diverse, and high-quality&nbsp;
synthetic data with very minimal human&nbsp;&nbsp;

00:10:26.800 --> 00:10:30.580
input. So that's one thing I would,&nbsp;
like, to remember from this paper. 

00:10:30.580 --> 00:10:36.440
HUIZINGA: So as we close, talk briefly about&nbsp;
the limitations that you encountered in this&nbsp;&nbsp;

00:10:36.440 --> 00:10:41.560
project and directions for future research. What&nbsp;
are the outstanding challenges in this field,&nbsp;&nbsp;

00:10:41.560 --> 00:10:44.460
and what's on your research&nbsp;
agenda to overcome them? 

00:10:44.460 --> 00:10:51.080
MITRA: Yes, so we're exploring further automation.&nbsp;
But apart from making this data creation more&nbsp;&nbsp;

00:10:51.080 --> 00:10:57.200
automated and less human involvement needed,&nbsp;
we're trying to focus on two other aspects. One&nbsp;&nbsp;

00:10:57.200 --> 00:11:03.120
is automated model debugging, and the other is&nbsp;
automated model repairing. So now that we have&nbsp;&nbsp;

00:11:03.120 --> 00:11:07.840
the ability to generate data for a particular&nbsp;
skill, let's say math, for model debugging,&nbsp;&nbsp;

00:11:07.840 --> 00:11:14.080
what we need is basically an error handler.&nbsp;
Like something we can plug in which takes&nbsp;&nbsp;

00:11:14.080 --> 00:11:18.240
the question and the answer coming from a&nbsp;
different model and verifies if the answer&nbsp;&nbsp;

00:11:18.240 --> 00:11:23.880
is correct or not. So that's the part we're&nbsp;
working on right now, figuring out this error&nbsp;&nbsp;

00:11:23.880 --> 00:11:29.040
handler. And the second aspect is repairing.&nbsp;
So once we have the error, we figure out, OK,&nbsp;&nbsp;

00:11:29.040 --> 00:11:34.040
this is where the model is struggling. How can we&nbsp;
give feedback or how can we give more knowledge so&nbsp;&nbsp;

00:11:34.040 --> 00:11:38.229
it can basically correct those errors? So those&nbsp;
are some things we're working on right now. 

00:11:38.229 --> 00:11:39.550
[MUSIC PLAYS] 

00:11:39.550 --> 00:11:43.160
HUIZINGA: Well, Arindam Mitra, thanks for&nbsp;
joining us today, and to our listeners,&nbsp;&nbsp;

00:11:43.160 --> 00:11:51.200
thanks for tuning in. If you want to read this&nbsp;
paper, you can find a link at aka.ms/abstracts,&nbsp;&nbsp;

00:11:51.200 --> 00:11:56.059
or you can find a preprint on arXiv.&nbsp;
See you next time on Abstracts!

00:12:03.957 --> 00:12:10.800
[MUSIC FADES]