00:00:01.385 --> 00:00:02.119
[MUSIC] 

00:00:02.119 --> 00:00:08.640
AMBER TINGLE: Welcome to Abstracts, a Microsoft&nbsp;
Research Podcast that puts the spotlight on&nbsp;&nbsp;

00:00:08.640 --> 00:00:16.240
world-class research in brief. I’m Amber Tingle.&nbsp;
In this series, members of the research community&nbsp;&nbsp;

00:00:16.240 --> 00:00:28.026
at Microsoft give us a quick snapshot—or a podcast&nbsp;
abstract—of their new and noteworthy papers. 

00:00:28.026 --> 00:00:29.028
[MUSIC FADES]

00:00:29.028 --> 00:00:34.800
Our guest today is Weizhu Chen. He is vice&nbsp;
president of Microsoft GenAI and coauthor&nbsp;&nbsp;

00:00:34.800 --> 00:00:40.400
of a paper called “Not All Tokens Are What&nbsp;
You Need for Pretraining.” This paper is&nbsp;&nbsp;

00:00:40.400 --> 00:00:45.760
an oral presentation during the 38th annual&nbsp;
Conference on Neural Information Processing&nbsp;&nbsp;

00:00:45.760 --> 00:00:52.760
Systems, also known as NeurIPS, which is&nbsp;
happening this week in Vancouver. Weizhu,&nbsp;&nbsp;

00:00:52.760 --> 00:00:55.707
thank you for joining us today on Abstracts!

00:00:55.707 --> 00:00:57.160
WEIZHU CHEN: Thank you for having me, Amber.

00:00:57.160 --> 00:01:00.160
TINGLE: So let's start with a brief overview&nbsp;&nbsp;

00:01:00.160 --> 00:01:06.200
of your paper. In a couple sentences, tell us&nbsp;
about the problem your research addresses and,&nbsp;&nbsp;

00:01:06.200 --> 00:01:11.360
more importantly, why the research community&nbsp;
and beyond should know about this work.

00:01:11.360 --> 00:01:18.000
CHEN: So my team basically in Microsoft GenAI,&nbsp;
we are working on model training. So one of the&nbsp;&nbsp;

00:01:18.000 --> 00:01:23.520
things actually we do in the pretraining,&nbsp;
we realize the importance of the data. And&nbsp;&nbsp;

00:01:23.520 --> 00:01:28.480
we found that actually when we do this kind of&nbsp;
data for each of the tokens, some token is more&nbsp;&nbsp;

00:01:28.480 --> 00:01:33.120
important than the other. That's one. The other&nbsp;
one actually is some token actually is very,&nbsp;&nbsp;

00:01:33.120 --> 00:01:40.080
very hard to be predicted during the pretraining.&nbsp;
So, for example, just like if someone see the text&nbsp;&nbsp;

00:01:40.080 --> 00:01:45.000
of “Weizhu,” and what's the next token? It can&nbsp;
be “Chen”; it can be any of the last name. So&nbsp;&nbsp;

00:01:45.000 --> 00:01:51.040
it's very hard to be predicted. And if we try&nbsp;
to enforce a language model to focus on this,&nbsp;&nbsp;

00:01:51.040 --> 00:01:55.040
kind of, the hard-to-predict token, just like&nbsp;
actually it's going to confuse the language&nbsp;&nbsp;

00:01:55.040 --> 00:01:59.120
model. There are so many different kinds of&nbsp;
the example like this. Just like, for example,&nbsp;&nbsp;

00:01:59.120 --> 00:02:04.760
the serial number in your UPS. So the focus&nbsp;
of this paper is try to identify which token&nbsp;&nbsp;

00:02:04.760 --> 00:02:09.320
actually is more important for the language&nbsp;
model to learn. And actually the other token&nbsp;&nbsp;

00:02:09.320 --> 00:02:14.680
maybe is just the noise. And how can we try&nbsp;
to discriminate the token—which is good token,&nbsp;&nbsp;

00:02:14.680 --> 00:02:20.880
which is noise token. Basically, you try to&nbsp;
understand this kind of dynamic of the tokens.

00:02:20.880 --> 00:02:23.560
TINGLE: How did you conduct this research?

00:02:23.560 --> 00:02:28.760
CHEN: Actually we do a lot of work in the&nbsp;
model training, including the pretraining&nbsp;&nbsp;

00:02:28.760 --> 00:02:34.160
and the post-training. So for the pretraining&nbsp;
side, actually the most important thing to us&nbsp;&nbsp;

00:02:34.160 --> 00:02:40.040
is the data. We also try to understand, how can we&nbsp;
leverage the existing data, and how can we create&nbsp;&nbsp;

00:02:40.040 --> 00:02:46.960
much more data, as well? And data basically is&nbsp;
one of the most important thing to build a better&nbsp;&nbsp;

00:02:46.960 --> 00:02:56.160
foundation model. So we try to understand how much&nbsp;
more we can get from the data. And the important&nbsp;&nbsp;

00:02:56.160 --> 00:03:02.360
thing for the data is about data filtering. So you&nbsp;
think about actually in the previous literature,&nbsp;&nbsp;

00:03:02.360 --> 00:03:06.920
we do the data filtering, for example, just&nbsp;
like we build a classifier to classify, OK,&nbsp;&nbsp;

00:03:06.920 --> 00:03:11.600
this page is more important than the other. And&nbsp;
this page actually is a noise because there's so&nbsp;&nbsp;

00:03:11.600 --> 00:03:19.680
much noise data in the web. So we just keep the&nbsp;
best data to get into the pretraining corpus. And&nbsp;&nbsp;

00:03:19.680 --> 00:03:26.400
further away, we think about, OK, yeah, so this&nbsp;
is … maybe it's not fine grain enough, so can we&nbsp;&nbsp;

00:03:26.400 --> 00:03:32.480
try to understand even for the same page we want&nbsp;
to keep? So some token is more important than the&nbsp;&nbsp;

00:03:32.480 --> 00:03:37.920
other. Maybe some token just some noise token.&nbsp;
Actually you put this data into the pretraining,&nbsp;&nbsp;

00:03:37.920 --> 00:03:43.620
it's going to hurt the model quality. So there&nbsp;
is the motivation actually we try to think about.

00:03:43.620 --> 00:03:46.120
TINGLE: And what were your major findings?

00:03:46.120 --> 00:03:52.760
CHEN: Our major finding is about basically,&nbsp;
definitely this works so well. And it's so&nbsp;&nbsp;

00:03:52.760 --> 00:04:00.200
important that actually we are able to get&nbsp;
the best token from the corpus and then&nbsp;&nbsp;

00:04:00.200 --> 00:04:05.320
make it available and try to ask the model during&nbsp;
the pretraining to ignore the token we don't want&nbsp;&nbsp;

00:04:05.320 --> 00:04:12.200
to get into the model itself. So that is one.&nbsp;
The second thing definitely data is the other&nbsp;&nbsp;

00:04:12.200 --> 00:04:18.040
very important thing. If you're able to figure&nbsp;
out the better way to build a better data is&nbsp;&nbsp;

00:04:18.040 --> 00:04:22.800
most likely you’re able to build a much better&nbsp;
foundation model. The third thing actually is&nbsp;&nbsp;

00:04:22.800 --> 00:04:29.720
also connected to a lot of other existing work,&nbsp;
just like data synthesis, just like distillation,&nbsp;&nbsp;

00:04:30.280 --> 00:04:36.880
just like data filtering, and so a lot of things&nbsp;
are really connected together. And actually,&nbsp;&nbsp;

00:04:36.880 --> 00:04:43.280
this work, basically, you can associate with also&nbsp;
a lot of other work we are working on, just like&nbsp;&nbsp;

00:04:43.280 --> 00:04:48.360
distillation. You can think about, for example,&nbsp;
for this work, we also try to build a model,&nbsp;&nbsp;

00:04:48.360 --> 00:04:55.280
a reference model—we call as the reference&nbsp;
model—to try to identify actually this data,&nbsp;&nbsp;

00:04:55.280 --> 00:05:02.240
this token, is more important than the other&nbsp;
and try to understand the discrepancy between&nbsp;&nbsp;

00:05:02.240 --> 00:05:10.120
the reference model and the running model, their&nbsp;
prediction on each tokens. So you can think about&nbsp;&nbsp;

00:05:10.120 --> 00:05:17.520
also it's some kind of the try to distill from the&nbsp;
reference model to the existing model, as well.

00:05:17.520 --> 00:05:24.120
TINGLE: Let's talk a little bit about real-world&nbsp;
impact. Who benefits most from this work? And how&nbsp;&nbsp;

00:05:24.120 --> 00:05:29.640
significant is this within your discipline and&nbsp;
even downstream for people using applications?

00:05:29.640 --> 00:05:36.480
CHEN: This actually is very, very fundamental work&nbsp;
because just like I share a little bit before,&nbsp;&nbsp;

00:05:36.480 --> 00:05:42.880
actually we build the data and this data is—build&nbsp;
the data much better—is able to build a much&nbsp;&nbsp;

00:05:42.880 --> 00:05:48.080
better foundation model. If we're able to build&nbsp;
a better model actually is able to benefit so&nbsp;&nbsp;

00:05:48.080 --> 00:05:53.680
many different kinds of application. This also&nbsp;
is going to help us to build a much better small&nbsp;&nbsp;

00:05:53.680 --> 00:05:59.680
language model. And we can also serve this model&nbsp;
even in the edge side, in the client side, in the&nbsp;&nbsp;

00:05:59.680 --> 00:06:06.440
coding scenario. So we are going to see actually&nbsp;
huge impact from this kind of the foundation&nbsp;&nbsp;

00:06:06.440 --> 00:06:10.720
model if you are able to benefit from&nbsp;
building much better training data.

00:06:10.720 --> 00:06:15.360
TINGLE: Are there any unanswered&nbsp;
questions or unsolved problems&nbsp;&nbsp;

00:06:15.360 --> 00:06:18.460
in this area? What's next on your research agenda?

00:06:18.460 --> 00:06:26.960
CHEN: Yeah, I think that is a very good questions.&nbsp;
And definitely there's a lot of things about how&nbsp;&nbsp;

00:06:26.960 --> 00:06:35.840
to build a better data [that] is unsolved yet in&nbsp;
the literature. And especially because when you&nbsp;&nbsp;

00:06:35.840 --> 00:06:42.840
do the pretraining, the most important part is the&nbsp;
data, but the data is very limited. And how can we&nbsp;&nbsp;

00:06:42.840 --> 00:06:48.840
make better use from the existing limited data is&nbsp;
a big challenge. Because we can increase the model&nbsp;&nbsp;

00:06:48.840 --> 00:06:53.880
by 10x, but it’s super hard to increase the data&nbsp;
by 10x, especially when we want to deal with the&nbsp;&nbsp;

00:06:53.880 --> 00:07:01.280
high quality of data. The other way, even given&nbsp;
the data, how can you identify, especially for&nbsp;&nbsp;

00:07:01.280 --> 00:07:08.840
this work, the importance of each token to build&nbsp;
a much better model? I think all these things are&nbsp;&nbsp;

00:07:08.840 --> 00:07:15.200
very connected together. To me, actually, data is&nbsp;
the oxygen. So there are still so many things we&nbsp;&nbsp;

00:07:15.200 --> 00:07:20.240
are able to do in the data, including building for&nbsp;
even the small language model or the large model.

00:07:20.240 --> 00:07:26.560
TINGLE: Data is oxygen—I love that! So other&nbsp;
than that being a key takeaway, is there any&nbsp;&nbsp;

00:07:26.560 --> 00:07:32.200
other one thing that you'd like our listeners&nbsp;
to walk away from this conversation knowing?

00:07:32.200 --> 00:07:39.760
CHEN: I would love to say actually focus more&nbsp;
on this kind of data and focus more about how&nbsp;&nbsp;

00:07:39.760 --> 00:07:46.880
can I get more from the data actually;&nbsp;
it is the very important thing. And the&nbsp;&nbsp;

00:07:46.880 --> 00:07:49.720
other thing actually, we are working&nbsp;
on something that's very exciting.&nbsp;&nbsp;

00:07:49.720 --> 00:07:54.184
You can feel free to come to join us if&nbsp;
you are very interested in this area.

00:07:54.184 --> 00:07:54.839
[MUSIC]

00:07:54.839 --> 00:07:58.620
TINGLE: Well, Weizhu Chen, thank you for&nbsp;
joining us today. We really appreciate it.

00:07:58.620 --> 00:08:00.620
CHEN: Thank you. Thank you for having me.

00:08:00.620 --> 00:08:05.960
TINGLE: And thanks to our listeners for tuning&nbsp;
in. If you’d like to read the full paper, you may&nbsp;&nbsp;

00:08:05.960 --> 00:08:14.760
find a link at aka.ms/abstracts. You can also find&nbsp;
the paper on arXiv and on the NeurIPS conference&nbsp;&nbsp;

00:08:14.760 --> 00:08:27.920
website. I’m Amber Tingle from Microsoft Research,&nbsp;
and we hope you’ll join us next time on Abstracts!

00:08:27.920 --> 00:08:28.855
[MUSIC FADES]