00:00:00.400 --> 00:00:02.412
[MUSIC]

00:00:02.412 --> 00:00:07.160
GRETCHEN HUIZINGA: Welcome to Abstracts,&nbsp;
a Microsoft Research Podcast that puts the&nbsp;&nbsp;

00:00:07.160 --> 00:00:11.965
spotlight on world-class research in brief.&nbsp;
I’m Dr. Gretchen Huizinga. 

00:00:15.126 --> 00:00:20.080
In this series,&nbsp;members of the research community 
at&nbsp;Microsoft give us a quick snapshot &nbsp;

00:00:20.080 --> 00:00:25.227
—or&nbsp;a podcast abstract—of their&nbsp;
new and noteworthy papers.

00:00:25.227 --> 00:00:25.240
[MUSIC FADES]

00:00:25.240 --> 00:00:31.680
My guest today is Dr. Li Lyna Zhang, a senior&nbsp;
researcher at Microsoft Research. Dr. Zhang&nbsp;&nbsp;

00:00:31.680 --> 00:00:37.800
is coauthor of a paper called “LongRoPE:&nbsp;
Extending LLM Context Window Beyond 2 Million&nbsp;&nbsp;

00:00:37.800 --> 00:00:43.160
Tokens.” This paper was featured at this year's&nbsp;
International Conference on Machine Learning,&nbsp;&nbsp;

00:00:43.160 --> 00:00:47.825
or ICML. Li, thanks so much for&nbsp;
joining us today on Abstracts!

00:00:47.825 --> 00:00:49.200
LI LYNA ZHANG: Thank you for having me.

00:00:49.200 --> 00:00:51.880
HUIZINGA: So let's start with a brief overview of&nbsp;&nbsp;

00:00:51.880 --> 00:00:57.140
your paper. Tell us about the issue your&nbsp;
research addresses and why it matters.

00:00:57.140 --> 00:01:04.400
ZHANG: OK, so this paper is about how to&nbsp;
effectively extend the context window of&nbsp;&nbsp;

00:01:04.400 --> 00:01:11.760
large language models beyond 2 million tokens.&nbsp;
Why this is important? Because enabling longer&nbsp;&nbsp;

00:01:11.760 --> 00:01:20.040
input contexts can improve LLM capabilities.&nbsp;
Right now, some LLMs can only handle a limited&nbsp;&nbsp;

00:01:20.040 --> 00:01:28.760
context window of 4K tokens, which is about 10&nbsp;
pages in a book. With our method, we can push&nbsp;&nbsp;

00:01:28.760 --> 00:01:36.680
LLM context window to over 2 million tokens. That&nbsp;
means you can put all seven Harry Potter books to&nbsp;&nbsp;

00:01:36.680 --> 00:01:44.600
the LLM and ask any question about this story!&nbsp;
Another important thing is that our method is&nbsp;&nbsp;

00:01:44.600 --> 00:01:52.440
super efficient. It requires minimal changes&nbsp;
to the LLM architectures, and most existing&nbsp;&nbsp;

00:01:52.440 --> 00:01:59.840
optimizations can be reused. Therefore, our&nbsp;
method can be easily applied in real production.

00:01:59.840 --> 00:02:04.560
HUIZINGA: So it sounds like what you're&nbsp;
working on is improving the memory span&nbsp;&nbsp;

00:02:04.560 --> 00:02:08.120
of artificial intelligence or large&nbsp;
language models. So what's already been&nbsp;&nbsp;

00:02:08.120 --> 00:02:12.960
done in this field, and what unique&nbsp;
contributions does your work bring?

00:02:12.960 --> 00:02:19.913
ZHANG: Well, there has been a lot of work&nbsp;
in building long-context LLMs. For example,&nbsp;&nbsp;

00:02:19.913 --> 00:02:24.480
pretraining with an efficient model architecture,&nbsp;
using RAG (retrieval-augmented generation), and&nbsp;&nbsp;

00:02:24.480 --> 00:02:30.280
extending the context window with RoPE&nbsp;
positional interpolation. Our approach&nbsp;&nbsp;

00:02:30.280 --> 00:02:39.040
uses the last technique. Let me briefly explain&nbsp;
it. RoPE stands for rotary positional embedding,&nbsp;&nbsp;

00:02:39.040 --> 00:02:46.240
which encodes token position information for&nbsp;
transformer models. When we pretrain an LLM, we&nbsp;&nbsp;

00:02:46.240 --> 00:02:54.640
set a context window size, and all token positions&nbsp;
have a predefined range of RoPE values. Extending&nbsp;&nbsp;

00:02:54.640 --> 00:03:02.360
for a longer context window introduces new token&nbsp;
positions that can be out of this predefined&nbsp;&nbsp;

00:03:02.360 --> 00:03:10.640
range, thus leading to out-of-distribution issues&nbsp;
and making fine-tuning difficult. RoPE positional&nbsp;&nbsp;

00:03:10.640 --> 00:03:17.480
interpolation solves this by downscaling&nbsp;
positional embeddings to fit within the&nbsp;&nbsp;

00:03:17.480 --> 00:03:25.600
pretrained range. However, positional embeddings&nbsp;
like RoPE exhibit non-uniform information entropy&nbsp;&nbsp;

00:03:25.600 --> 00:03:33.280
in transformer models. Existing approaches do not&nbsp;
effectively handle these non-uniformities during&nbsp;&nbsp;

00:03:33.280 --> 00:03:39.840
RoPE interpolation, leading to information&nbsp;
loss and limiting the context window size.&nbsp;&nbsp;

00:03:39.840 --> 00:03:46.120
Our method addresses this challenge; therefore,&nbsp;
it can achieve the longest context window size.

00:03:46.120 --> 00:03:51.560
HUIZINGA: OK, so, Li, how would you describe&nbsp;
the methodology you used for this work,&nbsp;&nbsp;

00:03:51.560 --> 00:03:54.640
and how did you go about conducting the research?

00:03:54.640 --> 00:04:01.840
ZHANG: OK. So our method is to interpolate the&nbsp;
RoPE positional embedding. It has three main&nbsp;&nbsp;

00:04:01.840 --> 00:04:09.320
steps. First, we introduce an efficient evolution&nbsp;
search algorithm to perform non-uniform RoPE&nbsp;&nbsp;

00:04:09.320 --> 00:04:16.560
positional interpolation. Second, we propose&nbsp;
progressive context window extension strategy.&nbsp;&nbsp;

00:04:16.560 --> 00:04:24.320
It begins by searching for a 256K length on&nbsp;
the pretrained LLM and fine-tuning it at this&nbsp;&nbsp;

00:04:24.320 --> 00:04:33.160
length. Then, based on the fine-tuned 256K&nbsp;
LLM, we did a second search for new RoPE&nbsp;&nbsp;

00:04:33.160 --> 00:04:40.240
interpolations to achieve 2048K context&nbsp;
window size. Finally, since long-context&nbsp;&nbsp;

00:04:40.240 --> 00:04:46.480
LLMs will drop performance at its original&nbsp;
context window, we readjusted the non-uniform&nbsp;&nbsp;

00:04:46.480 --> 00:04:52.440
positional interpolation at a 4K length to&nbsp;
recover the short-context-window performance.

00:04:52.440 --> 00:04:55.800
HUIZINGA: Let's talk about findings. Tell us how&nbsp;&nbsp;

00:04:55.800 --> 00:04:59.840
things worked out for you and what you&nbsp;
found as a result of your experiments.

00:04:59.840 --> 00:05:07.440
ZHANG: Yeah. Our study verified two important&nbsp;
non-uniformities in LLM context window extension.&nbsp;&nbsp;

00:05:07.440 --> 00:05:14.560
We identified that lower RoPE dimensions&nbsp;
and initial token positions require less&nbsp;&nbsp;

00:05:14.560 --> 00:05:20.640
interpolation because they contain crucial&nbsp;
and high-frequency information. Higher RoPE&nbsp;&nbsp;

00:05:20.640 --> 00:05:27.460
dimensions require more interpolation because&nbsp;
these are sparse and low-frequency information.

00:05:27.460 --> 00:05:31.200
HUIZINGA: So work in the&nbsp;
lab is always interesting,&nbsp;&nbsp;

00:05:31.200 --> 00:05:36.800
but deployment in real-world settings is often&nbsp;
another story. If everything is successful,&nbsp;&nbsp;

00:05:36.800 --> 00:05:40.000
Li, who benefits most from your LongRoPE research?

00:05:40.000 --> 00:05:45.440
ZHANG: Well, our work significantly&nbsp;
improves LLM's capabilities to handle&nbsp;&nbsp;

00:05:45.440 --> 00:05:52.000
long context in real-world applications, such&nbsp;
as long-context retrieval, code debugging,&nbsp;&nbsp;

00:05:52.000 --> 00:05:58.640
and even multi-modality LLM applications.&nbsp;
Moreover, our method achieves this with&nbsp;&nbsp;

00:05:58.640 --> 00:06:04.600
minimal modifications to the RoPE positional&nbsp;
embedding. Therefore, it can be widely applied&nbsp;&nbsp;

00:06:04.600 --> 00:06:13.640
to production. We have integrated LongRoPE&nbsp;
into Microsoft Phi-3 128K family, which are&nbsp;&nbsp;

00:06:13.640 --> 00:06:22.380
the first long-context LLMs in its class. Before&nbsp;
LongRoPE, Phi models have only 2K context window.

00:06:22.380 --> 00:06:25.360
HUIZINGA: So who is your primary user?

00:06:25.360 --> 00:06:32.360
ZHANG: I think any users who want to use the&nbsp;
long-context LLMs, they can be our audience.

00:06:32.360 --> 00:06:34.060
HUIZINGA: So it's a wide audience.

00:06:34.060 --> 00:06:35.900
ZHANG: Yeah, it’s a wide audience.

00:06:35.900 --> 00:06:40.040
HUIZINGA: It's about now that I always&nbsp;
ask the “golden nugget” question. If&nbsp;&nbsp;

00:06:40.040 --> 00:06:45.960
you wanted to leave our listeners with one key&nbsp;
takeaway from this research, what would it be?

00:06:45.960 --> 00:06:52.680
ZHANG: Well, if there's one key takeaway from&nbsp;
our work, it must be our key findings that&nbsp;&nbsp;

00:06:52.680 --> 00:07:00.160
non-uniformities in rotary positional embedding&nbsp;
are crucial for LLM context window extension. And&nbsp;&nbsp;

00:07:00.160 --> 00:07:06.160
if you want to build a high-quality long-context&nbsp;
LLM, LongRoPE is all you need to know!

00:07:06.160 --> 00:07:11.280
HUIZINGA: Talk about what's left to do&nbsp;
in this field in terms of open questions&nbsp;&nbsp;

00:07:11.280 --> 00:07:16.000
and outstanding challenges. What's&nbsp;
next on your research agenda, Li?

00:07:16.000 --> 00:07:21.120
ZHANG: So far, there are still a couple&nbsp;
of big questions in this field. First,&nbsp;&nbsp;

00:07:21.120 --> 00:07:26.760
it's challenging to achieve both strong&nbsp;
long and short capabilities at the same&nbsp;&nbsp;

00:07:26.760 --> 00:07:33.240
time. Although we have managed to recover some&nbsp;
of the short performance for long-context LLM,&nbsp;&nbsp;

00:07:33.240 --> 00:07:41.800
it has not recovered 100 percent. We are trying&nbsp;
different approaches to close these gaps. Second,&nbsp;&nbsp;

00:07:41.800 --> 00:07:48.280
we want to figure out how we can use these&nbsp;
long-context LLMs to solve more challenging tasks,&nbsp;&nbsp;

00:07:48.280 --> 00:07:53.663
and then we can push this model&nbsp;
to work harder and smarter for us.

00:07:53.663 --> 00:07:53.670
[MUSIC]

00:07:53.670 --> 00:07:57.560
HUIZINGA: Well, Li Lyna Zhang, thanks for&nbsp;
joining us today, and to our listeners,&nbsp;&nbsp;

00:07:57.560 --> 00:08:06.080
thanks for tuning in. If you want to read this&nbsp;
paper, you can find a link at aka.ms/abstracts,&nbsp;&nbsp;

00:08:06.080 --> 00:08:10.903
or you can find it on arXiv.&nbsp;
See you next time on Abstracts!

00:08:18.002 --> 00:08:25.857
[MUSIC FADES]