LLM 超长小说文本总结

背景

使用 LLM 对网页、文章、PDF 文件等进行总结已经不是什么新鲜的应用，LLM 也逐渐往超大上下文窗口（直到 24 年 5 月，主流的模型上下文窗口基本已经到 128K，最大 200K）和多模态发展，处理一般的文本已经不是什么大的问题。

尽管如此，LLM 并不能一次性处理小说总体较长的文本，同时，小语种也增加了“总结”这个简单任务的难度。总的来说，有两个大的问题需要解决：

将文本缩小成一个上下文窗口的大小，以便进行总结，同时尽量不损失信息
改善模型在小语种上表现不佳的情况

分而治之（Divide and conquer）

对一部小说来说，它最小的组成单位就是章节。我们可以对每个章节进行缩写，同时保留尽量多的信息，供后续总结。

经过实验，一个 2500 tokens 左右的章节，可以缩写到 1000 tokens 左右，节省了 60% 的 token，可以多处理 80 个左右的章节，比起原来的 50 个左右，共可以处理 130 个左右的章节。

罗塞塔石碑（Rosetta Stone）

针对菲语、印尼语等一些小语种，由于模型在小语种上的能力不足，容易出现生成内容混乱的情况。但是模型（或使用其他的方式）把小语种翻译成英语是可行的。可以先将小语种翻译成英语，再对英语进行总结。

锦上添花（Cherry on top）

前面提到，需要对章节进行缩写。缩写这个步骤，对每个章节来说是独立的，因此可以开大量协程或者用线程池来并行处理缩写的过程。实践发现，使用 5 个线程，可以将原本需要 40 分钟左右的总体生成时间压缩到 10 分钟内。

便当（Takeaways）

要使用 LLM 来总结超长的文本，可以采取以下的方法和步骤：

针对部分小语种，可以先翻译成英语
分割章节，按章节缩写，有条件可以使用并行来加速
将缩写后的章节拼接起来，进行总结

附

缩写 prompt

I want you to act as a sophisticated editor. Your task is to shorten the given text while preserving most of its contents.

翻译 prompt

You are a professional translator. The language would be in ISO-639-1 format.

Your task is to translate {source_language} into {target_language}.

Do not translate any name of a person. Consider the context for better translation.

书籍概要 prompt

I want you to act as a sophisticated editor. Your task is to summarize the entire story, including the ending. You would be given a series of shortened version of chapters of a book.

章纲 prompt

You are a sophisticated editor. Your task is to summarize the given chapter in a few sentences. The summarization should be brief but don't leave out any key moments.

阿猫的博客