LLM 结构化输出

引子

在几乎除了聊天以外所有的程序调用场景中，我们都希望 LLM 通过某种结构化的方式来输出，便于后续程序处理。在本文中，我们采用一个推书的例子，通过几种方式由简单到复杂地让 LLM 结构化地输出结果。我采用的方法尽量不依赖某个平台或模型的特有功能，而是一些通用的方式来实现。

这个例子很简单，向 LLM 提供一个主题，然后让它推荐几本相关的书，列出其名称、作者、推荐原因以及发表年份。对应的 prompt 可以这么写：

I want you to recommend some books about {topic}.

一般来说，LLM 会给出一大段话，然后用子弹列表的形式列举（当然这个 prompt 太过简单，不一定我想要的四个字段都有）。

对于「纯文本」，程序显然是无法「稳定」解析的。我们需要让它以某种结构化的方式进行输出，例如 JSON 或者 XML。本文中，我们选择 JSON 作为「结构化」输出，我们希望 LLM 输出以下格式的内容。

{
  "items": [
    {
      "name": "1984",
      "author": "George Orwell",
      "reason": "Another classic dystopian novel that explores themes of surveillance, totalitarianism, and individuality in a future society.",
      "year_of_publish": 1949
    },
    {
      "name": "Dune",
      "author": "Frank Herbert",
      "reason": "A sprawling epic set on the desert planet of Arrakis, dealing with politics, religion, and the struggle for control of the planet's valuable spice.",
      "year_of_publish": 1965
    }
  ]
}

注：尽管列表也是一个标准的 JSON，但 OpenAI 的 JSON mode 只支持 JSON object，因此套多一层 items。

起手式：输出示例

先说最简单、通常有效的方式：在 system 中以示例的方式要求 LLM 输出对应格式。

I want you to recommend some books about {topic}.
Do NOT include anything other than a json object in your output.

Your output should look like this:
{
  "items": [
    {
      "name": "1984",
      "author": "George Orwell",
      "reason": "Another classic dystopian novel that explores themes of surveillance, totalitarianism, and individuality in a future society.",
      "year_of_publish": 1949
    },
    {
      "name": "Dune",
      "author": "Frank Herbert",
      "reason": "A sprawling epic set on the desert planet of Arrakis, dealing with politics, religion, and the struggle for control of the planet's valuable spice.",
      "year_of_publish": 1965
    }
  ]
}

划重点，Your output should look like this: 让模型以指定的格式输出。

这种方式的好处是非常通用，对任意模型都可以用，而且消耗的 token 数相对比较少（你甚至可以把长文本直接替换成 xxx）。坏处是，当结构比较复杂（例如同时存在多种类型）或者逻辑比较复杂时，或者模型抽风，就容易生成出多余的东西，无法解析到有效的 JSON。

进阶：JSON mode

针对上面模型抽风输出了无效 JSON 的场景，OpenAI 和 Claude都有 JSON mode，其中 Claude 还支持 XML。在指定输出格式后，模型会「尽力保证」输出合法的 JSON object（是的，还是有可能抽风）。

需要注意，OpenAI 的模型需要在 prompt 中包含「JSON」字样才能启用 JSON mode，否则会生成失败。我们只需稍作修改：

I want you to recommend some books about {topic}.
Do NOT include anything other than a json object in your output.

Your output should be in JSON format. For example:
(...省略示例...)

使用 JSON mode 之后，稳定性会有所提升。

组合拳： few-shot

few-shot（又称少样本提示）是指给模型提供一点示例，从而引导模型实现更好的性能。其实我们的起手式就算是一种 few-shot，但是仅使用了 system 消息。通过增加 user 和 assistant 消息，可能会让效果更好。

--- system ---
I want you to recommend some books about the given topic.
Do NOT include anything other than a json object in your output.

--- user ---
{topic}

--- assistant ---
{
  "items": [
    {
      "name": "1984",
      "author": "George Orwell",
      "reason": "Another classic dystopian novel that explores themes of surveillance, totalitarianism, and individuality in a future society.",
      "year_of_publish": 1949
    },
    {
      "name": "Dune",
      "author": "Frank Herbert",
      "reason": "A sprawling epic set on the desert planet of Arrakis, dealing with politics, religion, and the struggle for control of the planet's valuable spice.",
      "year_of_publish": 1965
    }
  ]
}

这种方式一般会比起手式更稳定，但是也可能会消耗更多的 token。

终结技： JSON schema

如果我们需要给 JSON 引入更加复杂的结构，或者要使用枚举等等，用之前的方式不一定能获得稳定的结构化输出。而 JSON 是有 schema 的，通过指定 JSON schema，我们可以实现更加复杂的结构以及使用枚举等功能。

这里我们增加一个 genre 的枚举字段用来演示。

I want you to recommend some books about {topic}.

Your output should follow the JSON schema below:
{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "author": {
            "type": "string"
          },
          "reason": {
            "type": "string"
          },
          "year_of_publish": {
            "type": "number"
          },
          "genre": {
            "type": "string"
            "enum": ["SCI-FI", "NON-SCI-FI"]
          }
        },
        "required": [
          "name",
          "author",
          "reason",
          "year_of_publish",
          "genre"
        ]
      }
    }
  },
  "required": [
    "items"
  ]
}

注意，使用 JSON schema 最好同时打开 JSON mode。通过这种方式，我们不需要给出例子（如果例子不恰当，可能会带偏 LLM，出现抽风），也不需要在 prompt 中再指定某个字段的取值，另外也很方便强类型语言进行后续处理。这种方式消耗的 token 数会更多，但是稳定性更佳。

在实践中，也有人使用 TypeScript 的结构体等方式来实现类似的效果，大体的思路是一样的。

后手：修复 JSON

当生成的 JSON 真的不合法时，可以通过一些方式尝试恢复成合法的 JSON。目前有一些现成的工具，例如以下几个。基本的原理是通过BNF来解析 JSON，通过给数组或对象添加未闭合的括号、给字符串添加引号、调整空白或换行等启发式规则，尝试修复 JSON。

实战经验

可以先从最简单的方式入手，如果发现输出不稳定，再辅以其他手段
适当降低 Temperature 也有助于生成稳定的结构化输出
代码层面需要做好兼容，解析失败时可以采取重试等方法

References

如何控制 LLM 的输出格式和解析其输出结果？ | 宝玉的分享
 JSON Schema

阿猫的博客