TaskWeaver使用记录

1. 基本介绍
2. 总体结构与流程
3. 概念细节
- 3.1 Project
- 3.2 Session
- 3.3 Memory
- 3.4 Conversation
- 3.5 Round
- 3.6 Post
- 3.7 Attachment
- 3.8 Plugin
- 3.9 Executor
4. 代码特点
5. 使用过程
- 5.1 api调用
- 5.2 本地模型使用
- 5.3 添加插件
6. 存在的问题与使用体验
- 6.1 判别模型以提供给planner不同的prompt
- 6.2 缓存prompt
- 6.3 判断会话是否结束
- 6.3 executor的结果直接反馈给用户
- 6.4 陷入死循环的问题
- 6.4 额外的角色专门与用户进行交互
- 6.5 用户介入流程copilot
7. 总结

1. 基本介绍

本文记录一下taskweaver项目的使用过程，其中遇到的问题，以及带来的启发。
Taskweaver是最近比较火的一个AI Agent项目，由微软开发，目前在git上已经有4.6k Star。

项目地址：https://github.com/microsoft/TaskWeaver
论文地址：https://export.arxiv.org/pdf/2311.17541
说明文档：https://microsoft.github.io/TaskWeaver/docs/overview

Taskweaver的特点是，能够按照用户的指示，自动生成并执行代码，以完成一些更复杂的任务。在执行过程中，不仅保留了对话历史，还将代码执行的结果（包括报错信息）保留下来，以便解决代码执行过程中的问题。

2. 总体结构与流程

下面引用的是Taskweaver的结构图：

taskweaver

从图中可以看出，Taskweaver主要由两部分构成，Planer和Code Interpreter，其中Code Interpreter又由负责生成代码的Code Generator和负责执行代码的Code Executor构成。

从执行的流程上来看，可以大致划分为以下几个过程：

（1）用户发出指令，Planer接收到用户的指令。
（2）Planer根据用户的指令，结合prompt中的例子，做出初始计划（init plan）。
（3）Planer将init plan转化为更加精简的最终计划（final plan），计划中的每一步可以看作是一个子任务。
（4）逐步将计划中的每一步发送给Code Interpreter。
（5）Code Generator根据接收到的任务，从可用的插件里进行选择，并生成一段用于完成当前步骤的代码，发送给Code Executor。
（6）Code Executor执行代码，并将执行结果反馈给Planer。
（7）反复执行4-6，直到所有步骤都已经执行完成。Planer将结果返回给用户。
（8）用户可以继续发出新的指令。

在论文的附录A中，给出了具体的例子：
case study

3. 概念细节

在Taskweaver中，涉及到很多概念，在这里进行汇总介绍。

3.1 Project

项目可以看作是taskweaver中最高层级的概念，每个项目会指定一个特定的目录，其中包含了指导planer的prompt和example，项目下的每个会话（Session）的日志，项目中可以使用的各类插件，以及使用的LLM相关的配置。

基于每个项目可以多次创建会话。

3.2 Session

会话是实际代码执行中比较高层级，独立且完整的概念了，各种角色、组件都是挂在具体的会话下，每次启动taskweaver，就视作是唤起了一次会话。

3.3 Memory

Memory相当于是对话历史plus，除了像一般的LLM项目中，将对话历史保存下来，memory还保留了Code Executor的执行结果，并且在保留记忆的时候，利用compression组件对对话的内容进行了总结。

在代码结构中，Memory作为Session的一个属性。

3.4 Conversation

有点反常识的是，对话是Memory的一个属性，memory保存着所有的对话过程。而对话中记录的是taskweaver系统与用户交互的过程，每个对话由若干轮次（Round）构成。

3.5 Round

Round是conversation中的基础单元，每个round可以看作是post的集合。在round中记录着当前轮次的状态。

3.6 Post

Post是在不同角色之间传递的消息，这些角色包括用户，Planer，Code Generator，以及Code Executor，记录了从哪个角色发送到哪个角色，以及消息的内容和附件（Attachment）。

3.7 Attachment

Attachment是在Post中，除了一般的文本信息之外，需要特殊被标记出来的附件，包括code，markdown，execution_result等，不同的角色通过判断特定的attachment的类型和其中的内容，来采取进一步的行动。

3.8 Plugin

插件是在代码生成过程中，提供给Code Generator的，用于完成某些相对复杂的任务。在提供插件的时候，需要提供两个文件，其一是py脚本，一旦插件被选中，脚本会被提供给Code Executor用于执行；其二是yaml文件，用于添加到Code Generator的prompt部分中，其中提供了该插件的介绍，使用例子，以及输入输出的类型等信息。

在执行的时候，Code Generator有两种模式选择可用的Plugin。

默认全部可用，yaml文件中的enabled设置为true的全部plugin都会被加载。
使用相似度判断，使用LLM Service中的embedding服务对plugin的desc和当前的需要执行的步骤分别进行编码，然后根据相似度取topk。

3.9 Executor

Code Generator所生成的代码会交给Code Executor执行，这样就要求Executor具备python运行环境。Executor使用执行代码有两种方法，docker容器和jupyter kernel。在默认的配置中，使用的docker container，但是我实际操作过程中，会出现报错，可能是环境配置的缘故。所以在environment.py中，将mode写成EnvMode.Local，于是可以运行。

但是需要注意的是，Local模式的原理是利用ipykernel模块创建jupyter内核，这样的运行环境是与启用taskweaver的python环境是一致的，这样并不安全，也不灵活，这也是为什么taskweaver并不建议使用这种模式。

考虑到Code Generator所生成的代码中，包含try import; execept pip install 之类的语句，这样的作法可能会改变基础python环境，所以我在实验过程中，将各种prompt和example中与pip install语句相关的部分都删除。

如果需要在不同的环境中运行代码，可以在plugin中添加自定义组件，并且在组件中访问其他环境所创建的api。

4. 代码特点

作为一个相当完备的项目，taskweaver的代码注释比较清晰，但是由于加入了各种工程化的部分，导致代码读起来并不轻松，如果做实验的话会比较麻烦，尤其是引入了类似JAVA各种框架中的inject机制，在逐步debug的时候非常不方便。

从代码结构来看，项目代码中，taskweaver目录下是源码部分。
project目录中是当前的项目信息，项目可用的插件文件，Planer的example，执行的日志等。
命令行启动一个TaskWeaverApp时，需要指定具体的项目目录。
playground中提供了web界面。

另外，这个工程在运行的过程中，受到@tracing_decorator的影响，导致报错（装饰器内报错）在打印之前，并不能够像一般的程序一样，将所有print的信息打印出来，使得debug更费劲了。一个小技巧是将print改成raise，将需要打印的内容当作错误抛出，就可以看到打印的内容了。

5. 使用过程

这一部分记录一下使用taskweaver过程中遇到的一些问题。

5.1 api调用

首先是关于大模型的使用，taskweaver提供了若干种模型的调用方法，包括openai，qwen等接口的使用。通过配置API KEY可以轻松的使用这些大模型。

然而直接使用api也存在一些问题。由于访问频率以及输入prompt长度的原因（taskweaver的planer和code generator本身的prompt就很长，拼接上各种example、experience、plugin之后就会变得非常长），很容易导致大模型接口调用失败，而如果使用长文本的接口服务，则费用就会很高。

5.2 本地模型使用

如果要使用本地模型，则需要在taskweaver.llm的目录下增加一个自己的py文件，可以仿照qwen.py实现，需要保证chat_completion返回的一个Generator，其中的内容形式如下即可：

{"status_code": 200, "request_id": "226ff7aa-b675-98d0-adab-6e5b98c79250", "code": "", "message": "", "output": {"text": null, "finish_reason": null, "choices": [{"finish_reason": "null", "message": {"role": "assistant", "content": "你好"}}]}, "usage": {"input_tokens": 20, "output_tokens": 1, "total_tokens": 21}}
{"status_code": 200, "request_id": "226ff7aa-b675-98d0-adab-6e5b98c79250", "code": "", "message": "", "output": {"text": null, "finish_reason": null, "choices": [{"finish_reason": "null", "message": {"role": "assistant", "content": "！"}}]}, "usage": {"input_tokens": 20, "output_tokens": 2, "total_tokens": 22}}
{"status_code": 200, "request_id": "226ff7aa-b675-98d0-adab-6e5b98c79250", "code": "", "message": "", "output": {"text": null, "finish_reason": null, "choices": [{"finish_reason": "null", "message": {"role": "assistant", "content": "有什么"}}]}, "usage": {"input_tokens": 20, "output_tokens": 3, "total_tokens": 23}}
{"status_code": 200, "request_id": "226ff7aa-b675-98d0-adab-6e5b98c79250", "code": "", "message": "", "output": {"text": null, "finish_reason": null, "choices": [{"finish_reason": "null", "message": {"role": "assistant", "content": "我能帮助你的吗？"}}]}, "usage": {"input_tokens": 20, "output_tokens": 8, "total_tokens": 28}}
{"status_code": 200, "request_id": "226ff7aa-b675-98d0-adab-6e5b98c79250", "code": "", "message": "", "output": {"text": null, "finish_reason": null, "choices": [{"finish_reason": "null", "message": {"role": "assistant", "content": ""}}]}, "usage": {"input_tokens": 20, "output_tokens": 8, "total_tokens": 28}}
{"status_code": 200, "request_id": "226ff7aa-b675-98d0-adab-6e5b98c79250", "code": "", "message": "", "output": {"text": null, "finish_reason": null, "choices": [{"finish_reason": "stop", "message": {"role": "assistant", "content": ""}}]}, "usage": {"input_tokens": 20, "output_tokens": 8, "total_tokens": 28}}

然后编辑taskweaver.llm.init.py，在构造方法中加入自己本地的api_type。

5.3 添加插件

在前文的介绍中提到了，插件的作用是为了实现某些特定场景的功能，完成code generator无法实现的一些复杂任务。默认提供的插件包括：pdf文档阅读，图像转文本，语音识别文本，讲笑话等。

添加自定义插件的方法也不复杂，在project.plugins目中，照葫芦画瓢地加入一个py文件和一个yaml文件即可。

在py文件中，可以通过调用其他外部接口的方法，来使用其他环境。
在yaml中，需要将enabled设置为true，否则这个插件将不会被加载。

6. 存在的问题与使用体验

最后记录一下我在试用taskweaver过程中的感受，以及个人认为可以优化的方向。

6.1 判别模型以提供给planner不同的prompt

目前版本的taskweaver，所有的对话场景，使用的planner都是同样的，我们可以通过planner_prompt.yaml来查看它的prompt：

  ### Examples of planning process[Example 1]User: count rows for ./data.csvinit_plan:1. Read ./data.csv file 2. Count the rows of the loaded data <sequential depend on 1>plan:1. Read ./data.csv file and count the rows of the loaded data[Example 2]User: Read a manual file and follow the instructions in it.init_plan:1. Read the file content and show its content to the user2. Follow the instructions based on the file content.  <interactively depends on 1>plan:1. Read the file content and show its content to the user2. follow the instructions based on the file content.[Example 3]User: detect anomaly on ./data.csvinit_plan:1. Read the ./data.csv and show me the top 5 rows to understand the data schema 2. Confirm the columns to be detected anomalies  <sequentially depends on 1>3. Detect anomalies on the loaded data <interactively depends on 2>4. Report the detected anomalies to the user <interactively depends on 3>plan:1. Read the ./data.csv and show me the top 5 rows to understand the data schema and confirm the columns to be detected anomalies2. Detect anomalies on the loaded data3. Report the detected anomalies to the user[Example 4]User: read a.csv and b.csv and join them togetherinit_plan:1. Load a.csv as dataframe and show me the top 5 rows to understand the data schema2. Load b.csv as dataframe and show me the top 5 rows to understand the data schema 3. Ask which column to join <sequentially depends on 1, 2>4. Join the two dataframes <interactively depends on 3>5. report the result to the user <interactively depends on 4>plan:1. Load a.csv and b.csv as dataframes, show me the top 5 rows to understand the data schema, and ask which column to join2. Join the two dataframes3. report the result to the user

可以看到这其中的例子，都是读取数据相关的，所以planner能够针对taskweaver中给出的例子很好地制定计划就一点也不例外了。但是可以预料地是，当用户的指令与数据处理相关的任务没有什么关联的时候，这些prompt就没有什么价值了。

所以很自然地可以想到一个优化的点就是在对话开始之前，将用户query经过一个判别模型进行判断，分类到预设的若干任务种类之中，再根据任务的种类，加载不同的prompt，来创建相应的planner。

6.2 缓存prompt

在使用taskweaver的过程中，一个很直观的感受是，生成的效率很慢，并且随着生成的进行，效率越来越慢。

考虑调用LLM时，拼接出来的prompt除了初始的prompt本身就很长，还要拼接上各种example、experience、plugin、chat_history，这导致在启动对话的时候，需要很大的计算量，来计算prompt部分。

而prompt部分，很大程度上是很固定的（初始的prompt本身），而调用的LLM也是固定的，可以考虑采用kv cache的策略，提前将这部分算好保存下来，从而避免每次都计算。

至于plugin有没有必要提前缓存下来，然后根据被选中的情况，与初始prompt的cache进行拼接，可能需要具体实验一下才能下结论，而且如果这样改造的话，工程量更大一些。

6.3 判断会话是否结束

Taskweaver中，为了防止会话历史过长导致的计算资源开销过大，采用了Memory管理机制，利用叫RoundCompressor的组件来对对话历史进行总结压缩。但是好像并没有在用户给出新的query时，判断新的query与之前的对话历史是否相关。这里也可以添加一个判断机制，如果新的query与之前的内容完全无关时，就清除之前的所有对话历史。

6.3 executor的结果直接反馈给用户

从Taskweaver的示意图可以看出，每当Code Executor执行完代码，并不会将结果直接反馈给用户，而是将执行结果给到Planner，由Planner重新组织之后，再给到用户，在这个过程中可能存在信息的偏差，并且由Planner重新组织一遍语言，会更加耗时。

例如，我提供了一个plugin，其内容是实体识别。我输入的指令是：“识别这段话中存在的实体：xxxxxxx”。
Code Executor给出的结果是：

{'entity_id': 'Q23792', 'entity_mentions': [{'end_pos': 57, 'entity_mention_id': '2', 'entity_type': 'LOC', 'mention_type': 'NAM', 'nid': 'Q23792', 'start_pos': 50, 'words': '巴勒斯坦加沙地带'}], 'entity_type': 'LOC'}
{'entity_id': 'Q974850', 'entity_mentions': [{'end_pos': 117, 'entity_mention_id': '9', 'entity_type': 'WEA', 'mention_type': 'NOM', 'nid': 'Q974850', 'start_pos': 116, 'words': '导弹'}], 'entity_type': 'WEA'}
{'entity_id': 'Q484000', 'entity_mentions': [{'end_pos': 121, 'entity_mention_id': '10', 'entity_type': 'WEA', 'mention_type': 'NOM', 'nid': 'Q484000', 'start_pos': 119, 'words': '无人机'}], 'entity_type': 'WEA'}
{'entity_id': 'Q6840712', 'entity_mentions': [{'end_pos': 132, 'entity_mention_id': '12', 'entity_type': 'LOC', 'mention_type': 'NAM', 'nid': 'Q6840712', 'start_pos': 129, 'words': '中东地区'}], 'entity_type': 'LOC'}
{'entity_id': 'Q41053', 'entity_mentions': [{'end_pos': 141, 'entity_mention_id': '13', 'entity_type': 'ORG', 'mention_type': 'NAM', 'nid': 'Q41053', 'start_pos': 136, 'words': '黎巴嫩真主党'}], 'entity_type': 'ORG'}
{'entity_id': 'Q516655', 'entity_mentions': [{'end_pos': 148, 'entity_mention_id': '14', 'entity_type': 'ORG', 'mention_type': 'NAM', 'nid': 'Q516655', 'start_pos': 143, 'words': '也门胡塞武装'}], 'entity_type': 'ORG'}......

而经过Planner的一顿操作之后，最终输出给用户的结果变成了：

The entities recognized in the text are as follows:1. **Iran** (GPE): mentioned 3 times - at positions 63-64, 70-71, and 126-127
2. **Israel** (GPE): mentioned 3 times - at positions 12-14, 81-83, and 108-110
3. **Palestinian Gaza Strip** (LOC): mentioned once at position 50-57
4. **Missile** (WEA): mentioned once at position 116-117
5. **Unmanned Aerial Vehicle (UAV)** (WEA): mentioned once at position 119-121
6. **Middle East** (LOC): mentioned once at position 129-132
7. **Lebanese Hezbollah** (ORG): mentioned once at position 136-141
8. **Houthi Militia in Yemen** (ORG): mentioned once at position 143-148
9. **Islamic Revolutionary Guard Corps of Iran** (ORG): mentioned once at position 91-99These entities represent geographical locations (GPE), weapons (WEA), regions (LOC), and organizations (ORG) mentioned in the text.

可以看出模型做了一些不必要的总结，并且因为Planner的prompt整体是英文的，它还把结果给翻译了。

并且只有当Planner指定的计划中，执行到当前计划是“ Report to the user.”时，才会将内容输出给用户，否则会反复调用Code Generator和Executor。这就很容易造成循环，即便没有陷入循环，也会消耗大量时间。

因此在某些特定的情况下，最好保留一个通路，直接将Executor的结果输出给用户。

6.4 陷入死循环的问题

Taskweaver关注的更多的是代码能够安全的执行，每次代码执行失败时，就会将报错的结果与代码一并返回给Code Generator进行修改，然后再将修改过后的代码，再交给Code Executor尝试执行。而当LLM能力不足，或用户本身的需求不合理时，很容在这个过程中陷入死循环，因此需要设置一个出口，在适当的时候跳出。