51c大模型~合集122

我自己的原文哦~ https://blog.51cto.com/whaosoft/13877107

#PHYBench

北大物院200人合作，金牌得主超50人！PHYBench：大模型究竟能不能真的懂物理？

本项目由北京大学物理学院朱华星老师、曹庆宏副院长统筹指导。基准设计、项目管理以及数据整合的主要工作由学生核心团队完成，核心成员包括仇是、郭绍阳、宋卓洋、孙韫博、蔡则宇、卫家燊、罗天宇等。项目还得到了北京大学计算中心罗民兴院士和人工智能研究院张牧涵老师的鼎力支持。

PHYBench 项目汇聚了来自物理学院及兄弟院系的 200 余名学生，共同承担题目编写、审核及人类基准测试等工作。这支高水平的参与者团队中，包含至少 50 位全国中学生物理竞赛金牌得主，更有亚洲物理奥赛和国际物理奥赛的金牌获得者。这场大规模、高质量的协作，不仅充分展现了北大学子深厚的学术功底和卓越的组织协调能力，也为 PHYBench 产出高质量成果提供了坚实保障。

在大语言模型（LLMs）飞速发展的当下，模型的推理能力俨然成为模型能力的代名词。OpenAI 的 o 系列、DeepSeek R1 等前沿模型相继发布，这些大模型凭借强化学习技术的助力，在许多科学评测基准上频频刷新纪录，甚至声称 “超越人类专家”。

但是，随着模型能力和评测基准的军备竞赛白热化，越来越多的基准不得不转向生僻的知识点、或者抽象的数学竞赛题。这些题目虽然能 “区分” 模型，但是逐渐脱离实际场景，可能难以真正反映模型的实际表现。

近日，北京大学物理学院联合人工智能研究院等多个院系，推出了全新评测基准 PHYBench。PHYBench 包含 500 道经过精心设计的高质量物理题（如图 1），难度横跨高中物理、大学物理以及物理奥林匹克竞赛。这些题目以真实的物理场景为基础，对人类来说并不抽象，却把一众大模型考得七零八落。大模型在解决物理题时的思维链也暴露了它们在感知（Perception）和推理（Reasoning）能力上的缺陷。

论文链接：https://arxiv.org/abs/2504.16074
项目网址：https://phybench-official.github.io/phybench-demo/
数据集：https://huggingface.co/datasets/Eureka-Lab/PHYBench

也许，物理才是最适合考察 AI 推理能力的学科？PHYBench 的尝试为评估大模型真正有效的推理能力提供了全新的工具和视角。

图 1：题目样例与两种评估方法：表达式树编辑距离、正确率。

表 1：与现有 benchmark 对比，PHYBench 在高难度数据集中，有着相对大的规模，同时引入了创新的分数度量：表达式树编辑距离。

评测方法创新

表达式树编辑距离（EED Score）

传统基准通常依赖 Accuracy 这一单一指标：设置唯一正确答案，模型只有在完全匹配时才能得分。为了方便评分，问答题通常被改写成选择题或要求代入数值。这样会导致答案的信息量被严重压缩，而且给出过多条件可能导致模型 “根据选项猜过程”，或者缺乏使用解析表达式表达普适关系的能力。同时在高难度的样本上，0/1 打分会使得所有模型在分数层面都被归零，强弱差异无从体现。

EED Score（Expression‑tree Edit Distance）带来了更贴近人类阅卷的方案。它将数学表达式解析成表达式树，再计算模型答案与参考答案之间的编辑距离：树的结构越接近，得分越高。这一机制输出的是连续、细粒度的分数，能在更多题目上显示区分度，显著提高了统计效力。

实验表明，采用 EED Score 的 500 题，其区分能力相当于 1500 道使用 0/1 Accuracy 的题目。上图（图 1）展示了同一道题三种不同答案在 Accuracy 与 EED Score 下的对比：前者只能给出 “全错 / 全对” 的粗糙评价，而后者则定量刻画了模型解答与正确答案之间的 “距离”。

实验结果

前沿模型与人类专家的差距

PHYBench 团队招募了 81 名北大学子，在 3 小时时限内做 8 道题目，与最先进的 AI 模型展开了一场 "人机大战"。

结果显示，即使是最强的 Gemini 2.5 pro，也只能答对 36.9% 的题目，EED 评分 49.5%。而 “人类专家” 们则轻松碾压，平均正确率高达 61.9%，EED 评分高达 70.5%。排名前 25% 的受试者更是达到了 71.4% 的正确率 —— 几乎是最强 AI 的两倍。其他模型与人类的差距则更为显著。这一显著差距揭示了现阶段 LLM 在在物理推理场景中的瓶颈。

PHYBench 对模型的能力也进行了细粒度的对比。可以看到，Gemini 2.5 pro、o3 等强推理模型虽然和人类还有较大差距，但是相比前代推理模型已经有了明显的进步。DeepSeek-V3 等基座模型虽未能超越主流推理模型，但也展现出了亮眼的成绩。QwQ-32B 和 DeepSeek32B 蒸馏模型等小型推理模型在 PHYBench 上的表现很令人失望，这可能归因于其物理感知能力的不足。

基于思维链的错因分析：PP × RR

PHYBench 团队对模型的错误进行了系统性总结分析，将模型的推理过程和推理能力划分为了两个关键模块：物理感知（Physical Perception，PP）和鲁棒推理（Robust Reasoning，RR）：

物理感知（PP）：在此阶段，模型进行密集的文字推理，模型需要识别问题相关的物理对象、变量和动力学关系，定性判断哪些物理效应是重要的，哪些可以忽略不计。若 PP 出错，后续整个推理都会偏离轨道。（示例 1 展示典型 PP 失误）
鲁棒推理（RR）：在此阶段，模型写下大量的 “草稿”，一步步化简表达式，解方程。现阶段的推理模型在此阶段的推理效率尚不高，“草稿” 长度远长于人类，而且经常犯 “低级错误”。（示例 2 展示典型 RR 失误）

PP 和 RR 交替进行，组成了典型的物理解题思维链。

未来展望

推动 AI 的物理理解与推理能力发展

PHYBench 的愿景远不止于 “评测”，更在于 “引领” AI 探索物理世界的无限可能。

PHYBench 的发布，不仅为评估大语言模型在物理感知与推理方面的能力提供了一个全新且权威的基准，更为未来 AI 系统的发展指明了攻坚方向。我们精心设计的真实、复杂的物理场景，旨在深度激发并验证 AI 理解世界并进行可靠推理的能力，推动 AI 系统真正实现对世界的认知、融入与变革。

面向未来，PHYBench 团队将持续致力于数据集的拓展与创新，计划纳入更多前沿物理课题、跨学科交叉内容，甚至挑战人类尚未解开的科学谜题。我们相信，通过提供更具深度和广度的物理挑战，PHYBench 将有力催化 AI 向着突破认知边界、探索未知领域的 “智能伙伴” 或 “超级助手” 发展。

#DIFF Transformer

差分注意力机制引领变革，DIFF Transformer攻克长序列建模难题

近年来，Transformer 架构在自然语言处理领域取得了巨大成功，从机器翻译到文本生成，其强大的建模能力为语言理解与生成带来了前所未有的突破。

然而，随着模型规模的不断扩大和应用场景的日益复杂，传统 Transformer 架构逐渐暴露出缺陷，尤其是在处理长文本、关键信息检索以及对抗幻觉等任务时，Transformer 常常因过度关注无关上下文而陷入困境，导致模型表现受限。

为攻克这一难题，来自微软和清华的研究团队提出了 DIFF Transformer，一种基于差分注意力机制的创新基础模型架构。

论文标题：Differential Transformer
论文链接：https://openreview.net/pdf?id=OvoCm1gGhN
代码链接：https://aka.ms/Diff-Transformer

其核心思想是通过计算两组 Softmax 注意力图的差值来放大对关键上下文的关注，同时消除注意力噪声干扰。DIFF Transformer 具备以下显著优势：

在语言建模任务中，DIFF Transformer 在模型大小、训练 token 数量等方面展现出了卓越的可扩展性，仅需约 65% 的模型规模或训练 token 数量即可达到与传统 Transformer 相当的性能，大幅提升了语言模型通用表现。

在长文本建模、关键信息检索、数学推理、对抗幻觉、上下文学习、模型激活值量化等一系列任务中，DIFF Transformer 展现了独特优势，相比传统 Transformer 有显著提升。

DIFF Transformer 的特性使其在自然语言处理领域具有广阔的应用前景，有望成为推动语言模型发展的新动力。此外，已有跟进研究初步验证方法在视觉、多模态等领域中的有效性，显示出其跨模态通用的潜力。该研究已被 ICLR 2025 接收，并获选为 Oral 论文（入选比例 1.8%）。

方法

本文提出了一种名为 Differential Transformer（DIFF Transformer）的基础模型架构，旨在解决传统 Transformer 在长文本建模中对无关上下文过度分配注意力的问题。该方法通过差分注意力机制（Differential Attention）放大对关键上下文的关注，同时消除注意力噪声，从而显著提升模型在多种任务中的性能。

差分注意力机制

传统 Transformer 的注意力机制通过 Softmax 函数对输入序列中的不同 token 进行加权，但 Softmax 的性质导致模型难以完全消除无关上下文的影响。为了克服这一问题，DIFF Transformer 引入了差分注意力机制。

具体而言，该机制将查询向量（Query）和键向量（Key）在注意力头（Head）维度分为两组，分别计算两组的 Softmax 注意力图，然后计算两者的差值作为最终的注意力分数。这一设计类似于电子工程中的差分放大器，以及降噪耳机，通过两组信号相减以消除共有噪声。

差分注意力的数学表达如下：

其中，

和

分别是两组查询和键向量，

是值向量，

是一个可学习的标量参数，用于调节两组注意力图的权重。计算过程如图 1 所示。

图 1. 差分注意力机制图示与伪代码

为了同步学习速率，将

重参数化为：

其中，

是可学习的向量，而

是用于初始化的常数。

多头差分注意力

为了进一步提升模型的表达能力，DIFF Transformer 采用了多头机制。每个注意力头独立计算差分注意力，并将多头输出拼接为最终结果。具体实现如下：

其中

是注意力头的数量，

是输出投影矩阵。为了保持与 Transformer 梯度一致，DIFF Transformer 在每个头的输出后应用了独立的归一化层，采用 RMSNorm 实现。

图 2. Transformer 与 DIFF Transformer 注意力分数分布可视化

图 2 展示了 DIFF Transformer 和传统 Transformer 在注意力分数分配上的显著差异。作者将一段关键信息插入大段不相关文本的中间位置，并对模型抽取关键信息时的注意力分数分配进行可视化。

传统 Transformer 的注意力分数被广泛分配到整个上下文中，只有极少分数分配至关键信息；而 DIFF Transformer 能够将更高的分数集中在目标答案上，并且几乎不向无关上下文分配注意力。

注意力分数分配的稀疏性与精准性也使得 DIFF Transformer 在处理长文本关键信息检索任务时显著优于 Transformer。

实验

作者通过一系列实验验证了 DIFF Transformer 在多个方面的卓越性能，证明了其在大语言模型中应用的独特潜力与优势。

语言建模

作者研究了 DIFF Transformer 在扩展模型规模和训练数据量时的性能，如图 3 所示。实验表明，DIFF Transformer 仅需约 65% 的参数规模或训练数据量即可达到与 Transformer 相当的语言建模性能。例如，6.8B 参数规模的 DIFF Transformer 在语言建模损失上与 11B 参数规模的 Transformer 相当。

图 3. 语言建模上的模型参数、训练数据量可扩展性实验

长文本建模

作者将模型扩展到 64K 上下文长度，并在长文本书籍数据上进行了评估。结果显示，考虑累积平均负对数似然（NLL）指标， DIFF Transformer 在不同序列位置上均优于 Transformer，能够更有效地利用长上下文信息。

图 4. 长文本书籍数据模型性能评估

关键信息检索

作者通过「多针检索」（Multi-Needle Retrieval）实验评估了模型从大量上下文中提取关键信息的能力，如图 5 所示。实验表明，DIFF Transformer 在不同上下文长度和答案深度下均表现出更高的准确率，尤其是在文本较长以及答案位于文本更靠前位置时，优势更为明显。例如，在 64K 上下文中，DIFF Transformer 在答案位于 25% 深度时的准确率比 Transformer 高出 76%。此外，统计信息显示，DIFF Transformer 在注意力分数分配上也表现出更高的聚焦能力，能够准确定位关键信息，并展现了更高的信噪比。

图 5. 多针检索评估

上下文学习

作者从两个角度评估了 DIFF Transformer 的上下文学习能力：多样本上下文学习和样本顺序鲁棒性测试。如图 6 所示，在多样本上下文学习任务中，作者使用了 4 个不同的数据集（TREC、TREC-fine、Banking-77 和 Clinic-150），并逐步增加示例数量，直到总长度达到 64K tokens。结果显示，DIFF Transformer 在不同数据集上均优于 Transformer，平均准确率提升显著。

图 6. 多样本上下文学习

在鲁棒性测试中，作者通过打乱示例顺序的方式评估了模型的性能稳定性。如图 7 所示，DIFF Transformer 在不同示例排列下的性能方差显著低于 Transformer，表明其对输入顺序的敏感性更低，具有更强的鲁棒性。

图 7. 样本顺序鲁棒性测试

幻觉评测

作者利用文本摘要和问答任务作为两个典型的幻觉评测场景，评估了 DIFF Transformer 在降低大模型幻觉（hallucination）方面的表现。结果如图 8 所示，DIFF Transformer 在生成摘要和回答问题时显著提升了准确率，减少了幻觉现象。这是因为差分注意力机制能够准确定位重要文段，避免无关上下文对模型预测的干扰。

图 8. 利用文本摘要、问答任务进行幻觉评测

异常激活值分析

作者还发现 DIFF Transformer 能够显著减少模型激活中的异常值，这为模型激活值的量化提供了新的可能性。实验表明，DIFF Transformer 在注意力激活值（attention logits）和隐藏状态（hidden states）中的最大激活值显著低于 Transformer。例如，在注意力激活值的 Top-1 激活值上，DIFF Transformer 比 Transformer 低了近 8 倍。利用这一性质，DIFF Transformer 在注意力激活值的低比特量化下的性能也优于 Transformer，如图 9 所示。

图 9. 注意力激活值的低比特量化

数学推理能力

作者在数学推理任务上进一步验证了 DIFF Transformer 的性能。作者采用两阶段训练，在 3B 预训练模型的基础上进行有监督微调，并在 MATH 等 8 个数学数据集上评测模型性能。在第一阶段，采用 20B token 合成数学数据对模型进行微调，使模型获得基础数学能力，评测结果如图 10 所示。从 15B token 开始，DIFF Transformer 展现出了显著优于 Transformer 的数学能力，至 20B token 结束的时候，准确率的差距达到了 11% 左右。

图 10. 第一阶段数学合成数据微调

在第二阶段，作者利用 Deepseek-R1 输出所构造的数据集 OpenThoughts-114K-Math 对模型进行蒸馏，使模型更强大的深度推理能力。如图 11 所示，在 8 个数据集上，DIFF Transformer 相较 Transformer 均有不同程度的提升，平均准确率提升了 7.5%，这表明差分注意力机制更强大的上下文建模能力在推理任务中也至关重要。

图 11. 第二阶段深度推理能力评测

讨论与未来工作

DIFF Transformer 自发布以来获得了较大关注与讨论。作者在 Hugging Face 论文讨论平台、alphaXiv 平台上与社区开展了深入的探讨。在 X 平台（原 Twitter）上，Google DeepMind 高级研究科学家（Senior Staff Research Scientist）Petar Veličković 与作者就文章中的理论分析展开讨论，ViT 核心作者 Lucas Beyer 也在阅读文章后撰写了一篇深入的论文总结，相关发帖已获得数十万浏览。目前 DIFF Transformer 也已集成至 Hugging Face 的 transformers 库中。

Hugging Face：https://huggingface.co/papers/2410.05258
alphaXiv：https://www.alphaxiv.org/abs/2410.05258v1
Petar Veličković：https://x.com/PetarV_93/status/1874820028975267866
Lucas Beyer：https://x.com/giffmana/status/1873869654252544079
transformers库：https://github.com/huggingface/transformers/tree/main/src/transformers/models/diffllama

未来工作方面，作者认为可以利用 DIFF Transformer 的性质设计低比特注意力算子，以及利用差分注意力的稀疏特性进行键值缓存（key-value cache）的剪枝。此外，将 DIFF Transformer 应用在除语言以外的其他模态上也值得探索。近期工作 DiffCLIP 将差分注意力扩展至视觉、多模态领域，揭示了 DIFF Transformer 在不同模态任务中的更多结构特性与应用潜力。

DiffCLIP：https://arxiv.org/abs/2503.06626

总结

本文的贡献主要在两个方面：

（1）DIFF Transformer 通过创新的差分注意力机制，有效解决了传统 Transformer 在处理文本时受到噪声干扰、注意力分配不准确的问题；

（2）凭借对关键信息的关注和对噪声的抵御能力，DIFF Transformer 在语言建模、长文本建模、关键信息检索、数学推理、对抗幻觉、上下文学习、模型激活值量化等任务中表现出色，有望在自然语言处理、多模态等领域作为基础模型架构。

#LLM 工程师工具箱

120+大模型库全攻略！

为大语言模型（LLM）开发者整理了超过120个相关库，并按训练、推理、应用开发等14个类别进行分类，涵盖从数据提取到安全评估的全方位工具，助力开发者高效筛选和利用资源。

在大语言模型（LLM）迅速发展的今天，开发者们面临着海量的资源和工具选择。如何高效地筛选和利用这些资源，成为了每一个 LLM 开发者的关键任务。 今天，我们要介绍的 GitHub 仓库——LLM Engineer Toolkit，或许能成为你的得力助手！

https://github.com/KalyanKS-NLP/llm-engineer-toolkit

这个由 KalyanKS-NLP 创建的仓库，精心整理了超过 120 个 LLM 相关的库，并按照类别进行了分类。无论是训练、推理、应用开发，还是数据提取、安全评估，你都能在这里找到对应的工具。

大模型工具划分

🚀 LLM Training：专注于 LLM 训练和微调的工具，帮助你更快、更高效地优化模型。
🧱 LLM Application Development：从框架到多 API 接入，再到缓存和低代码开发，为应用开发提供全方位支持。
🩸 LLM RAG：Retrieval-Augmented Generation（检索增强生成）相关的库，提升模型的知识检索能力。
🟩 LLM Inference：推理加速和优化工具，让模型运行更流畅。
🚧 LLM Serving：模型部署和推理服务的解决方案。
📤 LLM Data Extraction：数据提取工具，帮助你从各种来源获取高质量数据。
🌠 LLM Data Generation：生成合成数据，丰富你的训练集。
💎 LLM Agents：构建智能代理，实现自动化任务和多代理协作。
⚖️ LLM Evaluation：评估工具，确保模型性能达到预期。
🔍 LLM Monitoring：监控模型运行状态，及时发现并解决问题。
📅 LLM Prompts：优化和管理提示词，提升模型输出质量。
📝 LLM Structured Outputs：生成结构化输出，让模型结果更易用。
🛑 LLM Safety and Security：保障模型的安全性和可靠性。
💠 LLM Embedding Models：提供先进的文本嵌入模型。
❇️ Others：其他实用工具，涵盖更多开发场景。

LLM Training and Fine-Tuning

Library	Description
unsloth	Fine-tune LLMs faster with less memory.
PEFT	State-of-the-art Parameter-Efficient Fine-Tuning library.
TRL	Train transformer language models with reinforcement learning.
Transformers	Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.
Axolotl	Tool designed to streamline post-training for various AI models.
LLMBox	A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
LitGPT	Train and fine-tune LLM lightning fast.
Mergoo	A library for easily merging multiple LLM experts, and efficiently train the merged LLM.
Llama-Factory	Easy and efficient LLM fine-tuning.
Ludwig	Low-code framework for building custom LLMs, neural networks, and other AI models.
Txtinstruct	A framework for training instruction-tuned models.
Lamini	An integrated LLM inference and tuning platform.
XTuring	xTuring provides fast, efficient and simple fine-tuning of open-source LLMs, such as Mistral, LLaMA, GPT-J, and more.
RL4LMs	A modular RL library to fine-tune language models to human preferences.
DeepSpeed	DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
torchtune	A PyTorch-native library specifically designed for fine-tuning LLMs.
PyTorch Lightning	A library that offers a high-level interface for pretraining and fine-tuning LLMs.

LLM Application DevelopmentFrameworks

Library	Description
LangChain	LangChain is a framework for developing applications powered by large language models (LLMs).
Llama Index	LlamaIndex is a data framework for your LLM applications.
HayStack	Haystack is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more.
Prompt flow	A suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications.
Griptape	A modular Python framework for building AI-powered applications.
Weave	Weave is a toolkit for developing Generative AI applications.
Llama Stack	Build Llama Apps.

Data Preparation

Library	Description
Data Prep Kit	Data Prep Kit accelerates unstructured data preparation for LLM app developers. Developers can use Data Prep Kit to cleanse, transform, and enrich use case-specific unstructured data to pre-train LLMs, fine-tune LLMs, instruct-tune LLMs, or build RAG applications.

Multi API Access

Library	Description
LiteLLM	Library to call 100+ LLM APIs in OpenAI format.
AI Gateway	A Blazing Fast AI Gateway with integrated Guardrails. Route to 200+ LLMs, 50+ AI Guardrails with 1 fast & friendly API.

Routers

Library	Description
RouteLLM	Framework for serving and evaluating LLM routers - save LLM costs without compromising quality. Drop-in replacement for OpenAI's client to route simpler queries to cheaper models.

Memory

Library	Description
mem0	The Memory layer for your AI apps.
Memoripy	An AI memory layer with short- and long-term storage, semantic clustering, and optional memory decay for context-aware applications.
Letta (MemGPT)	An open-source framework for building stateful LLM applications with advanced reasoning capabilities and transparent long-term memory
Memobase	A user profile-based memory system designed to bring long-term user memory to your Generative AI applications.

Interface

Library	Description
Streamlit	A faster way to build and share data apps. Streamlit lets you transform Python scripts into interactive web apps in minutes
Gradio	Build and share delightful machine learning apps, all in Python.
AI SDK UI	Build chat and generative user interfaces.
AI-Gradio	Create AI apps powered by various AI providers.
Simpleaichat	Python package for easily interfacing with chat apps, with robust features and minimal code complexity.
Chainlit	Build production-ready Conversational AI applications in minutes.

Low Code

Library	Description
LangFlow	LangFlow is a low-code app builder for RAG and multi-agent AI applications. It’s Python-based and agnostic to any model, API, or database.

Cache

Library	Description
GPTCache	A Library for Creating Semantic Cache for LLM Queries. Slash Your LLM API Costs by 10x 💰, Boost Speed by 100x. Fully integrated with LangChain and LlamaIndex.

LLM RAG

Library	Description
FastGraph RAG	Streamlined and promptable Fast GraphRAG framework designed for interpretable, high-precision, agent-driven retrieval workflows.
Chonkie	RAG chunking library that is lightweight, lightning-fast, and easy to use.
RAGChecker	A Fine-grained Framework For Diagnosing RAG.
RAG to Riches	Build, scale, and deploy state-of-the-art Retrieval-Augmented Generation applications.
BeyondLLM	Beyond LLM offers an all-in-one toolkit for experimentation, evaluation, and deployment of Retrieval-Augmented Generation (RAG) systems.
SQLite-Vec	A vector search SQLite extension that runs anywhere!
fastRAG	fastRAG is a research framework for efficient and optimized retrieval-augmented generative pipelines, incorporating state-of-the-art LLMs and Information Retrieval.
FlashRAG	A Python Toolkit for Efficient RAG Research.
Llmware	Unified framework for building enterprise RAG pipelines with small, specialized models.
Rerankers	A lightweight unified API for various reranking models.
Vectara	Build Agentic RAG applications.

LLM Inference

Library	Description
LLM Compressor	Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment.
LightLLM	Python-based LLM inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
vLLM	High-throughput and memory-efficient inference and serving engine for LLMs.
torchchat	Run PyTorch LLMs locally on servers, desktop, and mobile.
TensorRT-LLM	TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference.
WebLLM	High-performance In-browser LLM Inference Engine.

LLM Serving

Library	Description
Langcorn	Serving LangChain LLM apps and agents automagically with FastAPI.
LitServe	Lightning-fast serving engine for any AI model of any size. It augments FastAPI with features like batching, streaming, and GPU autoscaling.

LLM Data Extraction

Library	Description
Crawl4AI	Open-source LLM Friendly Web Crawler & Scraper.
ScrapeGraphAI	A web scraping Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).
Docling	Docling parses documents and exports them to the desired format with ease and speed.
Llama Parse	GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents).
PyMuPDF4LLM	PyMuPDF4LLM library makes it easier to extract PDF content in the format you need for LLM & RAG environments.
Crawlee	A web scraping and browser automation library.
MegaParse	Parser for every type of document.
ExtractThinker	Document Intelligence library for LLMs.

LLM Data Generation

Library	Description
DataDreamer	DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows.
fabricator	A flexible open-source framework to generate datasets with large language models.
Promptwright	Synthetic Dataset Generation Library.
EasyInstruct	An Easy-to-use Instruction Processing Framework for Large Language Models.

LLM Agents

Library	Description
CrewAI	Framework for orchestrating role-playing, autonomous AI agents.
LangGraph	Build resilient language agents as graphs.
Agno	Build AI Agents with memory, knowledge, tools, and reasoning. Chat with them using a beautiful Agent UI.
Agents SDK	Build agentic apps using LLMs with context, tools, hand off to other specialized agents.
AutoGen	An open-source framework for building AI agent systems.
Smolagents	Library to build powerful agents in a few lines of code.
Pydantic AI	Python agent framework to build production grade applications with Generative AI.
BeeAI	Build production-ready multi-agent systems in Python.
gradio-tools	A Python library for converting Gradio apps into tools that can be leveraged by an LLM-based agent to complete its task.
Composio	Production Ready Toolset for AI Agents.
Atomic Agents	Building AI agents, atomically.
Memary	Open Source Memory Layer For Autonomous Agents.
Browser Use	Make websites accessible for AI agents.
OpenWebAgent	An Open Toolkit to Enable Web Agents on Large Language Models.
Lagent	A lightweight framework for building LLM-based agents.
LazyLLM	A Low-code Development Tool For Building Multi-agent LLMs Applications.
Swarms	The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework.
ChatArena	ChatArena is a library that provides multi-agent language game environments and facilitates research about autonomous LLM agents and their social interactions.
Swarm	Educational framework exploring ergonomic, lightweight multi-agent orchestration.
AgentStack	The fastest way to build robust AI agents.
Archgw	Intelligent gateway for Agents.
Flow	A lightweight task engine for building AI agents.
AgentOps	Python SDK for AI agent monitoring.
Langroid	Multi-Agent framework.
Agentarium	Framework for creating and managing simulations populated with AI-powered agents.
Upsonic	Reliable AI agent framework that supports MCP.

LLM Evaluation

Library	Description
Ragas	Ragas is your ultimate toolkit for evaluating and optimizing Large Language Model (LLM) applications.
Giskard	Open-Source Evaluation & Testing for ML & LLM systems.
DeepEval	LLM Evaluation Framework
Lighteval	All-in-one toolkit for evaluating LLMs.
Trulens	Evaluation and Tracking for LLM Experiments
PromptBench	A unified evaluation framework for large language models.
LangTest	Deliver Safe & Effective Language Models. 60+ Test Types for Comparing LLM & NLP Models on Accuracy, Bias, Fairness, Robustness & More.
EvalPlus	A rigorous evaluation framework for LLM4Code.
FastChat	An open platform for training, serving, and evaluating large language model-based chatbots.
judges	A small library of LLM judges.
Evals	Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
AgentEvals	Evaluators and utilities for evaluating the performance of your agents.
LLMBox	A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
Opik	An open-source end-to-end LLM Development Platform which also includes LLM evaluation.

LLM Monitoring

Library	Description
MLflow	An open-source end-to-end MLOps/LLMOps Platform for tracking, evaluating, and monitoring LLM applications.
Opik	An open-source end-to-end LLM Development Platform which also includes LLM monitoring.
LangSmith	Provides tools for logging, monitoring, and improving your LLM applications.
Weights & Biases (W&B)	W&B provides features for tracking LLM performance.
Helicone	Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc.
Evidently	An open-source ML and LLM observability framework.
Phoenix	An open-source AI observability platform designed for experimentation, evaluation, and troubleshooting.
Observers	A Lightweight Library for AI Observability.

LLM Prompts

Library	Description
PCToolkit	A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models.
Selective Context	Selective Context compresses your prompt and context to allow LLMs (such as ChatGPT) to process 2x more content.
LLMLingua	Library for compressing prompts to accelerate LLM inference.
betterprompt	Test suite for LLM prompts before pushing them to production.
Promptify	Solve NLP Problems with LLMs & easily generate different NLP Task prompts for popular generative models like GPT, PaLM, and more with Promptify.
PromptSource	PromptSource is a toolkit for creating, sharing, and using natural language prompts.
DSPy	DSPy is the open-source framework for programming—rather than prompting—language models.
Py-priompt	Prompt design library.
Promptimizer	Prompt optimization library.

LLM Structured Outputs

Library	Description
Instructor	Python library for working with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API.
XGrammar	An open-source library for efficient, flexible, and portable structured generation.
Outlines	Robust (structured) text generation
Guidance	Guidance is an efficient programming paradigm for steering language models.
LMQL	A language for constraint-guided and efficient LLM programming.
Jsonformer	A Bulletproof Way to Generate Structured JSON from Language Models.

LLM Safety and Security

Library	Description
JailbreakEval	A collection of automated evaluators for assessing jailbreak attempts.
EasyJailbreak	An easy-to-use Python framework to generate adversarial jailbreak prompts.
Guardrails	Adding guardrails to large language models.
LLM Guard	The Security Toolkit for LLM Interactions.
AuditNLG	AuditNLG is an open-source library that can help reduce the risks associated with using generative AI systems for language.
NeMo Guardrails	NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
Garak	LLM vulnerability scanner
DeepTeam	The LLM Red Teaming Framework

LLM Embedding Models

Library	Description
Sentence-Transformers	State-of-the-Art Text Embeddings
Model2Vec	Fast State-of-the-Art Static Embeddings
Text Embedding Inference	A blazing fast inference solution for text embeddings models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.

Others

Library	Description
Text Machina	A modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, and boundary detection.
LLM Reasoners	A library for advanced large language model reasoning.
EasyEdit	An Easy-to-use Knowledge Editing Framework for Large Language Models.
CodeTF	CodeTF: One-stop Transformer Library for State-of-the-art Code LLM.
spacy-llm	This package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks.
pandas-ai	Chat with your database (SQL, CSV, pandas, polars, MongoDB, NoSQL, etc.).
LLM Transparency Tool	An open-source interactive toolkit for analyzing internal workings of Transformer-based language models.
Vanna	Chat with your SQL database. Accurate Text-to-SQL Generation via LLMs using RAG.
mergekit	Tools for merging pretrained large language models.
MarkLLM	An Open-Source Toolkit for LLM Watermarking.
LLMSanitize	An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).
Annotateai	Automatically annotate papers using LLMs.
LLM Reasoner	Make any LLM think like OpenAI o1 and DeepSeek R1.