LangChain - 文档转换

文章目录

    • 一、文档转换器 & 文本拆分器
      • 文本拆分器
    • 二、开始使用文本拆分器
    • 三、按字符进行拆分
    • 四、代码分割 (Split code)
      • 1、PythonTextSplitter
      • 2、JS
      • 3、Markdown
      • 4、Latex
      • 5、HTML
      • 6、Solidity
    • 五、MarkdownHeaderTextSplitter
      • 1、动机
      • 2、Use case
    • 六、递归按字符分割
    • 七、按token 进行分割
      • 1、tiktoken
      • 2、spaCy
      • 3、SentenceTransformers
      • 4、NLTK
      • 5、Hugging Face tokenizer


本文转载改编自:
https://python.langchain.com.cn/docs/modules/data_connection/document_transformers/


一、文档转换器 & 文本拆分器

一旦加载了文档,您通常会希望对其进行转换,以更好地适应您的应用程序。
最简单的例子是您可能希望将长文档拆分为更小的块,以适应您模型的上下文窗口。
LangChain提供了许多内置的文档转换器,使得拆分、合并、过滤和其他文档操作变得容易。


文本拆分器

当您想要处理大块文本时,有必要将文本拆分为块。
虽然听起来很简单,但这里存在许多 潜在的复杂性。

理想情况下,您希望将 语义相关的文本片段 保持在一起。

"语义相关"的含义可能取决于 文本的类型。本笔记本演示了几种做法。


在高层次上,文本拆分器的工作方式如下:

  1. 将文本拆分为小的、语义上有意义的块(通常是句子)。
  2. 将这些小块组合成较大的块,直到达到某个大小(由某个函数测量)。
  3. 一旦达到该大小,将该块作为自己的文本片段,然后开始创建一个具有一定重叠的新文本块(以保持块之间的上下文)。

这意味着有两个不同的轴可以定制您的文本拆分器:

  1. 文本如何拆分
  2. 块大小如何测量

二、开始使用文本拆分器

默认推荐的文本分割器是 RecursiveCharacterTextSplitter。
该文本分割器接受一个字符列表。

它尝试根据第一个字符进行分割来创建块,但如果任何块太大,则继续移动到下一个字符,依此类推。

默认情况下,它尝试进行分割的字符是 ["\n\n", "\n", " ", ""]


除了控制可以进行分割的字符之外,您还可以控制一些其他事项:

  • length_function:计算块长度的方法。默认只计算字符数,但通常在此处传递一个令牌计数器。
  • chunk_size:块的最大大小(由长度函数测量)。
  • chunk_overlap:块之间的最大重叠。保持一些连续性之间可能有一些重叠(例如使用滑动窗口)。
  • add_start_index:是否在元数据中包含每个块在原始文档中的起始位置。

加载一段长文本

with open('../../state_of_the_union.txt') as f:state_of_the_union = f.read()
from langchain.text_splitter import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(# Set a really small chunk size, just to show.chunk_size = 100,chunk_overlap  = 20,length_function = len,add_start_index = True,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])
    page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={'start_index': 0}page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={'start_index': 82}

三、按字符进行拆分

这是最简单的方法。它基于字符进行拆分(默认为"\n\n"),并通过字符数量来测量块的长度。

  1. 文本如何被拆分: 按单个字符拆分。
  2. 块大小如何被测量: 通过字符数量来测量。

# This is a long document we can split up.
with open('../../../state_of_the_union.txt') as f:state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(        separator = "\n\n",chunk_size = 1000,chunk_overlap  = 200,length_function = len,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
    page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. ...He met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0

如下示例,传递文档的元数据信息。注意,它是和文档一起拆分的。

metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
print(documents[0])
    page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. ...From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0

text_splitter.split_text(state_of_the_union)[0]
    'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. ...From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'

四、代码分割 (Split code)

CodeTextSplitter 允许您使用多种语言进行代码分割。

导入枚举 Language并指定语言。

from langchain.text_splitter import (RecursiveCharacterTextSplitter,Language,
)
Full list of support languages
[e.value for e in Language]
    ['cpp','go','java','js','php','proto','python','rst','ruby','rust','scala','swift','markdown','latex','html','sol',]

给定编程语言,你也可以看到 这个语言对应的 separators

RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

    ['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

1、PythonTextSplitter

这里是使用 PythonTextSplitter 的示例

PYTHON_CODE = """
def hello_world():print("Hello, World!")# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs
    [Document(page_content='def hello_world():\n    print("Hello, World!")', metadata={}),Document(page_content='# Call the function\nhello_world()', metadata={})]

2、JS

这里是使用 JS 文本分割器的示例

JS_CODE = """
function helloWorld() {console.log("Hello, World!");
}// Call the function
helloWorld();
"""js_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.JS, chunk_size=60, chunk_overlap=0
)
js_docs = js_splitter.create_documents([JS_CODE])
js_docs
    [Document(page_content='function helloWorld() {\n  console.log("Hello, World!");\n}', metadata={}),Document(page_content='// Call the function\nhelloWorld();', metadata={})]

3、Markdown

这里是使用 Markdown 文本分割器的示例

markdown_text = """# 🦜️🔗 LangChain⚡ Building applications with LLMs through composability ⚡## Quick Install```bashpip install langchain```As an open source project in a rapidly developing field, we are extremely open to contributions.
"""
md_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
md_docs = md_splitter.create_documents([markdown_text])
md_docs
    [Document(page_content='# 🦜️🔗 LangChain', metadata={}),Document(page_content='⚡ Building applications with LLMs through composability ⚡', metadata={}),...Document(page_content='are extremely open to contributions.', metadata={})]

4、Latex

这里是使用 Latex 文本的示例

latex_text = """
\documentclass{article}\begin{document}\maketitle\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\end{document}
"""
latex_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
latex_docs = latex_splitter.create_documents([latex_text])
latex_docs
[Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle', metadata={}),Document(page_content='\\section{Introduction}', metadata={}),Document(page_content='Large language models (LLMs) are a type of machine learning', metadata={}),...Document(page_content='psychology, and computational linguistics.', metadata={}),Document(page_content='\\end{document}', metadata={})]

5、HTML

这里是使用 HTML 文本分割器的示例

html_text = """
<!DOCTYPE html>
<html><head><title>🦜️🔗 LangChain</title><style>body {font-family: Arial, sans-serif;}h1 {color: darkblue;}</style></head><body><div><h1>🦜️🔗 LangChain</h1><p>⚡ Building applications with LLMs through composability ⚡</p></div><div>As an open source project in a rapidly developing field, we are extremely open to contributions.</div></body>
</html>
"""

html_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
html_docs = html_splitter.create_documents([html_text])
html_docs

    [Document(page_content='<!DOCTYPE html>\n<html>\n    <head>', metadata={}),Document(page_content='<title>🦜️🔗 LangChain</title>\n        <style>', metadata={}),Document(page_content='body {', metadata={}),Document(page_content='font-family: Arial, sans-serif;', metadata={}),Document(page_content='}\n            h1 {', metadata={}),Document(page_content='color: darkblue;\n            }', metadata={}),Document(page_content='</style>\n    </head>\n    <body>\n        <div>', metadata={}),Document(page_content='<h1>🦜️🔗 LangChain</h1>', metadata={}),Document(page_content='<p>⚡ Building applications with LLMs through', metadata={}),Document(page_content='composability ⚡</p>', metadata={}),Document(page_content='</div>\n        <div>', metadata={}),Document(page_content='As an open source project in a rapidly', metadata={}),Document(page_content='developing field, we are extremely open to contributions.', metadata={}),Document(page_content='</div>\n    </body>\n</html>', metadata={})]

6、Solidity

这里是使用 Solidity 文本分割器的示例

SOL_CODE = """
pragma solidity ^0.8.20;
contract HelloWorld {function add(uint a, uint b) pure public returns(uint) {return a + b;}
}
"""sol_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.SOL, chunk_size=128, chunk_overlap=0
)
sol_docs = sol_splitter.create_documents([SOL_CODE])
sol_docs
[Document(page_content='pragma solidity ^0.8.20;', metadata={}),Document(page_content='contract HelloWorld {\n   function add(uint a, uint b) pure public returns(uint) {\n       return a + b;\n   }\n}', metadata={})
]

五、MarkdownHeaderTextSplitter


1、动机

许多聊天或问答应用程序在嵌入和向量存储之前,会先对输入文档进行分割成块。

Pinecone 的这些笔记提供了一些有用的提示:

当嵌入整个段落或文档时,嵌入过程会同时考虑整体上下文和文本中句子和短语之间的关系。这可能会得到更全面的向量表示,捕捉文本的更广泛的含义和主题。

正如上面所述,分块通常旨在将具有共同上下文的文本保持在一起。

在这种情况下,我们可能想要特别尊重文档本身的结构。

例如,一个 Markdown 文件的组织方式是通过标题。

在特定的标题组内创建分块是一个直观的想法。

为了解决这个挑战,我们可以使用 MarkdownHeaderTextSplitter

它将按照指定的一组标题来分割一个 Markdown 文件。

例如,如果我们想要分割这个 Markdown:

md = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly' 

我们可以指定要分割的标题:

[("#", "Header 1"),("##", "Header 2")]

然后根据公共标题进行内容的分组或分割:

{'content': 'Hi this is Jim  \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

让我们来看一些下面的示例。

from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"headers_to_split_on = [("#", "Header 1"),("##", "Header 2"),("###", "Header 3"),
]markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
for split in md_header_splits:print(split)
{'content': 'Hi this is Jim  \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

在每个 markdown 组中,我们可以应用我们需要的 text splitter。

markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."headers_to_split_on = [("#", "Header 1"),("##", "Header 2"),
]# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)# Char-level splits
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 10
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)# Split within each header group
all_splits=[]
all_metadatas=[]    
for header_group in md_header_splits:_splits = text_splitter.split_text(header_group['content'])_metadatas = [header_group['metadata'] for _ in _splits]all_splits += _splitsall_metadatas += _metadatas

all_splits[0]
# -> 'Markdown[9'
all_metadatas[0]
# -> {'Header 1': 'Intro', 'Header 2': 'History'}

2、Use case

我们将 MarkdownHeaderTextSplitter 应用到 Notion page 作为测试。详情可见:https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b

这个页面使用 markdown 下载保存到本地。

# Load Notion database as a markdownfile file
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("../Notion_DB_Metadata")
docs = loader.load()
md_file=docs[0].page_content
# Let's create groups based on the section headers
headers_to_split_on = [("###", "Section"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(md_file)
md_header_splits[3]
{'content': 'We previously introduced [auto-evaluator](https://blog.langchain.dev/auto-evaluator-opportunities/), an open-source tool for grading LLM question-answer chains. Here, we extend auto-evaluator with a [lightweight Streamlit app](https://github.com/langchain-ai/auto-evaluator/tree/main/streamlit) that can connect to any existing Pinecone index. We add the ability to test metadata filtering using `SelfQueryRetriever` as well as some other approaches that we’ve found to be useful, as discussed below.  \n[ret_trim.mov](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/ret_trim.mov)','metadata': {'Section': 'Evaluation'}}

现在,我们将文本拆分到每个组中,并将该组作为元数据保存。

# Define our text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 500
chunk_overlap = 50
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)# Create splits within each header group
all_splits=[]
all_metadatas=[]
for header_group in md_header_splits:_splits = text_splitter.split_text(header_group['content'])_metadatas = [header_group['metadata'] for _ in _splits]all_splits += _splitsall_metadatas += _metadatas
all_splits[6]
'In these cases, semantic search will look for the concept `episode 53` in the chunks, but instead we simply want to filter the chunks for `episode 53` and then perform semantic search to extract those that best summarize the episode. Metadata filtering does this, so long as we 1) we have a metadata filter for episode number and 2) we can extract the value from the query (e.g., `54` or `252`) that we want to extract. The LangChain `SelfQueryRetriever` does the latter (see'

all_metadatas[6]
{'Section': 'Motivation'}

这使我们能够很好地执行 基于文档结构的 元数据过滤。

让我们先建一个向量库,把这一切结合起来。

! pip install chromadb
# Build vectorstore
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(texts=all_splits,metadatas=all_metadatas,embedding=OpenAIEmbeddings())

我们创建一个 SelfQueryRetriever,可以根据我们定义的元数据进行筛选。

# Create retriever 
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo# Define our metadata
metadata_field_info = [AttributeInfo(name="Section",description="Headers of the markdown document that organize the ideas",type="string or list[string]",),
]
document_content_description = "Headers of the markdown document"# Define self query retriver
llm = OpenAI(temperature=0)
sq_retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)

然后我们可以从 文章的任意部分,获取 chunks。

# Test
question="Summarize the Introduction section of the document"
sq_retriever.get_relevant_documents(question)
query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. ... Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),Document(page_content='on a user-defined criteria in a VectorDB using metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]

现在,我们可以创建清洗的文档结构的 聊天或Q+A 应用程序。

当然,没有特定元数据过滤的语义搜索,可能对这个简单的文档 相当有效。

但是,对于更复杂或更长的文档,保留文档结构 以进行 元数据过滤的能力 可能会有所帮助。

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm,retriever=sq_retriever)
qa_chain.run(question)
query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None'The document discusses different approaches to retrieve relevant text chunks and synthesize them into an answer in Q+A systems. 
...
The Retriever-Less option, which uses the Anthropic 100k context window model, is also mentioned as an alternative approach.'
question="Summarize the Testing section of the document"
qa_chain.run(question)
query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None'The Testing section of the document describes how the performance of the SelfQueryRetriever was evaluated using various test cases. 
...
Additionally, the document mentions the use of the Kor library for structured data extraction to explicitly specify transformations that the auto-evaluator can use.'

六、递归按字符分割

这个文本分割器 是用于 通用文本的推荐分割器。它通过一个 字符列表进行参数化。
它会按 顺序 尝试使用这些字符进行分割,直到块的大小足够小。
默认列表是 ["\n\n", "\n", " ", ""]

这样做的效果是尽可能地保持所有段落(然后是句子,然后是单词)在一起,因为它们通常是 在语义上相关的文本片段中 的最强关联部分。

  1. 文本如何分割:按字符列表。
  2. 块的大小如何衡量:按字符数。
This is a long document we can split up.
with open('../../../state_of_the_union.txt') as f:state_of_the_union = f.read()

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(# Set a really small chunk size, just to show.chunk_size = 100,chunk_overlap  = 20,length_function = len,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])
    page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0

text_splitter.split_text(state_of_the_union)[:2]
    ['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and','of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']

七、按token 进行分割

语言模型有一个token限制。您不应超过token限制。
因此,当您将文本分割成块时,将token的数量进行计数是一个好主意。
有许多分词器可供使用。在计数文本中的token时,应使用 与语言模型 中使用的相同的分词器。


1、tiktoken

tiktoken 是由 OpenAI 创建的高速BPE分词器。

我们可以使用它来估计已使用的token。对于 OpenAI 模型,它可能更准确。

  1. 文本的分割方式:通过传入的字符进行分割
  2. 分块大小的衡量标准:使用 tiktoken 分词器计数

安装 tiktoken

!pip install tiktoken
# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

texts[0]

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution.

也可以直接 load 一个 tiktoken splitter

from langchain.text_splitter import TokenTextSplittertext_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

2、spaCy

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

另一个替代NLTK 的是 spaCy tokenizer.

  1. How the text is split: by spaCy tokenizer
  2. How the chunk size is measured: by number of characters

!pip install spacy
# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:state_of_the_union = f.read()
from langchain.text_splitter import SpacyTextSplittertext_splitter = SpacyTextSplitter(chunk_size=1000)texts = text_splitter.split_text(state_of_the_union)

texts[0]

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.Members of Congress and the Cabinet.
...From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

3、SentenceTransformers

SentenceTransformersTokenTextSplitter 是一个专门用于 sentence-transformer 模型 的文本拆分器。
默认行为是 将文本拆分为 适合您想要使用的 sentence transformer 模型的标记窗口的块。

from langchain.text_splitter import SentenceTransformersTokenTextSplitter
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count) # 2
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier # 514print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
tokens in text to split: 514
text_chunks = splitter.split_text(text=text_to_split)print(text_chunks[1]) # lorem


4、NLTK

The Natural Language Toolkit, 或更被知道为 NLTK, 是一套用Python编程语言编写的 用于英语符号和统计自然语言处理(NLP)的库和程序。

在使用 “\n\n” 分割的基础上, 我们使用 NLTK 的 NLTK tokenizers 来分割。

  1. 文本如何被分割: 使用 NLTK tokenizer.
  2. 块大小如何计算:按 characters 数
# pip install nltk
# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:state_of_the_union = f.read()
from langchain.text_splitter import NLTKTextSplittertext_splitter = NLTKTextSplitter(chunk_size=1000)texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman....Groups of citizens blocking tanks with their bodies.

5、Hugging Face tokenizer

Hugging Face 有很多 tokenizers。

我们使用 Hugging Face tokenizer, GPT2TokenizerFast 来计算tokens 中的文本长度。

  1. 文本如何分割: by character passed in
  2. 块大小如何计算: 通过 Hugging Face tokenizer计算的 tokens 数量。
from transformers import GPT2TokenizerFast
from langchain.text_splitter import CharacterTextSplittertokenizer = GPT2TokenizerFast.from_pretrained("gpt2")# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:state_of_the_union = f.read()text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

texts[0]

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  ...With a duty to one another to the American people to the Constitution.

2024-04-08(一)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/800115.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

2024年C语言最新经典面试题汇总(21-30)

C语言文章更新目录 C语言学习资源汇总&#xff0c;史上最全面总结&#xff0c;没有之一 C/C学习资源&#xff08;百度云盘链接&#xff09; 计算机二级资料&#xff08;过级专用&#xff09; C语言学习路线&#xff08;从入门到实战&#xff09; 编写C语言程序的7个步骤和编程…

Navicat中导出导入txt结合sql语句实现sqlserver转到Postgis空间数据库中

效果 1、sqlserver数据库中导出 sqlserver数据库字段 2、postgis中导入

C++ //练习 11.20 重写11.1节练习(第376页)的单词计数程序,使用insert代替下标操作。你认为哪个程序更容易编写和阅读?解释原因。

C Primer&#xff08;第5版&#xff09; 练习 11.20 练习 11.20 重写11.1节练习&#xff08;第376页&#xff09;的单词计数程序&#xff0c;使用insert代替下标操作。你认为哪个程序更容易编写和阅读&#xff1f;解释原因。 环境&#xff1a;Linux Ubuntu&#xff08;云服务…

Golang | Leetcode Golang题解之第9题回文数

题目&#xff1a; 题解&#xff1a; func isPalindrome(x int) bool {// 特殊情况&#xff1a;// 如上所述&#xff0c;当 x < 0 时&#xff0c;x 不是回文数。// 同样地&#xff0c;如果数字的最后一位是 0&#xff0c;为了使该数字为回文&#xff0c;// 则其第一位数字也…

antd vue table控件的使用(二)

今天就讲讲Ant Design Vue下的控件 ---- table表格&#xff08;选择和操作&#xff09; 结合项目中的需求&#xff0c;看看如何配置table控件&#xff0c;需求&#xff1a; &#xff08;1&#xff09;根据列表中的选项&#xff0c;单选&#xff1b; &#xff08;2&#xff0…

京东API接口采集商品详情数据(测试入口如下)

京东API接口采集商品详情数据 请求示例&#xff0c;API接口接入Anzexi58 在当今数字化时代&#xff0c;电商平台的API接口成为了获取商品详情数据的重要途径之一。作为中国最大的自营式电商企业&#xff0c;京东提供了丰富的API接口供开发者使用&#xff0c;以便获取京东平台上…

【C++】每日一题 58 最后一个单词的长度

给你一个字符串 s&#xff0c;由若干单词组成&#xff0c;单词前后用一些空格字符隔开。返回字符串中 最后一个 单词的长度。 单词 是指仅由字母组成、不包含任何空格字符的最大 子字符串 #include <iostream> #include <string>int lengthOfLastWord(std::strin…

计算机网络——38报文完整性

报文完整性 数字签名 数字签名类比于手写签名 发送方数字签署了文件&#xff0c;前提是他是文件的拥有者/创建者可验证性&#xff0c;不可伪造性&#xff0c;不可抵赖性 谁签署&#xff0c;接收方可以向他人证明是他&#xff0c;而不是其他人签署了这个文件签署了什么&#…

day02php环境和编译器—我耀学IT

一、环境介绍 1、web 环境 使用 PHP 需要先安装环境&#xff0c;安装环境比较麻烦&#xff0c;需要安装Web服务、PHP应用服务器、MySQL管理系统。 Web服务&#xff1a;apache 和 nginx PHP&#xff1a;多版本 MySQL&#xff1a;多版本 2、环境集成包 因为多环境、多版本、多系…

Java学习笔记23(面向对象三大特征)

1.5 多态 ​ *多态(polymorphic) ​ *方法或对象具有多种形态&#xff0c;是面向对象的第三大特征&#xff0c;多态是建立在封装和继承基础之上的 1.多态的具体体现 1.方法的多态 &#xff08;重写和重载体现了多态&#xff09; 2.对象的多态 ​ 1.一个对象的编译类型和…

三国游戏(贪心 排序)

三国游戏 利用贪心、排序、前缀和的计算方法&#xff0c;特别注意不要数据溢出了&#xff0c;sum 加long long s[i] x[i]-y[i]-z[i]输入: 3 1 2 2 2 3 2 1 0 7输出: 2#include <bits/stdc.h> using namespace std;const int N 1e5100;typedef long long ll;bool cm…

容器和K8s常见概念

【容器】 1、Open Container Initiative&#xff08;OCI&#xff09;&#xff1a;制定和推动容器格式和运行时的开放标准。容器运行时需要遵循此标准。主要的产出物包括&#xff1a; OCI Image Specification: 定义容器镜像格式的规范&#xff0c;统一描述容器镜像的内容和结…

ubuntu 22.04安装Anaconda3步骤

系统&#xff1a;Ubuntu 22.04 目标&#xff1a;安装Anaconda3 步骤&#xff1a; 1.在Anaconda官网 link下载xx.sh文件 2.进入Download文件夹&#xff0c;添加文件执行权限 chmod x Anaconda_xxx.sh 3.运行 ./Anaconda_xxx.sh 4.确定当前安装路径,一般为如下形式 /home/usr-nam…

git入门教程

Git 1. Git历史 同生活中的许多伟大事件一样&#xff0c;Git 诞生于一个极富纷争大举创新的年代。Linux 内核开源项目有着为数众广的参与者。绝大多数的 Linux 内核维护工作都花在了提交补丁和保存归档的繁琐事务上&#xff08;1991&#xff0d;2002年间&#xff09;。到 2002…

从零开始实现一个RPC框架(五)

前言 这是系列最后一篇文章了&#xff0c;最后我们来为我们的rpc框架实现一个http gateway。这个功能实际上受到了rpcx的启发&#xff0c;基于这种方式实现一个简单的类似service mesh中的sidecar。 原理 http gateway可以接收来自客户端的http请求并将其转换为rpc请求然后交…

PicGo日志报错 image not found in clipboard

PicGo: image not found in clipboard 文章目录 PicGo: image not found in clipboard问题描述问题尝试解决方案 问题描述 背景&#xff1a;在剪切板中的图片无法通过 PicGo 的剪切板图片进行上传。 读取PicGo 日志报错&#xff0c;显示图片没有在剪切板中找到。 $ ------Erro…

express接口请求的几种方式分析总结

导语 在用express做接口开发的时候&#xff0c;我们要处理post,get,put,delete等请求&#xff0c;以及jsonp的方式&#xff0c;这篇文章记录下结合ajax&#xff0c;实现处理这些请求方式的过程 实现过程 上代码&#xff0c;主要演示post,get及jsonp的请求 <!DOCTYPE htm…

(学习日记)2024.04.05:UCOSIII第三十三节:互斥量

写在前面&#xff1a; 由于时间的不足与学习的碎片化&#xff0c;写博客变得有些奢侈。 但是对于记录学习&#xff08;忘了以后能快速复习&#xff09;的渴望一天天变得强烈。 既然如此 不如以天为单位&#xff0c;以时间为顺序&#xff0c;仅仅将博客当做一个知识学习的目录&a…

51单片机入门_江协科技_17~18_OB记录的笔记

17. 定时器 17.1. 定时器介绍&#xff1a;51单片机的定时器属于单片机的内部资源&#xff0c;其电路的连接和运转均在单片机内部完成&#xff0c;无需占用CPU外围IO接口&#xff1b; 定时器作用&#xff1a; &#xff08;1&#xff09;用于计时系统&#xff0c;可实现软件计时&…

数据库基础(面试常见题)

一、数据库基础 1. 数据抽象&#xff1a;物理抽象、概念抽象、视图级抽象,内模式、模式、外模式2. SQL语言包括数据定义、数据操纵(Data Manipulation),数据控制(Data Control)数据定义&#xff1a;Create Table,Alter Table,Drop Table, Craete/Drop Index等数据操纵&#xf…