使用ClickHouse UDF与OpenAI模型集成

本文字数：14683；估计阅读时间：37 分钟

作者：Dale McDiarmid

审校：庄晓东（魏庄）

本文在公众号【ClickHouseInc】首发

Meetup活动

ClickHouse Shenzhen User Group第1届 Meetup 火热报名中，详见文末海报！

介绍

在ClickHouse中，用户可以直接在其SQL中调用AI模型。这可以在插入数据时丰富数据的形式，或者在查询时补充特定的结果。虽然许多用户习惯于训练自己的领域特定模型，但对于较小的团队或用例，这通常是不切实际的。在这些情况下，预构建的“即插即用”模型或服务通常就足够，并且可以在付出最小努力的情况下获得良好的结果。

在本文中，我们演示了：

如何使用ClickHouse用户定义的函数（UDF）轻松集成第三方API，这些API提供“AI as a Service”
如何在ClickHouse中，直接调用这些“即插即用”模型完成特定任务，如情感分析，并针对这些结果进行聚合，以计算有关给定主题的正面和负面帖子数量等指标

鉴于OpenAI最近的知名度和备受关注的ChatGPT服务，我们以OpenAI为例。然而，这种方法的简单性意味着可以轻松地适应其他AI服务。

用户定义的函数（UDF）

ClickHouse中的UDF（用户定义的函数）有几种形式。在最近的一篇文章中，我们分享了：如何使用ClickHouse SQL定义的函数查询Hugging Face中托管的外部数据集。虽然诸如此类的SQL定义的函数对于概括常见的SQL任务非常有用，但有时用户需要熟悉的编程语言的全部功能。为此，ClickHouse支持可执行的UDF。这为开发人员提供了调用任何外部可执行程序或脚本来处理数据的灵活性。在我们下面的简单示例中，我们将使用这个功能来调用简单的Bash和Python脚本，这些脚本将查询OpenAI API。我们将展示API响应如何自动丰富由ClickHouse插入或查询的数据。

使用OpenAI

大多数用户通过流行的ChatGPT服务熟悉OpenAI，该服务已经改变了工作行为和日常任务。OpenAI为企业提供了访问ChatGPT模型的REST API，以在现有服务和自动化流程中使用。这些服务提供从聊天和嵌入生成到图像生成和语音转文本的所有方式。我们的示例专注于聊天完成，这可以用于更通用的任务，如实体提取和情感标记。

注意：访问OpenAI服务的所有请求都需要一个令牌 - 在下面的示例中作为环境变量OPENAI_API_KEY传递。用户可以注册试用版，并获得足够的免费积分来执行这里的示例。

除了能够充当聊天机器人外，OpenAI的完成（completion）服务还支持其他任务，如情感分析和结构提取。对于这些任务，开发人员必须通过系统角色提供相关说明，以描述期望的行为。执行情感分析的示例REST API请求可能如下所示。在这里，我们要求服务对论坛帖子进行分类。请注意，我们需要提供明确的说明，以仅返回指定情感的单个令牌：

curl https://api.openai.com/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer $OPENAI_API_KEY" \-d '{"model": "gpt-3.5-turbo","messages": [{"role": "system","content": "You are an AI language model trained to analyze and detect the sentiment of forum comments."},{"role": "user","content": "Analyze the following hackernews comment and determine if the sentiment is: positive, negative or neutral. Return only a single word, either POSITIVE, NEGATIVE or NEUTRAL: I can say for BigQuery and Databricks from personal experience.<p>BigQuery is much slower and is much more expensive for both storage and query.<p>Databricks (Spark) is even slower than that (both io and compute), although you can write custom code&#x2F;use libs.<p>You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage)."}],"temperature": 0,"max_tokens": 256
}'{"id": "chatcmpl-7vOWWkKWGN7McODMXJzQB6zzDcx0r","object": "chat.completion","created": 1693913320,"model": "gpt-3.5-turbo-0613","choices": [{"index": 0,"message": {"role": "assistant","content": "NEGATIVE"},"finish_reason": "stop"}],"usage": {"prompt_tokens": 147,"completion_tokens": 2,"total_tokens": 149}
}

请注意，这里我们使用更具成本效益的gpt-3.5-turbo模型，而不是最新的gpt-4模型。前者对于示例目的已足够。我们将其性能水平的评估留给读者。

相同的服务也可以用于提取结构。假设我们希望从上述文本中提取所提到的技术，作为字符串值的列表。我们需要稍微修改一下说明：

curl https://api.openai.com/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer $OPENAI_API_KEY" \-d '{"model": "gpt-3.5-turbo","messages": [{"role": "system","content": "You are an AI language model trained to extract entities from forum comments"},{"role": "user","content": "From the following text extract the many technologies mentioned as a comma seperated list: I can say for BigQuery and Databricks from personal experience.<p>BigQuery is much slower and is much more expensive for both storage and query.<p>Databricks (Spark) is even slower than that (both io and compute), although you can write custom code&#x2F;use libs.<p>You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage)."}],"temperature": 0,"max_tokens": 20
}'{"id": "chatcmpl-7vOdLnrZWeax3RxjeUNelCTdGvr8q","object": "chat.completion","created": 1693913743,"model": "gpt-3.5-turbo-0613","choices": [{"index": 0,"message": {"role": "assistant","content": "BigQuery, Databricks, Spark, ClickHouse"},"finish_reason": "stop"}],"usage": {"prompt_tokens": 122,"completion_tokens": 11,"total_tokens": 133}
}

关于上述请求参数的一些建议：

我们将temperature设置为0，以从响应中删除任何随机性。对于这些用例，我们不需要创造性的文本，只需要确定性的文本分析。
在两种情况下，我们设置max_tokens以确定响应的长度。一个token大约是3/4个词。因此，我们调整了我们的请求。

数据集

对于我们的示例数据集，我们使用了Hacker News的帖子。这个数据集在我们的公共play环境中可用，包含从2006年到2023年8月的流行Hacker News论坛上的所有帖子和评论，约有3700万行。表模式如下所示。

我们对title和text列感兴趣。我们将探索这个数据集的任务留给读者，如果希望将这个数据集的最新版本加载到你自己的ClickHouse实例中，可以按照这里的说明操作。或者，我们提供了一个在S3上的Parquet文件，可以使用如下所示的s3函数加载：

CREATE TABLE hackernews
(`id` UInt32,`deleted` UInt8,`type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),`by` LowCardinality(String),`time` DateTime,`text` String,`dead` UInt8,`parent` UInt32,`poll` UInt32,`kids` Array(UInt32),`url` String,`score` Int32,`title` String,`parts` Array(UInt32),`descendants` Int32
)
ENGINE = MergeTree
ORDER BY idINSERT INTO hackernews SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/2023-08-18.parquet')

添加情感分析

对于我们的示例，假设我们希望为存储在ClickHouse中的Hacker News数据添加情感分析。为此，我们需要通过ClickHouse UDF调用之前的OpenAI REST API。这个请求的简单性意味着：即使是一个简单的bash脚本可能也足够，如下所示（下面需要jq）。稍后，我们将演示如何直接在Python中执行此操作。

#!/bin/bashwhile read read_data; dosentiment=$(curl -s https://api.openai.com/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer <insert>" \-d "{\"model\": \"gpt-3.5-turbo\",\"messages\": [{\"role\": \"system\",\"content\": \"You are an AI language model trained to analyze and detect the sentiment of forum comments.\"},{\"role\": \"user\",\"content\": \"Analyze the following Hacker News comment and determine if the sentiment is: positive, negative or neutral. Return only a single word, either POSITIVE, NEGATIVE or NEUTRAL: ${read_data}\"}],\"temperature\": 0,\"max_tokens\": 2,\"temperature\": 0
}" | jq -r '.choices[0].message.content')printf "$sentiment";
done

这个脚本应该保存在ClickHouse的user_scripts目录中，命名为sentiment.sh，并设置为可执行。还应该在一个文件openai_functions.xml中添加以下条目，并保存到ClickHouse配置目录（通常是/etc/clickhouse-server/）。

<functions><function><name>sentiment</name><type>executable</type><format>TabSeparated</format><return_type>String</return_type><argument><type>String</type></argument><command>sentiment.sh</command><command_read_timeout>10000</command_read_timeout><command_write_timeout>10000</command_write_timeout><max_command_execution_time>10000</max_command_execution_time></function>
</functions>

这个配置使得UDF可用于ClickHouse。除了在这里修改超时以允许OpenAI请求的延迟之外，我们提供了一个函数名sentiment，并指定了输入和返回类型。

有了上述配置，用户可以通过简单的函数调用请求文本片段的情感，例如：

SELECT sentiment('Learn about the key differences between ClickHouse Cloud and Snowflake and how ClickHouse Cloud outperforms Snowflake across the critical dimensions for real-time analytics: query latency and and cost.') AS sentiment┌─sentiment─┐
│ POSITIVE  │
└───────────┘1 row in set. Elapsed: 0.433 sec.

虽然上面的例子可以让我们开始，但可能需要一个更健壮的解决方案，具有错误处理的功能。为此，我们可能希望将上述内容转换为Python。下面的Python脚本添加了基本的错误处理和带有回退的重试。后者是为了特别解决OpenAI速率限制的挑战 - 请参阅处理延迟和速率限制以获取更多详细信息。

请注意，需要openai和tenacity库来处理API请求和速率限制。

#!/usr/bin/python3
import sys
import openai
from tenacity import (retry,stop_after_attempt,wait_random_exponential,
)openai.api_key = "<INSERT>"
request_timeout = 3@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(20))
def completion_with_backoff(**kwargs):return openai.ChatCompletion.create(**kwargs)def extract_sentiment(text):if text == "":return "NEUTRAL"messages = [{"role": "system","content": "You are an AI language model trained to analyze and detect the sentiment of hackernews forum comments."},{"role": "user","content": f"Analyze the following hackernews comment and determine if the sentiment is: positive, negative or neutral. "f"Return only a single word, either POSITIVE, NEGATIVE or NEUTRAL: {text}"}]try:response = completion_with_backoff(model="gpt-3.5-turbo", messages=messages, max_tokens=30, temperature=0, request_timeout=request_timeout)return response.choices[0].message.contentexcept:return "ERROR"for size in sys.stdin:# collect a batch for performancefor row in range(0, int(size)):print(extract_sentiment(sys.stdin.readline().strip()))sys.stdout.flush()

服务的基于聊天的性质使得在单个请求中评估多个文本片段的情感变得具有挑战性。为了使这些示例保持简单，我们每行进行一次请求。一个更优化的解决方案可能会对请求进行批处理，并要求端点评估一组文本。

上述假设：从ClickHouse传递的任何输入，都包括一个行数的前缀。这用于确定后续输入要迭代的次数。这可以使Python脚本内的操作进行批处理以获得更高的性能。

除了定义唯一名称sentiment_p之外，上述函数的配置还有一些额外的设置。我们将type设置为可执行池，以提高吞吐性能。这将启动命令N次（下面是10次），允许多个并发的调用。设置send_chunk_header确保在任何输入之前有一个表示要处理的行数的数字标题。我们增加了超时设置，以防传递大块的行。

<functions><function><name>sentiment_p</name><type>executable_pool</type><pool_size>10</pool_size><send_chunk_header>true</send_chunk_header><format>TabSeparated</format><return_type>String</return_type><argument><type>String</type></argument><command>sentiment.py</command><command_read_timeout>10000000</command_read_timeout><command_write_timeout>10000000</command_write_timeout><max_command_execution_time>1000000</max_command_execution_time></function>
</functions>

我们可以将上述任何一个函数应用于列的一组行。在下面的示例中，我们请求包含单词ClickHouse的10行的标题和文本的情感。

SELECT text, sentiment_p(text) AS sentiment
FROM hackernews WHERE text LIKE '%ClickHouse%' OR title LIKE '%ClickHouse%'
ORDER BY time DESC
LIMIT 2
FORMAT Vertical
Row 1:
──────
text:    Yeah ClickHouse is definitely the way to go here. Its ability to serve queries with low latency and high concurrency is in an entirely different league from Snowflake, Redshift, BigQuery, etc.
sentiment: POSITIVERow 2:
──────
text:    There are other databases today that do real time analytics (ClickHouse, Apache Druid, StarRocks along with Apache Pinot).  I&#x27;d look at the ClickHouse Benchmark to see who are the competitors in that space and their relative performance.
sentiment: POSITIVE2 rows in set. Elapsed: 2.763 sec. Processed 37.17 million rows, 13.30 GB (13.46 million rows/s., 4.82 GB/s.)

只有在最终结果被整理后，UDF才会执行 - 这意味着只需要两个请求。这种方法是理想的，因为对OpenAI的请求的延迟通常比ClickHouse评估查询的时间要长得多。

进一步地，我们可以通过简单的聚合来计算ClickHouse的正面和负面帖子的数量。这会产生更多的开销，因为我们需要对OpenAI API进行1600多次调用。这反映在最终的计时中。

SELECTcount(),sentiment
FROM hackernews
WHERE (text LIKE '%ClickHouse%') OR (title LIKE '%ClickHouse%')
GROUP BY sentiment_p(text) AS sentiment
FORMAT PrettyCompactMonoBlock┌─count()─┬─sentiment─┐
│   192 │ NEGATIVE  │
│   628 │ NEUTRAL   │
│   857 │ POSITIVE  │
└─────────┴───────────┘3 rows in set. Elapsed: 203.695 sec. Processed 37.17 million rows, 13.28 GB (182.48 thousand rows/s., 65.21 MB/s.)

处理延迟和速率限制

OpenAI API的实用性受到两个因素的限制：延迟和它施加的速率限制。请注意，这些变量将取决于所选择的“即插即用”模型。在我们的示例中，我们使用OpenAI。还有许多其他选择，每个选择都有其自己的权衡。

延迟将影响查询的最小响应时间。虽然OpenAI允许多个并发查询以确保这不会影响吞吐量，但速率限制将更具限制性。因此，我们建议用户仅将这些API用于自发分析，其中函数仅在结果的小子集上使用（例如我们之前的2行示例），或者在插入时丰富数据。在展示后者的示例之前，让我们探讨一下延迟和速率限制的限制。

我们可以通过修改我们的情感curl请求来使用一个简单的格式文件来评估响应的延迟：

curl -w "@curl-format.txt" -o /dev/null -s  https://api.openai.com/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer $OPENAI_API_KEY" \-d '{"model": "gpt-3.5-turbo","messages": [{"role": "system","content": "You are an AI language model trained to analyze and detect the sentiment of forum comments."},{"role": "user","content": "Analyze the following hackernews comment and determine if the sentiment is: positive, negative or neutral. Return only a single word, either POSITIVE, NEGATIVE or NEUTRAL: I can say for BigQuery and Databricks from personal experience.<p>BigQuery is much slower and is much more expensive for both storage and query.<p>Databricks (Spark) is even slower than that (both io and compute), although you can write custom code&#x2F;use libs.<p>You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage)."}],"temperature": 0,"max_tokens": 256,"temperature": 0
}'time_namelookup:  0.081196s
time_connect:  0.084907s
time_appconnect:  0.095853s
time_pretransfer:  0.095937s
time_redirect:  0.000000s
time_starttransfer:  0.095942s
----------
time_total:  0.650401s

这里的总延迟为0.65秒，限制了我们的查询响应时间，但由于我们有一个命令池（上面是10），ClickHouse的并行执行管道可以利用它。然而，这种并行化又受到OpenAPI速率限制的限制。

OpenAI受到每分钟请求数和每分钟令牌数的限制。对于我们的gpt-3.5-turbo模型，这是每分钟90,000个令牌（TPM）和每分钟3,500个请求（RPM）。速率限制因模型和帐户类型而异 - 进一步的细节请参阅此处。

为了解决这个问题，我们在我们的UDF中添加了基本的速率限制。API返回头中包含速率限制信息（即下一分钟剩余的令牌数和请求数）。虽然我们可以开发一个使用这些信息的速率限制函数，但OpenAI建议使用几个专为通过指数回退解决此问题而设计的库。这样做的好处是我们不需要跟踪跨多个线程的请求和令牌使用情况。

我们对聚合查询的上述计时（203.695秒）表明，我们可能没有充分利用我们的10个UDF命令池，或者正在受到速率限制的限制。假设平均延迟为0.65 *，完全并行化，我们预计我们的总执行时间在这里更接近100秒（1600/10 * 0.65 = 104秒）。

*我们假设Open AI API可以保持这种延迟，而不受变量内容长度等因素的影响（一些评论可能比其他评论更长）。

这种性能（100秒）之所以没有实现，是因为查询受到OpenAI API的速率限制的限制 - 具体来说是令牌限制。每个Hacker News评论平均约为330个单词，如下所示，或约80个令牌（每个令牌约4个字符）。但这并不包括我们的提示和系统文本，其中还增加了额外的60个令牌。我们的与ClickHouse相关的子集的平均令牌长度也较高，为136。

SELECTround(avg(length(text))) AS num_chars,round(num_chars * 0.25) AS num_tokens
FROM hackernews┌─num_chars─┬─num_tokens─┐
│     333 │       83 │
└───────────┴────────────┘1 row in set. Elapsed: 1.656 sec. Processed 37.17 million rows, 12.72 GB (22.44 million rows/s., 7.68 GB/s.)SELECTround(avg(length(text))) AS num_chars,round(num_chars * 0.25) AS num_tokens
FROM hackernews
WHERE (title LIKE '%ClickHouse%') OR (text LIKE '%ClickHouse%')┌─num_chars─┬─num_tokens─┐
│     546 │      136 │
└───────────┴────────────┘1 row in set. Elapsed: 1.933 sec. Processed 37.17 million rows, 13.28 GB (19.23 million rows/s., 6.87 GB/s.)
Peak memory usage: 73.49 MiB.

虽然每个评论都需要一个请求，导致总共1600个请求（低于每分钟3500的限制），但我们总共有900k个字符或229k个令牌。考虑到我们的提示文本，这增加到了329k个令牌（每个请求额外增加60个）。这远远超过了每分钟90k的限制。尽管如此，如果这项工作被完美地安排，我们预计这个请求将在我们经历的200秒内完成（329/90〜3.65分钟〜200秒）。

尽管更好的速率限制实现（例如基于通用单元速率算法的实现）可能会更有效地利用Open AI API资源，但请求的延迟最终将受到令牌限制的约束。我们只能使用前N个令牌，其中N的选择基于一个限制，该限制将确保完全利用每分钟3500次的完整请求限制，即90000/3500〜25个令牌。然而，在我们的示例中，这不太可能足以确定情感上述技术。

在插入时进行提取

鉴于速率限制和延迟，使用API进行查询的更可取的方法是在数据插入时分配情感列。借助其命令池，Python函数更适合进行此类型的批处理处理。下面，在通过INSERT INTO加载行时提取情感。在这个例子中，我们将所有与ClickHouse相关的行插入一个专用表中，为每个计算一个情感列。这种处理方式非常理想，因为新行插入时，Hacker News数据集每分钟接收约8-10行新数据。一旦分配了列，我们就可以在情感列上享受ClickHouse查询速度，而无需进行API请求。

INSERT INTO hackernews_v2 SELECT*,sentiment_p(text) AS sentiment
FROM hackernews
WHERE (text LIKE '%ClickHouse%') OR (title LIKE '%ClickHouse%')0 rows in set. Elapsed: 185.452 sec. Processed 37.17 million rows, 13.54 GB (200.44 thousand rows/s., 73.00 MB/s.)SELECT count(), sentiment
FROM hackernews_v2
GROUP BY sentiment┌─count()─┬─sentiment─┐
│   193 │ NEGATIVE  │
│   850 │ POSITIVE  │
│   634 │ NEUTRAL   │
└─────────┴───────────┘3 rows in set. Elapsed: 0.003 sec. Processed 1.68 thousand rows, 1.68 KB (531.10 thousand rows/s., 531.10 KB/s.)
Peak memory usage: 72.90 KiB.

提取结构

为了完整起见，让我们也将之前的OpenAI请求转换为提取我们的帖子中的技术。以下是bash Python脚本：

#!/usr/bin/python3
import sys
import openai
from tenacity import (retry,stop_after_attempt,wait_random_exponential,
)openai.api_key = "<INSERT>"
request_timeout = 3@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(20))
def completion_with_backoff(**kwargs):return openai.ChatCompletion.create(**kwargs)def extract_topics(text):if text == "":return ""messages = [{"role": "system","content": "You are an AI language model trained to extract entities from Hacker News forum comments."},{"role": "user","content": f"From the following text extract the technologies mentioned as a comma separated list with no spaces. Return an empty string if there are no technologies: {text}"}]try:response = completion_with_backoff(model="gpt-3.5-turbo", messages=messages, max_tokens=30, temperature=0,request_timeout=request_timeout)return response.choices[0].message.content.strip()except Exception as e:return f"ERROR - {e}"for size in sys.stdin:# collect a batch for performancefor row in range(0, int(size)):print(",".join([tech.strip() for tech in extract_topics(sys.stdin.readline().strip()).split(",")]))sys.stdout.flush()

配置此函数时，使用与情感UDF相同的参数，只是将名称更改为extract_techs，我们可以在Hacker News上使用ClickHouse识别提到的顶级技术。

WITH results AS (SELECT extract_techs(text) as techsFROM hackernewsWHERE (text LIKE '%ClickHouse%') OR (title LIKE '%ClickHouse%')
)
SELECTarrayJoin(splitByChar(',', techs)) AS tech,count() AS c
FROM results
GROUP BY tech
HAVING tech NOT ILIKE '%ClickHouse%' AND tech != ''
ORDER BY c DESC
LIMIT 5┌─tech────────┬───c─┐
│ Postgres  │  78 │
│ PostgreSQL  │  65 │
│ SQL       │  63 │
│ TimescaleDB │  54 │
│ MySQL     │  51 │
└─────────────┴─────┘5 rows in set. Elapsed: 211.358 sec. Processed 37.17 million rows, 13.28 GB (175.87 thousand rows/s., 62.85 MB/s.)
Peak memory usage: 931.95 MiB.