Problems retrieving the embeddings data form OpenAI API Batch embedding job

题意：从OpenAI API批量嵌入作业中检索嵌入数据时遇到问题

问题背景：

I have to embed over 300,000 products description for a multi-classification project. I split the descriptions onto chunks of 34,337 descriptions to be under the Batch embeddings limit size.

我需要在一个多分类项目中嵌入超过300,000个产品描述。为了不超过批量嵌入的限制大小，我将这些描述分割成了每批34,337个描述的小块。

A sample of my jsonl file for batch processing:

用于批量处理的我的jsonl文件的一个样本：

{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Base L\u00edquida Maybelline Superstay 24 Horas Full Coverage Cor 220 Natural Beige 30ml", "encoding_format": "float"}}
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Sand\u00e1lia Havaianas Top Animals Cinza/Gelo 39/40", "encoding_format": "float"}}

My json file has 34,337 lines. 我的JSON 文件包34,337上文件

I've susscesfully uploaded the file: 我已经成功上传了文件

File 'batch_emb_file_1.jsonl' uploaded succesfully:FileObject(id='redacted for work compliance', bytes=6663946, created_at=1720128016, filename='batch_emb_file_1.jsonl', object='file', purpose='batch', status='processed', status_details=None)

and ran the embedding job: 并执行了嵌入任务

Batch job created successfully:Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

The work was completed: 任务完成

client.batches.retrieve(batch_job_1.id).status
'completed'

client.batches.retrieve('redacted for work compliance'), returns:

调用 client.batches.retrieve 方法，它返回了...

Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1720135956, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=1720133521, in_progress_at=1720129903, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id='redacted for work compliance', request_counts=BatchRequestCounts(completed=34337, failed=0, total=34337))

But when I try to get the content using output_file_id string client.files.content(value of output_file_id), returns:

但是，当我尝试使用类似 client.files.content(value of output_file_id) 这样的形式来获取内容时，其中 value of output_file_id 被当作了一个字符串字面量而不是 output_file_id 变量的实际值，这个方法调用返回了：...

<openai._legacy_response.HttpxBinaryResponseContent at 0x79ae81ec7d90>

I have tried: client.files.content(value of output_file_id).content but this kills my kernel

我尝试的 client.files.content(value of output_file_id).content 可能是导致你的内核（比如Jupyter Notebook或类似环境）崩溃的原因

What am I doing wrong? Also I believe I am under utilizing Batch embeddings. the 90,000 limits conflicts with Batch Queue Limit of 'text-embedding-ada-002' model which is: 3,000,000

“我做错了什么？另外，我认为我没有充分利用批量嵌入。90,000的限制与'text-embedding-ada-002'模型的批量队列限制冲突，该模型的批量队列限制是：3,000,000。”

Could someone help? 有人能提供帮助吗

问题解决：

Retrieving the embedding data from batch file is a bit trick, this Tutorial breaks it down set by set link

after getting the output_file_id, you need to:

output_file =client.files.content(output_files_id).textembedding_results = []
for line in output_file.split('\n')[:-1]:data =json.loads(line)custom_id = data.get('custom_id')embedding = data['response']['body']['data'][0]['embedding']embedding_results.append([custom_id, embedding])embedding_results = pd.DataFrame(embedding_results, columns=['custom_id', 'embedding'])

In my case, this retrieves the embedding data from the batch job file