题意:从OpenAI API批量嵌入作业中检索嵌入数据时遇到问题
问题背景:
I have to embed over 300,000 products description for a multi-classification project. I split the descriptions onto chunks of 34,337 descriptions to be under the Batch embeddings limit size.
我需要在一个多分类项目中嵌入超过300,000个产品描述。为了不超过批量嵌入的限制大小,我将这些描述分割成了每批34,337个描述的小块。
A sample of my jsonl file for batch processing:
用于批量处理的我的jsonl文件的一个样本:
{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Base L\u00edquida Maybelline Superstay 24 Horas Full Coverage Cor 220 Natural Beige 30ml", "encoding_format": "float"}}
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Sand\u00e1lia Havaianas Top Animals Cinza/Gelo 39/40", "encoding_format": "float"}}
My json file has 34,337 lines. 我的JSON 文件包34,337上文件
I've susscesfully uploaded the file: 我已经成功上传了文件
File 'batch_emb_file_1.jsonl' uploaded succesfully:FileObject(id='redacted for work compliance', bytes=6663946, created_at=1720128016, filename='batch_emb_file_1.jsonl', object='file', purpose='batch', status='processed', status_details=None)
and ran the embedding job: 并执行了嵌入任务
Batch job created successfully:Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))
The work was completed: 任务完成
client.batches.retrieve(batch_job_1.id).status
'completed'
client.batches.retrieve('redacted for work compliance')
, returns:
调用 client.batches.retrieve
方法,它返回了...
Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1720135956, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=1720133521, in_progress_at=1720129903, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id='redacted for work compliance', request_counts=BatchRequestCounts(completed=34337, failed=0, total=34337))
But when I try to get the content using output_file_id string client.files.content(value of output_file_id), returns:
但是,当我尝试使用类似 client.files.content(value of output_file_id)
这样的形式来获取内容时,其中 value of output_file_id
被当作了一个字符串字面量而不是 output_file_id
变量的实际值,这个方法调用返回了:...
<openai._legacy_response.HttpxBinaryResponseContent at 0x79ae81ec7d90>
I have tried: client.files.content(value of output_file_id).content
but this kills my kernel
我尝试的 client.files.content(value of output_file_id).content
可能是导致你的内核(比如Jupyter Notebook或类似环境)崩溃的原因
What am I doing wrong? Also I believe I am under utilizing Batch embeddings. the 90,000 limits conflicts with Batch Queue Limit of 'text-embedding-ada-002' model which is: 3,000,000
“我做错了什么?另外,我认为我没有充分利用批量嵌入。90,000的限制与'text-embedding-ada-002'模型的批量队列限制冲突,该模型的批量队列限制是:3,000,000。”
Could someone help? 有人能提供帮助吗
问题解决:
Retrieving the embedding data from batch file is a bit trick, this Tutorial breaks it down set by set link
after getting the output_file_id, you need to:
output_file =client.files.content(output_files_id).textembedding_results = []
for line in output_file.split('\n')[:-1]:data =json.loads(line)custom_id = data.get('custom_id')embedding = data['response']['body']['data'][0]['embedding']embedding_results.append([custom_id, embedding])embedding_results = pd.DataFrame(embedding_results, columns=['custom_id', 'embedding'])
In my case, this retrieves the embedding data from the batch job file