今天在将json文件划分数据集时出现了KeyError: 'instruction'错误,由于数据量比较大,所以在浏览了部分数据后以为结构没有问题,这是一部分的数据结构
{"instruction": "描述面向对象编程(OOP)的原则。","input": "OOP 原则包括封装、继承、多态和抽象,促进了有组织和可维护的代码。","output": "输出评价:你对面向对象编程的原则有很好的理解。在你的开发经验中,这些原则是如何指导你编写代码的?"},
因此百思不得其解,但是后来通过打印JSON 文件中的示例条目 发现存在少部分格式不同,这才导致KeyError: 'instruction'错误
Keys in example: dict_keys(['instruction', 'input', 'output'])
Keys in example: dict_keys(['instruction', 'input', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['question', 'answer', 'output'])
Keys in example: dict_keys(['instruction', 'input', 'output'])
Keys in example: dict_keys(['instruction', 'input', 'output'])
因此,我们需要在处理数据之前先检查每个示例的键结构,并根据不同的结构来处理数据。我们可以在读取数据时,对每个示例的键结构进行检查,然后选择性地提取所需的键。这样就可以确保代码不会在处理不同结构的数据时出现错误。
这是处理好后的代码
import json
from sklearn.model_selection import train_test_split# 读取 JSON 文件
with open('qiyeruangong.json', 'r', encoding='utf-8') as file:data = json.load(file)# 提取所需的键
X = []
y = []
for example in data:# 检查示例的键结构if 'instruction' in example and 'input' in example and 'output' in example:X.append(example['instruction'] + ' ' + example['input'])y.append(example['output'])elif 'question' in example and 'answer' in example and 'output' in example:X.append(example['question'])y.append(example['output'])else:print("Unsupported example format:", example)# 划分训练集、验证集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2# 可选:保存划分后的数据集
with open('train_data.json', 'w') as train_file:json.dump({'X_train': X_train, 'y_train': y_train}, train_file)with open('val_data.json', 'w') as val_file:json.dump({'X_val': X_val, 'y_val': y_val}, val_file)with open('test_data.json', 'w') as test_file:json.dump({'X_test': X_test, 'y_test': y_test}, test_file)