1. 前言
模型训练是一个不断调优的过程,这注定了我们的需要多次跑同一个训练过程。在前文欺诈文本分类微调(六):Lora单卡跑的整个训练过程中,基本可以分为几步:
- 数据加载
- 数据预处理
- 模型加载
- 定义lora参数
- 插入微调矩阵
- 定义训练参数
- 构建训练器开始训练
这个流程基本是固定的,而训练调优过程中需要调整的主要是以下这些项:
- 输入和输出:数据路径,模型路径,输出路径
- 参数:lora参数,训练参数
因此,我们将整个训练过程中基本不变的部分提取到trainer.py中。内容如下所示:
def load_jsonl(path):with open(path, 'r') as file:data = [json.loads(line) for line in file]return pd.DataFrame(data)def preprocess(item, tokenizer, max_length=2048):input_ids, attention_mask, labels = [], [], []system_message = "You are a helpful assistant."user_message = item['instruction'] + item['input']assistant_message = json.dumps({"is_fraud":item["label"]}, ensure_ascii=False)instruction = tokenizer(f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant\n", add_special_tokens=False) response = tokenizer(assistant_message, add_special_tokens=False)input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1] # -100是一个特殊的标记,用于指示指令部分的token不应参与损失计算labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id] # 对输入长度做一个限制保护,超出截断return {"input_ids": input_ids[:max_length],"attention_mask": attention_mask[:max_length],"labels": labels[:max_length]}def load_dataset(train_path, eval_path, tokenizer):train_df = load_jsonl(train_path)train_ds = Dataset.from_pandas(train_df)train_dataset = train_ds.map(lambda x: preprocess(x, tokenizer), remove_columns=train_ds.column_names)eval_df = load_jsonl(eval_path)eval_ds = Dataset.from_pandas(eval_df)eval_dataset = eval_ds.map(