ChatGPT 大模型训练核心技术
从 GPT-3 到 ChatGPT 的大模型训练技术演进
基于RLHF训练大模型的三阶段
- • Domain Specific Pre-Training: Fine-tune a pre-trained LLM on raw text with a Causal Language Modelling Objective.
- • Supervised fine-tuning: Fine-tune the domain-specific LLM on task-specific as well as domain-specific (prompt/instruction, response) pairs.
- • RLHF
- – Reward model training: Training a language model to classify responses as good or bad (thumbs up, thumbs down)
- – RLHF fine-tuning: Using the reward model training on (prompt, good_response, bad_response) data labeled by human experts to align the responses on the LLM