2402.10038 (arxiv.org)
对每一个提示词xi生成k个响应,然后从k个响应中选择2个响应对(yil, yij),计算两者的回报(reward)差值,如果差值大于阈值η(超参数),就接收该条偏好数据(xi , yil, yij )
参考第4页 Algorithm 1 Preference Data Generation via Rejection Sampling
Result:
DP = {(x, yl
, yw)}3m : Preference dataset
Input:
{x1, . . . , xn} : Sample prompts from DRM
L
SFT: SFT model
R(x, y): Reward model
τ : Temperature
η: Threshold for preference data selection
for i = 1 : n do
(yi1, . . . , yik)| yik ∼ LSFT(·|xi) ▷ generate
k responses from L
SFT model for prompt xi
(ri1, . . . , rik)| rij = R(xi
, yij ) ▷ compute
the reward for each of generated responses
for j = 1 : k do
for l = 1 : k do
if j == l then
continue
end if
rgap = σ(
rij−ril
τ
) ▷ compute
the reward gap between the pair of responses yil
and yij
if rgap > η then
DP = {DP; (xi
, yil, yij )} ▷
append the accepted sample
end if
end for
end for
end for