大模型拒绝采样

2402.10038 (arxiv.org)

对每一个提示词xi生成k个响应，然后从k个响应中选择2个响应对（yil, yij），计算两者的回报（reward）差值，如果差值大于阈值η（超参数），就接收该条偏好数据（xi , yil, yij ）

参考第4页 Algorithm 1 Preference Data Generation via Rejection Sampling

Result:
DP = {(x, yl
, yw)}3m : Preference dataset
Input:
{x1, . . . , xn} : Sample prompts from DRM
L
SFT: SFT model
R(x, y): Reward model
τ : Temperature
η: Threshold for preference data selection
for i = 1 : n do
(yi1, . . . , yik)| yik ∼ LSFT(·|xi) ▷ generate
k responses from L
SFT model for prompt xi
(ri1, . . . , rik)| rij = R(xi
, yij ) ▷ compute
the reward for each of generated responses
for j = 1 : k do
for l = 1 : k do
if j == l then
continue
end if
rgap = σ(
rij−ril
τ
) ▷ compute
the reward gap between the pair of responses yil
and yij
if rgap > η then
DP = {DP; (xi
, yil, yij )} ▷
append the accepted sample
end if
end for
end for
end for

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/bicheng/60463.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！