LLM在SFT之后会产生大量的冗余参数(delta参数),阿里团队提出DARE方法来消除delta参数,并将其合并到PRE模型中,从而实现多源模型能力的吸收。
DARE无需GPU重新训练,其思路非常简单,就跟dropout类似:
m t ∼ Bernoulli ( p ) δ ~ t = ( 1 − m t ) ⊙ δ t δ ^ t = δ ~ t / ( 1 − p ) θ D A R E t = δ ^ t + θ P R E \begin{gathered} \boldsymbol{m}^t \sim \operatorname{Bernoulli}(p) \\ \widetilde{\boldsymbol{\delta}}^t=\left(\mathbf{1}-\boldsymbol{m}^t\right) \odot \boldsymbol{\delta}^t \\ \hat{\boldsymbol{\delta}}^t=\widetilde{\boldsymbol{\delta}}^t /(1-p) \\ \boldsymbol{\theta}_{\mathrm{DARE}}^t=\hat{\boldsymbol{\delta}}^t+\boldsymbol{\theta}_{\mathrm{PRE}} \end{gathered} mt∼Bernoulli(p)δ t=(1−mt)⊙δtδ^t=δ t/(1−p)θDAREt=δ^t+θPRE
两个步骤:
- drop:随机mask参数为0
- rescale:对保存的参数rescale,这样可以保证神经元期望值不变: E n o t m a s k = x , E m a s k = p ∗ x p E_{not_{mask}}=x,E_{mask}=\frac{p*x}{p} Enotmask=x,Emask=pp∗x
传统的模型融合只是对神经元进行加权求和,这样会导致模型能力骤降。DARE方法通过dropout避免了这种问题。
多源模型融合
θ D A R E t k = DARE ( θ S F T t k , θ P R E ) , for 1 ≤ k ≤ K , θ M = θ P R E + λ ⋅ ∑ k = 1 K δ ^ t k = θ P R E + λ ⋅ ∑ k = 1 K ( θ D A R E t k − θ P R E ) . \begin{gathered} \boldsymbol{\theta}_{\mathrm{DARE}}^{t_k}=\operatorname{DARE}\left(\boldsymbol{\theta}_{\mathrm{SFT}}^{t_k}, \boldsymbol{\theta}_{\mathrm{PRE}}\right), \text { for } 1 \leq k \leq K, \\ \boldsymbol{\theta}_{\mathrm{M}}=\boldsymbol{\theta}_{\mathrm{PRE}}+\lambda \cdot \sum_{k=1}^K \hat{\boldsymbol{\delta}}^{t_k}=\boldsymbol{\theta}_{\mathrm{PRE}}+\lambda \cdot \sum_{k=1}^K\left(\boldsymbol{\theta}_{\mathrm{DARE}}^{t_k}-\boldsymbol{\theta}_{\mathrm{PRE}}\right) . \end{gathered} θDAREtk=DARE(θSFTtk,θPRE), for 1≤k≤K,θM=θPRE+λ⋅k=1∑Kδ^tk=θPRE+λ⋅k=1∑K(θDAREtk−θPRE).
流程图:
实验结果
参考
- 丢弃99%的参数!阿里团队提出语言模型合体术,性能暴涨且无需重新训练和GPU
- MergeLM