链载Ai

标题: 深入解析LLM预训练与SFT对齐:Loss函数差异与代码解析 [打印本页]

作者: 链载Ai    时间: 前天 11:58
标题: 深入解析LLM预训练与SFT对齐:Loss函数差异与代码解析


LLM(Large Language Model,大型语言模型)在预训练和对齐阶段,虽然都使用loss函数来指导模型学习,但两者在loss的设计和目标上存在显著差异。

1. 预训练阶段:

544@dataclass
545classLabelSmoother:
546"""
547Addslabel-smoothingonapre-computedoutputfromaTransformersmodel.
548
549Args:
550epsilon(`float`,*optional*,defaultsto0.1):
551Thelabelsmoothingfactor.
552ignore_index(`int`,*optional*,defaultsto-100):
553Theindexinthelabelstoignorewhencomputingtheloss.
554"""
555
556epsilon:float=0.1
557ignore_index:int=-100
558
559def__call__(self,model_output,labels,shift_labels=False):
560logits=model_output["logits"]ifisinstance(model_output,dict)elsemodel_output[0]
561ifshift_labels:
562logits=logits[...,:-1,:].contiguous()
563labels=labels[...,1:].contiguous()
564
565log_probs=-nn.functional.log_softmax(logits,dim=-1)
566iflabels.dim()==log_probs.dim()-1:
567labels=labels.unsqueeze(-1)
568
569padding_mask=labels.eq(self.ignore_index)
570#Incasetheignore_indexis-100,thegatherwillfail,sowereplacelabelsby0.Thepadding_mask
571#willignoretheminanycase.
572labels=torch.clamp(labels,min=0)
573nll_loss=log_probs.gather(dim=-1,index=labels)
574#worksforfp16inputtensortoo,byinternallyupcastingittofp32
575smoothed_loss=log_probs.sum(dim=-1,keepdim=True,dtype=torch.float32)
576
577nll_loss.masked_fill_(padding_mask,0.0)
578smoothed_loss.masked_fill_(padding_mask,0.0)
579
580#Takethemeanoverthelabeldimensions,thendividebythenumberofactiveelements(i.e.not-padded):
581num_active_elements=padding_mask.numel()-padding_mask.long().sum()
582nll_loss=nll_loss.sum()/num_active_elements
583smoothed_loss=smoothed_loss.sum()/(num_active_elements*log_probs.shape[-1])
584return(1-self.epsilon)*nll_loss+self.epsilon*smoothed_loss

2. 对齐阶段 (SFT: Supervised Fine-Tuning):

81@override
82defprediction_step(
83self,
84model:"torch.nn.Module",
85inputsict[str,Union["torch.Tensor",Any]],
86prediction_loss_only:bool,
87ignore_keys:Optional[List[str]]=None,
88)->Tuple[Optional[float],Optional["torch.Tensor"],Optional["torch.Tensor"]]:
89r"""
90Removesthepromptpartinthegeneratedtokens.
91
92Subclassandoverridetoinjectcustombehavior.
93"""
94labels=inputs["labels"]if"labels"ininputselseNone
95ifself.args.predict_with_generate:
96assertself.tokenizer.padding_side=="left","Thismethodonlyacceptsleft-paddedtensor."
97labels=labels.detach().clone()iflabelsisnotNoneelseNone#backuplabels
98prompt_len,label_len=inputs["input_ids"].size(-1),inputs["labels"].size(-1)
99ifprompt_len>label_len:
100inputs["labels"]=self._pad_tensors_to_target_len(inputs["labels"],inputs["input_ids"])
101iflabel_len>prompt_len:#truncatethelabelsinsteadofpaddingtheinputs(llama2fp16compatibility)
102inputs["labels"]=inputs["labels"][:,:prompt_len]
103
104loss,generated_tokens,_=super().prediction_step(#ignorethereturnedlabels(maybetruncated)
105model,inputs,prediction_loss_only=prediction_loss_only,ignore_keys=ignore_keys
106)
107ifgenerated_tokensisnotNoneandself.args.predict_with_generate:
108generated_tokens[:,:prompt_len]=self.tokenizer.pad_token_id
109generated_tokens=generated_tokens.contiguous()
110
111returnloss,generated_tokens,labels
112
113def_pad_tensors_to_target_len(self,src_tensor:"torch.Tensor",tgt_tensor:"torch.Tensor")->"torch.Tensor":
114r"""
115Padsthetensortothesamelengthasthetargettensor.
116"""
117assertself.tokenizer.pad_token_idisnotNone,"adtokenisrequired."
118padded_tensor=self.tokenizer.pad_token_id*torch.ones_like(tgt_tensor)
119padded_tensor[:,-src_tensor.shape[-1]:]=src_tensor#adoptleft-padding
120returnpadded_tensor.contiguous()#incontiguousmemory

总结:

特征预训练对齐 (SFT)
目标学习通用语言表示迁移到特定任务
数据海量未标注数据高质量标注数据
Loss函数自监督学习 (MLM, CLM)监督学习 (Cross-Entropy, MSE)
Loss特点数值较大,关注语言理解数值较小,关注任务表现

需要注意的是,以上只是一些常见的区别,实际情况可能更加复杂。例如,有些预训练任务也会使用少量标注数据,而有些对齐任务也会使用自监督学习方法。

总的来说,预训练和对齐阶段的loss函数设计都至关重要,它们共同决定了LLM最终的性能。







欢迎光临 链载Ai (https://www.lianzai.com/) Powered by Discuz! X3.5