DoRA的作者通过将预训练矩阵W分解,得到大小为1 x d的大小向量m和方向矩阵V,从而独立训练大小和方向。然后方向矩阵V通过B* A增强(LoRA),然后m按原样训练。虽然LoRA倾向于同时改变幅度和方向(正如这两者之间高度正相关所表明的那样),DoRA可以更容易地将二者分开调整,或者用另一个的负变化来补偿一个的变化。所以可以DoRA的方向和大小之间的关系更像微调。代码如下
import torch.optim as optimfrom torch.utils.data import DataLoader, TensorDatasetimport torchimport torch.nn as nnimport torch.nn.functional as F# This layer is dropped into your pre-trained PyTorch model where nn.Linear is usedclass DoRALayer(nn.Module): def __init__(self, d_in, d_out, rank=4, weight=None, bias=None): super().__init__() if weight is not None: self.weight = nn.Parameter(weight, requires_grad=False) else: self.weight = nn.Parameter(torch.Tensor(d_out, d_in), requires_grad=False) if bias is not None: self.bias = nn.Parameter(bias, requires_grad=False) else: self.bias = nn.Parameter(torch.Tensor(d_out), requires_grad=False) # m = Magnitude column-wise across output dimension self.m = nn.Parameter(self.weight.norm(p=2, dim=0, keepdim=True)) std_dev = 1 / torch.sqrt(torch.tensor(rank).float()) self.lora_A = nn.Parameter(torch.randn(d_out, rank)*std_dev) self.lora_B = nn.Parameter(torch.zeros(rank, d_in)) def forward(self, x): lora = torch.matmul(self.lora_A, self.lora_B) adapted = self.weight + lora column_norm = adapted.norm(p=2, dim=0, keepdim=True) norm_adapted = adapted / column_norm calc_weights = self.m * norm_adapted return F.linear(x, calc_weights, self.bias)
LongLoRA
LongLoRA 是港中文和 MIT 在 23 年发表的一篇 paper,主要是为了解决长上下文的注意力机制计算量很大的问题。