Suppose you have a model whose final layer is a dot product between a vector produced only from context and a vector produced only from response. I use models of this form as “level 1” models because they facilitate precomputation of a fast serving index, but note the following trick will not apply to architectures like bidirectional attention. Anyway for these models you can be more efficient during training by drawing the negatives from the same mini-batch. This is a well-known trick but I couldn’t find anybody talking about how to do this explicitly in pytorch.
Structure your model to have a leftforward and a rightforward like this:
class MyModel(nn.Module): ... def forward(leftinput, rightinput): leftvec = self.leftforward(leftinput) rightvec = self.rightforward(rightinput) return torch.mul(leftvec, rightvec).sum(dim=1, keepdim=True)
At training time, compute the leftforward and rightforward for your mini-batch distinctly:
... criterion = BatchPULoss() model = MyModel() ... leftvec = model.leftforward(batch.leftinput) rightvec = model.rightforward(batch.rightinput) (loss, preds) = criterion.fortraining(leftvectors, rightvectors) loss.backward() # "preds" contains the highest score right for each left # so for instance, calculate "mini-batch precision at 1" gold_labels = torch.arange(0, batch.batch_size).long().cuda() n_correct += (preds.data == gold_labels).sum() ...
Finally use this loss:
import torch class BatchPULoss(): def __init__(self): self.loss = torch.nn.CrossEntropyLoss() def fortraining(self, left, right): outer = torch.mm(left, right.t()) labels = torch.autograd.Variable(torch.arange(0,outer.shape).long().cuda(), requires_grad=False) loss = self.loss(outer, labels) _, preds = torch.max(outer, dim=1) return (loss, preds) def __call__(self, *args, **kwargs): return self.loss(*args, **kwargs)
At training time you call the fortraining method but if you have fixed distractors for evaluation you can also call it directly just like CrossEntropyLoss.