This happens with ResNets, DenseNets and even a vanilla VGG without batch-norm. I haven’t experienced this while using data augmentation or decaying learning rates. The image above is the training log of a VGG on CIFAR10, with 0.005 learning rate (SGD+Nesterov) and 0.0005 weight decay. The frequency of these “cycles” seem to be very dependent on the learning rate and weight decay, and they only happen at 100% training accuracy.
Also, I’m using the imagenet example from pytorch’s repo (except I’m training on CIFAR instead). I have tried removing most of the code almost to the point of only having forward / zero_grad / backward / step, and this still happens. Tried other training scripts / repos, no luck. I’m guessing this is not a bug / pytorch-related issue, and probably a general issue on network optimization?
Has anyone observed this before? This is a huge obstacle for a current project that I have, which involves collecting a lot of statistics regarding generalization and regularization to model network dynamics.