I’ve read a paper on arXiv about what is written in this title a while ago, but I’ve lost a track of it. The paper was probably released from June to July, and it’s a computer vision counterpart of OpenAI’s similar paper using Transformer LM. These papers essentially say that on a large dataset plus fine-tuning beats existing algorithms without . Could anyone give me the link to the former paper?

Edit: I finally found it! It’s Deep Clustering for Unsupervised Learning of Visual Features . Thank everyone for posting an interesting paper!

