BEIT:BERT Pre-training of Image Transformers
Date:
Motivated by BERT, they turn to the denoising auto-encoding idea to pretrain vision transformers, which has not been well studied by the vision community.
There is no pre-exist vocabulary for vision Transformer’s input unit, i.e., image patches. So they cannot simply employ a softmax classifier to predict over all possible candidates for masked patches.
They propose a masked image modeling task to pretrain vision Transformers in a self-supervised manner. They also provide a theoretical explanation from the perspective of variational autoencoder.
Leave a Comment