BEIT:BERT Pre-training of Image Transformers

Date:

Motivated by BERT, they turn to the denoising auto-encoding idea to pretrain vision transformers, which has not been well studied by the vision community.

There is no pre-exist vocabulary for vision Transformer’s input unit, i.e., image patches. So they cannot simply employ a softmax classifier to predict over all possible candidates for masked patches.

They propose a masked image modeling task to pretrain vision Transformers in a self-supervised manner. They also provide a theoretical explanation from the perspective of variational autoencoder.

Powerpoint for this talk

Powerpoint for this talk

Leave a Comment