BEIT:BERT Pre-training of Image Transformers

Date: December 05, 2022

Motivated by BERT, they turn to the denoising auto-encoding idea to pretrain vision transformers, which has not been well studied by the vision community.

There is no pre-exist vocabulary for vision Transformer’s input unit, i.e., image patches. So they cannot simply employ a softmax classifier to predict over all possible candidates for masked patches.

They propose a masked image modeling task to pretrain vision Transformers in a self-supervised manner. They also provide a theoretical explanation from the perspective of variational autoencoder.

Po-Chuan Chen

BEIT:BERT Pre-training of Image Transformers

Powerpoint for this talk

Share on

Leave a Comment