current position:Home>Google | coca: the contrast caption generator is the basic image text model

Google | coca: the contrast caption generator is the basic image text model

2022-05-15 07:32:44Zhiyuan community

title : Google |CoCa: Contrastive Captioners are Image-Text Foundation Models(CoCa: The contrast caption generator is an image - Text base model )

author :Jiahui Yu, Zirui Wang, Yonghui Wu etc.

brief introduction : This paper introduces the best basic model in the field of image . This paper introduces comparative subtitles (CoCa), This is a minimalist design , For pre training images - Basic model of text encoding and decoding, contrast loss and caption loss , So as to change the model ability from the comparison method ( Such as CLIP) And generation methods ( Such as SimVLM) Included in . The author applies contrast loss between unimodal image and text embedding , In addition, autoregressive prediction of caption loss output by multi-mode decoder is also carried out . By sharing the same calculation diagram , Two training objectives can be calculated efficiently with minimal overhead .CoCa By simply treating all labels as text , Pre training is performed end-to-end and from scratch on network scale alternative text data and annotated images , Seamless supervision of natural language learning . Based on experience ,CoCa In a wide range of downstream tasks 、 Including visual recognition 、 Cross modal Retrieval 、 The task specific adaptation of multimodal understanding and image caption generation achieves the best performance . It is worth noting that , stay ImageNet In the classification ,CoCa To obtain the 86.3% Zero sample of top-1 Accuracy rate , Using freeze encoder and learning classification header 90.6%, And in ImageNet The use of fine-tuning encoder has obtained the most advanced 91.0% top-1 Accuracy rate .

Paper download :

copyright notice
author[Zhiyuan community],Please bring the original link to reprint, thank you.

Random recommended