current position:Home>dalle2: hierarchical text-conditional image generation with clip

dalle2: hierarchical text-conditional image generation with clip

2022-08-06 07:54:18Kun Li

DALL·E 2【Intensive reading of the thesis】_bilibili_bilibiliMore papers:, 30350 video views, 256 barrages, 1767 likes, 1318 coins tossed, 751 collectors, The number of forwarders is 344, the author of the video learns AI from Li Mu, the author's introduction, related videos: the video of the postgraduate courtship, how to read the literature and organize notes, online courtship|26-year-old associate professor of 985, when the group meeting was held, my brother SCI appeared in the journal, Generative Adversarial Network GAN pioneering paper intensive reading, GAN paper intensive reading paragraph by paragraph [paper intensive reading], comparative study paper review [paper intensive reading], 01 Overview of machine learning compilation [MLC-Machine Learning Compilation Chinese Version], I am sorry for the tutor, your commentAcademicians may have to slow down. [Intensive AI Papers] Knowledge Distillation>Seeing some explanations about dalle2 on the market are actually not very good. I didn't say it very clearly. The three major directions of the generative model are vae, gan and diffusion model, of which ae->dae->vae->vqvae->diffusion, the ddpm->improved ddpm->diffusion beets GAN->glide->dalle2 of the diffusion model.


The clip is robust to changes in image distribution and can be zero-shot. The diffusion model can satisfy sample diversity and has good fidelity.dalle2 combines the good features of both models.


The picture above is very good. Based on this picture, first of all, there is a clip above the dotted line. This clip is trained in advance and will not be used again during the training of dalle2.To train clip, it is a weight lock. In the training of dalle2, the input is also a pair of data, a text pair and its corresponding image, first enter a text, and go through the text encoding module of clip (bert, clip uses vit for images)., use bert to encode text, clip is a basic contrastive learning, the encoding of two modalities is very important, after modal encoding, the cosine is directly calculated for similarity).Image vector, this image vector is actually gt.The generated text code is input into the first prior model, which is a diffusion model, and an autoregressive transformer can also be used. This diffusion model outputs a set of image vectors, which are supervised by the image vectors generated by clip.It is actually a supervised model, followed by a decoder module. In the previous dalle, the encoder and the decoder were trained together in dvae, but the deocder here is a single training and a diffusion model. In fact, under the dotted lineThe generative model is to turn a complete generation step into a two-stage explicit image generation. The author experimented with this explicit generation.This article calls itself unclip, clip is to convert input text and images into features, and dalle2 is the process of converting text features into image features and then into images. In fact, image features to images are achieved through a diffusion model.In the deocder, both the classifier-free guidence and the clip's guidence are used. This guidence refers to the process of the decoder, the input is a noisy image at time t, and the final output is an image, this noisy image.A feature map obtained by unet each time can be judged by an image classifier. Here, the cross-entropy function is generally used for a two-classification, but the gradient of image classification can be obtained, and this gradient can be used to guide the diffusion to betterdecoder.

copyright notice
author[Kun Li],Please bring the original link to reprint, thank you.

Random recommended