ÀÖ²¥´«Ã½

Skip to main content

DALL-E 2 from Scratch

Text-Conditioned Image Generation on FashionMNIST using CLIP Latents

by Matthew Nguyen

Denoising diffusion probabilistic models (DDPM) are a popular type of generative AI model that were introduced by and improved upon by . The basic idea behind these models is that noise is added to images in the forward diffusion process in order to train the model to predict the noise that should be removed at a certain timestep in the reverse diffusion process. When sampling images, you would start with an image containing pure noise and iteratively remove model’s predicted noise at each timestep until you get the final image.

In order to have a DDPM generate multiple types of images while still letting the user choose which type of image they want, the model needs to be conditioned on some input. introduced one such conditioning method called unCLIP, which is used in OpenAI’s DALL-E 2 model. In the method described by Ramesh et al., the input caption is first passed to a prior network which will use a trained CLIP model to get the CLIP text embeddings. These text embeddings are then used by a decoder-only transformer in order to generate possible CLIP image embeddings. The CLIP image embeddings generated by the prior network will be used by a decoder network, which consists of a UNet model, in order to condition the images that are created. In this article, we are going to be building a simple diffusion model using this process.

Dall-E from scratch generating clothing MNIST like data