CS 180 Project 5 - Geoffrey Xiang

Part A - The Power of Diffusion Models!

Part 0: Setup

I started by using the DeepFloyd IF model and precomputed text embeddings to generate images. When playing around with the number of inference steps, there tended to be higher quality of output (more details or realistic) with a higher number of steps. The results are shown below using a random seed of 79:

num_inference_steps=20

An Oil Painting of a Snowy Mountain
A Man Wearing a Hat
A Rocket Ship

num_inference_steps=50

An Oil Painting of a Snowy Mountain
A Man Wearing a Hat
A Rocket Ship

Part 1: Sampling Loops

1.1 Implementing the Forward Process

Since diffusion requires having clean images with varying levels of noise added to them, I first implemented the forward() function for adding noise to an image. Here are the results on the test image of the Campanile with increasing levels of noise:

Campanile (Original)
Campanile (t=250)
Campanile (t=500)
Campanile (t=750)

1.2 Classical Denoising

I first tried to denoise using Gaussian blur, but it didn't really work well at all. It seemed to eliminate some of the noise at lower noise levels, but it also got rid of details in the original image. Here are the results using the noisy images from the previous part.

Campanile (t=250)
Campanile (t=500)
Campanile (t=750)
Campanile "Denoised" (t=250)
Campanile "Denoised" (t=500)
Campanile "Denoised" (t=750)

1.3 One-Step Denoising

Now that we know the classical denoising technique isn't great for our purposes, I moved on to using the pretrained diffusion model. The UNet given estimates the noise at a given step for an image, which helps use eliminate the noise from the image. This works great at getting rid of the sandy consistency of the noise in the images, but at higher noise levels, the denoised images come out looking more and more blurry and different than the original image.

Campanile (t=250)
Campanile (t=500)
Campanile (t=750)
Campanile "Denoised" (t=250)
Campanile "Denoised" (t=500)
Campanile "Denoised" (t=750)

1.4 Iterative Denoising

To solve the issue of the quality significantly degrading at higher noise levels, I moved on to iterative denoising. Instead of doing the denoising process in a single step, I now denoise iteratively with a stride of 30 steps to save some compute/time. Here are the results:

Noisy Campanile at t=90
Noisy Campanile at t=240
Noisy Campanile at t=390
Noisy Campanile at t=540
Noisy Campanile at t=690
Campanile (Original)
Campanile (Iterative Denoised) at t=500
Campanile (One-Step Denoised) at t=500
Campanile (Gaussian Blurred) at t=500

1.5 Diffusion Model Sampling

With the `iterative_denoise` function implemented, we can actually use it to generate images from scratch. Using the general prompt "a high quality photo," here were the results of image generation:

1.6 Classifier-Free Guidance (CFG)

While some images like the first image look good, others are somewhat nonsensical like the fifth one. The images also look kind of dully and faded, and we can do better with Classifier-Free Guidance. Using a CFG scale of 7 and the null prompt (""), here are the results of image generation with CFG:

1.7 Image-to-image Translation

Now, I use the SDEdit algorithm to implement image-to-image translation. Given an image, we can modify it using this algorithm by adding noise and then 'denoising' it using the functions I impelemented previously. Here are the results:

Campanile

SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Campanile (Original)

Minion (Bob)

SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Bob (Original)

Crochet Ducky

SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Crochet Ducky (Original)

1.7.1 Editing Hand-Drawn and Web Images

With the ability to edit images, I can try to 'edit' hand drawn images or turn nonrealistic images into more real-looking ones. Here are the results:

Minecraft Painting (Skull)

SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
MC Skull (Original)

Hand Drawn Horse

SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Horse Drawing (Original)

Hand Drawn Cat?

SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Cat Drawing (Original)

1.7.2 Inpainting

I then implemented image inpainting by masking out a portion of the image and adding noise to the masked portion before running the iterative denoising process. Here are the results of inpainting:

Campanile (Original)
Mask
Hole
Inpainted
Minion with BANANA (Original)
Mask
Hole
Inpainted - banana gone :0
Dog
Mask
Hole
Inpainted - Dog Hiding Behind Hat

1.7.3 Text-Conditional Image-to-image Translation

Now instead of having the model edit the image with whatever it wants, I used text to prompt the edits. Here are the results:

Campanile with Text Prompt: "a rocket ship"

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Campanile (Original)

Cat with Text Prompt: "a photo of a dog"

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Cat (Original)

Desert with Text Prompt: "a lithograph of waterfalls"

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Desert (Original)

1.8 Visual Anagrams

Now with the model, we can make visual anagrams. Each anagram averages the noise from one image and the noise from another flipped image at each step, which creates the effect of looking like the first image when viewed normally yet looking like the second image when viewed upside down. Here are the results of visual anagrams:

Campfire
Old Man
Waterfall
Man
Snowy Mountain Village
Man Wearing Hat

1.9 Hybrid Images

Just like with visual anagrams, we can use a very similar process to generate hybrid images. These images look like one thing close up but look like a completely different image when viewed from farther away. We achieve this by running a low pass filter on the noise estimate for one prompt, and then running a high pass filter on the noise estimate for the other prompt. Summing the two resulting noise estimates gives us the hybrid image. Here are my results from this process:

Waterfall (close), Skull (far)
People Around Campfire (close), Snowy Mountain Village (far)
Dog (close), Skull (far)

Part B - Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising Net

1.1 Implementing the UNet

I started off by implementing the denoiser as a UNet, following the structure given in the spec. (No deliverables in this section)

1.2 Using the UNet to Train a Denoiser

To visualize the noising process that our denoiser will work with, I added varying levels of noise to MNIST digits. The results are shown below:

1.2.1 Training

Now that the architecture was set up, I trained the model on images noised with a variance of 0.5 over 5 epochs using the suggested hyperparameters on the MNIST training dataset. The results are shown in the following training loss curve:

I also ran the model on the test set after the 1st and 5th training epoch. Here are the results:

1.2.2 Out-of-Distribution Testing

Even though the model was trained on images noised with a variance of 0.5, it can still be used on images with different noise levels to varying degrees of success. Here were the results of sampling the test set with out-of-distribution noise levels:

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

Since we don't want to have to train a new model for every new noise level, we can just train a single UNet with time conditioning. Implementing this UNet is extremely similar to the previous one, and we just embed the timestep into our existing model using FCBlocks.

2.2 Training the UNet

With this new architecture, the training process is also slightly different. For every image we pick from the test dataset, we pick a random timestep t and noise the image based on the chosen value of t. Then we train the denoiser to predict the level of noise in the noisy image. Here is the training loss curve for this process:

2.3 Sampling from the UNet

I also sampled the model using the test dataset every few epochs, and here were the results of the sampling for epochs 5 and 20:

2.4 Adding Class-Conditioning to UNet

Since the MNIST dataset is images of digits, we can control which digits we want to generate images of by conditioning the UNet on 10 classes (each digit). This modification is similar to how we added time conditioning to the UNet previously, except this time we encode the class as a one-hot vector and implement a 10% dropout to still allow for the non-conditioned use case. Then we can train the model with almost the same process as before but just incorporating the conditioning vector c. Here is the training loss curve for this process:

2.5 Sampling from the Class-Conditioned UNet

Here is the result of sampling the new class-conditioned UNet on epochs 5 and 20 (I used a gamma value of 5.0):

Conclusion

It was great to see how we could use the diffusion model in part A to modify or generate images in all sorts of ways. I didn't have the best background in machine learning entering this project, so I definitely learned a lot about the struggles of dealing with tensor shapes and debugging garbage outputs when training diffusion models in part B.