I started by using the DeepFloyd IF model and precomputed text embeddings to generate images. When playing around with the number of inference steps, there tended to be higher quality of output (more details or realistic) with a higher number of steps. The results are shown below using a random seed of 79:
Since diffusion requires having clean images with varying levels of noise added to them, I first implemented the forward() function for adding noise to an image. Here are the results on the test image of the Campanile with increasing levels of noise:
I first tried to denoise using Gaussian blur, but it didn't really work well at all. It seemed to eliminate some of the noise at lower noise levels, but it also got rid of details in the original image. Here are the results using the noisy images from the previous part.
Now that we know the classical denoising technique isn't great for our purposes, I moved on to using the pretrained diffusion model. The UNet given estimates the noise at a given step for an image, which helps use eliminate the noise from the image. This works great at getting rid of the sandy consistency of the noise in the images, but at higher noise levels, the denoised images come out looking more and more blurry and different than the original image.
To solve the issue of the quality significantly degrading at higher noise levels, I moved on to iterative denoising. Instead of doing the denoising process in a single step, I now denoise iteratively with a stride of 30 steps to save some compute/time. Here are the results:
With the `iterative_denoise` function implemented, we can actually use it to generate images from scratch. Using the general prompt "a high quality photo," here were the results of image generation:
While some images like the first image look good, others are somewhat nonsensical like the fifth one. The images also look kind of dully and faded, and we can do better with Classifier-Free Guidance. Using a CFG scale of 7 and the null prompt (""), here are the results of image generation with CFG:
Now, I use the SDEdit algorithm to implement image-to-image translation. Given an image, we can modify it using this algorithm by adding noise and then 'denoising' it using the functions I impelemented previously. Here are the results:
With the ability to edit images, I can try to 'edit' hand drawn images or turn nonrealistic images into more real-looking ones. Here are the results:
I then implemented image inpainting by masking out a portion of the image and adding noise to the masked portion before running the iterative denoising process. Here are the results of inpainting:
Now instead of having the model edit the image with whatever it wants, I used text to prompt the edits. Here are the results:
Now with the model, we can make visual anagrams. Each anagram averages the noise from one image and the noise from another flipped image at each step, which creates the effect of looking like the first image when viewed normally yet looking like the second image when viewed upside down. Here are the results of visual anagrams:
Just like with visual anagrams, we can use a very similar process to generate hybrid images. These images look like one thing close up but look like a completely different image when viewed from farther away. We achieve this by running a low pass filter on the noise estimate for one prompt, and then running a high pass filter on the noise estimate for the other prompt. Summing the two resulting noise estimates gives us the hybrid image. Here are my results from this process:
I started off by implementing the denoiser as a UNet, following the structure given in the spec. (No deliverables in this section)
To visualize the noising process that our denoiser will work with, I added varying levels of noise to MNIST digits. The results are shown below:
Now that the architecture was set up, I trained the model on images noised with a variance of 0.5 over 5 epochs using the suggested hyperparameters on the MNIST training dataset. The results are shown in the following training loss curve:
I also ran the model on the test set after the 1st and 5th training epoch. Here are the results:
Even though the model was trained on images noised with a variance of 0.5, it can still be used on images with different noise levels to varying degrees of success. Here were the results of sampling the test set with out-of-distribution noise levels:
Since we don't want to have to train a new model for every new noise level, we can just train a single UNet with time conditioning. Implementing this UNet is extremely similar to the previous one, and we just embed the timestep into our existing model using FCBlocks.
With this new architecture, the training process is also slightly different. For every image we pick from the test dataset, we pick a random timestep t and noise the image based on the chosen value of t. Then we train the denoiser to predict the level of noise in the noisy image. Here is the training loss curve for this process:
I also sampled the model using the test dataset every few epochs, and here were the results of the sampling for epochs 5 and 20:
Since the MNIST dataset is images of digits, we can control which digits we want to generate images of by conditioning the UNet on 10 classes (each digit). This modification is similar to how we added time conditioning to the UNet previously, except this time we encode the class as a one-hot vector and implement a 10% dropout to still allow for the non-conditioned use case. Then we can train the model with almost the same process as before but just incorporating the conditioning vector c. Here is the training loss curve for this process:
Here is the result of sampling the new class-conditioned UNet on epochs 5 and 20 (I used a gamma value of 5.0):
It was great to see how we could use the diffusion model in part A to modify or generate images in all sorts of ways. I didn't have the best background in machine learning entering this project, so I definitely learned a lot about the struggles of dealing with tensor shapes and debugging garbage outputs when training diffusion models in part B.