In this Repository, You can find implementation of various generative models in pytorch, which was done as assignments of the graduate-level Deep Generative Models course, offered by Dr.Mostafa Tavassolipour at University of Tehran in Fall2024.
- Diffusion Models
- Wasserstein-GAN
- Normalizing Flow
- VAE
- Score Based Models
- VQA on Paligemma
- References
As a part of the third assignment, I implemented DDPM [9]. If we represent the forward process' distribution with
Revisiting the bayesian network, each
This is a lower bound for for likelihood which we wish to maximize. Simplifying the above objective, we have:
The first, third and third term of the above objective are called, reconstruction term, prior matching and consistency term, respectively. The last term is the only term considered in the optimization; because it contains summation and will be dominant to the other two. The intuiton behind this term is also noteworthy. the KL term inside the summation is trying to minimize the divergence between
Doing some fancy math, shrinks the above objective to:
The intuition behind this is also attractive. In the ideal world, we would like to see
In which,
The denoiser network is UNET [10] which follows the DDPM algorithm to train the model. The dataset on which I trained the model is sprites[11] dataset. The training went on for 40 epochs which you can see the trend in the following figure:
Following the sampling algorithm, provided by DDPM authors, you can see 50 new generated images:
As mentioned in DDIM [12], the trained DDPM can be sampled using DDIM if:
which is significantly faster than DDPM. In the following figure, you can see 50 images, generated by DDIM sampling algorithm:
The models were evaluated in aspect of quality and speed. As expected, DDIM could generate 50 images in 126ms while DDPM did it in 6s which is significantly slower than DDIM. The quality of images were also evaluated, using FID score[6], which represented
As a part of the second assignment, I implemented Wasserstein-GAN[2] using pytorch. The high level objective in all generative models, is to minimize the distance or divergence between distribution of training data and the distribution learnt by the model. In this variant of GAN, we take into accout Wasserstein distance which is defined as:
Where
and
To deploy this in GAN, we take into account kantorovich-rubinstein duality [3]:
As illustrated in the objective,
lipschitzness of
def calc_gradient_penalty(self, data, generated_data):
batch_size = data.shape[0]
alpha = torch.rand(batch_size, 1, 1, 1).expand_as(data).to('cuda')
interpolated = alpha * data.data + (1 - alpha) * generated_data.data
interpolated = Variable(interpolated, requires_grad=True).to('cuda')
prob_interpolated = self.discriminator(interpolated)
gradients = torch.autograd.grad(outputs=prob_interpolated, inputs=interpolated,
grad_outputs=torch.ones(prob_interpolated.size()).to('cuda'),
create_graph=True, retain_graph=True)[0]
gradients = gradients.view(batch_size, -1)
self.training_history['gradient_norm'].append(gradients.norm(2, dim=1).mean().item())
gradients_norm = torch.sqrt(torch.sum(gradients ** 2, dim=1) + 1e-12)
return self.gradient_penalty_weight * ((gradients_norm - 1) ** 2).mean()The model was trained on the Fashion-MNIST [5] dataset through 100 epochs. The following figure, illustrates loss during training:
In the following figure, you can see 25 samples, generated by the W-GAN:
We expect the model's FID-Score [6] to decrease as the training process goes on. I sampled three batches from epoch 1, 50 and 100 which had the fid score of
As a part of the second assignment, I impleneted Real NVP [7] as a variant of normalizing flow.
Normalizing flows are known to be capable of evaluating likelihood. When the series of transformations from latent space to image space are Bijective, we can take into account the change of variable formula:
Now we can rewrite the maximum-likelihood:
The important note is that computational complexity of calculating determinant is
def loss_function(x, sldj):
prior_ll = -0.5 * (x ** 2 + np.log(2 * np.pi))
prior_ll = prior_ll.view(x.size(0), -1).sum(-1) \
- np.log(256) * np.prod(x.size()[1:])
ll = prior_ll + sldj
nll = -ll.mean()
return nllReal-NVP, stacks a set of coupling layers which operate as following :
where
The jacobian matrix of transformation can be calculated as:
As you see, the matrix is upper-traingular whose determinant can be calculated in
As you see, the first dimension is immune to change; to overcome this issue,before applying next coupling layer, a random permutation is applied on dimensions so that all of them be exposed to change.
The dataset deployed to train the model is Fashion-MNIST[5], and the trend of loss over 25 epochs is illustrated in the following figure:
The model was assessed on both tasks of generation and evaluation. In the following image, you see the performance of model on reconstruction:
And here are 16 images, generated by the model:
As mentioned earlier, normalizing flows are capable of evaluating the exact value of likelihood, unlike VAEs which use ELBO. Thus, after training, we can state that images with likelihoods lower than a specific thereshold, don't belong to the distribution of the training images. The following figure, illustrates likelihood of CIFAR-10[8] dataset, compared to Fashion-MNIST(training data) :
As a part of the first assignment, I implemented a variational auto-encoder, in pytorch, following the architecture, offered at [1]. The dataset used to train, contains images of smiling and non-smiling faces. The VAE was trained jointly on these two categories. The training process took 1000 epochs and in the following figure, you can see the trends of both loss terms, reconstruction loss and KL-Divergence:
The performance of VAE was assessed on three differenct tasks which you can see the results in the following subsections.
The ability of VAE to reconstruct the images, signifies the amount of information which can be stored in its latent space, which has lowerk number of dimensions compared to pixel space. Here is an example of the model's performance:
Odd columns show the original image and even columns show their corresponding reconstructed images.
Once the entire network is trained, we can simply take a sample from the latent space distribution, which was pushed towards standard normal distribution during training process, and pass it to the decoder to generate an image. In the following figure, you can see 32 images generated by the model:
The KL-Divergence term in objective of VAE, pushes the learned distibution of latent space towards standard normal distribution. The remarkable property of this distribution, is that its covariance matrix is
In many cases, learning the distribution of an image dataset is infeasible, while we can take samples from the distriubtion by estimating its score function. As part of the third assignment, we used the idea of denoising score matching to estimate the socre function and Annealed Langevin Dynamics algorithm to take samples from the distribution[13].
Suppose
In which,
Gaussian distribution, has this well-defined closed form score function:
which makes the final objective of
Training a network using this objective, gives us the score function:
Now that we have score function, we might think that we can easily generate samples using langevin dynamic [14] algorithm. But the algorithm will fail because data points usually occur in manifold and most of the space, does not have any gradient. The solution is to use the Annealed Langevin Dynamics algorithm. We start by adding high-variance noise, estimate the score function, move our initial sample points with a step size, and continue this process untill convergence:
![]() |
![]() |
![]() |
And here is how samples converged through steps:
![]() |
![]() |
![]() |
As a part of the fourth assignment, I evaluated Paligemma[15] on visual question answering task, using CLEVR[16] dataset to assess the model's ability to understand composition. The process was done on a portion of CELVR images, and also LoRA[17] was taken into account to strengthen the ability of model on this task and dataset. To ensure the significance of LoRA finetuning, I used ROUGE-score[18] whose results are summarized in the following table:
| model | ROUGE-1 | ROUGE-L |
|---|---|---|
| pretrained | 0.40 | 0.41 |
| fine tuned | 0.54 | 0.54 |
Here is an example of model's performance after fine-tuning:
- Irina Higgins, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework .link
- Martin Arjovsky, et al. Wasserstein GAN. link
- Leonid Kantorovich, Leonid Rubinstein. link
- EmilienDupont. link
- Han Xiao, et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. link
- Sadeep Jayasumana, et al. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. link
- Laurent Dinh, et al. Density estimation using Real NVP. link
- CIFAR-10
- Jonathan Ho, et al. Denoising Diffusion Probabilistic Models. link
- Olaf Ronneberger, et al. U-Net: Convolutional Networks for Biomedical Image Segmentation. link
- link
- Jiaming Song, et al. Denoising Diffusion Implicit Models. link
- Yang Song, et al. Generative Modeling by Estimating Gradients of the Data Distribution. link
- Ajay Chandra, et al. Langevin dynamic for the 2D Yang-Mills measure. link
- Lucas Beyer, et al. PaliGemma: A versatile 3B VLM for transfer. link
- Justin Johnson, et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. link
- Edward J. Hu, et al. LoRA: Low-Rank Adaptation of Large Language Models. link
- Chin-Yew Lin, ROUGE: A Package for Automatic Evaluation of summaries, link




















