Skip to content

javadkavian/Deep-Generative-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Generative Models

In this Repository, You can find implementation of various generative models in pytorch, which was done as assignments of the graduate-level Deep Generative Models course, offered by Dr.Mostafa Tavassolipour at University of Tehran in Fall2024.

Table of Contents


Diffusion Models

Theory of DDPM

As a part of the third assignment, I implemented DDPM [9]. If we represent the forward process' distribution with $q$ (beacuse forward process is just adding random noise to the image, it does not have any learnable parameter) and the denoising process with $p$, we can rewrite the variational inference for the model. In the bayesian network of DDPM, $x_0$ which is the original image, is the ovserved variable and $x_1$ through $x_T$ are latent variables. As a result, the variational inference objective is rewritten as:

$$ \max_{\theta}{-KL(q(x_1, x_2, ..., x_T|x_0) || p_{\theta}(x_0, x_1, ..., x_T))} $$

Revisiting the bayesian network, each $x_i$ is only dependent to $x_{i-1}$, which simplifies the above objective as :

$$-\mathbb{E}_{q}\left[ \log{\frac{p(x_T)\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_t)}{\prod_{t=1}^{T}{q(x_t|x_{t-1})}}} \right]$$

This is a lower bound for for likelihood which we wish to maximize. Simplifying the above objective, we have:

$$\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)} \right] - \mathbb{E}_{q(x_{T-1}|x_0)}\left[KL(q(x_T|x_{T-1})||p(x_T)) \right] - \sum_{t=1}^{T}{ \mathbb{E}_{q(x_{t-1}, x_{t+1}|x_0)}\left[KL(q(x_t|x_{t-1})||p_{\theta}(x_t|x_{t+1})) \right] }$$

The first, third and third term of the above objective are called, reconstruction term, prior matching and consistency term, respectively. The last term is the only term considered in the optimization; because it contains summation and will be dominant to the other two. The intuiton behind this term is also noteworthy. the KL term inside the summation is trying to minimize the divergence between $q(x_t|x_{t-1})$ and $p(x_t|x_{t+1})$ which correspond to two ways of reaching to $x_t$; one through adding noise to $x_{t-1}$ and the other through denoising $x_{t+1}$.

Doing some fancy math, shrinks the above objective to:

$$-\sum_{t=2}^{T}{ \mathbb{E}_{q(x_t|x_0)}{\left[KL(q(x_{t-1}|x_t, x_0)||p_{\theta}(x_{t-1}|x_t)) \right]} }$$

The intuition behind this is also attractive. In the ideal world, we would like to see $p(x_{t-1}|x_t) = q(x_{t-1}|x_t)$ but unfourtanately $q(x_{t-1}|x_t)$ does not have any closed form as it is the reverse process of adding gaussian noise. But it is proven in the paper that $q(x_{t-1}|x_t, x_0)$ has a closed form, under the condition that in each step of adding noise, the intensity of noise is not much and consequently to achieve pure noise at $x_T$, we need to take a lot of diffusion steps. Simplifying the above KL divergence term using the PDF of gaussian distribution, and writing a closed form for each diffusion step, which can be done inplace, we get the following objective to train DDPM

$$\mathbb{E}_{t, x_0, \epsilon} \left[\|\epsilon - \epsilon_{\theta}(\sqrt{\bar{a_{t}}}x_0 + \sqrt{1-\bar{a_{t}}}\epsilon, t) \| \right]$$

In which, $\epsilon_{\theta}$ is the network to predict the added noise.

Training DDPM

The denoiser network is UNET [10] which follows the DDPM algorithm to train the model. The dataset on which I trained the model is sprites[11] dataset. The training went on for 40 epochs which you can see the trend in the following figure:

Image Generation

Following the sampling algorithm, provided by DDPM authors, you can see 50 new generated images:

DDIM sampling

As mentioned in DDIM [12], the trained DDPM can be sampled using DDIM if:

$$ q_{\sigma}(x_{t-1}|x_t, x_0) = \mathbb{N}(\sqrt{a_{t-1}}x_0 + \sqrt{1 - a_{t-1} - \sigma_{t}^{2}}\times \frac{x_t - \sqrt{a_t}x_0}{\sqrt{1-a_{t}}}, \sigma_{t}^{2}\mathbb{I}) $$

which is significantly faster than DDPM. In the following figure, you can see 50 images, generated by DDIM sampling algorithm:

Evaluation of samples

The models were evaluated in aspect of quality and speed. As expected, DDIM could generate 50 images in 126ms while DDPM did it in 6s which is significantly slower than DDIM. The quality of images were also evaluated, using FID score[6], which represented $62.41$ for DDPM and $103.93$, indicating that DDPM generated images with higher quality.

Wasserstein-GAN

Theory Of GAN

As a part of the second assignment, I implemented Wasserstein-GAN[2] using pytorch. The high level objective in all generative models, is to minimize the distance or divergence between distribution of training data and the distribution learnt by the model. In this variant of GAN, we take into accout Wasserstein distance which is defined as:

$$ D_{w}(p, q) = \inf_{\gamma \in \pi(p, q)} \mathbb{E}_{(x, y) \sim \gamma}[|x - y|_1] $$

Where $\pi(p, q)$ is the set of all joint distributions, defined on $(x, y)$ such that:

$$ p(x) = \int{\gamma(x, y)dy} $$

and

$$ q(y) = \int{\gamma(x, y)dx} $$

To deploy this in GAN, we take into account kantorovich-rubinstein duality [3]:

$$D_{w}(p, q) = \sup_{\|f\|_L \le 1} \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{x \sim q}[f(x)]$$

As illustrated in the objective, $f$ should be one lipschitz function. Considering the distribution of discriminator as the function, the objective of Wasserstein-GAN can be defined this way:

$$\min_{\theta} \max_{\phi} \mathbb{E}_{x \sim p\_{data}}[D_{\phi}(x)] - \mathbb{E}_{z \sim p(z)}[D_{\phi}(G_{\theta}(z))]$$

lipschitzness of $D$ is enforced through weight-clipping and gradient-penalty on $\nabla_{x}D_{\phi}(x)$ which is implemented in the code as [4]:

def calc_gradient_penalty(self, data, generated_data):
        batch_size = data.shape[0]
        alpha = torch.rand(batch_size, 1, 1, 1).expand_as(data).to('cuda')
        interpolated = alpha * data.data + (1 - alpha) * generated_data.data
        interpolated = Variable(interpolated, requires_grad=True).to('cuda')

        prob_interpolated = self.discriminator(interpolated)

        gradients = torch.autograd.grad(outputs=prob_interpolated, inputs=interpolated,
                                        grad_outputs=torch.ones(prob_interpolated.size()).to('cuda'),
                                        create_graph=True, retain_graph=True)[0]

        gradients = gradients.view(batch_size, -1)
        self.training_history['gradient_norm'].append(gradients.norm(2, dim=1).mean().item())

        gradients_norm = torch.sqrt(torch.sum(gradients ** 2, dim=1) + 1e-12)
        return self.gradient_penalty_weight * ((gradients_norm - 1) ** 2).mean()

Training Process

The model was trained on the Fashion-MNIST [5] dataset through 100 epochs. The following figure, illustrates loss during training:

Generation

In the following figure, you can see 25 samples, generated by the W-GAN:

FID-Score analysis

We expect the model's FID-Score [6] to decrease as the training process goes on. I sampled three batches from epoch 1, 50 and 100 which had the fid score of $2798.7711$, $2105.2888$, $1857.8185$ respectively in which you can see the decreasing trend.

Normalizing Flow

As a part of the second assignment, I impleneted Real NVP [7] as a variant of normalizing flow.

Theory Of NF

Normalizing flows are known to be capable of evaluating likelihood. When the series of transformations from latent space to image space are Bijective, we can take into account the change of variable formula:

$$ p_{X}(x) = p_{Z}(f_{\theta}^{-1}(x)) |{\det{\frac{\partial{f_{\theta}^{-1}(x)}}{\partial{x}}}}| $$

Now we can rewrite the maximum-likelihood:

$$\max_{\theta} \log{p_{X}(D; \theta)} = \sum_{x \in D}( \log{p_{z}(f_{\theta}^{-1}(x))} + \log{|\det{\frac{\partial{f_{\theta}^{-1}(x)}}{\partial{x}}}|} )$$

The important note is that computational complexity of calculating determinant is $O(d^3)$ for a $d \times d$ matrix. To make the training process feasible, The transformations should be defined in a way that calculating its jacobian matrix is computationally feasible. The details will be explained in the next subsection. It is noteworthy that when latent space distriubtion is standard normal, loss function, which is negative log likelihood, is implemented in the code as :

def loss_function(x, sldj):
    prior_ll = -0.5 * (x ** 2 + np.log(2 * np.pi))
    prior_ll = prior_ll.view(x.size(0), -1).sum(-1) \
        - np.log(256) * np.prod(x.size()[1:])
    ll = prior_ll + sldj
    nll = -ll.mean()
    return nll

Real-NVP

Real-NVP, stacks a set of coupling layers which operate as following :

$$ y_1 = x_1 $$ $$ y_2 = s(x_1)x_2 + t(x_1) $$

where $x_1$ is the first dimension of input, and $y_1$ is the corresponding output, and the same for $y_2, y_3$ and s and x are a series of transformations, where the former returns a non-zero scaler and the latter returns a vector with the same dimensions as input.

The jacobian matrix of transformation can be calculated as:

$$ x_1 = y_1 $$

$$ x_2 = \frac{y_2 - t(y_1)}{s(y_1)} $$

$$ J = \begin{bmatrix} \frac{dx_1}{dy_1} = I & 0 \\ ... & \frac{dx_2}{dy_2} = \frac{1}{s(y_1)}I \end{bmatrix} $$

As you see, the matrix is upper-traingular whose determinant can be calculated in $O(1)$ as:

$$ |J| = (\frac{1}{s(y_1)})^d $$

As you see, the first dimension is immune to change; to overcome this issue,before applying next coupling layer, a random permutation is applied on dimensions so that all of them be exposed to change.

Training Of NF

The dataset deployed to train the model is Fashion-MNIST[5], and the trend of loss over 25 epochs is illustrated in the following figure:

Reconstruction and Generation

The model was assessed on both tasks of generation and evaluation. In the following image, you see the performance of model on reconstruction:

And here are 16 images, generated by the model:

Out Of Distribution Detection

As mentioned earlier, normalizing flows are capable of evaluating the exact value of likelihood, unlike VAEs which use ELBO. Thus, after training, we can state that images with likelihoods lower than a specific thereshold, don't belong to the distribution of the training images. The following figure, illustrates likelihood of CIFAR-10[8] dataset, compared to Fashion-MNIST(training data) :

VAE

As a part of the first assignment, I implemented a variational auto-encoder, in pytorch, following the architecture, offered at [1]. The dataset used to train, contains images of smiling and non-smiling faces. The VAE was trained jointly on these two categories. The training process took 1000 epochs and in the following figure, you can see the trends of both loss terms, reconstruction loss and KL-Divergence:

The performance of VAE was assessed on three differenct tasks which you can see the results in the following subsections.

Reconstruction

The ability of VAE to reconstruct the images, signifies the amount of information which can be stored in its latent space, which has lowerk number of dimensions compared to pixel space. Here is an example of the model's performance:

Odd columns show the original image and even columns show their corresponding reconstructed images.

Image Generation

Once the entire network is trained, we can simply take a sample from the latent space distribution, which was pushed towards standard normal distribution during training process, and pass it to the decoder to generate an image. In the following figure, you can see 32 images generated by the model:

Disentanglement

The KL-Divergence term in objective of VAE, pushes the learned distibution of latent space towards standard normal distribution. The remarkable property of this distribution, is that its covariance matrix is $I$, indicating that different dimensions are not correlated and point out to independent features. So the more you penalize this KL term, the more disentanglement is achieved, but in expense of performance drop in reconstruction. In this dataset, the most primary difference between two categories, was whether the person is smiling or non-smiling. So we expect the maximum difference in dimensions of latent space, is for the feature which corresponds to the intensity of smile. Thus, by manipulating this dimension for an image, we can manipulate the intensity of smile. The following figure depicts this process:

Score Based Models

In many cases, learning the distribution of an image dataset is infeasible, while we can take samples from the distriubtion by estimating its score function. As part of the third assignment, we used the idea of denoising score matching to estimate the socre function and Annealed Langevin Dynamics algorithm to take samples from the distribution[13].

Suppose $x \sim p(x)$, is the distribution we wish to estimate its score function. If we add a sli$ght noise to it to get $q(\bar{x})$, we can consider $q(\bar{x}) = p(x)$. Therefore, the process of estimating score function will be:

$$\frac{1}{2}\mathbb{E}_{\bar{x} \sim q_{\sigma}} \left[ \| \nabla_{\bar{x}}{\log{q_{\sigma}(\bar{x})} - s_{\theta}(\bar{x})} \|^{2}_{2} \right]$$

In which, $s_{\theta}$ is the network to estimate score function. The important note is that added noise, gave us the following important ralation:

$$q_{\sigma}(\bar{x}|x) = \mathbb{N}(\bar{x}|x, \sigma^{2}\mathbb{I})$$

Gaussian distribution, has this well-defined closed form score function:

$$\nabla_{\bar{x}}{\log{q_{\sigma}(\bar{x}|x)}} = - \frac{\bar{x} - x}{\sigma^2}$$

which makes the final objective of $s_\theta$:

$$\frac{1}{2n}\sum_{i=1}^{n}\left[ \| s_{\theta}(\bar{x_i}) - \nabla_{\bar{x}}{\log{q_{\sigma}(\bar{x_i}|x_i)}}) \|_{2}^{2} \right]$$

Training a network using this objective, gives us the score function:

Now that we have score function, we might think that we can easily generate samples using langevin dynamic [14] algorithm. But the algorithm will fail because data points usually occur in manifold and most of the space, does not have any gradient. The solution is to use the Annealed Langevin Dynamics algorithm. We start by adding high-variance noise, estimate the score function, move our initial sample points with a step size, and continue this process untill convergence:

And here is how samples converged through steps:

VQA on Paligemma

As a part of the fourth assignment, I evaluated Paligemma[15] on visual question answering task, using CLEVR[16] dataset to assess the model's ability to understand composition. The process was done on a portion of CELVR images, and also LoRA[17] was taken into account to strengthen the ability of model on this task and dataset. To ensure the significance of LoRA finetuning, I used ROUGE-score[18] whose results are summarized in the following table:

model ROUGE-1 ROUGE-L
pretrained 0.40 0.41
fine tuned 0.54 0.54

Here is an example of model's performance after fine-tuning:

References

  1. Irina Higgins, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework .link
  2. Martin Arjovsky, et al. Wasserstein GAN. link
  3. Leonid Kantorovich, Leonid Rubinstein. link
  4. EmilienDupont. link
  5. Han Xiao, et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. link
  6. Sadeep Jayasumana, et al. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. link
  7. Laurent Dinh, et al. Density estimation using Real NVP. link
  8. CIFAR-10
  9. Jonathan Ho, et al. Denoising Diffusion Probabilistic Models. link
  10. Olaf Ronneberger, et al. U-Net: Convolutional Networks for Biomedical Image Segmentation. link
  11. link
  12. Jiaming Song, et al. Denoising Diffusion Implicit Models. link
  13. Yang Song, et al. Generative Modeling by Estimating Gradients of the Data Distribution. link
  14. Ajay Chandra, et al. Langevin dynamic for the 2D Yang-Mills measure. link
  15. Lucas Beyer, et al. PaliGemma: A versatile 3B VLM for transfer. link
  16. Justin Johnson, et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. link
  17. Edward J. Hu, et al. LoRA: Low-Rank Adaptation of Large Language Models. link
  18. Chin-Yew Lin, ROUGE: A Package for Automatic Evaluation of summaries, link

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors