lavinal712 · lavinal712 · May 3, 2025 · Aug 28, 2025 · Aug 28, 2025 · Aug 28, 2025
diff --git a/README.md b/README.md
@@ -1,98 +1,106 @@
-# AutoencoderKL
+# Deep Compression Autoencoder
 
-## About The Project
+This repository is a branch of [lavinal712/AutoencoderKL](https://github.com/lavinal712/AutoencoderKL) with some modifications for the DC-AE. This code aims to provide a simple and easy-to-use training script regarding to [issue #173](https://github.com/mit-han-lab/efficientvit/issues/173).
 
-There are many great training scripts for VAE on Github. However, some repositories are not maintained and some are not updated to the latest version of PyTorch. Therefore, I decided to create this repository to provide a simple and easy-to-use training script for VAE by Lightning. Beside, the code is easy to transfer to other projects for time saving.
+We provide three different parameter configurations corresponding to the three phases of DC-AE training.
 
-- Support training and finetuning both [Stable Diffusion](https://github.com/CompVis/stable-diffusion) VAE and [Flux](https://github.com/black-forest-labs/flux) VAE.
-- Support evaluating reconstruction quality (FID, PSNR, SSIM, LPIPS).
-- A practical guidance of training VAE.
-- Easy to modify the code for your own research.
+```
+configs/
+├── dc-ae-f32c32-in-1.0_phase1.yaml
+├── dc-ae-f32c32-in-1.0_phase2.yaml
+└── dc-ae-f32c32-in-1.0_phase3.yaml
+```
 
-<!-- GETTING STARTED -->
-## Getting Started
+![phases](assets/phases.png)
+
+## Visualization
+
+Model: [mit-han-lab/dc-ae-f32c32-in-1.0](https://huggingface.co/models?other=dc-ae-f32c32-in-1.0)
+
+| Input                                   | Reconstruction                                            |
+|---------------------------------------  |-----------------------------------------------------------|
+| ![assets/inputs.png](assets/inputs.png) | ![assets/reconstructions.png](assets/reconstructions.png) |
+
+Evaluation (from [original paper](https://arxiv.org/abs/2410.10733))
 
-To get a local copy up and running follow these simple example steps.
+| Model          | rFID | PSNR  | SSIM | LPIPS |
+|----------------|------|-------|------|-------|
+| DC-AE (f32c32) | 0.69 | 23.85 | 0.66 | 0.082 |
+| SD-VAE         | 0.69 | 26.91 | 0.77 | 0.130 |
+
+## Getting Started
 
 ### Installation
 
-```bash
-git clone https://github.com/lavinal712/AutoencoderKL.git
+```
+git clone https://github.com/lavinal712/AutoencoderKL.git -b dc-ae
 cd AutoencoderKL
 conda create -n autoencoderkl python=3.10 -y
 conda activate autoencoderkl
 pip install -r requirements.txt
 ```
 
-### Training
-
-To start training, you need to prepare a config file. You can refer to the config files in the `configs` folder.
+### Data
 
-If you want to train on your own dataset, you should write your own data loader in `sgm/data` and modify the parameters in the config file.
+We use the ImageNet dataset for training and validation.
 
-Finetuning a VAE model is simple. You just need to specify the `ckpt_path` and `trainable_ae_params` in the config file. To keep the latent space of the original model, it is recommended to set decoder to be trainable.
+```
+ImageNet/
+├── train/
+│   ├── n01440764/
+│   │   ├── n01440764_18.JPEG
+│   │   ├── n01440764_36.JPEG
+│   │   └── ...
+│   ├── n01443537/
+│   │   ├── n01443537_2.JPEG
+│   │   ├── n01443537_16.JPEG
+│   │   └── ...
+│   ├── ...
+├── val/
+│   ├── n01440764/
+│   │   ├── ILSVRC2012_val_00000293.JPEG
+│   │   ├── ILSVRC2012_val_00002138.JPEG
+│   │   └── ...
+│   ├── n01443537/
+│   │   ├── ILSVRC2012_val_00000236.JPEG
+│   │   ├── ILSVRC2012_val_00000262.JPEG
+│   │   └── ...
+│   ├── ...
+└── ...
+```
 
-Then, you can start training by running the following command.
+### Training
 
 ```bash
-NUM_GPUS=4
-NUM_NODES=1
-
-torchrun --nproc_per_node=${NUM_GPUS} --nnodes=${NUM_NODES} main.py \
-    --base configs/autoencoder_kl_32x32x4.yaml \
+torchrun --nproc_per_node=4 --nnodes=1 main.py \
+    --base configs/dc-ae-f32c32-in-1.0_phase1.yaml \
     --train \
-    --logdir logs/autoencoder_kl_32x32x4 \
-    --scale_lr True \
-    --wandb False \
+    --scale_lr False \
+    --wandb True \
 ```
 
-### Evaluation
-
-We provide a script to evaluate the reconstruction quality of the trained model. `--resume` provides a convenient way to load the checkpoint from the log directory.
-
-We introduce multi-GPU and multi-thread method for faster evaluation.
-
-The default dataset is ImageNet. You can change the dataset by modifying the `--datadir` in the command line and the evaluation script.
+Remember to specify the model checkpoint path in the next phases in the command line or in the config file.
 
 ```bash
-NUM_GPUS=4
-NUM_NODES=1
-
-torchrun --nproc_per_node=${NUM_GPUS} --nnodes=${NUM_NODES} eval.py \
-    --resume logs/autoencoder_kl_32x32x4 \
-    --base configs/autoencoder_kl_32x32x4.yaml \
-    --logdir eval/autoencoder_kl_32x32x4 \
-    --datadir /path/to/ImageNet \
-    --image_size 256 \
-    --batch_size 16 \
-    --num_workers 16 \
+torchrun --nproc_per_node=4 --nnodes=1 main.py \
+    --base configs/dc-ae-f32c32-in-1.0_phase2.yaml \
+    --train \
+    --scale_lr False \
+    --wandb True \
+    --resume_from_checkpoint /path/to/last.ckpt
 ```
 
-### Converting to diffusers
-
-[huggingface/diffusers](https://github.com/huggingface/diffusers) is a library for diffusion models. It provides a script [convert_vae_pt_to_diffusers.py
-](https://github.com/huggingface/diffusers/blob/main/scripts/convert_vae_pt_to_diffusers.py) to convert a PyTorch Lightning model to a diffusers model.
-
-Currently, the script is not updated for all kinds of VAE models, just for SD VAE.
-
 ```bash
-python convert_vae_pt_to_diffusers.py \
-    --vae_path logs/autoencoder_kl_32x32x4/checkpoints/last.ckpt \
-    --dump_path autoencoder_kl_32x32x4 \
+torchrun --nproc_per_node=4 --nnodes=1 main.py \
+    --base configs/dc-ae-f32c32-in-1.0_phase3.yaml \
+    --train \
+    --scale_lr False \
+    --wandb True \
+    --resume_from_checkpoint /path/to/last.ckpt
 ```
 
-## Guidance
-
-Here are some guidance for training VAE. If there are any mistakes, please let me know.
-
-- Learning rate: In LDM repository [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion), the base learning rate is set to 4.5e-6 in the config file. However, the batch size is 12, accumulated gradient is 2 and `scale_lr` is set to `True`. Therefore, the effective learning rate is 4.5e-6 * 12 * 2 * 1 = 1.08e-4. It is better to set the learning rate from 1.0e-4 to 1.0e-5. In finetuning stage, it can be smaller than the first stage.
-  - `scale_lr`: It is better to set `scale_lr` to `False` when training on a large dataset.
-- Discriminator: You should open the discriminator in the end of the training, when the VAE has good reconstruction performance. In default, `disc_start` is set to 50001.
-- Perceptual loss: LPIPS is a good metric for evaluating the quality of the reconstructed images. Some models use other perceptual loss functions to gain better performance, such as [sypsyp97/convnext_perceptual_loss](https://github.com/sypsyp97/convnext_perceptual_loss).
-
-## Acknowledgments
+## Acknowledgements
 
-Thanks for the following repositories. Without their code, this project would not be possible.
+Thanks for the original introduction and implementation of [DC-AE](https://github.com/mit-han-lab/efficientvit) .
 
-- [Stability-AI/generative-models](https://github.com/Stability-AI/generative-models). We heavily borrow the code from this repository, just modifing a few parameters for our concept.
-- [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion). We follow the hyperparameter settings of this repository in config files.
+- [mit-han-lab/efficientvit](https://github.com/mit-han-lab/efficientvit)
diff --git a/assets/inputs.png b/assets/inputs.png
diff --git a/assets/phases.png b/assets/phases.png
diff --git a/assets/reconstructions.png b/assets/reconstructions.png
diff --git a/configs/dc-ae-f32c32-in-1.0_phase1.yaml b/configs/dc-ae-f32c32-in-1.0_phase1.yaml
@@ -0,0 +1,90 @@
+model:
+  base_learning_rate: 6.4e-5
+  target: sgm.models.autoencoder.AutoencoderDC
+  params:
+    input_key: jpg
+    monitor: "val/loss/rec"
+    disc_start_iter: 1000000000
+
+    encoder_config:
+      target: sgm.modules.efficientvitmodules.model.EncoderConfig
+      params:
+        in_channels: 3
+        latent_channels: 32
+        block_type: [ResBlock, ResBlock, ResBlock, EViT_GLU, EViT_GLU, EViT_GLU]
+        width_list: [128, 256, 512, 512, 1024, 1024]
+        depth_list: [0, 4, 8, 2, 2, 2]
+
+    decoder_config:
+      target: sgm.modules.efficientvitmodules.model.DecoderConfig
+      params:
+        in_channels: 3
+        latent_channels: 32
+        block_type: [ResBlock, ResBlock, ResBlock, EViT_GLU, EViT_GLU, EViT_GLU]
+        width_list: [128, 256, 512, 512, 1024, 1024]
+        depth_list: [0, 5, 10, 2, 2, 2]
+        norm: [bn2d, bn2d, bn2d, trms2d, trms2d, trms2d]
+        act: [relu, relu, relu, silu, silu, silu]
+
+    loss_config:
+      target: sgm.modules.autoencoding.losses.GeneralLPIPSWithDiscriminator
+      params:
+        perceptual_weight: 0.25
+        disc_start: 50001
+        disc_weight: 0.0
+        learn_logvar: false
+        pixel_loss: "l1"
+
+    optimizer_config:
+      target: torch.optim.AdamW
+      params:
+        betas: [0.9, 0.999]
+        weight_decay: 0.1
+
+data:
+  target: sgm.data.imagenet.ImageNetLoader
+  params:
+    batch_size: 24
+    num_workers: 4
+    prefetch_factor: 2
+    shuffle: true
+
+    train:
+      root_dir: /path/to/ImageNet
+      size: 256
+      transform: true
+    validation:
+      root_dir: /path/to/ImageNet
+      size: 256
+      transform: true
+
+lightning:
+  strategy:
+    target: pytorch_lightning.strategies.DDPStrategy
+    params:
+      find_unused_parameters: True
+
+  modelcheckpoint:
+    params:
+      every_n_epochs: 1
+
+  callbacks:
+    metrics_over_trainsteps_checkpoint:
+      params:
+        every_n_train_steps: 50000
+
+    image_logger:
+      target: main.ImageLogger
+      params:
+        enable_autocast: False
+        batch_frequency: 1000
+        max_images: 8
+        increase_log_steps: True
+
+  trainer:
+    precision: bf16
+    devices: 0, 1, 2, 3
+    limit_val_batches: 50
+    benchmark: True
+    accumulate_grad_batches: 1
+    check_val_every_n_epoch: 1
diff --git a/configs/dc-ae-f32c32-in-1.0_phase2.yaml b/configs/dc-ae-f32c32-in-1.0_phase2.yaml
@@ -0,0 +1,92 @@
+model:
+  base_learning_rate: 1.6e-5
+  target: sgm.models.autoencoder.AutoencoderDC
+  params:
+    input_key: jpg
+    monitor: "val/loss/rec"
+    disc_start_iter: 1000000000
+    trainable_ae_params:
+      - ["encoder.project_out", "decoder.project_in"]
+
+    encoder_config:
+      target: sgm.modules.efficientvitmodules.model.EncoderConfig
+      params:
+        in_channels: 3
+        latent_channels: 32
+        block_type: [ResBlock, ResBlock, ResBlock, EViT_GLU, EViT_GLU, EViT_GLU]
+        width_list: [128, 256, 512, 512, 1024, 1024]
+        depth_list: [0, 4, 8, 2, 2, 2]
+
+    decoder_config:
+      target: sgm.modules.efficientvitmodules.model.DecoderConfig
+      params:
+        in_channels: 3
+        latent_channels: 32
+        block_type: [ResBlock, ResBlock, ResBlock, EViT_GLU, EViT_GLU, EViT_GLU]
+        width_list: [128, 256, 512, 512, 1024, 1024]
+        depth_list: [0, 5, 10, 2, 2, 2]
+        norm: [bn2d, bn2d, bn2d, trms2d, trms2d, trms2d]
+        act: [relu, relu, relu, silu, silu, silu]
+
+    loss_config:
+      target: sgm.modules.autoencoding.losses.GeneralLPIPSWithDiscriminator
+      params:
+        perceptual_weight: 0.25
+        disc_start: 50001
+        disc_weight: 0.0
+        learn_logvar: false
+        pixel_loss: "l1"
+
+    optimizer_config:
+      target: torch.optim.AdamW
+      params:
+        betas: [0.9, 0.999]
+        weight_decay: 0.001
+
+data:
+  target: sgm.data.imagenet.ImageNetLoader
+  params:
+    batch_size: 4
+    num_workers: 4
+    prefetch_factor: 2
+    shuffle: true
+
+    train:
+      root_dir: /path/to/ImageNet
+      size: 512
+      transform: true
+    validation:
+      root_dir: /path/to/ImageNet
+      size: 512
+      transform: true
+
+lightning:
+  strategy:
+    target: pytorch_lightning.strategies.DDPStrategy
+    params:
+      find_unused_parameters: True
+
+  modelcheckpoint:
+    params:
+      every_n_epochs: 1
+
+  callbacks:
+    metrics_over_trainsteps_checkpoint:
+      params:
+        every_n_train_steps: 50000
+
+    image_logger:
+      target: main.ImageLogger
+      params:
+        enable_autocast: False
+        batch_frequency: 1000
+        max_images: 8
+        increase_log_steps: True
+
+  trainer:
+    precision: bf16
+    devices: 0, 1, 2, 3
+    limit_val_batches: 50
+    benchmark: True
+    accumulate_grad_batches: 1
+    check_val_every_n_epoch: 1