lavinal712 · lavinal712 · May 12, 2025
diff --git a/README.md b/README.md
@@ -1,98 +1,37 @@
-# AutoencoderKL
+# ViTVAE
 
-## About The Project
+For a long time, VAEs have mostly relied on CNN architectures. Some projects, like [ViT-VQGAN](https://github.com/thuanz123/enhancing-transformers), [ViTok](https://github.com/philippe-eecs/vitok), have experimented with Transformers. Now, thanks to various acceleration techniques, training Transformer-based VAEs has become more feasible, and scaling them to larger model sizes is more accessible. This branch offers a 2D version of the [MAGI-1 VAE](https://github.com/SandAI-org/MAGI-1/tree/main/inference/model/vae).
 
-There are many great training scripts for VAE on Github. However, some repositories are not maintained and some are not updated to the latest version of PyTorch. Therefore, I decided to create this repository to provide a simple and easy-to-use training script for VAE by Lightning. Beside, the code is easy to transfer to other projects for time saving.
-
-- Support training and finetuning both [Stable Diffusion](https://github.com/CompVis/stable-diffusion) VAE and [Flux](https://github.com/black-forest-labs/flux) VAE.
-- Support evaluating reconstruction quality (FID, PSNR, SSIM, LPIPS).
-- A practical guidance of training VAE.
-- Easy to modify the code for your own research.
-
-<!-- GETTING STARTED -->
 ## Getting Started
 
-To get a local copy up and running follow these simple example steps.
-
 ### Installation
 
-```bash
-git clone https://github.com/lavinal712/AutoencoderKL.git
+```
+git clone https://github.com/lavinal712/AutoencoderKL.git -b dc-ae
 cd AutoencoderKL
 conda create -n autoencoderkl python=3.10 -y
 conda activate autoencoderkl
 pip install -r requirements.txt
+pip install --no-cache-dir --no-build-isolation flash-attn==2.7.0.post2
 ```
 
 ### Training
 
-To start training, you need to prepare a config file. You can refer to the config files in the `configs` folder.
-
-If you want to train on your own dataset, you should write your own data loader in `sgm/data` and modify the parameters in the config file.
-
-Finetuning a VAE model is simple. You just need to specify the `ckpt_path` and `trainable_ae_params` in the config file. To keep the latent space of the original model, it is recommended to set decoder to be trainable.
-
-Then, you can start training by running the following command.
-
 ```bash
-NUM_GPUS=4
-NUM_NODES=1
-
-torchrun --nproc_per_node=${NUM_GPUS} --nnodes=${NUM_NODES} main.py \
-    --base configs/autoencoder_kl_32x32x4.yaml \
+torchrun --nproc_per_node=4 --nnodes=1 main.py \
+    --base configs/magi-1_2d.yaml \
     --train \
-    --logdir logs/autoencoder_kl_32x32x4 \
-    --scale_lr True \
-    --wandb False \
-```
-
-### Evaluation
-
-We provide a script to evaluate the reconstruction quality of the trained model. `--resume` provides a convenient way to load the checkpoint from the log directory.
-
-We introduce multi-GPU and multi-thread method for faster evaluation.
-
-The default dataset is ImageNet. You can change the dataset by modifying the `--datadir` in the command line and the evaluation script.
-
-```bash
-NUM_GPUS=4
-NUM_NODES=1
-
-torchrun --nproc_per_node=${NUM_GPUS} --nnodes=${NUM_NODES} eval.py \
-    --resume logs/autoencoder_kl_32x32x4 \
-    --base configs/autoencoder_kl_32x32x4.yaml \
-    --logdir eval/autoencoder_kl_32x32x4 \
-    --datadir /path/to/ImageNet \
-    --image_size 256 \
-    --batch_size 16 \
-    --num_workers 16 \
+    --scale_lr False \
+    --wandb True \
 ```
 
-### Converting to diffusers
-
-[huggingface/diffusers](https://github.com/huggingface/diffusers) is a library for diffusion models. It provides a script [convert_vae_pt_to_diffusers.py
-](https://github.com/huggingface/diffusers/blob/main/scripts/convert_vae_pt_to_diffusers.py) to convert a PyTorch Lightning model to a diffusers model.
+## TODO
 
-Currently, the script is not updated for all kinds of VAE models, just for SD VAE.
-
-```bash
-python convert_vae_pt_to_diffusers.py \
-    --vae_path logs/autoencoder_kl_32x32x4/checkpoints/last.ckpt \
-    --dump_path autoencoder_kl_32x32x4 \
-```
-
-## Guidance
-
-Here are some guidance for training VAE. If there are any mistakes, please let me know.
-
-- Learning rate: In LDM repository [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion), the base learning rate is set to 4.5e-6 in the config file. However, the batch size is 12, accumulated gradient is 2 and `scale_lr` is set to `True`. Therefore, the effective learning rate is 4.5e-6 * 12 * 2 * 1 = 1.08e-4. It is better to set the learning rate from 1.0e-4 to 1.0e-5. In finetuning stage, it can be smaller than the first stage.
-  - `scale_lr`: It is better to set `scale_lr` to `False` when training on a large dataset.
-- Discriminator: You should open the discriminator in the end of the training, when the VAE has good reconstruction performance. In default, `disc_start` is set to 50001.
-- Perceptual loss: LPIPS is a good metric for evaluating the quality of the reconstructed images. Some models use other perceptual loss functions to gain better performance, such as [sypsyp97/convnext_perceptual_loss](https://github.com/sypsyp97/convnext_perceptual_loss).
+- [ ] Support [ViT-VQGAN](https://github.com/thuanz123/enhancing-transformers).
+- [ ] Support [ViTok](https://github.com/philippe-eecs/vitok).
 
 ## Acknowledgments
 
-Thanks for the following repositories. Without their code, this project would not be possible.
-
-- [Stability-AI/generative-models](https://github.com/Stability-AI/generative-models). We heavily borrow the code from this repository, just modifing a few parameters for our concept.
-- [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion). We follow the hyperparameter settings of this repository in config files.
+- [MAGI-1](https://github.com/SandAI-org/MAGI-1)
+- [ViT-VQGAN](https://github.com/thuanz123/enhancing-transformers)
+- [ViTok](https://github.com/philippe-eecs/vitok)
diff --git a/configs/magi-1_2d.yaml b/configs/magi-1_2d.yaml
@@ -0,0 +1,90 @@
+model:
+  base_learning_rate: 1.0e-4
+  target: sgm.models.autoencoder.AutoencodingEngine
+  params:
+    input_key: jpg
+    monitor: "val/loss/rec"
+    disc_start_iter: 50001
+
+    encoder_config:
+      target: sgm.modules.vaemodules.model.ViTEncoder
+      params:
+        conv_last_layer: true
+        depth: 24
+        double_z: true
+        embed_dim: 1024
+        in_chans: 3
+        ln_in_attn: true
+        mlp_ratio: 4
+        norm_code: false
+        num_heads: 16
+        patch_size: 8
+        qkv_bias: true
+        img_size: 256
+        z_chans: 16
+
+    decoder_config:
+      target: sgm.modules.vaemodules.model.ViTDecoder
+      params: ${model.params.encoder_config.params}
+
+    regularizer_config:
+      target: sgm.modules.autoencoding.regularizers.DiagonalGaussianRegularizer
+
+    loss_config:
+      target: sgm.modules.autoencoding.losses.GeneralLPIPSWithDiscriminator
+      params:
+        perceptual_weight: 0.25
+        disc_start: 50001
+        disc_weight: 0.0
+        learn_logvar: false
+        pixel_loss: "l1"
+        regularization_weights:
+          kl_loss: 1.0e-6
+
+data:
+  target: sgm.data.imagenet.ImageNetLoader
+  params:
+    batch_size: 16
+    num_workers: 4
+    prefetch_factor: 2
+    shuffle: true
+
+    train:
+      root_dir: /home/azureuser/v-yuqianhong/ImageNet/ILSVRC2012
+      size: 256
+      transform: true
+    validation:
+      root_dir: /home/azureuser/v-yuqianhong/ImageNet/ILSVRC2012
+      size: 256
+      transform: true
+
+lightning:
+  strategy:
+    target: pytorch_lightning.strategies.DDPStrategy
+    params:
+      find_unused_parameters: True
+
+  modelcheckpoint:
+    params:
+      every_n_epochs: 1
+
+  callbacks:
+    metrics_over_trainsteps_checkpoint:
+      params:
+        every_n_train_steps: 50000
+
+    image_logger:
+      target: main.ImageLogger
+      params:
+        enable_autocast: False
+        batch_frequency: 1000
+        max_images: 8
+        increase_log_steps: True
+
+  trainer:
+    precision: bf16
+    devices: 0, 1, 2, 3
+    limit_val_batches: 50
+    benchmark: True
+    accumulate_grad_batches: 1
+    check_val_every_n_epoch: 1
diff --git a/main.py b/main.py
@@ -132,7 +132,7 @@ def str2bool(v):
     parser.add_argument(
         "--projectname",
         type=str,
-        default="autoencoderkl",
+        default="vitvae",
     )
     parser.add_argument(
         "-l",

diff --git a/sgm/modules/vaemodules/__init__.py b/sgm/modules/vaemodules/__init__.py