Skip to content

EvertBuzonBadillo/diffusion-attention-motion

Repository files navigation

A framework to add motion to diffusion model generated images

This repository contains an experimental project developed as part of a Practical Master course at LMU Munich (SS24).

The project was carried out by Evert Buzon and Yuan Cui.

The main goal of this work was to explore whether it is possible to introduce motion into images generated by diffusion models by manipulating cross-attention maps.

The ultimate goal of the project was to generate short videos from diffusion-generated images by progressively moving objects inside the image.


Background

This project builds on the Prompt-to-Prompt editing framework introduced in:

Hertz et al.
Prompt-to-Prompt Image Editing with Cross-Attention Control
2022

Prompt-to-Prompt allows editing images generated by Stable Diffusion by manipulating the cross-attention maps that connect words in the prompt with spatial regions of the image.

The code used in this repository is based on the original Prompt-to-Prompt implementation and was modified for the purposes of this experiment.

Full credit goes to the original authors for the Prompt-to-Prompt framework.

Stable Diffusion itself is based on:

Rombach et al.
High-Resolution Image Synthesis with Latent Diffusion Models
2022


Project Idea

The main idea of this project was to investigate whether objects inside a diffusion-generated image could be moved by shifting their attention maps.

Instead of replacing a word in the prompt (as done in the original Prompt-to-Prompt framework), we shift the cross-attention associated with a specific object.

For example, if the prompt contains the word "burger", the attention map of that word can be slightly shifted to move the burger in the image.

If this shift is repeated across multiple generated frames, it could potentially create the illusion of motion, which would allow the generation of simple videos from diffusion images.


Approaches Tested

Two different approaches were explored in the project.

1. Incremental Approach

In the incremental approach, the object is moved using very small shifts applied iteratively.

Each new image is generated using the previous image as a reference, and the shifts accumulate gradually.

Advantages

  • Lower distortion
  • More stable motion

Disadvantages

  • Computationally expensive
  • Requires many iterations to produce noticeable movement

Another limitation encountered during the project was that cross-attention maps of generated images could not be stored and reused, which prevented the implementation of a full video generation pipeline.


2. Variation-Shift Approach

The second method tested was the Variation-Shift Approach.

Instead of performing many small shifts, this approach applies different shift magnitudes directly to the original image.

Advantages

  • Fewer iterations required
  • Faster experimentation

Disadvantages

  • Larger shifts introduce stronger distortions in the generated images

Local blending was also tested in order to reduce distortions around the manipulated object.


Implementation

The experiments were implemented using the Stable Diffusion Prompt-to-Prompt framework.

Most of the code in this repository comes from the original Prompt-to-Prompt implementation.

The modifications introduced in this project mainly include:

  • shifting cross-attention maps in order to move objects in the generated image
  • testing incremental attention shifts
  • testing variation shifts of different magnitudes
  • experimenting with local blending to reduce distortions

The main experiment notebook can be found in:

prompt-to-prompt_stable.ipynb

Results

The experiments show that small attention shifts can indeed produce small spatial movements in generated images.

However, the quality of the generated images strongly depends on the magnitude of the shift.

Large shifts introduce distortions, while small shifts require many iterations.

Because of this trade-off, generating stable video sequences remains challenging.


Limitations

This project should be considered a proof of concept.

Several limitations were encountered:

  • Large shifts cause noticeable distortions
  • Incremental shifting is computationally expensive
  • Cross-attention maps could not be stored between iterations
  • Because of this limitation, a full video generation pipeline could not be implemented

Report

The full report submitted at LMU Munich is included in this repository:

report\A framework to add motion to diffusion model generated images.pdf

The report provides a more detailed explanation of the method, experiments, and results.

Note on the Report Format

The report included in this repository follows the CVPR paper format template.

This format was required as part of the course assignment in the Practical Master at LMU Munich (SS24), under the supervision of Nick Stracke.

The use of the CVPR template does not imply that this work was submitted to CVPR.
It was only used as an academic writing format for the course project.

The intention of the format requirement was to train students to structure research reports in a format commonly used in computer vision conferences.


Credits

This work builds directly on the following research:

Hertz et al.
Prompt-to-Prompt Image Editing with Cross-Attention Control
2022

Rombach et al.
High-Resolution Image Synthesis with Latent Diffusion Models
2022

The original Prompt-to-Prompt code and framework belong to their respective authors.

This repository contains experimental modifications made for an academic course project.


Authors

This project was developed as part of the Practical Master course at LMU Munich (SS24).

Evert Buzon
Yuan Cui


Repository Status

This repository contains experimental code developed for a university course project.

It is not intended to be a production implementation.

About

Experiment on cross-attention shifts for introducing motion in diffusion-generated images (LMU Practical Master SS24)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors