🎥 A request to improve video depth estimation for human poses and hair 🎥

You employ a joint image-video training strategy using video clips from TartanAir and Virtual KITTI, alongside static images from Hypersim. These datasets are excellent and have been widely used for many years to train depth estimation models, but they lack one crucial element: people! People are the most important. In almost every film, TV series or private video, people are the main focus of the recording. Stereo video conversion models require very good video depth estimation maps in which people are the main characters.

I know that the DVD is designed to provide very good zero-shot video depth estimation results, but please look at the example below:

https://github.com/user-attachments/assets/bcf6b21c-4b0f-4a63-893f-0e447a0c9eec

Source of the video file: https://www.pexels.com/video/young-women-posing-in-the-park-on-a-windy-day-8530646/
Video depth estimation made using the DVD demo version: https://huggingface.co/spaces/haodongli/DVD

I can see two elements here that could be improved:

1. Between the thirteenth and sixteenth seconds, the hand of the girl at the back is far too red compared to the face of the girl standing closer to the camera. 
2. Individual hairs and strands of hair on both girls appear to lack continuity and seem to consist of what look like individual dots.

Of course, compared to other video depth estimation models, the level of detail your model is able to capture is incredible. That’s precisely why I think it’s worth trying to refine your model even further, so that it really blows people away!

As you know your model’s architecture inside out, you might be able to make some changes to improve the depth estimation in this example. Additionally, I’d like to ask you to fine-tune the DVD model on datasets that could help with these two aspects mentioned above.

For **human poses**, the best dataset is the one below, which contains **video clips** along with depth maps:

|  | Dataset | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Venue&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Resolution | Unique features |
|:---:|:---:|:---:|:---:|:---:|
| 1 | **[BEDLAM2.0](https://bedlam2.is.tuebingen.mpg.de/)<br />📌&nbsp;Human&nbsp;poses&nbsp;😍** | [![NeurIPS](https://img.shields.io/badge/2025-NeurIPS-68448a)](https://openreview.net/forum?id=ii9kVwf95a) | **1280×720** | BEDLAM2.0 is a large-scale synthetic video dataset of animated bodies in simulated clothing. With more than 8 million images, it is a significant expansion of the popular BEDLAM dataset that increases pose and body shape variation, and adds shoes and strand-based hair. **Most importantly, it introduces a wide range of realistic cameras and camera motions.** |

For **human faces and hair**, the best dataset is the one below, which contains **static images** along with depth maps. The GT depth maps likely do not show the depth of the background, so you will need to replace the background with an image of a cloudless sky:

|  | Dataset | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Venue&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Resolution | Unique features |
|:---:|:---:|:---:|:---:|:---:|
| 2 | **[SynthHuman](https://microsoft.github.io/DAViD/)<br />📌&nbsp;Human&nbsp;faces&nbsp;😍** | [![ICCV](https://img.shields.io/badge/2025-ICCV-fcb900)](https://openaccess.thecvf.com/content/ICCV2025/html/Saleh_DAViD_Data-efficient_and_Accurate_Vision_Models_from_Synthetic_Data_ICCV_2025_paper.html) | **384×512** | The dataset contains 98040 samples feature the face, 99976 sample feature the full body and 99992 samples feature the upper body. DAViD trained on this dataset alone achieved better depth estimation results than Depth Anything V2 Large, Depth Pro and even Sapiens-2B on the Goliath-Face test set. See the results in [Table 1](https://openaccess.thecvf.com/content/ICCV2025/html/Saleh_DAViD_Data-efficient_and_Accurate_Vision_Models_from_Synthetic_Data_ICCV_2025_paper.html). |

I think that training on these datasets will improve not only the example above, but also the results on the Bonn and Sintel test sets in your paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎥 A request to improve video depth estimation for human poses and hair 🎥 #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🎥 A request to improve video depth estimation for human poses and hair 🎥 #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions