-
Notifications
You must be signed in to change notification settings - Fork 13
Description
You employ a joint image-video training strategy using video clips from TartanAir and Virtual KITTI, alongside static images from Hypersim. These datasets are excellent and have been widely used for many years to train depth estimation models, but they lack one crucial element: people! People are the most important. In almost every film, TV series or private video, people are the main focus of the recording. Stereo video conversion models require very good video depth estimation maps in which people are the main characters.
I know that the DVD is designed to provide very good zero-shot video depth estimation results, but please look at the example below:
8530646-sd_960_540_25fps_color_depth_vis.mp4
Source of the video file: https://www.pexels.com/video/young-women-posing-in-the-park-on-a-windy-day-8530646/
Video depth estimation made using the DVD demo version: https://huggingface.co/spaces/haodongli/DVD
I can see two elements here that could be improved:
- Between the thirteenth and sixteenth seconds, the hand of the girl at the back is far too red compared to the face of the girl standing closer to the camera.
- Individual hairs and strands of hair on both girls appear to lack continuity and seem to consist of what look like individual dots.
Of course, compared to other video depth estimation models, the level of detail your model is able to capture is incredible. That’s precisely why I think it’s worth trying to refine your model even further, so that it really blows people away!
As you know your model’s architecture inside out, you might be able to make some changes to improve the depth estimation in this example. Additionally, I’d like to ask you to fine-tune the DVD model on datasets that could help with these two aspects mentioned above.
For human poses, the best dataset is the one below, which contains video clips along with depth maps:
| Dataset | Venue | Resolution | Unique features | |
|---|---|---|---|---|
| 1 | BEDLAM2.0 📌 Human poses 😍 |
1280×720 | BEDLAM2.0 is a large-scale synthetic video dataset of animated bodies in simulated clothing. With more than 8 million images, it is a significant expansion of the popular BEDLAM dataset that increases pose and body shape variation, and adds shoes and strand-based hair. Most importantly, it introduces a wide range of realistic cameras and camera motions. |
For human faces and hair, the best dataset is the one below, which contains static images along with depth maps. The GT depth maps likely do not show the depth of the background, so you will need to replace the background with an image of a cloudless sky:
| Dataset | Venue | Resolution | Unique features | |
|---|---|---|---|---|
| 2 | SynthHuman 📌 Human faces 😍 |
384×512 | The dataset contains 98040 samples feature the face, 99976 sample feature the full body and 99992 samples feature the upper body. DAViD trained on this dataset alone achieved better depth estimation results than Depth Anything V2 Large, Depth Pro and even Sapiens-2B on the Goliath-Face test set. See the results in Table 1. |
I think that training on these datasets will improve not only the example above, but also the results on the Bonn and Sintel test sets in your paper.