Skip to content

dvc import --no-download and dvc exp run --pull: inconsistent .dvc stage output between runs #11005

@jefferyyjhsu

Description

@jefferyyjhsu

Bug Report

dvc import --no-download and dvc exp run --pull: inconsistent .dvc stage output between runs

Description

dvc exp run --pull alternates a .dvc stage in subsequent experiments if the original data is deleted. The new stage can't be reproduced on a different machine.

Reproduce

  1. Create a new repo: mkdir /tmp/example; cd /tmp/example; git init; dvc init.
  2. Create an empty dvc.yaml: touch dvc.yaml.
  3. Import dvc tracked data: dvc import https://github.com/treeverse/vscode-dvc-demo.git data/ -o remote_data/ --no-download.
  4. Stage and commit the first remote_data.dvc.
md5: 19072ec61004c43cb45ae849652c51e0
frozen: true
deps:
- path: data/
  repo:
    url: https://github.com/treeverse/vscode-dvc-demo.git
    rev_lock: 8518e8734f819b261cbf21e4bc6a1df93a90fe79
outs:
- hash: md5
  path: remote_data
  1. Run the experiment with dvc exp run --pull.
  2. remote_data.dvc is unchanged after the experiment. The remote_data folder now looks like:
├── remote_data
│   └── MNIST
│       └── raw
│           ├── t10k-images-idx3-ubyte
│           ├── t10k-images-idx3-ubyte.gz
│           ├── t10k-labels-idx1-ubyte
│           ├── t10k-labels-idx1-ubyte.gz
│           ├── train-images-idx3-ubyte
│           ├── train-images-idx3-ubyte.gz
│           ├── train-labels-idx1-ubyte
│           └── train-labels-idx1-ubyte.gz
  1. Remove the remote_data dir and run dvc exp run --pull. again. This time, the remote_data.dvc is modified, and the remote_data dir's content is not as expected.
md5: 1c23a5e0d2f34ad183f5d7010db87de9
frozen: true
deps:
- path: data/
  repo:
    url: https://github.com/treeverse/vscode-dvc-demo.git
    rev_lock: 8518e8734f819b261cbf21e4bc6a1df93a90fe79
outs:
- hash: md5
  path: remote_data
  md5: 077582768b6d3e879851f57144cd20db.dir
  size: 133089540
  nfiles: 16
├── remote_data
│   ├── data
│   │   └── MNIST
│   │       └── raw
│   │           ├── t10k-images-idx3-ubyte
│   │           ├── t10k-images-idx3-ubyte.gz
│   │           ├── t10k-labels-idx1-ubyte
│   │           ├── t10k-labels-idx1-ubyte.gz
│   │           ├── train-images-idx3-ubyte
│   │           ├── train-images-idx3-ubyte.gz
│   │           ├── train-labels-idx1-ubyte
│   │           └── train-labels-idx1-ubyte.gz
│   └── MNIST
│       └── raw
│           ├── t10k-images-idx3-ubyte
│           ├── t10k-images-idx3-ubyte.gz
│           ├── t10k-labels-idx1-ubyte
│           ├── t10k-labels-idx1-ubyte.gz
│           ├── train-images-idx3-ubyte
│           ├── train-images-idx3-ubyte.gz
│           ├── train-labels-idx1-ubyte
│           └── train-labels-idx1-ubyte.gz
  1. Stage and commit the second remote_data.dvc.

  2. On a different machine, clone the repo and then repeat the experiment with dvc exp run --pull. The second version of remote_data.dvc stage will result in a failure like this.

Logs

Reproducing experiment 'agley-ankh'                                                                                                               
Building workspace index                                                                                                |0.00 [00:00,    ?entry/s]
Comparing indexes                                                                                                       |12.0 [00:00, 28.8entry/s]
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte.gz'. It won't be created. 
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte'. It won't be created.   
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte'. It won't be created.   
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte'. It won't be created.    
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte.gz'. It won't be created. 
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte'. It won't be created.    
Applying changes                                                                                                        |0.00 [00:00,     ?file/s]
WARNING: Failed to pull run cache: config file error: no remote specified in /home/USER/code/dvc-debug. Create a default remote with
    dvc remote add -d <remote name> <remote url>
Pulling data for stage: 'remote_data.dvc'
Collecting                                                                                                              |11.0 [00:00,  857entry/s]
Fetching
Building workspace index                                                                                                |0.00 [00:00,    ?entry/s]
Comparing indexes                                                                                                       |12.0 [00:00,  311entry/s]
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte.gz'. It won't be created. 
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte'. It won't be created.   
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte'. It won't be created.   
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte'. It won't be created.    
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte.gz'. It won't be created. 
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte'. It won't be created.    
Applying changes                                                                                                        |0.00 [00:00,     ?file/s]
Unable to pull data for stage: 'remote_data.dvc'
Verifying outputs in frozen stage: 'remote_data.dvc'
ERROR: failed to reproduce 'remote_data.dvc': missing data 'source': remote_data

  1. Continuing on the second machine, check out the commit containing the first version of remote_data.dvc and repeat the experiment again. This time, the experiment will succeed.
Logs

Reproducing experiment 'curly-pein'                                                                                                               
Building workspace index                                                                                                |0.00 [00:00,    ?entry/s]
Comparing indexes                                                                                                       |12.0 [00:00, 26.7entry/s]
Applying changes                                                                                                        |0.00 [00:00,     ?file/s]
WARNING: Failed to pull run cache: config file error: no remote specified in /home/USER/code/dvc-debug. Create a default remote with
    dvc remote add -d <remote name> <remote url>
Pulling data for stage: 'remote_data.dvc'
Collecting                                                                                                              |11.0 [00:00,  849entry/s]
Fetching
Building workspace index                                                                                                |0.00 [00:00,    ?entry/s]
Comparing indexes                                                                                                       |12.0 [00:00,  716entry/s]
Applying changes                                                                                                        |8.00 [00:00,   101file/s]
                                                                                                                                                  
Ran experiment(s): curly-pein                                                                                                                     
Experiment results have been applied to your workspace.

  1. The remote_data folder now looks like this on the second machine.
├── remote_data
│   └── MNIST
│       └── raw
│           ├── t10k-images-idx3-ubyte
│           ├── t10k-images-idx3-ubyte.gz
│           ├── t10k-labels-idx1-ubyte
│           ├── t10k-labels-idx1-ubyte.gz
│           ├── train-images-idx3-ubyte
│           ├── train-images-idx3-ubyte.gz
│           ├── train-labels-idx1-ubyte
│           └── train-labels-idx1-ubyte.gz

Expected

dvc exp run --pull shouldn't change the .dvc stage during the second experiment. Alternatively, it could change it, but not break it.

Output of dvc doctor:

DVC version: 3.66.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.8.0-101-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 3.18.2
	dvc_objects = 5.2.0
	dvc_render = 1.0.2
	dvc_task = 0.40.2
	scmrepo = 3.6.1
Supports:
	http (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
	https (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
	ssh (sshfs = 2025.11.0)
Config:
	Global: /home/jeffery.hsu/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/crypt-home
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/crypt-home
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9033490812fdfc6cf24ba81dece2685b

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions