-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Description
Bug Report
dvc import --no-download and dvc exp run --pull: inconsistent .dvc stage output between runs
Description
dvc exp run --pull alternates a .dvc stage in subsequent experiments if the original data is deleted. The new stage can't be reproduced on a different machine.
Reproduce
- Create a new repo:
mkdir /tmp/example;cd /tmp/example;git init; dvc init. - Create an empty dvc.yaml:
touch dvc.yaml. - Import dvc tracked data:
dvc import https://github.com/treeverse/vscode-dvc-demo.git data/ -o remote_data/ --no-download. - Stage and commit the first
remote_data.dvc.
md5: 19072ec61004c43cb45ae849652c51e0
frozen: true
deps:
- path: data/
repo:
url: https://github.com/treeverse/vscode-dvc-demo.git
rev_lock: 8518e8734f819b261cbf21e4bc6a1df93a90fe79
outs:
- hash: md5
path: remote_data- Run the experiment with
dvc exp run --pull. remote_data.dvcis unchanged after the experiment. Theremote_datafolder now looks like:
├── remote_data
│ └── MNIST
│ └── raw
│ ├── t10k-images-idx3-ubyte
│ ├── t10k-images-idx3-ubyte.gz
│ ├── t10k-labels-idx1-ubyte
│ ├── t10k-labels-idx1-ubyte.gz
│ ├── train-images-idx3-ubyte
│ ├── train-images-idx3-ubyte.gz
│ ├── train-labels-idx1-ubyte
│ └── train-labels-idx1-ubyte.gz
- Remove the remote_data dir and run
dvc exp run --pull. again. This time, theremote_data.dvcis modified, and the remote_data dir's content is not as expected.
md5: 1c23a5e0d2f34ad183f5d7010db87de9
frozen: true
deps:
- path: data/
repo:
url: https://github.com/treeverse/vscode-dvc-demo.git
rev_lock: 8518e8734f819b261cbf21e4bc6a1df93a90fe79
outs:
- hash: md5
path: remote_data
md5: 077582768b6d3e879851f57144cd20db.dir
size: 133089540
nfiles: 16├── remote_data
│ ├── data
│ │ └── MNIST
│ │ └── raw
│ │ ├── t10k-images-idx3-ubyte
│ │ ├── t10k-images-idx3-ubyte.gz
│ │ ├── t10k-labels-idx1-ubyte
│ │ ├── t10k-labels-idx1-ubyte.gz
│ │ ├── train-images-idx3-ubyte
│ │ ├── train-images-idx3-ubyte.gz
│ │ ├── train-labels-idx1-ubyte
│ │ └── train-labels-idx1-ubyte.gz
│ └── MNIST
│ └── raw
│ ├── t10k-images-idx3-ubyte
│ ├── t10k-images-idx3-ubyte.gz
│ ├── t10k-labels-idx1-ubyte
│ ├── t10k-labels-idx1-ubyte.gz
│ ├── train-images-idx3-ubyte
│ ├── train-images-idx3-ubyte.gz
│ ├── train-labels-idx1-ubyte
│ └── train-labels-idx1-ubyte.gz
-
Stage and commit the second
remote_data.dvc. -
On a different machine, clone the repo and then repeat the experiment with
dvc exp run --pull. The second version ofremote_data.dvcstage will result in a failure like this.
Logs
Reproducing experiment 'agley-ankh'
Building workspace index |0.00 [00:00, ?entry/s]
Comparing indexes |12.0 [00:00, 28.8entry/s]
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte'. It won't be created.
Applying changes |0.00 [00:00, ?file/s]
WARNING: Failed to pull run cache: config file error: no remote specified in /home/USER/code/dvc-debug. Create a default remote with
dvc remote add -d <remote name> <remote url>
Pulling data for stage: 'remote_data.dvc'
Collecting |11.0 [00:00, 857entry/s]
Fetching
Building workspace index |0.00 [00:00, ?entry/s]
Comparing indexes |12.0 [00:00, 311entry/s]
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-labels-idx1-ubyte'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-images-idx3-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/train-images-idx3-ubyte.gz'. It won't be created.
WARNING: No file hash info found for '/home/USER/code/dvc-debug/remote_data/MNIST/raw/t10k-labels-idx1-ubyte'. It won't be created.
Applying changes |0.00 [00:00, ?file/s]
Unable to pull data for stage: 'remote_data.dvc'
Verifying outputs in frozen stage: 'remote_data.dvc'
ERROR: failed to reproduce 'remote_data.dvc': missing data 'source': remote_data
- Continuing on the second machine, check out the commit containing the first version of
remote_data.dvcand repeat the experiment again. This time, the experiment will succeed.
Logs
Reproducing experiment 'curly-pein'
Building workspace index |0.00 [00:00, ?entry/s]
Comparing indexes |12.0 [00:00, 26.7entry/s]
Applying changes |0.00 [00:00, ?file/s]
WARNING: Failed to pull run cache: config file error: no remote specified in /home/USER/code/dvc-debug. Create a default remote with
dvc remote add -d <remote name> <remote url>
Pulling data for stage: 'remote_data.dvc'
Collecting |11.0 [00:00, 849entry/s]
Fetching
Building workspace index |0.00 [00:00, ?entry/s]
Comparing indexes |12.0 [00:00, 716entry/s]
Applying changes |8.00 [00:00, 101file/s]
Ran experiment(s): curly-pein
Experiment results have been applied to your workspace.
- The
remote_datafolder now looks like this on the second machine.
├── remote_data
│ └── MNIST
│ └── raw
│ ├── t10k-images-idx3-ubyte
│ ├── t10k-images-idx3-ubyte.gz
│ ├── t10k-labels-idx1-ubyte
│ ├── t10k-labels-idx1-ubyte.gz
│ ├── train-images-idx3-ubyte
│ ├── train-images-idx3-ubyte.gz
│ ├── train-labels-idx1-ubyte
│ └── train-labels-idx1-ubyte.gzExpected
dvc exp run --pull shouldn't change the .dvc stage during the second experiment. Alternatively, it could change it, but not break it.
Output of dvc doctor:
DVC version: 3.66.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.8.0-101-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.18.2
dvc_objects = 5.2.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.6.1
Supports:
http (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
https (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
ssh (sshfs = 2025.11.0)
Config:
Global: /home/jeffery.hsu/.config/dvc
System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/crypt-home
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/crypt-home
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9033490812fdfc6cf24ba81dece2685b
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels