This is a collection of scripts for copying files from the a t2 cluster (ie uaf-4.t2.ucsd.edu) to a PVC on the NRP Nautilus cluster.
The script runs on UAF/T2. It uses krsync — a thin wrapper that tunnels rsync over kubectl exec — to stream files directly into a long-lived pod that has your PVC on the namespace mounted. Files are split into batches and run in parallel background processes. This was designed for the axol1tl namespace, UAF, and the traindatavol pvc but can be generalized. I hope you find it useful! :)
All of these need to be set up on UAF before things will work
NRP, kubectl and kubelogin setup This comes from the NRP getting started guide.
- ssh into your T2 cluster (of course you need access first)
#can do uaf-1,2,3 or 4
ssh username@uaf-2.t2.ucsd.edu
-
log into nrp.ai in your browser. You need to be in the namespace where the files are being copied.
-
install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
chmod +x kubectl
mkdir -p ~/.local/bin
mv ./kubectl ~/.local/bin/kubectl
get the path variables set properly and verify it works:
export PATH="$HOME/.local/bin:$PATH"
source ~/.bashrc
which kubectl
kubectl version --client
- install krew +kubelogin (this takes a while)
(
set -x; cd "$(mktemp -d)" &&
OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/arm.*$/arm/')" &&
KREW="krew-${OS}_${ARCH}" &&
curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
tar zxvf "${KREW}.tar.gz" &&
./"${KREW}" install krew
)
set path, install and verify
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
source ~/.bashrc
kubectl krew install oidc-login
kubectl oidc-login --version
- get nrp config file and copy it to the cluster
on the cluster:
mkdir ~/.kube
on your local machine: download-> https://nrp.ai/config
scp ~/Downloads/config-2 mequinna@uaf-4.t2.ucsd.edu:~/.kube/config
- log into nrp from t2 cluster (can use namespace of choice, example here is axol1tl)
kubectl get nodes
kubectl get pods -n axol1tl
#if you want a default namespace
kubectl config set contexts.nautilus.namespace axol1tl
- add to bashrc
do this so you dont have to src the paths again in another session
nano ~/.bashrc
#add these to the bottom:
export PATH="$HOME/.local/bin:$PATH"
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
source ~/.bashrc
Script setup
- after doing above, ssh into t2 cluster and clone the git repo
bash
git clone https://github.com/quinnanm/nrpcopy.git
cd nrpcopy
you should have a few scripts of note: kube_copy.py, krsync and ymls/copy-pod.yml
- set up the pod for copying
kubectl apply -f ymls/copy-pod.yml
#check its running:
kubectl get pods -n axol1tl
copy-pod needs to be "Running" for any of this to work.
- give krsync permissions
this is needed for it to work. if you get permission issues this is a likely culprit
chmod +x krsync
ls -la krsync
# should show -rwxr-xr-x
- prepare for liftoff
you need an input directory on the T2 of files you want to copy, a pvc on NRP to copy to, an output directory on that pvc, a namespace, and your running copy-pod
python kube_copy.py \
--input-dirs /indir/name \
--output-path /outdir/name \
--namespace nameofnamespace \
--pvc pvc nameThis will find all .root files under the input directory, split them into batches of 100, run up to 4 batches in parallel, block until everything is done, and print a summary.
Always do a dry run first to verify the file list and destination paths before copying anything:
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/QCD \
--output-path /data/ntuples \
--namespace axol1tl \
--pvc mequinna-pvc \
--dry-runBy default the script copies all .root files. To copy a different file type:
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/MyData \
--output-path /data/ADsamples/MyData \
--namespace axol1tl \
--pvc traindatavol \
--filetype '*.h5'Here is what I did in my working example: flat means there are no nested dirs like the input dirs, just the target output dirs.
#try it
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/VBFHto2B_25 \
--output-path /data/ADsamples/VBFHto2B_25 \
--namespace axol1tl \
--pvc traindatavol \
--copy-pod copy-pod \
--files-per-job 50 \
--max-parallel 4 \
--flat \
--dry-run
#submit for real
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/VBFHto2B_25 \
--output-path /data/ADsamples/VBFHto2B_25 \
--namespace axol1tl \
--pvc traindatavol \
--copy-pod copy-pod \
--files-per-job 50 \
--max-parallel 4 \
--flat
#resubmit failed jobs
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/VBFHto2B_25 \
--output-path /data/ADsamples/VBFHto2B_25 \
--namespace axol1tl \
--pvc traindatavol \
--copy-pod copy-pod \
--files-per-job 50 \
--max-parallel 4 \
--flat \
--skip-existing
#or to avoid printouts/holding the command line hostage
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/VBFHto2B_25 \
--output-path /data/ADsamples/VBFHto2B_25 \
--namespace axol1tl \
--pvc traindatavol \
--copy-pod copy-pod \
--files-per-job 50 \
--max-parallel 4 \
--flat \
--no-wait
log files are printed in copy_logs/.
then check the status on nrp:
kubectl exec -it copy-pod -n axol1tl -- bash
ls /data/ADsamples/VBFHto2B_25/
ls /data/ADsamples/VBFHto2B_25/ | wc -l
exit
dont forget to delete the pod when done!
kubectl delete pod copy-pod -n axol1tl
Copy multiple sample directories with prefixes, flat output:
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/QCD \
/ceph/cms/store/user/mequinna/ntuples/TTbar \
/ceph/cms/store/user/mequinna/ntuples/WJets \
--prefix QCD TTbar WJets \
--flat \
--output-path /data/ntuples \
--namespace axol1tl \
--pvc mequinna-pvcWith --prefix, each file is renamed PREFIX_originalname.root. With --flat, all files land in one directory regardless of the subdirectory structure on UAF. Without --flat, the subdirectory structure is preserved under --output-path.
First-time run — auto-create the pod:
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/QCD \
--output-path /data/ntuples \
--namespace axol1tl \
--pvc mequinna-pvc \
--create-podFire and forget — return immediately, check later:
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/QCD \
--output-path /data/ntuples \
--namespace axol1tl \
--pvc mequinna-pvc \
--no-waitThe script launches batches in the background and exits. The background processes survive SSH disconnection. The script prints the exact --summarize command to run when you come back to check results.
Resume after interruption — skip already-copied files:
python kube_copy.py \
--input-dirs /ceph/cms/store/user/mequinna/ntuples/QCD \
--output-path /data/ntuples \
--namespace axol1tl \
--pvc mequinna-pvc \
--skip-existingWhile batches are running (tail all batch logs):
tail -f copy_logs/batch_0520-142301_*.logCheck which batches are still going:
grep -l "BATCH_DONE" copy_logs/batch_0520-142301_*.log # finished
ps aux | grep batch_ # still runningGet the full summary (works mid-run too — shows which batches aren't done yet):
python kube_copy.py \
--summarize copy_logs/batch_0520-142301_*.log \
--output-path /data/ntuples \
--namespace axol1tl \
--pvc mequinna-pvc \
--copy-pod copy-podThis prints counts of succeeded / failed / size-mismatched files, lists any problem files, and if there are failures prints a ready-to-run resubmit command.
Resubmitting failures — just copy-paste the resubmit command from the summary output. It pre-fills --skip-existing so already-copied files are not re-copied.
| Flag | Default | Description |
|---|---|---|
--input-dirs |
required | One or more source directories on UAF. Recursively finds all .root files. |
--output-path |
required | Destination path inside the PVC, e.g. /data/ntuples. |
--namespace |
axol1tl |
Kubernetes namespace. |
--pvc |
required | PVC name, e.g. mequinna-pvc. |
--copy-pod |
copy-pod |
Name of the long-lived pod with the PVC mounted. |
--create-pod |
off | Create the copy pod if it doesn't exist. |
--prefix |
none | One prefix string per input dir. --prefix QCD TTbar renames files to QCD_file.root, TTbar_file.root. Count must match --input-dirs. |
--filetype |
*.root |
File pattern to match. e.g. --filetype '*.h5' or --filetype '*' for all files. |
--flat |
off | Put all output files in one flat directory. Without this, subdirectory structure from the source is preserved. |
--files-per-job |
100 |
Number of files per batch. |
--max-parallel |
4 |
Maximum number of batches running simultaneously. |
--skip-existing |
off | Check the pod before copying and skip files already present. |
--no-wait |
off | Launch batches and return immediately. Use --summarize to check results later. |
--summarize |
off | Parse log files and print summary. No copying. Pass log glob: --summarize copy_logs/batch_*.log. Also needs --output-path, --namespace, --pvc, --copy-pod for the resubmit command. |
--krsync |
./krsync |
Path to krsync wrapper. Created automatically if missing. |
--log-dir |
./copy_logs |
Directory for per-batch shell scripts and log files. |
--log-file |
copy_summary.json |
JSON file summarising all file statuses at the end of a blocking run. |
--dry-run |
off | Print everything that would happen without copying anything. |
- The script discovers all
.rootfiles recursively under each--input-dirspath. - Files are split into batches of
--files-per-job. For each batch a shell script is written to--log-dir. - Up to
--max-parallelbatch scripts run simultaneously as background processes. Each script rsyncs its files viakrsync(which tunnels rsync overkubectl exec) into the copy pod, then checks file sizes to verify each transfer. - Each file in the batch log gets a status line:
OK:,FAILED:, orSIZEMISMATCH:. The batch ends withBATCH_DONE. - In blocking mode the script watches all batches and prints a live progress line. In
--no-waitmode it exits immediately after launch. - At the end (or when you run
--summarize) the logs are parsed and results reported.
Pod not found / not Running
kubectl get pods -n axol1tl
kubectl describe pod copy-pod -n axol1tlIf the pod is stuck in Pending, check PVC status: kubectl get pvc -n axol1tl.
kubectl auth expired NRP uses OIDC tokens that expire. Re-authenticate with:
kubectl get pods -n axol1tl # triggers browser loginrsync fails immediately
Make sure the krsync file is executable: chmod +x ./krsync. Also confirm the copy pod is Running before starting.
Size mismatch on a file
The file was partially transferred. Run --summarize to get the resubmit command — it will list the affected files and retry them with --skip-existing so everything else is left alone.
Check what's on the PVC
kubectl exec -n axol1tl copy-pod -- find /data -name "*.root" | wc -l
kubectl exec -n axol1tl copy-pod -- du -sh /data