diff --git a/docs/How-to/habrok_cluster_guide.md b/docs/How-to/habrok_cluster_guide.md index cd6c1c21b..432d5e2e4 100644 --- a/docs/How-to/habrok_cluster_guide.md +++ b/docs/How-to/habrok_cluster_guide.md @@ -13,8 +13,8 @@ ssh-copy-id -i ~/.ssh/id_rsa.pub YOUR_USERNAME@login1.hb.hpc.rug.nl Once you have added your SSH key to Habrok, modify the entry below and insert it into your `~/.ssh/config` file ``` -Host habrok1 - HostName interactive1.hb.hpc.rug.nl +Host habrok + HostName login1.hb.hpc.rug.nl User YOUR_USERNAME IdentityFile ~/.ssh/id_rsa ServerAliveInterval 120 @@ -80,3 +80,100 @@ You can also submit single PROTEUS runs to the nodes. For example: ```console sbatch --mem-per-cpu=3G --time=1440 --wrap "proteus start -oc input/all_options.toml" ``` + +## Transferring data from Habrok to Kapteyn + +Habrok and Kapteyn are on different networks. Habrok cannot reach Kapteyn (the firewall blocks outgoing SSH), and although Kapteyn can reach Habrok, Habrok requires two-factor authentication (2FA) for every connection, which makes automated transfers from Kapteyn difficult. + +So you cannot simply run `rsync` or `scp` in either direction between the two clusters. The workaround is to relay data through a machine that can reach both, like your laptop: + +``` +Habrok --> your laptop --> Kapteyn (norma2) + pull push +``` + +### Prerequisites + +You need SSH access to both clusters configured on your laptop. See the [Habrok SSH setup](#access-the-habrok-cluster) above and the [Kapteyn cluster guide](kapteyn_cluster_guide.md) for SSH config instructions, including the ProxyJump setup needed to reach `norma2`. + +Test that both connections work before proceeding: + +```console +ssh habrok # will ask for your TOTP code +ssh norma2 # key-based, no 2FA +``` + +### Step 1: Pull data from Habrok to your laptop + +On Habrok, PROTEUS output typically lives in `/scratch//proteus_output/`. Check what is there: + +```console +ssh habrok 'ls -lh /scratch//proteus_output/' +``` + +Pull it to a temporary folder on your laptop: + +```console +mkdir -p /tmp/habrok_transfer +rsync -avz habrok:/scratch//proteus_output/my_run/ /tmp/habrok_transfer/my_run/ +``` + +Replace `` with your Habrok username (e.g., `p000000`) and `my_run` with your simulation directory name. + +If you only need the CSV and plots (not the raw per-timestep data), add `--exclude=data/` to save time and disk space: + +```console +rsync -avz --exclude=data/ habrok:/scratch//proteus_output/my_run/ /tmp/habrok_transfer/my_run/ +``` + +### Step 2: Push data from your laptop to Kapteyn + +Push the staged data to the Kapteyn dataserver: + +```console +ssh norma2 'mkdir -p /dataserver/users/formingworlds//proteus_output/my_run' +rsync -avz /tmp/habrok_transfer/my_run/ norma2:/dataserver/users/formingworlds//proteus_output/my_run/ +``` + +Replace `` with your Kapteyn username. + +### Step 3: Clean up + +Remove the temporary staging data from your laptop: + +```console +rm -rf /tmp/habrok_transfer/my_run +``` + +### Alternative: direct pipe (no staging on your laptop) + +Instead of storing data on your laptop in between, you can pipe the data straight through in a single command using SSH and `tar`: + +First, make sure the target directory exists on Kapteyn: + +```console +ssh norma2 'mkdir -p /dataserver/users/formingworlds//proteus_output' +``` + +Then pipe the data through: + +```console +ssh habrok 'tar -cf - -C /scratch//proteus_output my_run' \ + | ssh norma2 'tar -xf - -C /dataserver/users/formingworlds//proteus_output' +``` + +This streams data from Habrok through your laptop to Kapteyn without writing anything to disk locally. The downside is that if the connection drops, you have to start over from scratch (unlike `rsync`, which can resume). This approach is best for smaller transfers. + +To exclude the `data/` directory (slim transfer): + +```console +ssh habrok 'tar -cf - --exclude=data -C /scratch//proteus_output my_run' \ + | ssh norma2 'tar -xf - -C /dataserver/users/formingworlds//proteus_output' +``` + +### Tips + +- **rsync is incremental.** If the transfer gets interrupted (laptop goes to sleep, WiFi drops), re-run the same `rsync` command. It picks up where it left off and only transfers new or changed files. +- **Check sizes first.** Before pulling, check how large the data is: `ssh habrok 'du -sh /scratch//proteus_output/my_run/'`. Large runs can be tens of GB. +- **The `data/` directory is often not needed.** It contains raw NetCDF/JSON output at every timestep. The `runtime_helpfile.csv` and `plots/` directory are usually sufficient for analysis. +- **Kapteyn storage quotas.** The formingworlds dataserver has also limited space. Check your usage with `ssh norma2 'du -sh /dataserver/users/formingworlds//'` before transferring large datasets.