Skip to content

Commit 0bb2902

Browse files
committed
Add background provisioning with JupyterHub integration
1 parent 671d249 commit 0bb2902

4 files changed

Lines changed: 209 additions & 20 deletions

File tree

repo2docker/contentproviders/rdm/README.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,143 @@ paths:
5555
source: $default_storage_path
5656
target: .
5757
```
58+
59+
## Running provision.sh with JupyterHub
60+
61+
When deploying repo2docker-built images with JupyterHub, you can automatically execute the `provision.sh` script at container startup to provision RDM data.
62+
63+
### Background Execution to Avoid Timeout
64+
65+
Since copying large datasets may take time and cause JupyterHub spawn timeout, the `provision.sh` script supports background execution mode. When called with command-line arguments, it will:
66+
67+
1. Start provisioning (copy/link operations) in the background
68+
2. Immediately execute the passed command (e.g., `jupyterhub-singleuser`)
69+
70+
This allows the JupyterHub server to start while data provisioning continues in the background.
71+
72+
### JupyterHub Configuration
73+
74+
Configure your JupyterHub spawner to execute `provision.sh` if it exists:
75+
76+
#### KubeSpawner Example
77+
78+
```python
79+
# In jupyterhub_config.py
80+
c.KubeSpawner.cmd = [
81+
'bash', '-c',
82+
'''
83+
set -e
84+
85+
# Find and execute provision.sh if it exists
86+
for path in \
87+
"${REPO_DIR}/binder/provision.sh" \
88+
"${REPO_DIR}/.binder/provision.sh" \
89+
"$HOME/binder/provision.sh" \
90+
"$HOME/.binder/provision.sh"; do
91+
92+
if [ -f "$path" ]; then
93+
echo "[provision-wrapper] Executing: $path" >&2
94+
exec bash "$path" "$@"
95+
fi
96+
done
97+
98+
# No provision.sh found, start normally
99+
exec "$@"
100+
''',
101+
'--', 'jupyterhub-singleuser'
102+
]
103+
```
104+
105+
#### DockerSpawner Example
106+
107+
```python
108+
# In jupyterhub_config.py
109+
c.DockerSpawner.cmd = [
110+
'bash', '-c',
111+
'''
112+
set -e
113+
for path in \
114+
"${REPO_DIR}/binder/provision.sh" \
115+
"${REPO_DIR}/.binder/provision.sh" \
116+
"$HOME/binder/provision.sh" \
117+
"$HOME/.binder/provision.sh"; do
118+
[ -f "$path" ] && exec bash "$path" "$@"
119+
done
120+
exec "$@"
121+
''',
122+
'--', 'jupyterhub-singleuser'
123+
]
124+
```
125+
126+
### How provision.sh Works
127+
128+
The generated `provision.sh` script accepts command-line arguments and has the following structure:
129+
130+
```bash
131+
#!/bin/bash
132+
set -e
133+
134+
# Run provisioning in background
135+
{
136+
# Copy and link operations
137+
mkdir -p './target/path/'
138+
cp -fr '/mnt/rdm/storage/data/'* './target/path/'
139+
ln -s '/mnt/rdm/large-data/' './data'
140+
} &
141+
142+
# Execute passed command if provided
143+
if [ $# -gt 0 ]; then
144+
exec "$@"
145+
fi
146+
```
147+
148+
### Volume Mounts
149+
150+
Ensure that RDM storage is mounted at `/mnt/rdm/` in the container. Configure your spawner accordingly:
151+
152+
```python
153+
# KubeSpawner example
154+
c.KubeSpawner.volumes = [
155+
{
156+
'name': 'rdm-storage',
157+
'persistentVolumeClaim': {
158+
'claimName': 'rdm-pvc'
159+
}
160+
}
161+
]
162+
163+
c.KubeSpawner.volume_mounts = [
164+
{
165+
'name': 'rdm-storage',
166+
'mountPath': '/mnt/rdm/'
167+
}
168+
]
169+
```
170+
171+
**Note**: If `/mnt/rdm/` does not exist but `/mnt/rdms/{project_id}/` is available, `provision.sh` will automatically create a symlink from `/mnt/rdm` to `/mnt/rdms/{project_id}` at startup.
172+
173+
### Monitoring Provisioning Progress
174+
175+
Users can check the provisioning progress from within the Jupyter environment:
176+
177+
```bash
178+
# View the provisioning log in real-time
179+
tail -f /tmp/provision.log
180+
181+
# Check if provisioning is complete
182+
grep "completed" /tmp/provision.log
183+
```
184+
185+
The log file `/tmp/provision.log` contains:
186+
- Start and completion timestamps
187+
- Each copy/link operation with source and target paths
188+
- Detailed command output (from `set -x`)
189+
- Any errors that occur during provisioning
190+
191+
### Notes
192+
193+
- The `REPO_DIR` environment variable points to the repository directory (default: `/home/jovyan`)
194+
- Provisioning runs in the background, so large data copies won't block JupyterHub startup
195+
- Symbolic links are created immediately and are available right away
196+
- Check `/tmp/provision.log` for provisioning progress and errors
197+
- Container logs will show when provisioning starts and how to monitor it

repo2docker/contentproviders/rdm/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,7 @@ async def _fetch_binder(
255255
for binder_output_dir in binder_output_dirs:
256256
provisioner.save_provision_script(
257257
os.path.join(binder_output_dir, "provision.sh"),
258+
project.id,
258259
mnt_rdm_dir,
259260
)
260261

repo2docker/contentproviders/rdm/provisioner.py

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -57,50 +57,76 @@ async def add_link_mapping(self, path_mapping: PathMapping):
5757
source = await self._resolve_source(path_mapping)
5858
self._link_mappings.append((path_mapping, source))
5959

60-
def save_provision_script(self, script_path: str, source_mount_dir='/mnt/rdm/'):
60+
def save_provision_script(self, script_path: str, project_id: str, source_mount_dir='/mnt/rdm/'):
6161
"""Save the provision script to the specified path."""
6262
with open(script_path, "w") as f:
6363
f.write("#!/bin/bash\n")
64-
f.write("set -xe\n")
64+
f.write("set -e\n\n")
65+
f.write("# Ensure /mnt/rdm exists or create symlink to project-specific directory\n")
66+
f.write("if [ ! -e /mnt/rdm ]; then\n")
67+
f.write(f" PROJECT_DIR=/mnt/rdms/{shlex.quote(project_id)}\n")
68+
f.write(" for i in 1 2 4; do\n")
69+
f.write(" if [ -d \"$PROJECT_DIR\" ]; then\n")
70+
f.write(" ln -s \"$PROJECT_DIR\" /mnt/rdm\n")
71+
f.write(" break\n")
72+
f.write(" fi\n")
73+
f.write(" echo \"Waiting for $PROJECT_DIR to be available... (retry in ${i}s)\" >&2\n")
74+
f.write(" sleep $i\n")
75+
f.write(" done\n")
76+
f.write("fi\n\n")
77+
f.write("PROVISION_LOG=\"/tmp/provision.log\"\n\n")
78+
f.write("# Run provisioning in background\n")
79+
f.write("{\n")
80+
f.write(" echo \"[provision] Starting RDM data provisioning at $(date)...\"\n")
81+
f.write(" set -x\n")
6582
for path_mapping, source in self._copy_mappings:
6683
source_path = os.path.join(
6784
source_mount_dir,
6885
path_mapping.get_source(self._default_storage_path)
6986
)
7087
target_path = path_mapping.get_target()
88+
f.write(f" echo \"[provision] Copying {shlex.quote(source_path)} to {shlex.quote(target_path)}...\"\n")
7189
if target_path != "./" and target_path.endswith("/"):
7290
# target is directory
7391
if target_path.strip("/") != "." and target_path.strip("/") != "":
74-
f.write(f"mkdir -p {shlex.quote(target_path)}\n")
92+
f.write(f" mkdir -p {shlex.quote(target_path)}\n")
7593
elif target_path != "./" and "/" in target_path:
7694
# target is subdir
7795
parent_dir = os.path.dirname(target_path)
7896
if parent_dir.strip("/") != "." and parent_dir.strip("/") != "":
79-
f.write(f"mkdir -p {shlex.quote(parent_dir)}\n")
97+
f.write(f" mkdir -p {shlex.quote(parent_dir)}\n")
8098
if source.path.endswith("/"):
8199
# folder
82100
if target_path != "." and not target_path.endswith("/"):
83101
target_path += "/"
84102
if not source_path.endswith("/"):
85103
source_path += "/"
86104
if target_path.strip("/") != "." and target_path.strip("/") != "":
87-
f.write(f"mkdir -p {shlex.quote(target_path)}\n")
88-
f.write(f"cp -fr {shlex.quote(source_path)}* {shlex.quote(target_path)}\n")
105+
f.write(f" mkdir -p {shlex.quote(target_path)}\n")
106+
f.write(f" cp -fr {shlex.quote(source_path)}* {shlex.quote(target_path)}\n")
89107
else:
90108
# file
91-
f.write(f"cp {shlex.quote(source_path)} {shlex.quote(target_path)}\n")
109+
f.write(f" cp {shlex.quote(source_path)} {shlex.quote(target_path)}\n")
92110
for path_mapping, source in self._link_mappings:
93111
source_path = os.path.join(
94112
source_mount_dir,
95113
path_mapping.get_source(self._default_storage_path)
96114
)
97115
target_path = path_mapping.get_target()
116+
f.write(f" echo \"[provision] Linking {shlex.quote(source_path)} to {shlex.quote(target_path)}...\"\n")
98117
if target_path != "./" and "/" in target_path.strip("/"):
99118
# target is subdir
100119
parent_dir = os.path.dirname(target_path)
101120
if parent_dir.strip("/") != "." and parent_dir.strip("/") != "":
102-
f.write(f"mkdir -p {shlex.quote(parent_dir)}\n")
103-
f.write(f"ln -s {shlex.quote(source_path)} {shlex.quote(target_path)}\n")
121+
f.write(f" mkdir -p {shlex.quote(parent_dir)}\n")
122+
f.write(f" ln -s {shlex.quote(source_path)} {shlex.quote(target_path)}\n")
123+
f.write(" echo \"[provision] RDM data provisioning completed at $(date).\"\n")
124+
f.write("} > \"${PROVISION_LOG}\" 2>&1 &\n\n")
125+
f.write("echo \"[provision] Provisioning started in background. Check progress: tail -f ${PROVISION_LOG}\" >&2\n\n")
126+
f.write("# Execute passed command if provided\n")
127+
f.write("if [ $# -gt 0 ]; then\n")
128+
f.write(" exec \"$@\"\n")
129+
f.write("fi\n")
104130

105131
async def _resolve_source(self, path_mapping: PathMapping):
106132
"""Validate the path mapping."""

tests/unit/contentproviders/test_rdm.py

Lines changed: 33 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -416,6 +416,7 @@ async def mock_storage(name):
416416

417417
fake_project_obj = MagicMock(storages=AsyncIterator([fake_storage]))
418418
fake_project_obj.storage = mock_storage
419+
fake_project_obj.id = "x1234"
419420
fake_project.return_value = fake_project_obj
420421

421422
binder_dir = os.path.join(d, "binder")
@@ -469,9 +470,20 @@ async def mock_resolve_source(self, path_mapping):
469470

470471
# Verify script content
471472
assert "#!/bin/bash" in script_content
472-
assert "set -xe" in script_content
473+
assert "set -e" in script_content
474+
# Verify /mnt/rdm symlink creation with retry logic for project-specific directory
475+
assert "if [ ! -e /mnt/rdm ]; then" in script_content
476+
assert "PROJECT_DIR=/mnt/rdms/x1234" in script_content
477+
assert "for i in 1 2 4; do" in script_content
478+
assert 'ln -s "$PROJECT_DIR" /mnt/rdm' in script_content
479+
assert "Waiting for $PROJECT_DIR to be available" in script_content
480+
assert 'PROVISION_LOG="/tmp/provision.log"' in script_content
473481
assert "cp -fr /mnt/rdm/osfstorage/data/* ./dataset/" in script_content
474482
assert "ln -s /mnt/rdm/external_storage/large_files ./external" in script_content
483+
# Verify background execution and command passing
484+
assert "} > \"${PROVISION_LOG}\" 2>&1 &" in script_content
485+
assert 'if [ $# -gt 0 ]; then' in script_content
486+
assert 'exec "$@"' in script_content
475487

476488

477489
def test_fetch_with_binder_but_no_paths_yaml():
@@ -501,6 +513,7 @@ async def mock_storage(name):
501513

502514
fake_project_obj = MagicMock(storages=AsyncIterator([fake_storage]))
503515
fake_project_obj.storage = mock_storage
516+
fake_project_obj.id = "x1234"
504517
fake_project.return_value = fake_project_obj
505518

506519
# Mock Provisioner._resolve_source to avoid storage validation
@@ -529,10 +542,14 @@ async def mock_resolve_source(self, path_mapping):
529542

530543
# Verify default mapping is added (copy entire storage to current directory)
531544
assert "#!/bin/bash" in script_content
532-
assert "set -xe" in script_content
545+
assert "set -e" in script_content
546+
# Verify /mnt/rdm symlink creation with retry logic
547+
assert "PROJECT_DIR=/mnt/rdms/x1234" in script_content
548+
assert 'ln -s "$PROJECT_DIR" /mnt/rdm' in script_content
533549
assert "cp -fr /mnt/rdm/osfstorage/* ." in script_content
534-
# Should not have any link commands
535-
assert "ln -s" not in script_content
550+
# Should not have any user-defined link commands in the background block
551+
# (only the /mnt/rdm setup link should exist)
552+
assert script_content.count("ln -s") == 1
536553

537554

538555
def test_fetch_with_empty_paths_and_override():
@@ -561,6 +578,7 @@ async def mock_storage(name):
561578

562579
fake_project_obj = MagicMock(storages=AsyncIterator([fake_storage]))
563580
fake_project_obj.storage = mock_storage
581+
fake_project_obj.id = "x1234"
564582
fake_project.return_value = fake_project_obj
565583

566584
binder_dir = os.path.join(d, "binder")
@@ -593,17 +611,21 @@ async def mock_resolve_source(self, path_mapping):
593611
with open(provision_script_path, 'r') as f:
594612
script_content = f.read()
595613

596-
# Verify only shebang and set -xe are present, no copy or link commands
614+
# Verify script structure without user-defined copy or link commands
597615
assert "#!/bin/bash" in script_content
598-
assert "set -xe" in script_content
616+
assert "set -e" in script_content
617+
# Verify /mnt/rdm symlink creation with retry logic
618+
assert "PROJECT_DIR=/mnt/rdms/x1234" in script_content
619+
assert 'ln -s "$PROJECT_DIR" /mnt/rdm' in script_content
599620
# Should not have any copy commands
600621
assert "cp -fr" not in script_content
601622
assert "cp " not in script_content
602-
# Should not have any link commands
603-
assert "ln -s" not in script_content
604-
# The script should only have 2 lines
605-
lines = [line for line in script_content.strip().split('\n') if line]
606-
assert len(lines) == 2, f"Expected 2 lines, got {len(lines)}: {lines}"
623+
# Should not have any user-defined link commands
624+
# (only the /mnt/rdm setup link should exist)
625+
assert script_content.count("ln -s") == 1
626+
# But should have background execution structure
627+
assert 'PROVISION_LOG="/tmp/provision.log"' in script_content
628+
assert "} > \"${PROVISION_LOG}\" 2>&1 &" in script_content
607629

608630

609631
def test_rdmurl_project_id():

0 commit comments

Comments
 (0)