Skip to content

Commit 2fde214

Browse files
authored
feat(bootstrap): restore per-gateway Docker bridge networks (#303)
PR #281 removed the shared openshell-cluster Docker network in favor of the default bridge. This restores custom bridge networking but makes each gateway use its own isolated network named openshell-cluster-{name}, matching the existing container/volume naming convention. Changes: - Add network_name() to constants.rs for per-gateway network naming - Add ensure_network() with retry/backoff and force_remove_network() parameterized by network name instead of a global constant - Attach containers to their per-gateway network via network_mode - Disconnect and remove the network during gateway destroy - Wire ensure_network() into the deploy flow before ensure_volume() - Update architecture docs to reflect per-gateway network isolation
1 parent 2bf9969 commit 2fde214

File tree

4 files changed

+128
-13
lines changed

4 files changed

+128
-13
lines changed

architecture/gateway-single-node.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,14 @@ Out of scope:
2121
- `crates/openshell-cli/src/run.rs`: CLI command implementations (`gateway_start`, `gateway_stop`, `gateway_destroy`, `gateway_info`, `doctor_logs`).
2222
- `crates/openshell-cli/src/bootstrap.rs`: Auto-bootstrap helpers for `sandbox create` (offers to deploy a gateway when one is unreachable).
2323
- `crates/openshell-bootstrap/src/lib.rs`: Gateway lifecycle orchestration (`deploy_gateway`, `deploy_gateway_with_logs`, `gateway_handle`, `check_existing_deployment`).
24-
- `crates/openshell-bootstrap/src/docker.rs`: Docker API wrappers (network, volume, container, image operations).
24+
- `crates/openshell-bootstrap/src/docker.rs`: Docker API wrappers (per-gateway network, volume, container, image operations).
2525
- `crates/openshell-bootstrap/src/image.rs`: Remote image registry pull with XOR-obfuscated distribution credentials.
2626
- `crates/openshell-bootstrap/src/runtime.rs`: In-container operations via `docker exec` (health polling, stale node cleanup, deployment restart).
2727
- `crates/openshell-bootstrap/src/metadata.rs`: Gateway metadata creation, storage, and active gateway tracking.
2828
- `crates/openshell-bootstrap/src/mtls.rs`: Gateway TLS detection and CLI mTLS bundle extraction.
2929
- `crates/openshell-bootstrap/src/push.rs`: Local development image push into k3s containerd.
3030
- `crates/openshell-bootstrap/src/paths.rs`: XDG path resolution.
31-
- `crates/openshell-bootstrap/src/constants.rs`: Shared constants (image name, network name, container/volume naming).
31+
- `crates/openshell-bootstrap/src/constants.rs`: Shared constants (image name, container/volume/network naming).
3232
- `deploy/docker/Dockerfile.cluster`: Container image definition (k3s base + Helm charts + manifests + entrypoint).
3333
- `deploy/docker/cluster-entrypoint.sh`: Container entrypoint (DNS proxy, registry config, manifest injection).
3434
- `deploy/docker/cluster-healthcheck.sh`: Docker HEALTHCHECK script.
@@ -44,7 +44,7 @@ All gateway lifecycle commands live under `openshell gateway`:
4444
|---|---|
4545
| `openshell gateway start [--name NAME] [--remote user@host] [--ssh-key PATH]` | Provision or update a gateway |
4646
| `openshell gateway stop [--name NAME] [--remote user@host]` | Stop the container (preserves state) |
47-
| `openshell gateway destroy [--name NAME] [--remote user@host]` | Destroy container, attached volumes, metadata, and network |
47+
| `openshell gateway destroy [--name NAME] [--remote user@host]` | Destroy container, attached volumes, per-gateway network, and metadata |
4848
| `openshell gateway info [--name NAME]` | Show deployment details (endpoint, SSH host) |
4949
| `openshell status` | Show gateway health via gRPC/HTTP |
5050
| `openshell doctor logs [--name NAME] [--remote user@host] [--tail N]` | Fetch gateway container logs |
@@ -91,7 +91,7 @@ sequenceDiagram
9191
Note over B,R: Docker socket APIs only, no extra host dependencies
9292
9393
B->>B: resolve SSH host for extra TLS SANs
94-
B->>R: ensure_network (bridge, attachable)
94+
B->>R: ensure_network (per-gateway bridge, attachable)
9595
B->>R: ensure_volume
9696
B->>R: ensure_container (privileged, k3s server)
9797
B->>R: start_container
@@ -159,7 +159,7 @@ Image ref resolution in `default_gateway_image_ref()`:
159159

160160
For the target daemon (local or remote):
161161

162-
1. **Ensure bridge network** `openshell-cluster` (attachable, bridge driver) via `ensure_network()`.
162+
1. **Ensure bridge network** `openshell-cluster-{name}` (attachable, bridge driver) via `ensure_network()`. Each gateway gets its own isolated Docker network.
163163
2. **Ensure volume** `openshell-cluster-{name}` via `ensure_volume()`.
164164
3. **Compute extra TLS SANs**:
165165
- For **local deploys**: Check `DOCKER_HOST` for a non-loopback `tcp://` endpoint (e.g., `tcp://docker:2375` in CI). If found, extract the host as an extra SAN. The function `local_gateway_host_from_docker_host()` skips `localhost`, `127.0.0.1`, and `::1`.
@@ -168,7 +168,7 @@ For the target daemon (local or remote):
168168
- k3s server command: `server --disable=traefik --tls-san=127.0.0.1 --tls-san=localhost --tls-san=host.docker.internal` plus computed extra SANs.
169169
- Privileged mode.
170170
- Volume bind mount: `openshell-cluster-{name}:/var/lib/rancher/k3s`.
171-
- Network: `openshell-cluster`.
171+
- Network: `openshell-cluster-{name}` (per-gateway bridge network).
172172
- Extra host: `host.docker.internal:host-gateway`.
173173
- Port mappings:
174174

@@ -349,7 +349,7 @@ flowchart LR
349349
1. Stop the container.
350350
2. Remove the container (`force=true`). Tolerates 404.
351351
3. Remove the volume (`force=true`). Tolerates 404.
352-
4. Remove the network if no containers remain attached (`cleanup_network_if_unused()`).
352+
4. Force-remove the per-gateway network via `force_remove_network()`, disconnecting any stale endpoints first.
353353

354354
**CLI layer** (`gateway_destroy()` in `run.rs` additionally):
355355

@@ -359,7 +359,7 @@ flowchart LR
359359
## Idempotency and Error Behavior
360360

361361
- Re-running deploy is safe:
362-
- Existing network/volume are reused (inspect before create).
362+
- Network is recreated on each deploy to guarantee a clean state; volume is reused (inspect before create).
363363
- If a container exists with the same image ID, it is reused; if the image changed, the container is recreated.
364364
- `start_container` tolerates already-running state (409).
365365
- In interactive terminals, the CLI prompts the user to optionally destroy and recreate an existing gateway before redeploying.

crates/openshell-bootstrap/src/constants.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,7 @@ pub fn container_name(name: &str) -> String {
1919
pub fn volume_name(name: &str) -> String {
2020
format!("openshell-cluster-{name}")
2121
}
22+
23+
pub fn network_name(name: &str) -> String {
24+
format!("openshell-cluster-{name}")
25+
}

crates/openshell-bootstrap/src/docker.rs

Lines changed: 112 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,19 @@
22
// SPDX-License-Identifier: Apache-2.0
33

44
use crate::RemoteOptions;
5-
use crate::constants::{container_name, volume_name};
5+
use crate::constants::{container_name, network_name, volume_name};
66
use crate::image::{
77
self, DEFAULT_IMAGE_REPO_BASE, DEFAULT_REGISTRY, DEFAULT_REGISTRY_USERNAME, parse_image_ref,
88
};
99
use bollard::API_DEFAULT_VERSION;
1010
use bollard::Docker;
1111
use bollard::errors::Error as BollardError;
1212
use bollard::models::{
13-
ContainerCreateBody, DeviceRequest, HostConfig, PortBinding, VolumeCreateRequest,
13+
ContainerCreateBody, DeviceRequest, HostConfig, NetworkCreateRequest, NetworkDisconnectRequest,
14+
PortBinding, VolumeCreateRequest,
1415
};
1516
use bollard::query_parameters::{
16-
CreateContainerOptions, CreateImageOptions, InspectContainerOptions,
17+
CreateContainerOptions, CreateImageOptions, InspectContainerOptions, InspectNetworkOptions,
1718
ListContainersOptionsBuilder, RemoveContainerOptions, RemoveImageOptions, RemoveVolumeOptions,
1819
StartContainerOptions,
1920
};
@@ -185,6 +186,55 @@ pub async fn find_gateway_container(docker: &Docker, port: Option<u16>) -> Resul
185186
}
186187
}
187188

189+
/// Create a fresh Docker bridge network for the gateway.
190+
///
191+
/// Always removes and recreates the network to guarantee a clean state.
192+
/// Stale Docker networks (e.g., from a previous interrupted destroy or
193+
/// Docker Desktop restart) can leave broken routing that causes the
194+
/// container to fail with "no default routes found".
195+
pub async fn ensure_network(docker: &Docker, net_name: &str) -> Result<()> {
196+
force_remove_network(docker, net_name).await?;
197+
198+
// Docker may return a 409 conflict if the previous network teardown has
199+
// not fully completed in the daemon. Retry a few times with back-off,
200+
// re-attempting the removal before each create.
201+
let mut last_err = None;
202+
for attempt in 0u64..5 {
203+
if attempt > 0 {
204+
tokio::time::sleep(std::time::Duration::from_millis(500 * attempt)).await;
205+
// Re-attempt removal in case the previous teardown has now settled.
206+
force_remove_network(docker, net_name).await?;
207+
}
208+
match docker
209+
.create_network(NetworkCreateRequest {
210+
name: net_name.to_string(),
211+
driver: Some("bridge".to_string()),
212+
attachable: Some(true),
213+
..Default::default()
214+
})
215+
.await
216+
{
217+
Ok(_) => return Ok(()),
218+
Err(err) if is_conflict(&err) => {
219+
tracing::debug!(
220+
"Network create conflict (attempt {}/5), retrying: {}",
221+
attempt + 1,
222+
err,
223+
);
224+
last_err = Some(err);
225+
}
226+
Err(err) => {
227+
return Err(err)
228+
.into_diagnostic()
229+
.wrap_err("failed to create Docker network");
230+
}
231+
}
232+
}
233+
Err(last_err.expect("at least one retry attempt"))
234+
.into_diagnostic()
235+
.wrap_err("failed to create Docker network after retries (network still in use)")
236+
}
237+
188238
pub async fn ensure_volume(docker: &Docker, name: &str) -> Result<()> {
189239
match docker.inspect_volume(name).await {
190240
Ok(_) => return Ok(()),
@@ -328,6 +378,7 @@ pub async fn ensure_container(
328378
privileged: Some(true),
329379
port_bindings: Some(port_bindings),
330380
binds: Some(vec![format!("{}:/var/lib/rancher/k3s", volume_name(name))]),
381+
network_mode: Some(network_name(name)),
331382
// Add host.docker.internal mapping for DNS resolution
332383
// This allows the entrypoint script to configure CoreDNS to use the host gateway
333384
extra_hosts: Some(vec!["host.docker.internal:host-gateway".to_string()]),
@@ -629,6 +680,21 @@ pub async fn destroy_gateway_resources(docker: &Docker, name: &str) -> Result<()
629680
.ok()
630681
.and_then(|info| info.image);
631682

683+
// Explicitly disconnect the container from the per-gateway network before
684+
// removing it. This ensures Docker tears down the network endpoint
685+
// synchronously so port bindings are released immediately and the
686+
// subsequent network cleanup sees zero connected containers.
687+
let net_name = network_name(name);
688+
let _ = docker
689+
.disconnect_network(
690+
&net_name,
691+
NetworkDisconnectRequest {
692+
container: container_name.clone(),
693+
force: Some(true),
694+
},
695+
)
696+
.await;
697+
632698
let _ = stop_container(docker, &container_name).await;
633699

634700
let remove_container = docker
@@ -700,9 +766,52 @@ pub async fn destroy_gateway_resources(docker: &Docker, name: &str) -> Result<()
700766
return Err(err).into_diagnostic();
701767
}
702768

769+
// Force-remove the per-gateway network during a full destroy. First
770+
// disconnect any stale endpoints that Docker may still report (race
771+
// between container removal and network bookkeeping), then remove the
772+
// network itself.
773+
force_remove_network(docker, &net_name).await?;
774+
703775
Ok(())
704776
}
705777

778+
/// Forcefully remove a Docker network, disconnecting any remaining
779+
/// containers first. This ensures that stale Docker network endpoints
780+
/// cannot prevent port bindings from being released.
781+
async fn force_remove_network(docker: &Docker, net_name: &str) -> Result<()> {
782+
let network = match docker
783+
.inspect_network(net_name, None::<InspectNetworkOptions>)
784+
.await
785+
{
786+
Ok(info) => info,
787+
Err(err) if is_not_found(&err) => return Ok(()),
788+
Err(err) => return Err(err).into_diagnostic(),
789+
};
790+
791+
// Disconnect any containers still attached to the network.
792+
if let Some(containers) = network.containers {
793+
for (id, _) in containers {
794+
let _ = docker
795+
.disconnect_network(
796+
net_name,
797+
NetworkDisconnectRequest {
798+
container: id,
799+
force: Some(true),
800+
},
801+
)
802+
.await;
803+
}
804+
}
805+
806+
match docker.remove_network(net_name).await {
807+
Ok(()) => Ok(()),
808+
Err(err) if is_not_found(&err) => Ok(()),
809+
Err(err) => Err(err)
810+
.into_diagnostic()
811+
.wrap_err("failed to remove Docker network"),
812+
}
813+
}
814+
706815
fn is_not_found(err: &BollardError) -> bool {
707816
matches!(
708817
err,

crates/openshell-bootstrap/src/lib.rs

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,12 @@ use miette::{IntoDiagnostic, Result};
2626
use std::sync::{Arc, Mutex};
2727

2828
use crate::constants::{
29-
CLIENT_TLS_SECRET_NAME, SERVER_CLIENT_CA_SECRET_NAME, SERVER_TLS_SECRET_NAME, volume_name,
29+
CLIENT_TLS_SECRET_NAME, SERVER_CLIENT_CA_SECRET_NAME, SERVER_TLS_SECRET_NAME, network_name,
30+
volume_name,
3031
};
3132
use crate::docker::{
3233
check_existing_gateway, check_port_conflicts, destroy_gateway_resources, ensure_container,
33-
ensure_image, ensure_volume, start_container, stop_container,
34+
ensure_image, ensure_network, ensure_volume, start_container, stop_container,
3435
};
3536
use crate::metadata::{
3637
create_gateway_metadata, create_gateway_metadata_with_host, local_gateway_host,
@@ -309,6 +310,7 @@ where
309310

310311
// All subsequent operations use the target Docker (remote or local)
311312
log("[status] Initializing environment".to_string());
313+
ensure_network(&target_docker, &network_name(&name)).await?;
312314
ensure_volume(&target_docker, &volume_name(&name)).await?;
313315

314316
// Compute extra TLS SANs for remote deployments so the gateway and k3s

0 commit comments

Comments
 (0)