A minimalist Docker clone written in Python for learning container internals. Uses real Linux primitives (namespaces, cgroups, overlayfs) with zero external dependencies.
docklet/
├── pyproject.toml # Packaging, entry point: "docklet"
├── README.md
├── SPEC.md # This file
├── docklet/
│ ├── __init__.py
│ ├── cli.py # argparse CLI: run, ps, images, pull, rm, exec
│ ├── container.py # Container lifecycle orchestrator
│ ├── namespaces.py # ctypes wrappers for unshare/setns/sethostname
│ ├── cgroups.py # cgroups v2 resource limits (memory, cpu, pids)
│ ├── filesystem.py # overlayfs mount + pivot_root
│ ├── network.py # veth pairs + bridge via ip commands
│ ├── registry.py # Docker Hub registry v2 API (urllib only)
│ ├── image.py # Local image store: extract layers, list, remove
│ └── config.py # Paths, constants, syscall numbers
docklet pull IMAGE[:TAG] # Pull image from Docker Hub
docklet run [-m MEM] [-c CPU] IMAGE [CMD...] # Create and start a container
docklet exec CONTAINER CMD... # Execute command in running container
docklet ps # List containers
docklet images # List locally pulled images
docklet rm CONTAINER # Remove a container
Requires root (or equivalent capabilities). The CLI checks os.geteuid() == 0
at startup and exits with a clear message if not root.
Defines all filesystem paths and syscall constants. No logic, just data.
Paths:
| Constant | Value | Purpose |
|---|---|---|
| DOCKLET_ROOT | /var/lib/docklet | Top-level data directory |
| CONTAINERS_DIR | {DOCKLET_ROOT}/containers/ | Per-container state dirs |
| IMAGES_DIR | {DOCKLET_ROOT}/images/ | Extracted image layers |
| LAYERS_DIR | {DOCKLET_ROOT}/layers/ | Raw downloaded layer tarballs |
| CGROUP_ROOT | /sys/fs/cgroup/docklet | Cgroup subtree for docklet |
Networking:
| Constant | Value |
|---|---|
| NETWORK_BRIDGE | docklet0 |
| SUBNET | 10.0.100.0/24 |
| BRIDGE_IP | 10.0.100.1 |
Syscall numbers (x86-64):
| Constant | Value | Purpose |
|---|---|---|
| SYS_pivot_root | 155 | Switch root filesystem |
| SYS_mount | 165 | Mount filesystems |
| SYS_umount2 | 166 | Unmount filesystems |
| SYS_unshare | 272 | Move process into new namespaces |
| SYS_setns | 308 | Enter an existing namespace |
Namespace clone flags:
| Flag | Value |
|---|---|
| CLONE_NEWNS | 0x00020000 |
| CLONE_NEWUTS | 0x04000000 |
| CLONE_NEWIPC | 0x08000000 |
| CLONE_NEWUSER | 0x10000000 |
| CLONE_NEWPID | 0x20000000 |
| CLONE_NEWNET | 0x40000000 |
Wraps raw syscalls using ctypes.CDLL("libc.so.6", use_errno=True).
Functions:
-
unshare(flags: int) -> NoneCallslibc.unshare(). Moves the calling process into new namespaces specified by the flags bitmask. -
setns(fd: int, nstype: int) -> NoneCallslibc.setns(). Enters an existing namespace by file descriptor. Used bydocklet execto join a running container's namespaces via/proc/<pid>/ns/*files. -
sethostname(name: str) -> NoneSets the hostname inside the UTS namespace so the container gets its own hostname (typically the short container ID).
Design choice: Uses os.fork() + unshare() rather than raw clone().
Python's runtime (GC, thread state) does not survive a raw clone() well.
The fork+unshare pattern is standard in Python container code.
All operations are plain file reads/writes to /sys/fs/cgroup/. This module
demonstrates that cgroups are fundamentally just a filesystem interface.
Functions:
-
init() -> NoneCreates/sys/fs/cgroup/docklet/if needed. Writes"+cpu +memory +pids"tocgroup.subtree_controlto enable controllers. -
create(container_id: str) -> NoneCreates the cgroup directory at{CGROUP_ROOT}/{container_id}/. -
set_memory_limit(container_id: str, limit_bytes: int) -> NoneWrites tomemory.max. -
set_cpu_limit(container_id: str, quota_us: int = 50000, period_us: int = 100000) -> NoneWrites"{quota} {period}"tocpu.max. Default = 50% of one core. -
set_pids_limit(container_id: str, limit: int) -> NoneWrites topids.max. -
add_process(container_id: str, pid: int) -> NoneWrites PID tocgroup.procs. -
stats(container_id: str) -> dictReadsmemory.current,cpu.statfor reporting indocklet ps. -
cleanup(container_id: str) -> NoneRemoves the cgroup directory after all processes have exited.
Functions:
-
setup_overlay(container_id: str, image_layers: list[str]) -> strMounts an overlayfs with:lowerdir: colon-separated image layers (bottom to top)upperdir: per-container writable layerworkdir: per-container scratch directory- Returns the merged mountpoint path.
-
pivot_root(new_root: str) -> NoneThe pivot_root dance:- Bind-mount
new_rootonto itself (required by pivot_root) os.chdir(new_root)- Call
SYS_pivot_root(".", ".") umount(".", MNT_DETACH)to detach old root
Why pivot_root over chroot: chroot is escapable via
fchdir()on an open fd to the old root. pivot_root actually replaces the mount tree. - Bind-mount
-
mount_special(rootfs: str) -> NoneMounts/proc(new instance),/dev/pts,/dev/shminside the container so tools likepswork. -
cleanup_overlay(container_id: str) -> NoneUnmounts overlayfs and removes the writable layer.
Uses subprocess.run(["ip", ...]) for network configuration. Netlink would
be 200+ lines of binary message construction; ip commands are clear and
universally available.
Functions:
-
setup_bridge() -> NoneCreates bridgedocklet0with IP10.0.100.1/24if it doesn't exist. Enables IP forwarding (/proc/sys/net/ipv4/ip_forward). Adds iptables MASQUERADE rule for outbound NAT. -
setup_container_net(container_id: str, pid: int) -> str- Creates a veth pair:
veth-{id[:7]}(host) ↔eth0(container) - Attaches host end to bridge
docklet0 - Moves container end into the container's network namespace
- Assigns IP from subnet (sequential from .2)
- Sets default route inside container → bridge IP
- Returns assigned IP address
- Creates a veth pair:
-
cleanup_net(container_id: str) -> NoneDeletes the host-side veth (automatically destroys the pair). -
_allocate_ip(container_id: str) -> strReads existing container configs to pick next available IP in the subnet.
Uses only urllib.request from stdlib. No external HTTP libraries.
Functions:
-
pull_image(image: str, tag: str = "latest") -> list[str]Orchestrator: authenticate → fetch manifest → download layers → extract. Returns ordered list of layer directories. -
_get_auth_token(image: str) -> strRequests a bearer token fromauth.docker.ioscoped to pull the image. -
_get_manifest(image: str, tag: str, token: str) -> dictFetches the image manifest fromregistry-1.docker.io. Handles manifest lists (multi-arch) by selectingamd64/linux. SendsAccept: application/vnd.docker.distribution.manifest.v2+json. -
_pull_layer(image: str, digest: str, token: str, dest: str) -> NoneDownloads a blob with progress output (bytes downloaded / total). Saves toLAYERS_DIR.
Image naming: Bare names like alpine are expanded to library/alpine.
Functions:
-
list_images() -> list[dict]ScansIMAGES_DIR, returns list of{name, tag, size, layers}dicts. -
get_layers(image: str, tag: str) -> list[str]Returns ordered list of layer directory paths for a pulled image. -
remove_image(image: str, tag: str) -> NoneDeletes the image directory tree. -
extract_layer(tarball_path: str, dest_dir: str) -> NoneExtracts a.tar.gzlayer, handling OCI whiteout files (.wh.prefix).
Orchestrates all other modules into a complete container lifecycle.
Per-container state stored in CONTAINERS_DIR/{id}/config.json:
{
"id": "a1b2c3d4",
"image": "alpine",
"tag": "latest",
"command": ["/bin/sh"],
"pid": 12345,
"status": "running",
"ip": "10.0.100.2",
"created": "2026-03-21T12:00:00",
"mem_limit": "512m",
"cpu_limit": 50000
}Functions:
-
run(image, tag, command, mem_limit, cpu_limit) -> strFull create-and-start flow. Returns container ID. -
create(image, tag, command, mem_limit, cpu_limit) -> strGenerates 8-char hex ID, writes config.json, returns ID. -
start(container_id: str) -> NoneThe main startup sequence — see "Container Startup Flow" below. -
exec_in(container_id: str, command: list[str]) -> NoneOpens/proc/<pid>/ns/{pid,mnt,uts,ipc,net}, callssetns()for each, thenfork()+execvp(). -
stop(container_id: str) -> NoneSends SIGTERM, waits briefly, then SIGKILL. -
remove(container_id: str) -> NoneStops if running, then cleans up filesystem, cgroup, and network. -
list_containers() -> list[dict]Reads all config.json files, checks if PIDs are alive, returns status info.
Parent: os.fork()
│
├─ Parent side:
│ 1. Wait for child to unshare (sync via pipe)
│ 2. Add child PID to cgroup
│ 3. Setup veth networking (move interface into child's netns)
│ 4. Signal child that network is ready (write to pipe)
│ 5. Write PID + status to config.json
│ 6. Wait for child (if interactive)
│
└─ Child side:
1. unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET)
2. Signal parent that unshare is done
3. os.fork() again ← required: PID namespace applies to children
│
└─ Inner child (PID 1 inside container):
1. Wait for parent's "network ready" signal
2. Mount overlayfs
3. Mount /proc, /dev inside new root
4. pivot_root into new rootfs
5. sethostname(container_id[:8])
6. os.execvp(command)
Why double fork: unshare(CLONE_NEWPID) does not move the caller into the
new PID namespace — only its future children enter it. The second fork ensures
the container's init process sees itself as PID 1.
[project]
name = "docklet"
version = "0.1.0"
requires-python = ">=3.10"
[project.scripts]
docklet = "docklet.cli:main"- Zero external dependencies — pure stdlib
- Install:
pip install -e . - Run:
sudo docklet run alpine /bin/sh
| Phase | Modules | Root required? |
|---|---|---|
| 1 | config.py | No |
| 2 | registry.py, image.py | No |
| 3 | namespaces.py, cgroups.py, filesystem.py | To test |
| 4 | network.py | To test |
| 5 | container.py | To test |
| 6 | cli.py, pyproject.toml | No |