RFC: Generate minimal profile by observing a workload (`sandlock learn`)

## Goal

Add a `sandlock learn -- <cmd>` mode that runs a workload under instrumentation and emits a minimal sandlock profile (TOML) covering exactly the filesystem reads/writes, network egress, and syscalls the workload actually used. Subsequent `sandlock run -p <profile>` invocations confine the workload to just that observed surface.

## Motivation

Writing a tight policy from scratch is the single largest UX gap with the flag-based interface. To run a Python script under sandlock today, the user has to know which directories cpython, site-packages, ssl certs, locale archives, and tempdirs live in. Most users give up and write `-r /usr -r /lib -r /lib64 -r /etc -w /tmp`, which is wide enough to be close to no policy. The result is a sandbox in name only.

The XOA model assumes per-call confinement is tight enough to matter. If the per-call profile is permissive by default, the threat model degrades to "container without a container," which is worse than what we promise.

The way out is observation. Run the workload once, record what it actually touches, emit a profile. The user starts from "definitely works, definitely minimal" rather than "guess and iterate."

## Proposed design

### Command surface

```
sandlock learn -o profile.toml -- python3 build.py
sandlock learn --merge profile.toml -- python3 build.py   # union into existing
sandlock run -p profile.toml -- python3 build.py
```

### What is recorded

| Domain | Recording mechanism |
|---|---|
| Filesystem reads | Permissive Landlock + seccomp-notify on `openat`/`open` |
| Filesystem writes | Same; classified by open flags (`O_WRONLY` / `O_RDWR` / `O_CREAT`) |
| Network egress (TCP / UDP / ICMP) | seccomp-notify on `connect` / `sendto` / `sendmsg` |
| HTTP method + host + path | Existing transparent proxy with `--http-ca`, in logging-only mode |
| Syscalls | seccomp filter counts unique syscalls invoked |
| Resource peaks | `/proc/<pid>` sampling: max RSS, max threads, max FDs |

### Output format

Reuse the existing TOML profile serializer in `crates/sandlock-core/src/profile.rs`. Fields populated:

- `fs.readable` / `fs.writable`: minimal path prefixes covering observed accesses, collapsed to directory granularity (see below).
- `net.allow`: observed `host:port` pairs, with scheme prefix when non-TCP.
- `http.allow`: observed method+host+path rules. Optional, gated by `--learn-http`.
- `seccomp.allow`: minimal syscall set, gated by `--learn-syscalls`; otherwise omit and rely on the default profile.
- `limits.max_memory` / `limits.max_processes`: observed peak times a safety factor (default 1.5x, configurable).

The output carries a header comment with the input command, host kernel, and timestamp, so reproduction is unambiguous.

### Path collapsing

Recording one entry per file is unworkable: a Python import touches thousands. The collapser groups by directory using a tunable heuristic:

- If the workload touched at least N files (default 4) under a directory, allowlist the directory.
- If fewer, allowlist individual files.
- `--collapse N` and `--collapse-prefix /usr/lib/python3` force aggregation.

### Merging and iteration

`--merge profile.toml` performs a union: existing rules retained, new rules added; resource caps take the max of old vs observed. Iterative refinement is the expected workflow: run with one input, run with another, merge.

When a later `sandlock run -p` hits a denial, the seccomp/Landlock log line should suggest `sandlock learn --merge profile.toml -- ...` to extend the profile.

## Open questions

1. **Permissive-Landlock + notify vs notify-only.** Landlock has no native audit mode. Either accept denials and observe them, or run with rules fully permissive and observe via seccomp-notify on file syscalls. Latter is heavier per-syscall but complete. Decide during prototype.
2. **eBPF as alternative recorder.** A bcc/bpftrace tracer would be much faster than seccomp-notify but adds a build dependency. Out of scope for v1; revisit if seccomp-notify overhead is prohibitive on realistic workloads.
3. **Per-invocation log vs aggregate.** Probably emit both: the aggregated profile is the artifact, a side-channel `--debug-log` records every observation for diagnosing the collapser.
4. **Multi-process workloads.** Forks inherit the seccomp filter; the supervisor already aggregates across children for runtime sandboxes. Same machinery applies; verify in prototype.

## Out of scope for v1

- Replay / fuzzing across input variants to broaden the trace.
- ML-guided rule generalization.
- Auto-tightening: take an existing profile, identify rules that no recorded run actually exercised, suggest removal. Useful follow-on once `learn` lands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Generate minimal profile by observing a workload (`sandlock learn`) #72

Goal

Motivation

Proposed design

Command surface

What is recorded

Output format

Path collapsing

Merging and iteration

Open questions

Out of scope for v1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Domain	Recording mechanism
Filesystem reads	Permissive Landlock + seccomp-notify on `openat`/`open`
Filesystem writes	Same; classified by open flags (`O_WRONLY` / `O_RDWR` / `O_CREAT`)
Network egress (TCP / UDP / ICMP)	seccomp-notify on `connect` / `sendto` / `sendmsg`
HTTP method + host + path	Existing transparent proxy with `--http-ca`, in logging-only mode
Syscalls	seccomp filter counts unique syscalls invoked
Resource peaks	`/proc/<pid>` sampling: max RSS, max threads, max FDs

RFC: Generate minimal profile by observing a workload (sandlock learn) #72

Description

Goal

Motivation

Proposed design

Command surface

What is recorded

Output format

Path collapsing

Merging and iteration

Open questions

Out of scope for v1

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

RFC: Generate minimal profile by observing a workload (`sandlock learn`) #72