opencadc · shinybrar · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026
diff --git a/configs/kueue/docs/adrs/ADR-001-kueue-adoption.md b/configs/kueue/docs/adrs/ADR-001-kueue-adoption.md
@@ -0,0 +1,35 @@
+# ADR-001: Kueue Adoption
+
+- Status: Accepted
+- Date: January 2025
+
+## Context
+
+CANFAR needs a Kubernetes-native way to control admission, queueing, quotas,
+borrowing, reclaim, and visibility for a mix of interactive, persistent, and
+batch science workloads. The platform must handle very large pending backlogs
+without treating direct Kubernetes scheduling as the tenant policy layer.
+
+## Decision
+
+CANFAR adopts Kueue as the admission and quota orchestration layer for the
+Science Platform. Kubernetes remains the runtime scheduler and execution plane.
+`skaha` remains the main user submission entry point.
+
+## Consequences
+
+Kueue provides the needed queue, quota, priority, cohort, and visibility
+primitives. It also creates a clean path to future topology-aware scheduling and
+MultiKueue.
+
+CANFAR must still solve identity, project mapping, and accounting outside
+Kueue. Kueue is not the tenant system of record.
+
+## Alternatives considered
+
+- Continue with direct Kubernetes scheduling and custom ad hoc controls
+- Build a custom scheduling layer or scheduler plugin stack
+- Treat the backlog problem as only a `skaha` rate-limiting problem
+
+These alternatives either move too much policy into custom code or fail to give
+native cohort, quota, and admission control semantics.
diff --git a/configs/kueue/docs/adrs/ADR-002-workload-apis.md b/configs/kueue/docs/adrs/ADR-002-workload-apis.md
@@ -0,0 +1,36 @@
+# ADR-002: Supported Workload APIs
+
+- Status: Accepted
+- Date: Spring 2025
+
+## Context
+
+The target architecture must support a broad workload taxonomy, but the current
+repository baseline and the need for safe operational rollout make it unwise to
+treat every Kueue integration production commitment.
+
+## Decision
+
+Production support centers on `batch/v1.Job`, including Indexed Job
+usage patterns for large independent fan-out work. Protected interactive and
+persistent workloads may be brought under Kueue using mature controller
+patterns, but only where the team can verify the operational behavior.
+
+`JobSet`, MPI, Ray, and other advanced or distributed controllers remain part of
+the target taxonomy and future roadmap, not the initial production commitment.
+
+## Consequences
+
+The platform gets a safe first production lane for large-scale batch admission
+without blocking future support for more advanced workload types.
+
+The package still documents the full workload taxonomy so later phases do not
+need to invent a new fairness or queue model.
+
+## Alternatives considered
+
+- Promise full support for every Kueue integration production commitment
+- Delay all interactive or persistent integration until after batch-only rollout
+
+The first option creates avoidable operational risk. The second option breaks
+the desired unified scheduling model too early.
diff --git a/configs/kueue/docs/adrs/ADR-003-shared-workloads-namespace.md b/configs/kueue/docs/adrs/ADR-003-shared-workloads-namespace.md
@@ -0,0 +1,36 @@
+# ADR-003: Shared `workloads` namespace now, namespace evolution later
+
+- Status: Accepted
+- Date: March 12, 2026
+
+## Context
+
+The current Kueue repository baseline uses multiple managed namespaces, but the
+target architecture wants one shared Kueue-managed namespace at first so queue
+governance, RBAC, and visibility can be kept simple while the new tenant model
+is introduced.
+
+## Decision
+
+Use one shared `workloads` namespace for Kueue-managed user workloads in the
+target single-cluster design. Create project-scoped `LocalQueue` objects in that
+shared namespace on demand.
+
+Future supported namespace layouts include one namespace per community, one namespace per workload class, or a
+hybrid namespace layout.
+
+## Consequences
+
+This keeps the initial rollout simpler and reduces the number of moving parts
+while project-based fairness and community ownership are introduced.
+
+Future namespace splits remain possible without changing the core
+community-project-cohort model.
+
+## Alternatives considered
+
+- Start immediately with one namespace per community
+- Start immediately with one namespace per workload class
+
+Both alternatives increase governance and visibility complexity too early for
+production commitment.
diff --git a/configs/kueue/docs/adrs/ADR-004-standalone-control-service.md b/configs/kueue/docs/adrs/ADR-004-standalone-control-service.md
@@ -0,0 +1,40 @@
+# ADR-004: Standalone accounting and control service
+
+- Status: Accepted
+- Date: March 12, 2026
+
+## Context
+
+Kueue cannot serve as the system of record for communities, projects, POSIX
+group mapping, delegated project administration, or accounting relationships.
+Those concerns are fundamental to CANFAR's policy and visibility.
+
+## Decision
+
+Define a new standalone accounting and control service as a required future
+dependency of the platform. The service remains out of scope for implementation
+in this package, but it is in scope for architecture and requirements.
+
+The service must support:
+
+- community creation and management by cluster admins
+- project creation inside a community by delegated project admins
+- project-to-group mapping and later user or group resolution
+- override request workflows for temporary fair-share changes
+- exposure of tenant metadata to `skaha` and the future visibility UI
+
+## Consequences
+
+The scheduler design stays clean. Kueue owns admission and quota behavior while
+the control service owns tenant and policy metadata.
+
+The rollout now has an explicit dependency that must be addressed in later
+phases rather than hidden behind manual configuration.
+
+## Alternatives considered
+
+- Extend an existing service implicitly without naming a new component
+- Keep project metadata as static Kubernetes configuration only
+
+Both alternatives hide ownership and make future admin workflows difficult to
+design and operate.
diff --git a/configs/kueue/docs/adrs/ADR-005-fairness-priority-and-preemption.md b/configs/kueue/docs/adrs/ADR-005-fairness-priority-and-preemption.md
@@ -0,0 +1,43 @@
+# ADR-005: Fairness, workload priority, and preemption model
+
+- Status: Accepted
+- Date: March 12, 2026
+
+## Context
+
+CANFAR needs fair competition between projects, community ownership of
+resources, borrowing of idle capacity, and a workload-ordering model that keeps
+interactive work ahead of batch work inside each project.
+
+## Decision
+
+Use the following split model:
+
+- Community = `ClusterQueue`
+- Project = `LocalQueue`
+- Multiple communities sharing capacity = `Cohort`
+- Project competition inside one community = Admission Fair Sharing with
+  adjustable `LocalQueue` weights
+- Workload ordering inside one project = `WorkloadPriorityClass`
+
+Use cohort borrowing and reclaim for community-level resource ownership. Use
+project-local workload priority to select interactive work before lower-priority
+batch work inside the chosen project queue.
+
+## Consequences
+
+This preserves community ownership while still maximizing idle cluster use. It
+also avoids pretending that project fair-share and workload priority are the
+same thing.
+
+Cross-community competition remains community-scoped rather than global
+project-scoped. That is intentional.
+
+## Alternatives considered
+
+- One global project fair-share plane across all communities
+- Priority-only scheduling without project fair-share weights
+- Community-only fairness with no project-level balancing
+
+These alternatives either ignore community ownership or fail to give projects a
+meaningful fairness model inside a community.
diff --git a/configs/kueue/docs/adrs/ADR-006-posix-group-project-mapping.md b/configs/kueue/docs/adrs/ADR-006-posix-group-project-mapping.md
@@ -0,0 +1,50 @@
+# ADR-006: POSIX group to project mapping options
+
+- Status: Proposed
+- Date: March 12, 2026
+
+## Context
+
+Projects may contain multiple POSIX groups and communities may contain multiple
+projects. The open question is whether a POSIX group may belong to more than one
+project.
+
+This decision changes the submission experience because ambiguous group mapping
+may force the API layer to require an explicit project field.
+
+## Options
+
+### Option A: One group maps to exactly one project
+
+Under this option, a POSIX group may not belong to multiple projects.
+
+#### Benefits
+
+- `Skaha` can often infer project and community from group context
+- submission stays simpler for users
+- visibility and accounting reasoning stay easier to explain
+
+#### Costs
+
+- the identity model is stricter
+- some administrative use cases may need new group structures
+
+### Option B: A group may map to multiple projects
+
+Under this option, a POSIX group may belong to more than one project.
+
+#### Benefits
+
+- the identity model is more flexible
+- administrators can reuse groups across projects
+
+#### Costs
+
+- the submission path must require explicit project selection in ambiguous cases
+- user experience becomes more complex
+- the control service and UI must explain ambiguity clearly
+
+## Current direction
+
+Leave the decision open. The architecture and UI must support both models until
+the tenant administration workflow is finalized.
diff --git a/configs/kueue/docs/adrs/ADR-007-resource-flavor-taxonomy.md b/configs/kueue/docs/adrs/ADR-007-resource-flavor-taxonomy.md
@@ -0,0 +1,45 @@
+# ADR-007: ResourceFlavor taxonomy and topology model
+
+- Status: Accepted
+- Date: March 12, 2026
+
+## Context
+
+CANFAR needs a flavor model that captures resource identity across cluster,
+zone, accelerator type, storage class, and later topology-aware scheduling
+domains. The model must stay readable to operators and extensible to future
+MultiKueue deployments.
+
+## Decision
+
+Use `ResourceFlavor` as the canonical scheduler-facing identity for placement and
+hardware classes. Standardize flavor naming around stable placement and hardware
+dimensions rather than workload class.
+
+Adopt the following naming pattern:
+
+`rf-<cluster>-<zone>-<resource-class>[-<accelerator-class>]`
+
+Examples:
+
+- `rf-ca-west-01-cpu-standard`
+- `rf-ca-west-01-cpu-highmem`
+- `rf-ca-west-01-gpu-a100`
+
+Treat topology-aware scheduling as a future phase. When topology becomes active,
+use `Topology` objects and flavor association rather than encoding full topology
+hierarchy into the flavor name itself.
+
+## Consequences
+
+Operators get a stable taxonomy that works in both single-cluster and future
+manager-worker designs. Users and admins can also read flavor identity in a
+predictable way.
+
+## Alternatives considered
+
+- Opaque flavor names with documentation-only meaning
+- One flavor per workload class
+- Encoding every topology dimension directly in the flavor name
+
+These alternatives either hide meaning or create unnecessary flavor sprawl.
diff --git a/configs/kueue/docs/adrs/ADR-008-queue-enforcement-and-managed-namespaces.md b/configs/kueue/docs/adrs/ADR-008-queue-enforcement-and-managed-namespaces.md
@@ -0,0 +1,35 @@
+# ADR-008: Queue enforcement and managed namespace model
+
+- Status: Accepted
+- Date: March 12, 2026
+
+## Context
+
+Kueue policy only works predictably when managed workloads land in managed
+namespaces and carry valid queue information. CANFAR requires users to
+submit through `skaha`, not through raw Kubernetes APIs without platform policy.
+
+## Decision
+
+Use explicitly managed namespaces for Kueue-managed user work. In the target
+state this is one shared `workloads` namespace. The submission path must resolve
+and apply a `LocalQueue` explicitly.
+
+Keep `manageJobsWithoutQueueName` disabled and reject malformed or unqueued
+submissions in managed namespaces through admission policy and service-side
+validation.
+
+## Consequences
+
+The scheduler does not need to guess tenant identity. Platform policy remains
+explicit, and visibility stays consistent with actual queue assignment.
+
+Future namespace evolution remains possible as long as the same enforcement
+principles are preserved.
+
+## Alternatives considered
+
+- Allow silent default queue assignment everywhere
+- Allow users to create unmanaged work in the same namespaces as managed work
+
+These alternatives make fairness and explanation harder to trust.
diff --git a/configs/kueue/docs/adrs/ADR-009-visibility-and-ui-scope.md b/configs/kueue/docs/adrs/ADR-009-visibility-and-ui-scope.md
@@ -0,0 +1,39 @@
+# ADR-009: Visibility and UI scope
+
+- Status: Accepted
+- Date: March 12, 2026
+
+## Context
+
+Fair scheduling without understandable visibility will be perceived as arbitrary.
+CANFAR's users, project admins, and cluster admins all need different levels of
+insight into ownership, pending reasons, and current queue position.
+
+## Decision
+
+Treat visibility as a first-class architectural concern. Production commitment relies on
+`kubectl`, Grafana, Kueue metrics, and the pending-workloads visibility API.
+Later phases add a read-only queue UI and then guided admin workflows.
+
+The UI must explain scheduling outcomes in terms of:
+
+- fair-share position
+- workload priority
+- quota exhaustion
+- insufficient resource availability
+- policy rejection
+
+## Consequences
+
+The architecture gains a clear product surface instead of assuming that raw
+conditions or controller logs are enough.
+
+This also creates a requirement for the control service to expose tenant and
+override metadata to the UI.
+
+## Alternatives considered
+
+- Delay visibility until after scheduling is complete
+- Rely only on Kubernetes-native object inspection
+
+These alternatives make correct policy look opaque to most users.