Skip to content

Latest commit

 

History

History
309 lines (245 loc) · 11.6 KB

File metadata and controls

309 lines (245 loc) · 11.6 KB

Phase 3: Observability & Storage - Implementation Tasks

Goal: Complete storage integration and deploy full observability stack

Milestone: User has production-ready monitoring, logging, and persistent storage

Prerequisites

Phase 2 must be complete:

  • ✓ K3s cluster running with VIP
  • ✓ Helm client working
  • ✓ Contour and cert-manager deployed
  • ✓ OpenBAO and Zot installed

High-Level Task Areas

1. Storage Integration

Architecture Decision - S3 Storage: SeaweedFS is deployed as the S3 endpoint for all components (Loki, Velero, etc.). Its backing storage comes from Longhorn PVCs, providing distributed and replicated storage.

This provides a consistent S3 API with high performance, simplifying component configuration.

Working States:

  • Storage component deployed (supports Longhorn, NFS, local-path backends)
  • StorageClass configured and set as default
  • SeaweedFS component for S3-compatible storage
  • S3-compatible storage architecture defined (SeaweedFS on Longhorn PVCs)
  • Integration tests for PVC provisioning

Key Tasks:

  • Deploy storage component with Longhorn backend ✓ Implemented (unified storage component)
  • Configure StorageClass ✓ Implemented
  • Test PVC provisioning and mounting ✓ Integration tests exist
  • Deploy SeaweedFS via Helm ✓ Implemented
  • Configure SeaweedFS with Longhorn-backed PVCs ✓ SeaweedFS uses Longhorn StorageClass
  • Create S3 buckets for various components ✓ Helm chart handles bucket creation
  • Document storage architecture decisions ✓ Documented in docs/storage.md

Files:

  • v1/internal/component/storage/types.go ✓ (includes LonghornConfig)
  • v1/internal/component/storage/install.go ✓ (implements Longhorn, NFS, local-path)
  • v1/internal/component/storage/*_test.go
  • v1/internal/component/seaweedfs/types.go
  • v1/internal/component/seaweedfs/install.go
  • v1/internal/component/seaweedfs/*_test.go
  • v1/cmd/foundry/commands/storage/provision.go

2. Prometheus

Working States:

  • Prometheus deployed via kube-prometheus-stack Helm chart
  • ServiceMonitors configured for auto-discovery
  • Scraping K3s, OpenBAO, Zot metrics
  • PVC for Prometheus TSDB
  • Retention configured per stack.yaml

Key Tasks:

  • Deploy kube-prometheus-stack Helm chart ✓ Implemented
  • Configure Prometheus retention ✓ Implemented (configurable retention_days and retention_size)
  • Set up service discovery for foundry-* namespaces ✓ Configured with nil selectors for all namespaces
  • Create ServiceMonitors for core components (OpenBAO, Zot, etc.) ✓ External targets auto-configured via additionalScrapeConfigs
  • Configure PVC for storage ✓ Implemented with configurable storage class and size
  • Set up alerting rules (basic)
  • Test metric collection

Files:

  • v1/internal/component/prometheus/types.go
  • v1/internal/component/prometheus/install.go
  • v1/internal/component/prometheus/types_test.go ✓ (94% coverage)
  • v1/internal/component/prometheus/install_test.go

3. Loki

Working States:

  • Loki deployed via loki-stack Helm chart
  • Collecting logs from all pods (via Promtail)
  • S3-compatible storage for log retention (via SeaweedFS)
  • Retention configured per stack.yaml

Key Tasks:

  • Deploy loki-stack Helm chart ✓ Implemented (using grafana/loki chart)
  • Configure S3 backend (SeaweedFS) ✓ Implemented with full S3 config
  • Set up retention policies ✓ Implemented (configurable retention_days)
  • Configure promtail for log collection ✓ Implemented (optional, enabled by default)
  • Test log ingestion and querying

Files:

  • v1/internal/component/loki/types.go
  • v1/internal/component/loki/install.go
  • v1/internal/component/loki/types_test.go ✓ (92.7% coverage)
  • v1/internal/component/loki/install_test.go

4. Grafana

Working States:

  • Grafana deployed via Helm
  • Prometheus and Loki configured as data sources
  • Admin password stored in OpenBAO
  • OIDC auth via OpenBAO (optional for now, required in Phase 4)
  • Default dashboards provisioned
  • Accessible via Ingress

Key Tasks:

  • Deploy Grafana Helm chart ✓ Implemented (grafana/grafana chart v8.8.2)
  • Configure data sources (Prometheus, Loki) ✓ Implemented with auto-configuration
  • Store/retrieve admin password from OpenBAOfoundry-core/grafana with admin_password and admin_user keys
  • Create Ingress with TLS ✓ Implemented via Contour
  • Provision default dashboards (K3s, Prometheus, Loki) ✓ Kubernetes cluster + node-exporter dashboards
  • Test dashboard access

Files:

  • v1/internal/component/grafana/types.go
  • v1/internal/component/grafana/install.go ✓ (includes datasources configuration)
  • v1/internal/component/grafana/openbao.go ✓ (OpenBAO credential storage/retrieval)
  • v1/internal/component/grafana/types_test.go ✓ (92.5% coverage)
  • v1/internal/component/grafana/install_test.go
  • v1/internal/component/grafana/openbao_test.go ✓ (100% coverage on most functions)

5. External-DNS

Working States:

  • External-DNS deployed via Helm
  • Automatically creates DNS records for Ingress resources
  • Configured for appropriate DNS provider (or left unconfigured for user setup)

Key Tasks:

  • Deploy External-DNS Helm chart ✓ Implemented (external-dns/external-dns chart v1.15.0)
  • Configure DNS provider ✓ Supports PowerDNS, Cloudflare, RFC2136, Route53, Google, Azure
  • Store DNS provider credentials in OpenBAO ✓ Per-provider paths under foundry-core/external-dns/<provider>
  • Test automatic DNS record creation
  • Document supported DNS providers ✓ Documented in docs/dns.md

Supported Providers:

  • PowerDNS (pdns) - with API URL and key → stored at foundry-core/external-dns/pdns
  • Cloudflare - with API token, optional proxied mode → stored at foundry-core/external-dns/cloudflare
  • RFC2136 - for dynamic DNS updates with TSIG → stored at foundry-core/external-dns/rfc2136
  • AWS Route53, Google Cloud DNS, Azure DNS - pass-through for custom config (use IAM roles/workload identity)

Files:

  • v1/internal/component/externaldns/types.go
  • v1/internal/component/externaldns/install.go
  • v1/internal/component/externaldns/openbao.go ✓ (OpenBAO credential storage/retrieval for all providers)
  • v1/internal/component/externaldns/types_test.go ✓ (93.3% coverage)
  • v1/internal/component/externaldns/install_test.go
  • v1/internal/component/externaldns/openbao_test.go ✓ (>80% coverage on all functions)

6. Velero

Working States:

  • Velero deployed via Helm
  • S3 backend configured (SeaweedFS)
  • Can create cluster backups (CLI: foundry backup create)
  • Can restore from backups (CLI: foundry backup restore)
  • Scheduled backup configured (CLI: foundry backup schedule)

Key Tasks:

  • Deploy Velero Helm chart ✓ Implemented (vmware-tanzu/velero chart v8.0.0)
  • Configure S3-compatible storage backend ✓ Supports SeaweedFS and AWS S3
  • Set up backup schedules ✓ Configurable cron schedules with TTL
  • Test backup and restore (requires integration tests)
  • Create backup CLI commandsfoundry backup create/list/restore/schedule/delete

Supported Features:

  • SeaweedFS and AWS S3 providers
  • Configurable backup retention (TTL)
  • Scheduled backups with cron expressions
  • Namespace inclusion/exclusion filters
  • CSI volume snapshots (optional)
  • ServiceMonitor for Prometheus metrics
  • Resource requests/limits configuration

Files:

  • v1/internal/component/velero/types.go
  • v1/internal/component/velero/install.go
  • v1/internal/component/velero/types_test.go ✓ (94.8% coverage)
  • v1/internal/component/velero/install_test.go

7. CLI Commands

Commands Implemented:

# Storage
foundry storage provision <name> --size 10Gi    ✓
foundry storage pvc list [--namespace NS]       ✓
foundry storage pvc delete <name># Backup
foundry backup create [--name NAME]             ✓
foundry backup list                             ✓
foundry backup restore <name>                   ✓
foundry backup schedule --cron "0 2 * * *"      ✓
foundry backup delete <name># Observability
foundry dashboard open                          ✓  # Opens Grafana
foundry dashboard url                           ✓  # Print Grafana URL
foundry logs <pod># Quick log access
foundry metrics <query># Query Prometheus
foundry metrics list                            ✓  # List available metrics
foundry metrics targets                         ✓  # List scrape targets

Files:

  • v1/cmd/foundry/commands/storage/provision.go
  • v1/cmd/foundry/commands/storage/pvc.go
  • v1/cmd/foundry/commands/storage/provision_test.go
  • v1/cmd/foundry/commands/backup/commands.go
  • v1/cmd/foundry/commands/backup/client.go
  • v1/cmd/foundry/commands/backup/create.go
  • v1/cmd/foundry/commands/backup/list.go
  • v1/cmd/foundry/commands/backup/restore.go
  • v1/cmd/foundry/commands/backup/schedule.go
  • v1/cmd/foundry/commands/backup/delete.go
  • v1/cmd/foundry/commands/backup/client_test.go
  • v1/cmd/foundry/commands/dashboard/commands.go
  • v1/cmd/foundry/commands/dashboard/commands_test.go
  • v1/cmd/foundry/commands/logs/commands.go
  • v1/cmd/foundry/commands/logs/commands_test.go
  • v1/cmd/foundry/commands/metrics/commands.go
  • v1/cmd/foundry/commands/metrics/commands_test.go

8. Integration Tests

Test Scenarios:

  • Deploy PVC and mount to pod
  • Backup cluster and restore
  • Metrics are scraped and visible in Grafana
  • Logs are collected and queryable in Loki
  • DNS records are created automatically

Files:

  • v1/test/integration/phase3_storage_test.go ✓ (PVC provisioning, SeaweedFS deployment)
  • v1/test/integration/phase3_observability_test.go ✓ (Prometheus, Loki, Grafana, full stack)
  • v1/test/integration/phase3_backup_test.go ✓ (Velero deployment, backup/restore)
  • v1/test/integration/phase3_helpers_test.go ✓ (Shared test utilities)

9. Documentation

Documents to Create/Update:

  • docs/storage.md - Full CSI driver setup, PVC usage ✓
  • docs/observability.md - Prometheus, Loki, Grafana usage ✓
  • docs/backups.md - Velero usage, restore procedures ✓
  • docs/dns.md - External-DNS configuration ✓

Phase 3 Completion Criteria

  • All components deploy successfully
  • Storage provisioning works
  • Metrics are collected and visible
  • Logs are collected and queryable
  • Grafana dashboards show system health
  • Backups can be created and restored
  • DNS records are managed automatically
  • All tests pass
  • Documentation complete

Manual Verification

# Install storage with Longhorn backend for distributed block storage
foundry component install storage --backend longhorn

# Install SeaweedFS for S3-compatible storage
foundry component install seaweedfs

# Install observability stack (depends on storage)
foundry component install prometheus
foundry component install loki
foundry component install grafana
foundry component install velero

# Test storage
foundry storage provision test-pvc --size 1Gi
kubectl get pvc

# Test backup
foundry backup create test-backup
foundry backup list
foundry backup restore test-backup

# Access Grafana
foundry dashboard open

Estimated Working States: ~20 testable states Estimated LOC: ~3000-4000 lines Timeline: Not time-bound