From 7e9dc41381b1405fe54f85753c837d5e24f9b024 Mon Sep 17 00:00:00 2001 From: Matt Knop Date: Mon, 3 Nov 2025 11:19:47 -0700 Subject: [PATCH 1/2] generated SOP docs for clowder --- docs/sop.md | 630 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 516 insertions(+), 114 deletions(-) diff --git a/docs/sop.md b/docs/sop.md index 448f942c0..31c365112 100644 --- a/docs/sop.md +++ b/docs/sop.md @@ -1,152 +1,554 @@ -# Operating Clowder +# Clowder Standard Operating Procedures (SOP) -Two primary aspects of operating Clowder: Operating the apps managed by Clowder and operating -Clowder itself. +This document provides standard operating procedures for managing, debugging, and releasing Clowder - the Red Hat Insights application configuration management operator for Kubernetes. -Clowder utilizes a common configuration format that is presented to each application, no matter -the environment it is running in, enabling a far easier development experience. It governs many -different aspects of an application's configuration from defining the port it should listen to for -its main web service, to metrics, kafka and others. When using Clowder, the burden of identifying -and defining dependency and core service credentials and connection information is removed. +## Table of Contents -## Operating Apps Managed by Clowder +1. [Architecture Overview](#architecture-overview) +2. [Debugging Procedures](#debugging-procedures) +3. [Release Procedures](#release-procedures) -### ClowdEnvironment +--- -**abbreviated to [env] in k8s** +## Architecture Overview -The ``ClowdEnvironment`` CRD is responsible for configuring key infrastruture services that the Clowder enabled -apps will interact with. It is a *cluster scoped* CRD and thus must have a unique name inside the -k8s cluster. For production environments it is usual to have only one ``ClowdEnvironment``, whereas -in other scenarios -- such as ephemeral testing -- Clowder enables the management of -multiple environments that operate completely independently from each other. +### High-Level Architecture -#### Providers +Clowder is a Kubernetes operator that manages application configuration and infrastructure dependencies for cloud-native applications. It consists of several key components: -An environment's specification is broken into **providers**, which govern the creation of services, e.g. Kafka topics, -object storage, etc, that applications may depend on. The ``ClowdEnvironment`` CRD configures these -providers principally by making use of a provider's **mode**. +#### Core Components -#### Modes +1. **Clowder Controller Manager** + - Main operator process that watches for CRD changes + - Reconciles ClowdApp and ClowdEnvironment resources + - Manages application lifecycle and configuration generation -Providers often operate in different modes. As an example the Kafka provider can operate in three -different modes. In *local* mode, the Kafka provider deploys a single node Kafka/Zookeeper instance -inside the cluster and configures it to auto-create the topics. In *operator* mode, the provider -assumes a Strimzi Kafka instance is present and will create ``KafkaTopic`` CRs to provide the -topics. In *app-interface* mode, no resources are deployed and it is assumed app-interface has -already created the requested topics. For more information on the configuration of each of these -providers and their modes, please see the relevant pages. +2. **Custom Resource Definitions (CRDs)** + - `ClowdEnvironment`: Cluster-scoped resource defining infrastructure providers + - `ClowdApp`: Namespace-scoped resource defining application specifications + - `ClowdJobInvocation`: Resource for managing job executions -#### Target Namespace +3. **Provider System** + - Modular architecture supporting different infrastructure modes + - Providers: Database, Kafka, Object Storage, Logging, Metrics, Web, etc. + - Each provider supports multiple modes (local, operator, app-interface) -Environmental resources, such as the Kafka/Zookeeper from the example in the *Modes* section, will -be placed in the ``ClowdEnvironment``'s target namespace. This is configured by setting the -``targetNamespace`` attribute of the ``ClowdEnvironment``. If it is omitted, a random target -namespace is generated instead. The name of this resource can be found by inspecting the -``status.targetNamespace`` of the ClowdEnvironment resource. +#### Data Flow -### ClowdApp +``` +ClowdApp → Controller → Provider Logic → K8s Resources → Application Config + ↓ ↓ ↓ ↓ ↓ + Spec Reconcile Infrastructure Deployments cdappconfig.json +``` -**abbreviated to [app] in k8s** +#### Key Concepts -The ``ClowdApp`` CRD is responsible for configuring an application and is namespace scoped. Any -resources Clowder creates on behalf of the application will reside in the same namespace that the -``ClowdApp`` resources is applied to. As such the ``ClowdApp`` name must be unique within a -particular namespace. An ``ClowdApp`` does not have to be placed in the ``ClowdEnvironment``'s -target namespace. +- **Environment Coupling**: ClowdApps are coupled to ClowdEnvironments via `envName` +- **Configuration Generation**: Apps receive standardized config via mounted secrets +- **Dependency Management**: Automatic service discovery and configuration injection +- **Provider Modes**: Flexible infrastructure provisioning strategies -A ``ClowdApp`` may define multiple services inside it. These services, though defined by a -specification that is very similar to the k8s pod specification, will be deployed as individual -deployment resources. Functionally, defining multiple applications in the same ``ClowdApp`` -specification allows the sharing of some infrastructure dependencies such as databases. -Applications in different ClowdApp's should not expect to be able to share databases. +### Deployment Architecture -A ``ClowdApp`` is coupled to a ``ClowdEnvironment`` by the use of the ``envName`` parameter of the -``ClowdApp``. When Clowder configures applications, it will point them to the resources that are -defined in the coupled ``ClowdEnvironment``. As an example, if a ``ClowdApp`` requires the use of a -Kafka topic, the application will be configured to use the kafka broker that has been configured in -the coupled ClowdEnvironment, which could be a local, strimzi or app-interface managed Kafka -instance. +Clowder is deployed via Operator Lifecycle Manager (OLM) with the following components: +- **OperatorGroup**: Defines operator scope and permissions +- **CatalogSource**: Points to operator bundle images +- **Subscription**: Manages operator installation and updates +- **ClusterServiceVersion**: Defines operator metadata and permissions -#### Dependencies +--- -An application will usually require several dependencies in the form of either infrastructure -services e.g. Kafka, or other application services such as RBAC. +## Debugging Procedures -Services such as RBAC will be other Clowder-managed applications and, as such, have an associated -``ClowdApp`` coupled to the ``ClowdEnvironment``. These are defined in the ``dependencies`` field of -the ``ClowdApp`` and take the form of the dependency's ``ClowdApp`` name. This will result in all of -the dependent services being listed in the application's configuration. If a dependent service -defines multiple pod specs with a web service exposed in its ``ClowdApp``, each of these will be -exposed to the requesting app. A ``ClowdApp`` will not be deployed if any of its service -dependencies do not exist within the coupled ``ClowdEnvironment``. +### Prerequisites -Infrastructure dependencies, such as Kafka topics and object bucket storage, are defined in the -``ClowdApp`` spec. More information on each of them is defined in the [API specification](https://redhatinsights.github.io/clowder/clowder/dev/api_reference.html#k8s-api-github-com-redhatinsights-clowder-apis-cloud-redhat-com-v1alpha1-clowdappspec). +Before debugging Clowder issues, ensure you have: +- `kubectl` access to the target cluster +- Appropriate RBAC permissions to view Clowder resources +- Access to cluster logs and metrics -#### Created Resources +### Common Issues and Troubleshooting -For each ``ClowdApp`` service, Clowder will create an ``apps.Deployment`` and a ``Service`` -resource. If the service has the ``web`` field set to true, the ``Service`` resource will -include a port definition for ``webPort`` as well as the standard ``metricsPort``. The actual values -of these are defined in the ``ClowdEnvironment`` by configuring the web and metric providers, -respectively. By default these are set to 8000 for the web service port and 9000 for the metrics -port. +#### 1. ClowdApp Not Deploying -Clowder will also set certain fields in the pod spec, inline with best practice, such as pull -policy, and anti-affinity. +**Symptoms:** +- ClowdApp resource exists but no deployments are created +- Application pods are not starting -Clowder creates a ``Secret`` resource that is named the same as the ``ClowdApp`` which will contain the generated configuration -for that app. This secret will be mounted at ``/cdappconfig.json`` and will be consumed by the app -to configure itself on startup. +**Debugging Steps:** -Secrets may also be created for application dependencies such as databases and in-memory db -services. +1. **Check ClowdApp Status:** + ```bash + kubectl get clowdapp -n -o yaml + kubectl describe clowdapp -n + ``` -## Operating Clowder Itself +2. **Verify ClowdEnvironment:** + ```bash + kubectl get clowdenvironment -o yaml + kubectl describe clowdenvironment + ``` -### OLM pipeline +3. **Check Controller Logs:** + ```bash + kubectl logs -n clowder-system deployment/clowder-controller-manager -f + ``` -Clowder is deployed via OLM, thus the build and deploy pipeline comprises of creating and deploying -OLM CRs: ``OperatorGroup``, ``CatalogSource``, and ``Subscription``. To truly understand OLM is -outside the scope of this document, but it will cover how each resource is managed. +4. **Common Causes:** + - Missing or invalid `envName` reference + - ClowdEnvironment not ready + - Missing dependencies in ClowdApp spec + - Resource quota exceeded in target namespace -Despite being deployed via OLM, Clowder follows a very similar build and deployment pipeline as -other apps in app-interface, specifically all pushes to master are automatically deployed to stage, -and an MR to app-interface is required to update the ref in production. +#### 2. Configuration Issues -``OperatorGroup`` and ``Subscription`` are quite static, but ``CatalogSource`` is what gets updated -every promotion. Before it's updated, there are three images that are pushed to Quay: the Clowder -application image, the Clowder OLM bundle, and the Clowder catalog image. All three images use the -same image tag, based off the commit hash at the tip of master. The app image is built using -``build_deploy.sh``, and the bundle and catalog images are built in a separate Jenkins job using -``build_catalog.sh``. +**Symptoms:** +- Applications starting but failing to connect to services +- Missing configuration values in cdappconfig.json -#### Troubleshooting +**Debugging Steps:** -On occasion, updating the ``CatalogSource`` does not trigger OLM to deploy the latest version of -Clowder. If this happens, the simplest approach is to delete the ``ClusterServiceVersion`` and -``Subscription`` resources with the name ``clowder`` from the ``clowder`` namespace. Once they are -removed, you should re-run the saas-deploy job for clowder, which will recreate the -``Subscription``, which should trigger OLM to recreate the ``ClusterServiceVersion``. - -##### Metrics and alerts +1. **Check Generated Configuration:** + ```bash + kubectl get secret -n -o jsonpath='{.data.cdappconfig\.json}' | base64 -d | jq + ``` -##### App-interface modes +2. **Verify Provider Configuration:** + ```bash + kubectl get clowdenvironment -o jsonpath='{.spec.providers}' | jq + ``` -##### Promoting clowder to prod +3. **Check Provider Status:** + ```bash + kubectl describe clowdenvironment + ``` -As stated above, promoting Clowder to production is done the same as any other app in app-interface, -but there are additional considerations given how Clowder code changes could cause widespread -rollouts across the target cluster. For example, if a field is added to every app's -``cdappconfig.json``, this will trigger every deployment to rollout a new version at virtually the -same time. While this *shouldn't* cause a problem, promoters should be aware that such churn is -going to happen before promoting. +#### 3. Operator Not Responding -Another more disruptive example would be if the format of the name of services was changed. Not -only would this trigger a rollout of all deployments, but old pods would no longer function properly -because the old hostname in their configuration is no longer valid. A change like this should -either be done in a backwards-compatible way or be done in a planned outage window. +**Symptoms:** +- Changes to ClowdApp/ClowdEnvironment not being processed +- Controller manager pod crashing or restarting -Despite those two examples, most changes to Clowder should not be very disruptive; just make sure -that extra care is taken to review all changes before promoting to production. +**Debugging Steps:** + +1. **Check Operator Health:** + ```bash + kubectl get pods -n clowder-system + kubectl describe pod -n clowder-system -l app.kubernetes.io/name=clowder + ``` + +2. **Review Controller Logs:** + ```bash + kubectl logs -n clowder-system deployment/clowder-controller-manager --previous + ``` + +3. **Check Resource Usage:** + ```bash + kubectl top pods -n clowder-system + kubectl describe node + ``` + +4. **Restart Controller:** + ```bash + kubectl rollout restart deployment/clowder-controller-manager -n clowder-system + ``` + +#### 4. OLM Installation Issues + +**Symptoms:** +- Clowder operator not installing via OLM +- CSV in failed state + +**Debugging Steps:** + +1. **Check OLM Resources:** + ```bash + kubectl get csv -n clowder-system + kubectl get subscription -n clowder-system + kubectl get catalogsource -n clowder-system + ``` + +2. **Review CSV Status:** + ```bash + kubectl describe csv clowder.v -n clowder-system + ``` + +3. **Check OLM Operator Logs:** + ```bash + kubectl logs -n olm deployment/olm-operator + kubectl logs -n olm deployment/catalog-operator + ``` + +4. **Force Reinstall:** + ```bash + kubectl delete csv clowder.v -n clowder-system + kubectl delete subscription clowder -n clowder-system + # Re-run saas-deploy job + ``` + +#### 5. Performance Issues + +**Symptoms:** +- Slow reconciliation times +- High memory/CPU usage +- Timeouts during resource creation + +**Debugging Steps:** + +1. **Monitor Resource Usage:** + ```bash + kubectl top pods -n clowder-system + kubectl describe pod -n clowder-system + ``` + +2. **Check Reconciliation Metrics:** + ```bash + # Access Prometheus metrics endpoint + kubectl port-forward -n clowder-system svc/clowder-controller-manager-metrics-service 8080:8080 + curl http://localhost:8080/metrics | grep controller_runtime + ``` + +3. **Review Controller Configuration:** + ```bash + kubectl get configmap clowder-config -n clowder-system -o yaml + ``` + +### Log Analysis + +#### Controller Manager Logs + +Key log patterns to look for: + +- **Reconciliation Errors:** + ``` + ERROR controller-runtime.manager.controller.clowdapp Reconciler error + ``` + +- **Provider Failures:** + ``` + ERROR providers. Failed to reconcile provider + ``` + +- **Resource Creation Issues:** + ``` + ERROR controllers.ClowdApp unable to create deployment + ``` + +#### Useful Log Commands + +```bash +# Follow controller logs with filtering +kubectl logs -n clowder-system deployment/clowder-controller-manager -f | grep ERROR + +# Get logs for specific ClowdApp reconciliation +kubectl logs -n clowder-system deployment/clowder-controller-manager | grep "clowdapp/" + +# Export logs for analysis +kubectl logs -n clowder-system deployment/clowder-controller-manager --since=1h > clowder-logs.txt +``` + +### Emergency Procedures + +#### Complete Operator Reset + +**⚠️ WARNING: This will cause downtime for all managed applications** + +1. **Scale down controller:** + ```bash + kubectl scale deployment clowder-controller-manager --replicas=0 -n clowder-system + ``` + +2. **Clean up stuck resources:** + ```bash + kubectl patch clowdapp -n --type merge -p '{"metadata":{"finalizers":[]}}' + ``` + +3. **Restart operator:** + ```bash + kubectl scale deployment clowder-controller-manager --replicas=1 -n clowder-system + ``` + +#### Cluster-wide Resource Cleanup + +```bash +# List all Clowder resources +kubectl get clowdapps --all-namespaces +kubectl get clowdenvironments + +# Force delete stuck resources (use with caution) +kubectl patch clowdenvironment --type merge -p '{"metadata":{"finalizers":[]}}' +``` + +--- + +## Release Procedures + +### Release Types + +Clowder follows semantic versioning (SemVer) with the following release types: + +- **Patch Release (x.y.Z)**: Bug fixes, security patches, minor improvements +- **Minor Release (x.Y.z)**: New features, API additions, backward-compatible changes +- **Major Release (X.y.z)**: Breaking changes, API modifications, major architectural updates + +### Pre-Release Checklist + +Before initiating a release, ensure: + +- [ ] All planned features/fixes are merged to `main` branch +- [ ] CI/CD pipeline is passing on `main` branch +- [ ] E2E tests are passing in staging environment +- [ ] Documentation is updated for new features +- [ ] Breaking changes are documented in migration guide +- [ ] Security scan results are reviewed and approved +- [ ] Performance regression tests are passing + +### Release Process + +#### 1. Prepare Release Branch + +```bash +# Create release branch from main +git checkout main +git pull origin main +git checkout -b release/v + +# Update version in relevant files +# - Update VERSION file +# - Update operator bundle manifests +# - Update documentation references +``` + +#### 2. Generate Release Notes + +```bash +# Generate changelog since last release +git log --oneline --no-merges v..HEAD + +# Create release notes including: +# - New features and enhancements +# - Bug fixes +# - Breaking changes +# - Known issues +# - Upgrade instructions +``` + +#### 3. Build and Test Release Candidate + +```bash +# Build release candidate images +make docker-build IMG=quay.io/cloudservices/clowder:v-rc1 +make bundle-build BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v-rc1 + +# Push release candidate images +make docker-push IMG=quay.io/cloudservices/clowder:v-rc1 +make bundle-push BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v-rc1 + +# Deploy to staging environment for testing +# Run comprehensive test suite +make test-e2e +``` + +#### 4. Create Release Tag + +```bash +# Tag the release +git tag -a v -m "Release v" +git push origin v + +# Create GitHub release +# - Upload release artifacts +# - Include release notes +# - Mark as pre-release if RC +``` + +#### 5. Build Production Images + +```bash +# Build final release images +make docker-build IMG=quay.io/cloudservices/clowder:v +make bundle-build BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v +make catalog-build CATALOG_IMG=quay.io/cloudservices/clowder-catalog:v + +# Push production images +make docker-push IMG=quay.io/cloudservices/clowder:v +make bundle-push BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v +make catalog-push CATALOG_IMG=quay.io/cloudservices/clowder-catalog:v +``` + +#### 6. Deploy to Staging + +```bash +# Update staging environment +# - Update CatalogSource with new catalog image +# - Monitor deployment health +# - Run smoke tests +# - Validate application functionality +``` + +#### 7. Production Deployment + +**⚠️ Production deployments require additional approvals and coordination** + +1. **Create App-Interface MR:** + ```yaml + # Update saas file with new image references + resourceTemplates: + - name: clowder-catalog + targets: + - namespace: clowder-system + ref: # Update this + ``` + +2. **Coordinate Deployment:** + - Schedule deployment window + - Notify stakeholders + - Prepare rollback plan + - Monitor cluster capacity + +3. **Execute Deployment:** + ```bash + # Merge app-interface MR + # Monitor OLM deployment + kubectl get csv -n clowder-system -w + + # Verify operator health + kubectl get pods -n clowder-system + kubectl logs -n clowder-system deployment/clowder-controller-manager + ``` + +4. **Post-Deployment Validation:** + - Verify all ClowdApps are reconciling + - Check application configurations + - Monitor error rates and performance + - Validate new features (if applicable) + +### Rollback Procedures + +#### Emergency Rollback + +If critical issues are discovered post-deployment: + +1. **Immediate Rollback:** + ```bash + # Revert to previous catalog image + kubectl patch catalogsource clowder-catalog -n clowder-system \ + --type merge -p '{"spec":{"image":"quay.io/cloudservices/clowder-catalog:v"}}' + + # Force CSV recreation + kubectl delete csv clowder.v -n clowder-system + ``` + +2. **Monitor Rollback:** + ```bash + # Watch operator rollback + kubectl get csv -n clowder-system -w + kubectl get pods -n clowder-system -w + ``` + +3. **Validate Rollback:** + - Verify operator is running previous version + - Check ClowdApp reconciliation + - Validate application functionality + +#### Planned Rollback + +For planned rollbacks (e.g., during maintenance): + +1. Create app-interface MR reverting image references +2. Follow standard deployment process +3. Communicate changes to stakeholders + +### Post-Release Activities + +#### 1. Update Documentation + +- [ ] Update API reference documentation +- [ ] Refresh user guides and tutorials +- [ ] Update migration guides +- [ ] Publish release blog post (if major release) + +#### 2. Monitor Release Health + +```bash +# Monitor key metrics for 24-48 hours +# - Reconciliation success rate +# - Error rates in logs +# - Resource utilization +# - Application deployment success + +# Set up alerts for: +# - Controller restart loops +# - High error rates +# - Performance degradation +``` + +#### 3. Gather Feedback + +- Monitor support channels for issues +- Review user feedback and bug reports +- Track adoption metrics +- Plan hotfix releases if needed + +### Hotfix Release Process + +For critical bug fixes that cannot wait for the next regular release: + +1. **Create Hotfix Branch:** + ```bash + git checkout v + git checkout -b hotfix/v-hotfix1 + ``` + +2. **Apply Minimal Fix:** + - Cherry-pick specific commits + - Avoid unnecessary changes + - Update version to patch level + +3. **Fast-Track Testing:** + - Focus on regression testing + - Validate fix effectiveness + - Skip non-critical test suites + +4. **Expedited Deployment:** + - Follow abbreviated release process + - Coordinate with stakeholders + - Monitor closely post-deployment + +### Release Metrics and KPIs + +Track the following metrics for release quality: + +- **Lead Time**: Time from feature complete to production +- **Deployment Frequency**: How often releases are deployed +- **Mean Time to Recovery (MTTR)**: Time to recover from failures +- **Change Failure Rate**: Percentage of releases causing issues +- **Rollback Rate**: Percentage of releases requiring rollback + +### Release Calendar + +Maintain a regular release schedule: + +- **Major Releases**: Quarterly (every 3 months) +- **Minor Releases**: Monthly +- **Patch Releases**: As needed (typically bi-weekly) +- **Hotfix Releases**: Emergency only + +### Communication Plan + +For each release: + +1. **Pre-Release (1 week before):** + - Announce upcoming release + - Share release notes draft + - Coordinate with dependent teams + +2. **Release Day:** + - Announce release completion + - Share final release notes + - Provide support contact information + +3. **Post-Release (1 week after):** + - Share adoption metrics + - Address any issues or feedback + - Plan next release cycle From b7fc7cb8f7fdb0951c958dce4030b5bbcc760aa7 Mon Sep 17 00:00:00 2001 From: Matt Knop Date: Wed, 5 Nov 2025 10:59:32 -0700 Subject: [PATCH 2/2] adding toc --- docs/sop.md | 614 ++++++++++++---------------------------------------- 1 file changed, 135 insertions(+), 479 deletions(-) diff --git a/docs/sop.md b/docs/sop.md index 31c365112..2b736bcc8 100644 --- a/docs/sop.md +++ b/docs/sop.md @@ -1,20 +1,42 @@ -# Clowder Standard Operating Procedures (SOP) - -This document provides standard operating procedures for managing, debugging, and releasing Clowder - the Red Hat Insights application configuration management operator for Kubernetes. +# Operating Clowder ## Table of Contents -1. [Architecture Overview](#architecture-overview) -2. [Debugging Procedures](#debugging-procedures) -3. [Release Procedures](#release-procedures) +1. [What is Clowder?](#what-is-clowder) +2. [Architecture Overview](#architecture-overview) +3. [Operating Apps Managed by Clowder](#operating-apps-managed-by-clowder) --- +## What is Clowder? + +Clowder is a Kubernetes operator designed to simplify application configuration management and infrastructure dependency provisioning for cloud-native applications. Originally developed for Red Hat Insights, Clowder abstracts away the complexity of managing application configurations across different environments and deployment scenarios. + +### Key Benefits + +**Simplified Configuration Management**: Clowder eliminates the need for applications to manage environment-specific configurations by providing a standardized configuration format (`cdappconfig.json`) that contains all necessary connection details, credentials, and service endpoints. + +**Infrastructure Abstraction**: Applications no longer need to know whether they're connecting to a local development database, a managed cloud service, or an operator-managed instance. Clowder handles the complexity of different infrastructure providers through its modular provider system. + +**Environment Consistency**: The same application code can run unchanged across development, staging, and production environments. Clowder ensures that applications receive the appropriate configuration for their target environment without code changes. + +**Dependency Management**: Clowder automatically manages service dependencies, ensuring that applications have access to required infrastructure services (databases, message queues, object storage) and other application services they depend on. + +### How Clowder Works + +Clowder utilizes a common configuration format that is presented to each application, no matter the environment it is running in, enabling a far easier development experience. It governs many different aspects of an application's configuration from defining the port it should listen to for its main web service, to metrics, kafka and others. When using Clowder, the burden of identifying and defining dependency and core service credentials and connection information is removed. + +The operator watches for changes to `ClowdApp` and `ClowdEnvironment` custom resources and automatically provisions the necessary infrastructure and generates appropriate configurations for each application. + +### Operating Clowder + +There are two primary aspects of operating Clowder: Operating the apps managed by Clowder and operating Clowder itself. + ## Architecture Overview ### High-Level Architecture -Clowder is a Kubernetes operator that manages application configuration and infrastructure dependencies for cloud-native applications. It consists of several key components: +Clowder is a Kubernetes operator that manages application configuration and infrastructure dependencies for cloud-native applications. The operator follows the standard Kubernetes controller pattern, continuously reconciling desired state with actual state. #### Core Components @@ -22,6 +44,7 @@ Clowder is a Kubernetes operator that manages application configuration and infr - Main operator process that watches for CRD changes - Reconciles ClowdApp and ClowdEnvironment resources - Manages application lifecycle and configuration generation + - Runs as a deployment in the `clowder-system` namespace 2. **Custom Resource Definitions (CRDs)** - `ClowdEnvironment`: Cluster-scoped resource defining infrastructure providers @@ -33,6 +56,10 @@ Clowder is a Kubernetes operator that manages application configuration and infr - Providers: Database, Kafka, Object Storage, Logging, Metrics, Web, etc. - Each provider supports multiple modes (local, operator, app-interface) +4. **Webhooks** + - Validation webhooks for ClowdApp resource validation + - Mutation webhooks for pod injection and configuration + #### Data Flow ``` @@ -41,514 +68,143 @@ ClowdApp → Controller → Provider Logic → K8s Resources → Application Con Spec Reconcile Infrastructure Deployments cdappconfig.json ``` -#### Key Concepts - -- **Environment Coupling**: ClowdApps are coupled to ClowdEnvironments via `envName` -- **Configuration Generation**: Apps receive standardized config via mounted secrets -- **Dependency Management**: Automatic service discovery and configuration injection -- **Provider Modes**: Flexible infrastructure provisioning strategies - -### Deployment Architecture - -Clowder is deployed via Operator Lifecycle Manager (OLM) with the following components: -- **OperatorGroup**: Defines operator scope and permissions -- **CatalogSource**: Points to operator bundle images -- **Subscription**: Manages operator installation and updates -- **ClusterServiceVersion**: Defines operator metadata and permissions - ---- - -## Debugging Procedures - -### Prerequisites - -Before debugging Clowder issues, ensure you have: -- `kubectl` access to the target cluster -- Appropriate RBAC permissions to view Clowder resources -- Access to cluster logs and metrics - -### Common Issues and Troubleshooting - -#### 1. ClowdApp Not Deploying - -**Symptoms:** -- ClowdApp resource exists but no deployments are created -- Application pods are not starting - -**Debugging Steps:** - -1. **Check ClowdApp Status:** - ```bash - kubectl get clowdapp -n -o yaml - kubectl describe clowdapp -n - ``` +#### Deployment Architecture -2. **Verify ClowdEnvironment:** - ```bash - kubectl get clowdenvironment -o yaml - kubectl describe clowdenvironment - ``` +Clowder is deployed as a standard Kubernetes operator with the following components: -3. **Check Controller Logs:** - ```bash - kubectl logs -n clowder-system deployment/clowder-controller-manager -f - ``` +- **Controller Manager Deployment**: Main operator workload +- **Custom Resource Definitions**: API definitions for Clowder resources +- **RBAC**: Service accounts, cluster roles, and role bindings for operator permissions +- **Webhooks**: Validation and mutation webhooks with TLS certificates +- **ConfigMaps**: Operator configuration and feature flags +- **Services**: Webhook services and metrics endpoints -4. **Common Causes:** - - Missing or invalid `envName` reference - - ClowdEnvironment not ready - - Missing dependencies in ClowdApp spec - - Resource quota exceeded in target namespace +#### Configuration Management -#### 2. Configuration Issues +Clowder generates a standardized configuration format (`cdappconfig.json`) that contains: +- Database connection information +- Kafka broker details and topic configurations +- Object storage bucket credentials +- Service endpoints for dependencies +- Logging and metrics configuration +- Feature flags and environment-specific settings -**Symptoms:** -- Applications starting but failing to connect to services -- Missing configuration values in cdappconfig.json - -**Debugging Steps:** +This configuration is mounted as a secret in each application pod at `/cdappconfig.json`. -1. **Check Generated Configuration:** - ```bash - kubectl get secret -n -o jsonpath='{.data.cdappconfig\.json}' | base64 -d | jq - ``` - -2. **Verify Provider Configuration:** - ```bash - kubectl get clowdenvironment -o jsonpath='{.spec.providers}' | jq - ``` +## Operating Apps Managed by Clowder -3. **Check Provider Status:** - ```bash - kubectl describe clowdenvironment - ``` +### ClowdEnvironment -#### 3. Operator Not Responding +**abbreviated to [env] in k8s** -**Symptoms:** -- Changes to ClowdApp/ClowdEnvironment not being processed -- Controller manager pod crashing or restarting +The ``ClowdEnvironment`` CRD is responsible for configuring key infrastruture services that the Clowder enabled +apps will interact with. It is a *cluster scoped* CRD and thus must have a unique name inside the +k8s cluster. For production environments it is usual to have only one ``ClowdEnvironment``, whereas +in other scenarios -- such as ephemeral testing -- Clowder enables the management of +multiple environments that operate completely independently from each other. -**Debugging Steps:** +#### Providers -1. **Check Operator Health:** - ```bash - kubectl get pods -n clowder-system - kubectl describe pod -n clowder-system -l app.kubernetes.io/name=clowder - ``` +An environment's specification is broken into **providers**, which govern the creation of services, e.g. Kafka topics, +object storage, etc, that applications may depend on. The ``ClowdEnvironment`` CRD configures these +providers principally by making use of a provider's **mode**. -2. **Review Controller Logs:** - ```bash - kubectl logs -n clowder-system deployment/clowder-controller-manager --previous - ``` +#### Modes -3. **Check Resource Usage:** - ```bash - kubectl top pods -n clowder-system - kubectl describe node - ``` +Providers often operate in different modes. As an example the Kafka provider can operate in three +different modes. In *local* mode, the Kafka provider deploys a single node Kafka/Zookeeper instance +inside the cluster and configures it to auto-create the topics. In *operator* mode, the provider +assumes a Strimzi Kafka instance is present and will create ``KafkaTopic`` CRs to provide the +topics. In *app-interface* mode, no resources are deployed and it is assumed app-interface has +already created the requested topics. For more information on the configuration of each of these +providers and their modes, please see the relevant pages. -4. **Restart Controller:** - ```bash - kubectl rollout restart deployment/clowder-controller-manager -n clowder-system - ``` +#### Target Namespace -#### 4. OLM Installation Issues +Environmental resources, such as the Kafka/Zookeeper from the example in the *Modes* section, will +be placed in the ``ClowdEnvironment``'s target namespace. This is configured by setting the +``targetNamespace`` attribute of the ``ClowdEnvironment``. If it is omitted, a random target +namespace is generated instead. The name of this resource can be found by inspecting the +``status.targetNamespace`` of the ClowdEnvironment resource. -**Symptoms:** -- Clowder operator not installing via OLM -- CSV in failed state +### ClowdApp -**Debugging Steps:** - -1. **Check OLM Resources:** - ```bash - kubectl get csv -n clowder-system - kubectl get subscription -n clowder-system - kubectl get catalogsource -n clowder-system - ``` - -2. **Review CSV Status:** - ```bash - kubectl describe csv clowder.v -n clowder-system - ``` - -3. **Check OLM Operator Logs:** - ```bash - kubectl logs -n olm deployment/olm-operator - kubectl logs -n olm deployment/catalog-operator - ``` - -4. **Force Reinstall:** - ```bash - kubectl delete csv clowder.v -n clowder-system - kubectl delete subscription clowder -n clowder-system - # Re-run saas-deploy job - ``` - -#### 5. Performance Issues - -**Symptoms:** -- Slow reconciliation times -- High memory/CPU usage -- Timeouts during resource creation - -**Debugging Steps:** - -1. **Monitor Resource Usage:** - ```bash - kubectl top pods -n clowder-system - kubectl describe pod -n clowder-system - ``` - -2. **Check Reconciliation Metrics:** - ```bash - # Access Prometheus metrics endpoint - kubectl port-forward -n clowder-system svc/clowder-controller-manager-metrics-service 8080:8080 - curl http://localhost:8080/metrics | grep controller_runtime - ``` - -3. **Review Controller Configuration:** - ```bash - kubectl get configmap clowder-config -n clowder-system -o yaml - ``` - -### Log Analysis - -#### Controller Manager Logs - -Key log patterns to look for: - -- **Reconciliation Errors:** - ``` - ERROR controller-runtime.manager.controller.clowdapp Reconciler error - ``` - -- **Provider Failures:** - ``` - ERROR providers. Failed to reconcile provider - ``` - -- **Resource Creation Issues:** - ``` - ERROR controllers.ClowdApp unable to create deployment - ``` - -#### Useful Log Commands - -```bash -# Follow controller logs with filtering -kubectl logs -n clowder-system deployment/clowder-controller-manager -f | grep ERROR - -# Get logs for specific ClowdApp reconciliation -kubectl logs -n clowder-system deployment/clowder-controller-manager | grep "clowdapp/" - -# Export logs for analysis -kubectl logs -n clowder-system deployment/clowder-controller-manager --since=1h > clowder-logs.txt -``` - -### Emergency Procedures - -#### Complete Operator Reset - -**⚠️ WARNING: This will cause downtime for all managed applications** - -1. **Scale down controller:** - ```bash - kubectl scale deployment clowder-controller-manager --replicas=0 -n clowder-system - ``` +**abbreviated to [app] in k8s** -2. **Clean up stuck resources:** - ```bash - kubectl patch clowdapp -n --type merge -p '{"metadata":{"finalizers":[]}}' - ``` +The ``ClowdApp`` CRD is responsible for configuring an application and is namespace scoped. Any +resources Clowder creates on behalf of the application will reside in the same namespace that the +``ClowdApp`` resources is applied to. As such the ``ClowdApp`` name must be unique within a +particular namespace. An ``ClowdApp`` does not have to be placed in the ``ClowdEnvironment``'s +target namespace. -3. **Restart operator:** - ```bash - kubectl scale deployment clowder-controller-manager --replicas=1 -n clowder-system - ``` - -#### Cluster-wide Resource Cleanup - -```bash -# List all Clowder resources -kubectl get clowdapps --all-namespaces -kubectl get clowdenvironments - -# Force delete stuck resources (use with caution) -kubectl patch clowdenvironment --type merge -p '{"metadata":{"finalizers":[]}}' -``` +A ``ClowdApp`` may define multiple services inside it. These services, though defined by a +specification that is very similar to the k8s pod specification, will be deployed as individual +deployment resources. Functionally, defining multiple applications in the same ``ClowdApp`` +specification allows the sharing of some infrastructure dependencies such as databases. +Applications in different ClowdApp's should not expect to be able to share databases. ---- - -## Release Procedures - -### Release Types - -Clowder follows semantic versioning (SemVer) with the following release types: - -- **Patch Release (x.y.Z)**: Bug fixes, security patches, minor improvements -- **Minor Release (x.Y.z)**: New features, API additions, backward-compatible changes -- **Major Release (X.y.z)**: Breaking changes, API modifications, major architectural updates - -### Pre-Release Checklist - -Before initiating a release, ensure: - -- [ ] All planned features/fixes are merged to `main` branch -- [ ] CI/CD pipeline is passing on `main` branch -- [ ] E2E tests are passing in staging environment -- [ ] Documentation is updated for new features -- [ ] Breaking changes are documented in migration guide -- [ ] Security scan results are reviewed and approved -- [ ] Performance regression tests are passing - -### Release Process - -#### 1. Prepare Release Branch - -```bash -# Create release branch from main -git checkout main -git pull origin main -git checkout -b release/v - -# Update version in relevant files -# - Update VERSION file -# - Update operator bundle manifests -# - Update documentation references -``` - -#### 2. Generate Release Notes - -```bash -# Generate changelog since last release -git log --oneline --no-merges v..HEAD - -# Create release notes including: -# - New features and enhancements -# - Bug fixes -# - Breaking changes -# - Known issues -# - Upgrade instructions -``` - -#### 3. Build and Test Release Candidate - -```bash -# Build release candidate images -make docker-build IMG=quay.io/cloudservices/clowder:v-rc1 -make bundle-build BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v-rc1 - -# Push release candidate images -make docker-push IMG=quay.io/cloudservices/clowder:v-rc1 -make bundle-push BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v-rc1 - -# Deploy to staging environment for testing -# Run comprehensive test suite -make test-e2e -``` - -#### 4. Create Release Tag - -```bash -# Tag the release -git tag -a v -m "Release v" -git push origin v - -# Create GitHub release -# - Upload release artifacts -# - Include release notes -# - Mark as pre-release if RC -``` - -#### 5. Build Production Images - -```bash -# Build final release images -make docker-build IMG=quay.io/cloudservices/clowder:v -make bundle-build BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v -make catalog-build CATALOG_IMG=quay.io/cloudservices/clowder-catalog:v - -# Push production images -make docker-push IMG=quay.io/cloudservices/clowder:v -make bundle-push BUNDLE_IMG=quay.io/cloudservices/clowder-bundle:v -make catalog-push CATALOG_IMG=quay.io/cloudservices/clowder-catalog:v -``` - -#### 6. Deploy to Staging - -```bash -# Update staging environment -# - Update CatalogSource with new catalog image -# - Monitor deployment health -# - Run smoke tests -# - Validate application functionality -``` - -#### 7. Production Deployment - -**⚠️ Production deployments require additional approvals and coordination** - -1. **Create App-Interface MR:** - ```yaml - # Update saas file with new image references - resourceTemplates: - - name: clowder-catalog - targets: - - namespace: clowder-system - ref: # Update this - ``` - -2. **Coordinate Deployment:** - - Schedule deployment window - - Notify stakeholders - - Prepare rollback plan - - Monitor cluster capacity - -3. **Execute Deployment:** - ```bash - # Merge app-interface MR - # Monitor OLM deployment - kubectl get csv -n clowder-system -w - - # Verify operator health - kubectl get pods -n clowder-system - kubectl logs -n clowder-system deployment/clowder-controller-manager - ``` - -4. **Post-Deployment Validation:** - - Verify all ClowdApps are reconciling - - Check application configurations - - Monitor error rates and performance - - Validate new features (if applicable) - -### Rollback Procedures - -#### Emergency Rollback - -If critical issues are discovered post-deployment: - -1. **Immediate Rollback:** - ```bash - # Revert to previous catalog image - kubectl patch catalogsource clowder-catalog -n clowder-system \ - --type merge -p '{"spec":{"image":"quay.io/cloudservices/clowder-catalog:v"}}' - - # Force CSV recreation - kubectl delete csv clowder.v -n clowder-system - ``` - -2. **Monitor Rollback:** - ```bash - # Watch operator rollback - kubectl get csv -n clowder-system -w - kubectl get pods -n clowder-system -w - ``` - -3. **Validate Rollback:** - - Verify operator is running previous version - - Check ClowdApp reconciliation - - Validate application functionality - -#### Planned Rollback - -For planned rollbacks (e.g., during maintenance): - -1. Create app-interface MR reverting image references -2. Follow standard deployment process -3. Communicate changes to stakeholders - -### Post-Release Activities - -#### 1. Update Documentation - -- [ ] Update API reference documentation -- [ ] Refresh user guides and tutorials -- [ ] Update migration guides -- [ ] Publish release blog post (if major release) - -#### 2. Monitor Release Health - -```bash -# Monitor key metrics for 24-48 hours -# - Reconciliation success rate -# - Error rates in logs -# - Resource utilization -# - Application deployment success - -# Set up alerts for: -# - Controller restart loops -# - High error rates -# - Performance degradation -``` +A ``ClowdApp`` is coupled to a ``ClowdEnvironment`` by the use of the ``envName`` parameter of the +``ClowdApp``. When Clowder configures applications, it will point them to the resources that are +defined in the coupled ``ClowdEnvironment``. As an example, if a ``ClowdApp`` requires the use of a +Kafka topic, the application will be configured to use the kafka broker that has been configured in +the coupled ClowdEnvironment, which could be a local, strimzi or app-interface managed Kafka +instance. -#### 3. Gather Feedback +#### Dependencies -- Monitor support channels for issues -- Review user feedback and bug reports -- Track adoption metrics -- Plan hotfix releases if needed +An application will usually require several dependencies in the form of either infrastructure +services e.g. Kafka, or other application services such as RBAC. -### Hotfix Release Process +Services such as RBAC will be other Clowder-managed applications and, as such, have an associated +``ClowdApp`` coupled to the ``ClowdEnvironment``. These are defined in the ``dependencies`` field of +the ``ClowdApp`` and take the form of the dependency's ``ClowdApp`` name. This will result in all of +the dependent services being listed in the application's configuration. If a dependent service +defines multiple pod specs with a web service exposed in its ``ClowdApp``, each of these will be +exposed to the requesting app. A ``ClowdApp`` will not be deployed if any of its service +dependencies do not exist within the coupled ``ClowdEnvironment``. -For critical bug fixes that cannot wait for the next regular release: +Infrastructure dependencies, such as Kafka topics and object bucket storage, are defined in the +``ClowdApp`` spec. More information on each of them is defined in the [API specification](https://redhatinsights.github.io/clowder/clowder/dev/api_reference.html#k8s-api-github-com-redhatinsights-clowder-apis-cloud-redhat-com-v1alpha1-clowdappspec). -1. **Create Hotfix Branch:** - ```bash - git checkout v - git checkout -b hotfix/v-hotfix1 - ``` +#### Created Resources -2. **Apply Minimal Fix:** - - Cherry-pick specific commits - - Avoid unnecessary changes - - Update version to patch level +For each ``ClowdApp`` service, Clowder will create an ``apps.Deployment`` and a ``Service`` +resource. If the service has the ``web`` field set to true, the ``Service`` resource will +include a port definition for ``webPort`` as well as the standard ``metricsPort``. The actual values +of these are defined in the ``ClowdEnvironment`` by configuring the web and metric providers, +respectively. By default these are set to 8000 for the web service port and 9000 for the metrics +port. -3. **Fast-Track Testing:** - - Focus on regression testing - - Validate fix effectiveness - - Skip non-critical test suites +Clowder will also set certain fields in the pod spec, inline with best practice, such as pull +policy, and anti-affinity. -4. **Expedited Deployment:** - - Follow abbreviated release process - - Coordinate with stakeholders - - Monitor closely post-deployment +Clowder creates a ``Secret`` resource that is named the same as the ``ClowdApp`` which will contain the generated configuration +for that app. This secret will be mounted at ``/cdappconfig.json`` and will be consumed by the app +to configure itself on startup. -### Release Metrics and KPIs +Secrets may also be created for application dependencies such as databases and in-memory db +services. -Track the following metrics for release quality: -- **Lead Time**: Time from feature complete to production -- **Deployment Frequency**: How often releases are deployed -- **Mean Time to Recovery (MTTR)**: Time to recover from failures -- **Change Failure Rate**: Percentage of releases causing issues -- **Rollback Rate**: Percentage of releases requiring rollback -### Release Calendar -Maintain a regular release schedule: -- **Major Releases**: Quarterly (every 3 months) -- **Minor Releases**: Monthly -- **Patch Releases**: As needed (typically bi-weekly) -- **Hotfix Releases**: Emergency only -### Communication Plan +The deployment process involves building and pushing the Clowder application image to Quay. The image uses a tag based off the commit hash at the tip of master. The app image is built using ``build_deploy.sh``. -For each release: +#### Promoting clowder to prod -1. **Pre-Release (1 week before):** - - Announce upcoming release - - Share release notes draft - - Coordinate with dependent teams +As stated above, promoting Clowder to production is done the same as any other app in app-interface, +but there are additional considerations given how Clowder code changes could cause widespread +rollouts across the target cluster. For example, if a field is added to every app's +``cdappconfig.json``, this will trigger every deployment to rollout a new version at virtually the +same time. While this *shouldn't* cause a problem, promoters should be aware that such churn is +going to happen before promoting. -2. **Release Day:** - - Announce release completion - - Share final release notes - - Provide support contact information +Another more disruptive example would be if the format of the name of services was changed. Not +only would this trigger a rollout of all deployments, but old pods would no longer function properly +because the old hostname in their configuration is no longer valid. A change like this should +either be done in a backwards-compatible way or be done in a planned outage window. -3. **Post-Release (1 week after):** - - Share adoption metrics - - Address any issues or feedback - - Plan next release cycle +Despite those two examples, most changes to Clowder should not be very disruptive; just make sure +that extra care is taken to review all changes before promoting to production.