From c09452686eab3baba864667131902ec514de3511 Mon Sep 17 00:00:00 2001 From: Ilona Shishov Date: Tue, 5 May 2026 14:42:44 +0300 Subject: [PATCH 1/5] Add manual secret rotation policy and DCR key migration script Documents rotation procedures for all four secrets (RED_HAT_SSO_CLIENT_SECRET, GMA_CLIENT_SECRET, DATABASE_URL, DCR_ENCRYPTION_KEY) across three tiers with step-by-step runbook. Includes scripts/rotate_dcr_encryption_key.py for transactional re-encryption of DCR client secrets with dry-run support. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/MANUAL_SECRET_ROTATION.md | 1267 ++++++++++++++++++++++++++ scripts/rotate_dcr_encryption_key.py | 309 +++++++ 2 files changed, 1576 insertions(+) create mode 100644 docs/MANUAL_SECRET_ROTATION.md create mode 100644 scripts/rotate_dcr_encryption_key.py diff --git a/docs/MANUAL_SECRET_ROTATION.md b/docs/MANUAL_SECRET_ROTATION.md new file mode 100644 index 00000000..a9258960 --- /dev/null +++ b/docs/MANUAL_SECRET_ROTATION.md @@ -0,0 +1,1267 @@ +# Manual Secret Rotation + +## INTERNAL ATTESTATION + +**Finding:** Credential provider APIs for automated secret rotation are not supported. + +**Affected Secrets:** + +**Tier 1 (OAuth Secrets) - Blocked by missing credential provider APIs:** +- `RED_HAT_SSO_CLIENT_SECRET` - Red Hat SSO OAuth client secret +- `GMA_CLIENT_SECRET` - Google Marketplace Agent (GMA) client secret + +**Tiers 2 & 3 - Require human intervention (service restart/migration):** +- `DATABASE_URL` - Cloud SQL database credentials, requires coordinated service restart +- `DCR_ENCRYPTION_KEY` - Fernet encryption key requiring re-encryption migration + +**Current Mitigation:** Manual rotation procedures documented in this file, executed annually by operations team. + +**Recommended Action (CIAM team):** Investigate and build API support for: +1. Red Hat SSO Client API - OAuth client secret regeneration endpoint +2. Google Marketplace Agent API - Client secret regeneration endpoint + +**Compensating Control:** Annual manual rotation procedures (detailed below) provide security equivalent to automated rotation at longer intervals. + +**Review Date:** 2026-05-04 + +--- + +## Overview + +This document provides manual rotation procedures for all secrets in the Red Hat Lightspeed Agent for Google Cloud that cannot be rotated automatically. + +**Why Manual Rotation:** +- Required credential provider APIs do not exist (Red Hat SSO, GMA) +- Some secrets require coordinated service restarts or database migrations + +**Rotation Policy:** +- **Frequency:** Annual rotation for all tiers +- **Rationale:** Manual processes have operational overhead; annual cadence balances security with operational capacity + +**Three Tiers of Secrets:** + +| Tier | Complexity | Secrets | Downtime | +|------|-----------|---------|----------| +| 1 | Low | OAuth client secrets (SSO, GMA) | None (auto-pickup) | +| 2 | Medium | Database credentials | 2-5 minutes (restart) | +| 3 | High | DCR encryption key | Marketplace only (migration required) | + +**Emergency Rotation:** See Appendix B for expedited procedures. + +--- + +## Prerequisites + +### Access Requirements + +**All Tiers (for Secret Manager updates and service restarts):** +- Google Cloud Run project access +- Gemini Enterprise Plus license + - Provides necessary GCP admin roles: Secret Manager, Cloud SQL, Cloud Run + +**Tier 1 Specific:** +- **`RED_HAT_SSO_CLIENT_SECRET`:** `ai5-marketplace` GitLab group membership + - Required to create/update SSO clients and receive credentials +- **`GMA_CLIENT_SECRET`:** No access required (credentials provided by CIAM team) + +**Tier 3 Specific:** +- **`DCR_ENCRYPTION_KEY`:** Database access via `DATABASE_URL` with password (for migration script) + +### Tooling Requirements + +**All Tiers:** +- `gcloud` CLI authenticated and configured + +**Tier 2:** +- `jq` (for JSON parsing in log verification) + +**Tier 3:** +- Python 3.12+ with project dependencies installed +- `scripts/rotate_dcr_encryption_key.py` from repository + +### Coordination Requirements + +**Tier 1:** +- No coordination needed (zero downtime) + +**Tier 2:** +- Maintenance window approval (2-5 minute downtime) +- Stakeholder notification (service restart) + +**Tier 3:** +- Maintenance window (1 hour including contingency) +- Database backup recommended + +--- + +## Tier 1: OAuth Client Secrets + +**Characteristics:** +- Zero service downtime (Cloud Run auto-picks up new secret versions) +- CIAM team coordination required (1-2 business days) +- Low technical complexity (Secret Manager version update) + +--- + +### 1.1 RED_HAT_SSO_CLIENT_SECRET + +**Duration:** 1-2 business days (CIAM processing time) + +**Overview:** Rotate the Red Hat SSO OAuth client secret used for JWT validation. This secret authenticates the Lightspeed Agent with Red Hat SSO to validate user tokens. + +#### Step 1: Request New Secret from CIAM + +Follow the CIAM self-service client configuration management process: + +**Documentation:** https://source.redhat.com/departments/strategy_and_operations/it/ciam/docs/draft_self_service_client_configuration_management~1 + +**Process:** + +1. Connect to Red Hat VPN +2. Navigate to the `ai5-marketplace` GitLab namespace: https://gitlab.cee.redhat.com/ai5-marketplace +3. Navigate to the `client-enablements` repository fork +4. Create a merge request with client Service Account configuration changes +5. Submit for CIAM team review +6. Wait for approval and merge +7. Receive two emails from CIAM: + - **Email 1:** Client ID and password-protected link + - **Email 2:** Password to decrypt the link +8. Access the link and extract new client secret value + +#### Step 2: Update Secret Manager + +Add a new version of the secret in Google Cloud Secret Manager. Cloud Run services will automatically use the latest ENABLED version. + +```bash +# Set project +export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" + +# Add new secret version (paste secret value when prompted) +echo -n "" | gcloud secrets versions add redhat-sso-client-secret \ + --data-file=- \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Verify new version created +gcloud secrets versions list redhat-sso-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=3 +``` + +**Expected Output:** +``` +NAME STATE CREATED DESTROYED +4 ENABLED 2026-05-04T12:00:00 - +3 ENABLED 2025-05-04T12:00:00 - +2 ENABLED 2024-05-04T12:00:00 - +``` + +Version 4 is the new secret (latest). + +#### Step 3: Verification + +Cloud Run services automatically pick up new secret versions within minutes. Verify authentication works with the new secret. + +**Obtain a test JWT token:** + +```bash +# Install OCM CLI (if not already installed) +# https://console.redhat.com/openshift/token + +# Authenticate +ocm login --use-auth-code + +# Get JWT token +TOKEN=$(ocm token) +``` + +**Test authentication:** + +```bash +# Get agent service URL +AGENT_URL=$(gcloud run services describe lightspeed-agent \ + --region=us-central1 \ + --format='value(status.url)' \ + --project="${GOOGLE_CLOUD_PROJECT}") + +# Test authentication (should return agent card JSON) +curl -H "Authorization: Bearer $TOKEN" \ + "${AGENT_URL}/.well-known/agent.json" +``` + +**Expected Output:** +```json +{ + "name": "Red Hat Lightspeed for Google Cloud", + "description": "Access Red Hat Insights...", + ... +} +``` + +**If authentication fails**, the new secret is invalid. Proceed to rollback (Step 4). + +#### Step 4: Rollback (if needed) + +If verification fails, disable the new secret version to restore the previous version. + +```bash +# List versions to identify the new version number +gcloud secrets versions list redhat-sso-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Disable new version (replace with version from Step 2) +gcloud secrets versions disable \ + --secret=redhat-sso-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Verify rollback (latest ENABLED version should be previous) +gcloud secrets versions list redhat-sso-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=3 +``` + +**Expected Output:** +``` +NAME STATE CREATED DESTROYED +4 DISABLED 2026-05-04T12:00:00 - +3 ENABLED 2025-05-04T12:00:00 - +``` + +Cloud Run will automatically use version 3 (previous secret). Re-run verification (Step 3) to confirm rollback succeeded. + +**Investigate why the new secret failed** before attempting rotation again. Common causes: +- Wrong secret value copied (typo) +- CIAM provided staging credentials instead of production +- Client configuration not yet propagated to production SSO + +--- + +### 1.2 GMA_CLIENT_SECRET + +**Duration:** 1-5 business days (CIAM contact response time) + +**Overview:** Rotate the Google Marketplace Agent (GMA) OAuth client secret used for Dynamic Client Registration (DCR). This secret authenticates the Marketplace Handler with the GMA API to create tenant-specific OAuth clients. + +#### Step 1: Request New Secret from CIAM Contacts + +Unlike Red Hat SSO (which has a self-service process), GMA client secrets must be requested directly from CIAM team contacts. + +**Contact Information:** +- **Primary Contact:** [TBD - add contact name/email] +- **Secondary Contact:** [TBD - add contact name/email] +- **Escalation Path:** [TBD - add escalation contact for urgent requests] + +**Request Template:** + +``` +Subject: GMA Client Annual Secret Rotation Request - Red Hat Lightspeed Agent + +Hi [CIAM Team], + +We need to rotate the GMA client secret for the Red Hat Lightspeed Agent for Google Cloud. + +**Details:** +- Service: Red Hat Lightspeed Agent for Google Cloud +- Current client_id: +- Environment: Production + +Please generate a new client secret and provide it via a secure channel. + +Thank you, +[Your Name] +[Your Team] +``` + +#### Step 2: Update Secret Manager + +```bash +# Set project +export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" + +# Add new secret version +echo -n "" | gcloud secrets versions add gma-client-secret \ + --data-file=- \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Verify new version created +gcloud secrets versions list gma-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=3 +``` + +**Expected Output:** +``` +NAME STATE CREATED DESTROYED +3 ENABLED 2026-05-04T14:00:00 - +2 ENABLED 2025-05-04T14:00:00 - +1 ENABLED 2024-05-04T14:00:00 - +``` + +#### Step 3: Verification + + + +**If verification fails**, proceed to rollback (Step 4). + +#### Step 4: Rollback (if needed) + +```bash +# List versions to identify new version number +gcloud secrets versions list gma-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Disable new version +gcloud secrets versions disable \ + --secret=gma-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Verify rollback +gcloud secrets versions list gma-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=3 +``` + +Cloud Run will automatically use the previous ENABLED version. + +**Investigate failure** before retrying: +- Verify secret value is correct (no copy/paste errors) +- Confirm CIAM provided production credentials (not staging) +- Check GMA API endpoint is accessible from Cloud Run +- Review Marketplace Handler logs for authentication errors + +--- + +## Tier 2: Database Credentials + +**Characteristics:** +- Service restart required (2-5 minute downtime) +- Maintenance window coordination needed +- Medium technical complexity (Cloud SQL + Secret Manager + service restarts) + +**Duration:** 30-60 minutes (includes maintenance window) + +**Overview:** Rotate PostgreSQL database password used by both Lightspeed Agent and Marketplace Handler services. Requires coordinated updates to Cloud SQL, Secret Manager, and service restarts. + +--- + +### Step 1: Schedule Maintenance Window + +**Coordinate service downtime:** + +- **Affected Services:** Both `lightspeed-agent` and `marketplace-handler` Cloud Run services +- **Expected Downtime:** 2-5 minutes during service restart +- **Impact:** + - Agent API requests will fail during restart (users see 503 errors) + - Marketplace provisioning events buffered by Pub/Sub (processed after restart) + +--- + +### Step 2: Generate New Password + +Use a cryptographically secure random password generator: + +```bash +# Generate 32-character random password +NEW_DB_PASSWORD=$(openssl rand -base64 32 | tr -d "=+/" | cut -c1-32) + +# Display password (save temporarily in password manager) +echo "Generated password: ${NEW_DB_PASSWORD}" +echo "Length: ${#NEW_DB_PASSWORD} characters" +``` + +**Expected Output:** +``` +Generated password: Kj7mN9pQ2rT4vW6xZ8aB1cD3eF5gH0iJ +Length: 32 characters +``` + +--- + +### Step 3: Update Cloud SQL Password + +Update the database user password in Cloud SQL: + +```bash +# Set variables (replace with your actual values) +export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" +export DB_INSTANCE_NAME="your-db-instance-name" +export DB_USERNAME="your-db-username" + +# Update Cloud SQL password +gcloud sql users set-password "${DB_USERNAME}" \ + --instance="${DB_INSTANCE_NAME}" \ + --password="${NEW_DB_PASSWORD}" \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +--- + +### Step 4: Update Secret Manager + +Both services use the same database. Update the `database-url` secret with the new password: + +```bash +# Construct new connection string +# Format: postgresql+asyncpg://USERNAME:PASSWORD@/DATABASE?host=/cloudsql/PROJECT:REGION:INSTANCE +NEW_DATABASE_URL="postgresql+asyncpg://${DB_USERNAME}:${NEW_DB_PASSWORD}@/lightspeed_agent?host=/cloudsql/${GOOGLE_CLOUD_PROJECT}:us-central1:${DB_INSTANCE_NAME}" + +# Add new secret version +echo -n "${NEW_DATABASE_URL}" | gcloud secrets versions add database-url \ + --data-file=- \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Verify new version created +gcloud secrets versions list database-url \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=3 +``` + +--- + +### Step 5: Restart Services + +Force Cloud Run services to pick up the new secret version. **This triggers downtime.** + +```bash +# Restart agent service +echo "Restarting lightspeed-agent..." +gcloud run services update lightspeed-agent \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Restart marketplace handler +echo "Restarting marketplace-handler..." +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Monitor service health +echo "Waiting for services to become ready..." +sleep 10 + +# Check agent service status +gcloud run services describe lightspeed-agent \ + --region=us-central1 \ + --format='value(status.conditions[0].status)' \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Check marketplace handler status +gcloud run services describe marketplace-handler \ + --region=us-central1 \ + --format='value(status.conditions[0].status)' \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +**Expected Output (for both services):** +``` +True +``` + +This indicates services are ready and healthy. + +--- + +### Step 6: Verification + +Test that both services can connect to the database with the new password: + +**Check service logs for database errors:** + +```bash +# Check agent logs (last 5 minutes) +gcloud logging read "resource.type=cloud_run_revision AND \ + resource.labels.service_name=lightspeed-agent AND \ + severity>=ERROR AND \ + timestamp>=\"$(date -u -d '5 minutes ago' --iso-8601=seconds)\"" \ + --limit=20 \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --format=json | jq '.[].jsonPayload.message' + +# Check marketplace handler logs +gcloud logging read "resource.type=cloud_run_revision AND \ + resource.labels.service_name=marketplace-handler AND \ + severity>=ERROR AND \ + timestamp>=\"$(date -u -d '5 minutes ago' --iso-8601=seconds)\"" \ + --limit=20 \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --format=json | jq '.[].jsonPayload.message' +``` + +**Expected:** No database connection errors in recent logs. + +**If database connection errors appear**, proceed to rollback (Step 7). + +--- + +### Step 7: Rollback (if needed) + +If services cannot connect to the database, roll back to the previous password: + +```bash +# Step 1: Disable new Secret Manager version +gcloud secrets versions list database-url \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Disable new version (replace with number from Step 4) +gcloud secrets versions disable \ + --secret=database-url \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Step 2: Restore old password in Cloud SQL +# (You must retrieve old password from Secret Manager first) +OLD_DATABASE_URL=$(gcloud secrets versions access \ + --secret=database-url \ + --project="${GOOGLE_CLOUD_PROJECT}") + +# Extract password from connection string (between : and @) +OLD_DB_PASSWORD=$(echo "${OLD_DATABASE_URL}" | sed -n 's/.*:\([^@]*\)@.*/\1/p') + +# Restore old password in Cloud SQL +gcloud sql users set-password "${DB_USERNAME}" \ + --instance="${DB_INSTANCE_NAME}" \ + --password="${OLD_DB_PASSWORD}" \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Step 3: Restart services to pick up rollback +gcloud run services update lightspeed-agent \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +``` + +**Re-run verification (Step 6)** to confirm rollback succeeded. + +**Investigate failure** before retrying: +- Verify password in Secret Manager matches Cloud SQL +- Check database connection string format (typos in hostname, database name) +- Verify Cloud SQL instance is running and accessible +- Review service logs for specific connection errors + +--- + +## Tier 3: DCR Encryption Key + +**Characteristics:** +- Marketplace provisioning downtime during migration (30 minutes) +- Agent API remains fully available (does not use `DCR_ENCRYPTION_KEY`) +- Database migration required (re-encrypt all DCR client secrets) +- Database backup recommended as last-resort safety net + +**Duration:** 30 minutes (includes backup verification, migration, and verification) + +**Overview:** Rotate the Fernet encryption key used to encrypt DCR OAuth client secrets at rest. Requires re-encrypting all secrets in the database using a migration script. + +**Risk:** The migration script uses transactional updates, so a failure mid-migration leaves the database unchanged. If verification fails after migration, the migration can be reversed by re-running the script with keys swapped. If the new key is lost after migration commits, a database backup is the only recovery path. + +--- + +### Step 1: Schedule Maintenance Window + +**Requirements:** + +- **Downtime:** Marketplace provisioning unavailable during migration (30 minutes) + - Agent API remains fully available (does not use `DCR_ENCRYPTION_KEY`) +- **Backup:** Database backup verified within last 24 hours (recommended) +- **Window Size:** Minimum 1 hour (migration + contingency + rollback if needed) + +--- + +### Step 2: Verify Database Backup + +Verify a recent backup exists as a last-resort safety net in case of key loss. + +```bash +# Set variables +export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" +export DB_INSTANCE_NAME="your-db-instance-name" + +# List recent backups +gcloud sql backups list \ + --instance="${DB_INSTANCE_NAME}" \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=5 + +# Check most recent backup timestamp +LATEST_BACKUP=$(gcloud sql backups list \ + --instance="${DB_INSTANCE_NAME}" \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=1 \ + --format='value(id)') + +echo "Latest backup ID: ${LATEST_BACKUP}" +``` + +**Expected Output:** +``` +ID WINDOW_START_TIME TYPE STATUS +1620000000000 2026-05-04T06:00:00.000+00:00 AUTOMATED SUCCESSFUL +``` + +**If no backup within 24 hours**, trigger manual backup: + +```bash +# Create manual backup +gcloud sql backups create \ + --instance="${DB_INSTANCE_NAME}" \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Wait for backup to complete (2-10 minutes depending on database size) +echo "Waiting for backup to complete..." +sleep 60 + +# Verify backup created +gcloud sql backups list \ + --instance="${DB_INSTANCE_NAME}" \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=1 +``` + +If backup fails, consider whether to proceed without this safety net. + +--- + +### Step 3: Generate New Fernet Key + +```bash +# Generate new encryption key (44-character base64-encoded Fernet key) +NEW_ENCRYPTION_KEY=$(python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())") + +echo "New encryption key (save in password manager): ${NEW_ENCRYPTION_KEY}" +echo "Length: ${#NEW_ENCRYPTION_KEY} characters (should be 44)" +``` + +**Expected Output:** +``` +New encryption key: a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0u1v2= +Length: 44 characters +``` + +--- + +### Step 4: Retrieve Old Key from Secret Manager + +```bash +# Get current encryption key +OLD_ENCRYPTION_KEY=$(gcloud secrets versions access latest \ + --secret=dcr-encryption-key \ + --project="${GOOGLE_CLOUD_PROJECT}") + +echo "Old key retrieved" +echo "Length: ${#OLD_ENCRYPTION_KEY} characters (should be 44)" +``` + +**Verify both keys are 44 characters** and different: + +```bash +# Verify keys are different +if [ "${OLD_ENCRYPTION_KEY}" == "${NEW_ENCRYPTION_KEY}" ]; then + echo "ERROR: Keys are identical! Generate a new key." + exit 1 +else + echo "✓ Keys are different" +fi + +# Verify lengths +if [ ${#OLD_ENCRYPTION_KEY} -ne 44 ] || [ ${#NEW_ENCRYPTION_KEY} -ne 44 ]; then + echo "ERROR: Invalid key length!" + exit 1 +else + echo "✓ Keys are valid Fernet keys" +fi +``` + +--- + +### Step 5: Run Migration Script (Dry-Run) + +**CRITICAL:** Always run dry-run first to validate keys work before touching the database. + +```bash +# Navigate to repository +cd /path/to/google-lightspeed-agent + +# Activate virtual environment +source .venv/bin/activate + +# Get database URL +DATABASE_URL=$(gcloud secrets versions access latest \ + --secret=database-url \ + --project="${GOOGLE_CLOUD_PROJECT}") + +# Run dry-run migration +python scripts/rotate_dcr_encryption_key.py \ + --old-key="${OLD_ENCRYPTION_KEY}" \ + --new-key="${NEW_ENCRYPTION_KEY}" \ + --database-url="${DATABASE_URL}" \ + --dry-run \ + --verbose +``` + +**Expected Output (successful dry-run):** + +``` +INFO: Starting DCR encryption key rotation (DRY-RUN MODE) +INFO: Pre-flight checks: validating keys +INFO: Pre-flight checks: validating database connection +INFO: Pre-flight checks: found 12 DCR clients +INFO: Dry-run mode: testing decrypt/re-encrypt on 12 clients +DEBUG: Testing client gemini-order-abc123 (1/12) +DEBUG: Testing client gemini-order-def456 (2/12) +... +INFO: Dry-run complete: all 12 records can be rotated +``` + +**If dry-run fails**, STOP: + +``` +ERROR: Failed to decrypt client_id=gemini-order-abc123 with old key +ERROR: Pre-flight check failed: Failed to decrypt secret with old key: ... +``` + +**Common causes:** +- Old key is incorrect (not the current production key) +- Database contains secrets encrypted with a different key (previous rotation incomplete) +- Secret corruption in database + +**Resolution:** Verify `OLD_ENCRYPTION_KEY` matches the current production key in Secret Manager. If mismatch, do NOT proceed. Investigate database state. + +--- + +### Step 6: Run Migration Script (Production) + +**WARNING:** This step modifies the database. Ensure dry-run passed in Step 5. + +```bash +# Production migration (no --dry-run flag) +python scripts/rotate_dcr_encryption_key.py \ + --old-key="${OLD_ENCRYPTION_KEY}" \ + --new-key="${NEW_ENCRYPTION_KEY}" \ + --database-url="${DATABASE_URL}" \ + --verbose +``` + +**Expected Output (successful rotation):** + +``` +INFO: Starting DCR encryption key rotation (PRODUCTION MODE) +INFO: Pre-flight checks: validating keys +INFO: Pre-flight checks: validating database connection +INFO: Pre-flight checks: found 12 DCR clients +INFO: Production mode: rotating 12 clients +INFO: Rotation complete: 12 clients rotated successfully +``` + +**If migration fails mid-rotation:** + +``` +ERROR: Failed to decrypt client_id=gemini-order-xyz789 with old key +ERROR: Transaction rolled back +``` + +The database is **unchanged** (transaction rollback). Safe to retry or investigate. + +**Do NOT proceed to Step 7** if migration fails. Database still uses old key. + +--- + +### Step 7: Update Secret Manager + +Only update Secret Manager **after migration succeeds** (Step 6). + +```bash +# Add new encryption key version +echo -n "${NEW_ENCRYPTION_KEY}" | gcloud secrets versions add dcr-encryption-key \ + --data-file=- \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Verify new version created +gcloud secrets versions list dcr-encryption-key \ + --project="${GOOGLE_CLOUD_PROJECT}" \ + --limit=3 +``` + +**Expected Output:** +``` +NAME STATE CREATED DESTROYED +3 ENABLED 2026-05-04T15:00:00 - +2 ENABLED 2025-05-04T15:00:00 - +1 ENABLED 2024-05-04T15:00:00 - +``` + +--- + +### Step 8: Restart Marketplace Handler + +Marketplace Handler service must pick up the new encryption key from Secret Manager. + +```bash +# Restart marketplace handler +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Wait for service to become ready +sleep 15 + +# Check service status +gcloud run services describe marketplace-handler \ + --region=us-central1 \ + --format='value(status.conditions[0].status)' \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +**Expected Output:** +``` +True +``` + +--- + +### Step 9: Verification + + + +**If verification fails**, proceed to rollback (Step 10). + +--- + +### Step 10: Rollback (if verification fails) + +**Option A: Reverse migration (preferred)** — re-run the script with keys swapped: + +```bash +# Re-run migration with keys swapped to restore old encryption +python scripts/rotate_dcr_encryption_key.py \ + --old-key="${NEW_ENCRYPTION_KEY}" \ + --new-key="${OLD_ENCRYPTION_KEY}" \ + --database-url="${DATABASE_URL}" \ + --verbose +``` + +Then disable the new key in Secret Manager and restart the marketplace handler: + +```bash +# Disable new encryption key version +gcloud secrets versions disable \ + --secret=dcr-encryption-key \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Restart marketplace handler with old key +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +**Option B: Database backup restore (last resort)** — only if the new key has been lost and reverse migration is not possible: + +```bash +BACKUP_ID="" + +gcloud sql backups restore "${BACKUP_ID}" \ + --backup-instance="${DB_INSTANCE_NAME}" \ + --backup-project="${GOOGLE_CLOUD_PROJECT}" \ + --instance="${DB_INSTANCE_NAME}" \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# This operation takes 5-15 minutes + +# Disable new encryption key in Secret Manager +gcloud secrets versions disable \ + --secret=dcr-encryption-key \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Restart marketplace handler with old key +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +**WARNING:** Backup restore loses any DCR clients created after the backup. + +**Re-run verification** to confirm DCR functionality restored with old key. + +--- + +### Step 11: Cleanup + +Zero encryption keys from environment and shell history after confirming rotation succeeded (or rollback completed): + +```bash +unset OLD_ENCRYPTION_KEY +unset NEW_ENCRYPTION_KEY +unset DATABASE_URL + +history -c +``` + +--- + +## Appendix A: Migration Script Reference + +**Script:** `scripts/rotate_dcr_encryption_key.py` + +**Purpose:** Re-encrypt DCR OAuth client secrets from old Fernet encryption key to new key. + +--- + +### Usage + +**Dry-Run Mode (recommended first):** + +```bash +python scripts/rotate_dcr_encryption_key.py \ + --old-key="" \ + --new-key="" \ + --database-url="postgresql+asyncpg://user:pass@host/db" \ + --dry-run +``` + +**Production Mode:** + +```bash +python scripts/rotate_dcr_encryption_key.py \ + --old-key="" \ + --new-key="" \ + --database-url="postgresql+asyncpg://user:pass@host/db" +``` + +**Verbose Output:** + +```bash +python scripts/rotate_dcr_encryption_key.py ... --verbose +``` + +**Environment Variables (alternative to CLI args):** + +```bash +export DCR_OLD_KEY="" +export DCR_NEW_KEY="" +export DATABASE_URL="postgresql+asyncpg://..." +python scripts/rotate_dcr_encryption_key.py --dry-run +``` + +--- + +### Arguments + +| Argument | Required | Description | Environment Variable | +|----------|----------|-------------|---------------------| +| `--old-key` | Yes | Current Fernet encryption key (44-char base64) | `DCR_OLD_KEY` | +| `--new-key` | Yes | New Fernet encryption key (44-char base64) | `DCR_NEW_KEY` | +| `--database-url` | Yes | PostgreSQL connection string | `DATABASE_URL` | +| `--dry-run` | No | Test mode (no database writes) | N/A | +| `--verbose` | No | Detailed progress logging | N/A | + +--- + +### Exit Codes + +| Code | Meaning | Resolution | +|------|---------|------------| +| 0 | Success (all records rotated) | Rotation complete | +| 1 | Pre-flight check failed | Verify keys and database connection | +| 2 | Migration failed | Check error message, restore from backup if needed | + +--- + +### Output Interpretation + +**Dry-Run Success:** + +``` +INFO: Dry-run complete: all 12 records can be rotated +INFO: Rotation summary: 12 clients tested, 0 errors +``` + +All secrets can be decrypted with old key and re-encrypted with new key. Safe to proceed with production mode. + +**Production Success:** + +``` +INFO: Rotation complete: 12 clients rotated successfully +``` + +All secrets re-encrypted and database updated. Proceed to Secret Manager update (Tier 3 Step 7). + +**Error: Invalid Old Key:** + +``` +ERROR: Failed to decrypt client_id=gemini-order-abc123 with old key +ERROR: Decryption failed. Verify OLD_KEY is correct. Transaction rolled back. +``` + +Old key is incorrect or database contains secrets encrypted with a different key. Verify old key matches production Secret Manager value. + +**Error: Keys Identical:** + +``` +ERROR: Pre-flight check failed: Keys must be different (no-op rotation not allowed) +``` + +Old and new keys are the same. Generate a new key with `Fernet.generate_key()`. + +**Error: Database Connection:** + +``` +ERROR: Pre-flight check failed: Database connection failed: ... +``` + +Cannot connect to database. Verify `DATABASE_URL` is correct and database is accessible. + +--- + +### Safety Features + +1. **Dry-Run Mode:** Test decrypt/re-encrypt without database writes +2. **Transactional Updates:** All-or-nothing database updates (rollback on any error) +3. **Memory Safety:** Zeros plaintext secrets after re-encryption +4. **Pre-Flight Checks:** Validates keys and database before migration +5. **Progress Logging:** Shows each `client_id` rotated (never logs secret values) + +--- + +### Architecture + +**Migration Flow (per DCR client):** + +``` +1. Read encrypted_secret from database + └─> "gAAAAABh..." (ciphertext, encrypted with OLD key) + +2. Decrypt with OLD Fernet key + └─> "my-oauth-secret-12345" ← PLAINTEXT (sensitive!) + +3. Re-encrypt with NEW Fernet key + └─> "gAAAAABi..." ← NEW ciphertext (encrypted with NEW key) + +4. Update database record + └─> UPDATE dcr_clients SET encrypted_secret = "gAAAAABi..." WHERE client_id = ... + +5. Zero plaintext from memory + └─> Overwrite plaintext bytes with zeros +``` + +**Transaction Boundary:** + +All database updates happen in a single SQLAlchemy transaction: +- If any client fails to decrypt: rollback, exit with code 2 +- If database error: rollback, exit with code 2 +- If all succeed: commit, exit with code 0 + +**No partial updates possible** due to transactional safety. + +--- + +## Appendix B: Emergency Rotation + +**When to Use:** + +- Credential leaked in logs, source control, or error messages +- Security incident or suspected breach +- Compliance requirement (audit finding, regulatory mandate) +- Suspected credential compromise (anomalous API usage, unauthorized access) + +--- + +### Expedited Process + +#### Phase 1: Assess Impact + +**Questions to answer:** + +1. **Which secret is compromised?** + - Tier 1: `RED_HAT_SSO_CLIENT_SECRET` or `GMA_CLIENT_SECRET` + - Tier 2: Database credentials + - Tier 3: `DCR_ENCRYPTION_KEY` + +2. **What is the exposure scope?** + - Public (e.g., committed to GitHub, posted in public Slack channel) + - Internal (e.g., internal logs, internal documentation) + - Suspected (e.g., unusual API activity, failed auth attempts) + +3. **Are there signs of unauthorized access?** + - Check Cloud Logging for unusual API calls using compromised credentials + - Review Service Control metrics for anomalous usage patterns + - Check database audit logs (if enabled) for unauthorized queries + +--- + +#### Phase 2: Containment (Immediate) + +**Tier 1 (OAuth Secrets):** + +1. **Disable compromised secret immediately:** + +```bash +# Identify current secret version +gcloud secrets versions list redhat-sso-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Disable current version (blocks all authentication with this secret) +# WARNING: This breaks authentication until new secret is added +gcloud secrets versions disable \ + --secret=redhat-sso-client-secret \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +2. **Escalate to CIAM team** (mark as urgent security incident): + +``` +Subject: URGENT - Security Incident - Compromised SSO Client Secret + +SECURITY INCIDENT - URGENT + +Compromised Secret: RED_HAT_SSO_CLIENT_SECRET +Exposure: [Public/Internal/Suspected] +Impact: [Describe exposure - where secret was leaked] + +Current Status: Old secret disabled (authentication blocked) + +Request: Expedited new client secret generation (emergency rotation) +Timeline: ASAP (production service impacted) + +Security team notified: [Yes/No] +Incident ticket: [Link if available] +``` + +**Tier 2 (Database Credentials):** + +1. **Follow Tier 2 procedure** (Steps 2-6) immediately — skip the maintenance window scheduling + +**Tier 3 (DCR Encryption Key):** + +1. **Assess blast radius:** + - If encryption key leaked, all DCR OAuth client secrets are compromised + - Tenant OAuth clients may be at risk + +2. **Immediate actions:** + - Disable DCR functionality (stop Marketplace Handler service) + - Notify security team and stakeholders + - Prepare for emergency Tier 3 rotation (requires database backup + 1-2 hour downtime) + +```bash +# Stop marketplace handler (blocks new provisioning) +gcloud run services update marketplace-handler \ + --no-traffic \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +--- + +#### Phase 3: Communication (Parallel with Rotation) + +**Internal Stakeholders:** + +``` +Subject: SECURITY INCIDENT - Credential Rotation in Progress + +Security incident requiring emergency credential rotation: + +**Status:** Containment complete, rotation in progress +**Affected Systems:** [List] +**Current Impact:** [Describe downtime or service degradation] +**Estimated Resolution:** [Timeline] + +**Actions Taken:** +- Compromised credentials disabled +- Rotation procedures initiated +- Unauthorized access logs under review + +**Next Steps:** +- Complete rotation (ETA: [time]) +- Post-incident review +- Update incident response procedures + +Updates will be provided every 30 minutes. +``` + +**External Stakeholders (if customer impact):** + +``` +Subject: Service Advisory - Maintenance in Progress + +We are performing emergency maintenance on the Red Hat Lightspeed Agent for Google Cloud due to a security event. + +**Impact:** [Describe customer-facing impact] +**Expected Resolution:** [Timeline] +**Actions Required:** None (service will auto-recover) + +We will provide updates as the situation develops. +``` + +--- + +#### Phase 4: Post-Incident (Within 48 hours) + +1. **Document Incident Timeline:** + - When was credential compromised? + - When was compromise detected? + - Time to containment (disable old credential) + - Time to rotation (new credential active) + - Total incident duration + +2. **Update Incident Response Procedures:** + - What worked well? + - What could be faster? + - Were escalation paths clear? + - Did communication templates help? + +3. **Consider Policy Changes:** + - Should rotation frequency increase for this secret tier? + - Are additional monitoring/alerting needed? + - Should secret access be restricted further? + +--- + +### Compressed Timeline Example + +**Hour 0: Detection and Containment** +- 00:00 - Credential compromise detected (leaked in logs) +- 00:05 - Old secret disabled (authentication blocked) +- 00:10 - CIAM escalation for new secret (Tier 1) +- 00:15 - Security team notified, incident opened + +**Hour 0-1: Tier 1 Rotation** +- 00:30 - New SSO secret received from CIAM +- 00:35 - Secret Manager updated +- 00:40 - Verification complete (authentication restored) + +**Hour 1-2: Tier 3 Rotation (if needed)** +- 01:00 - Database backup verified +- 01:10 - New encryption key generated +- 01:15 - Migration script dry-run passed +- 01:20 - Production migration complete +- 01:25 - Secret Manager updated, services restarted +- 01:30 - Verification complete + +**Hour 2-3: Monitoring and Communication** +- 02:00 - Log analysis for unauthorized access +- 02:30 - Stakeholder update (rotation complete) +- 03:00 - Service fully operational, incident closed + +**Day 1-7: Post-Incident** +- Day 1 - Post-incident review meeting +- Day 3 - Incident report published +- Day 7 - Policy updates implemented + +--- + +### Emergency Contacts + +**CIAM Team Escalation:** +- Primary: [TBD - add name/email/Slack] +- Secondary: [TBD - add name/email/Slack] +- Emergency (24/7): [TBD - add on-call rotation or pager] + +**Security Team:** +- Primary: [TBD - add security contact] +- Incident Response: [TBD - add IR team contact] + +**Engineering On-Call:** +- PagerDuty: [TBD - add PD integration or on-call schedule] + +--- diff --git a/scripts/rotate_dcr_encryption_key.py b/scripts/rotate_dcr_encryption_key.py new file mode 100644 index 00000000..3434884b --- /dev/null +++ b/scripts/rotate_dcr_encryption_key.py @@ -0,0 +1,309 @@ +"""DCR encryption key rotation script. + +Re-encrypts all DCR OAuth client secrets from old Fernet key to new key. +""" +import argparse +import asyncio +import logging +import os +import sys +from dataclasses import dataclass + +from cryptography.fernet import Fernet, InvalidToken +from sqlalchemy import select, text +from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine + +from lightspeed_agent.db.models import DCRClientModel + +logger = logging.getLogger(__name__) + + +class RotationError(Exception): + """Raised when rotation fails pre-flight checks or during execution.""" + pass + + +@dataclass +class RotationResult: + """Result of encryption key rotation.""" + success: bool + clients_rotated: int + dry_run: bool + errors: list[str] | None = None + + def __post_init__(self) -> None: + if self.errors is None: + self.errors = [] + + +class EncryptionKeyRotator: + """Re-encrypts DCR client secrets from old Fernet key to new key.""" + + def __init__(self, old_key: str, new_key: str, database_url: str): + """Initialize rotator with old key, new key, and database URL. + + Args: + old_key: Current Fernet encryption key (44-char base64) + new_key: New Fernet encryption key (44-char base64) + database_url: PostgreSQL or SQLite connection string + """ + self.old_key = old_key + self.new_key = new_key + self.database_url = database_url + + try: + self.old_fernet = Fernet(old_key.encode()) + self.new_fernet = Fernet(new_key.encode()) + except Exception as e: + raise RotationError(f"Invalid Fernet key: {e}") from e + + self.engine = create_async_engine(database_url, echo=False) + self.async_session = async_sessionmaker( + self.engine, class_=AsyncSession, expire_on_commit=False + ) + + async def validate_keys(self) -> None: + """Validate that keys are different and valid. + + Raises: + RotationError: If keys are identical or invalid + """ + if self.old_key == self.new_key: + raise RotationError("Keys must be different (no-op rotation not allowed)") + + async def validate_database(self) -> None: + """Validate database connection. + + Raises: + RotationError: If database connection fails + """ + try: + async with self.engine.connect() as conn: + await conn.execute(text("SELECT 1")) + except Exception as e: + raise RotationError(f"Database connection failed: {e}") from e + + async def cleanup(self) -> None: + """Dispose database engine.""" + await self.engine.dispose() + + async def fetch_dcr_clients(self) -> list[DCRClientModel]: + """Fetch all DCR clients from database. + + Returns: + List of DCRClientModel instances + """ + async with self.async_session() as session: + result = await session.execute(select(DCRClientModel)) + clients = result.scalars().all() + return list(clients) + + def reencrypt_secret(self, encrypted_secret: str) -> str: + """Re-encrypt a secret from old key to new key. + + Args: + encrypted_secret: Base64-encoded ciphertext encrypted with old key + + Returns: + Base64-encoded ciphertext encrypted with new key + + Raises: + RotationError: If decryption with old key fails + """ + try: + plaintext = bytearray(self.old_fernet.decrypt(encrypted_secret.encode())) + reencrypted = self.new_fernet.encrypt(bytes(plaintext)) + + for i in range(len(plaintext)): + plaintext[i] = 0 + + return reencrypted.decode() + + except InvalidToken as e: + raise RotationError(f"Failed to decrypt secret with old key: {e}") from e + + async def rotate( + self, clients: list[DCRClientModel], dry_run: bool = False + ) -> RotationResult: + """Execute rotation with optional dry-run mode. + + Args: + clients: DCR clients to rotate (from fetch_dcr_clients) + dry_run: If True, test decrypt/re-encrypt without database writes + + Returns: + RotationResult with success status and counts + """ + if dry_run: + errors = [] + for i, client in enumerate(clients, 1): + try: + self.reencrypt_secret(client.client_secret_encrypted) + logger.debug("Testing client %s (%d/%d)", client.client_id, i, len(clients)) + except RotationError as e: + errors.append(f"{client.client_id}: {e}") + + return RotationResult( + success=len(errors) == 0, + clients_rotated=len(clients), + dry_run=True, + errors=errors + ) + else: + async with self.async_session() as session, session.begin(): + for client in clients: + reencrypted = self.reencrypt_secret(client.client_secret_encrypted) + client.client_secret_encrypted = reencrypted + session.add(client) + + return RotationResult( + success=True, + clients_rotated=len(clients), + dry_run=False + ) + + +def parse_args() -> argparse.Namespace: + """Parse command-line arguments.""" + parser = argparse.ArgumentParser( + description="Re-encrypt DCR OAuth client secrets from old Fernet key to new key", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Dry-run mode (test only, no database changes) + python rotate-dcr-encryption-key.py \\ + --old-key="" \\ + --new-key="" \\ + --database-url="postgresql+asyncpg://..." \\ + --dry-run + + # Production mode (modifies database) + python rotate-dcr-encryption-key.py \\ + --old-key="" \\ + --new-key="" \\ + --database-url="postgresql+asyncpg://..." + + # Using environment variables + export DCR_OLD_KEY="..." + export DCR_NEW_KEY="..." + export DATABASE_URL="..." + python rotate-dcr-encryption-key.py --dry-run + """ + ) + + parser.add_argument( + "--old-key", + type=str, + default=os.getenv("DCR_OLD_KEY"), + help="Current Fernet encryption key (44-char base64, or set DCR_OLD_KEY env)" + ) + parser.add_argument( + "--new-key", + type=str, + default=os.getenv("DCR_NEW_KEY"), + help="New Fernet encryption key (44-char base64, or set DCR_NEW_KEY env)" + ) + parser.add_argument( + "--database-url", + type=str, + default=os.getenv("DATABASE_URL"), + help="PostgreSQL connection string (or set DATABASE_URL env)" + ) + parser.add_argument( + "--dry-run", + action="store_true", + help="Test mode: no database changes (default: False)" + ) + parser.add_argument( + "--verbose", + action="store_true", + help="Detailed progress logging (default: False)" + ) + + args = parser.parse_args() + + if not args.old_key: + parser.error("--old-key is required (or set DCR_OLD_KEY environment variable)") + if not args.new_key: + parser.error("--new-key is required (or set DCR_NEW_KEY environment variable)") + if not args.database_url: + parser.error("--database-url is required (or set DATABASE_URL environment variable)") + + return args + + +async def main() -> int: + """Main entry point. + + Returns: + Exit code (0=success, 1=pre-flight failure, 2=rotation failure) + """ + args = parse_args() + + log_level = logging.DEBUG if args.verbose else logging.INFO + logging.basicConfig( + level=log_level, + format="%(levelname)s: %(message)s" + ) + + mode = "PRODUCTION MODE" if not args.dry_run else "DRY-RUN MODE" + logger.info(f"Starting DCR encryption key rotation ({mode})") + + try: + rotator = EncryptionKeyRotator( + old_key=args.old_key, + new_key=args.new_key, + database_url=args.database_url + ) + + try: + logger.info("Pre-flight checks: validating keys") + await rotator.validate_keys() + + logger.info("Pre-flight checks: validating database connection") + await rotator.validate_database() + + clients = await rotator.fetch_dcr_clients() + logger.info(f"Pre-flight checks: found {len(clients)} DCR clients") + + if len(clients) == 0: + logger.warning("No DCR clients found - nothing to rotate") + return 0 + + if args.dry_run: + logger.info(f"Dry-run mode: testing decrypt/re-encrypt on {len(clients)} clients") + else: + logger.info(f"Production mode: rotating {len(clients)} clients") + + result = await rotator.rotate(clients, dry_run=args.dry_run) + + if result.success: + if args.dry_run: + logger.info( + f"Dry-run complete: all {result.clients_rotated} records can be rotated" + ) + else: + logger.info( + f"Rotation complete: {result.clients_rotated} clients rotated successfully" + ) + return 0 + else: + errors = result.errors or [] + logger.error(f"Rotation failed with {len(errors)} errors:") + for error in errors: + logger.error(f" {error}") + return 2 + + finally: + await rotator.cleanup() + + except RotationError as e: + logger.error(f"Pre-flight check failed: {e}") + return 1 + except Exception as e: + logger.error(f"Unexpected error: {e}") + return 2 + + +if __name__ == "__main__": + sys.exit(asyncio.run(main())) From a61df7345756133102363416cfc2b426292ae2e2 Mon Sep 17 00:00:00 2001 From: Ilona Shishov Date: Mon, 11 May 2026 17:21:19 +0300 Subject: [PATCH 2/5] fix: correct Cloud Run secret pickup claims in rotation doc Cloud Run env-var secrets are resolved at instance startup, not auto-updated when Secret Manager versions change. Add explicit service restart steps to Tier 1 rotation procedures and remove the internal attestation section per PR review feedback. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/MANUAL_SECRET_ROTATION.md | 138 +++++++++++++++++++++++---------- 1 file changed, 96 insertions(+), 42 deletions(-) diff --git a/docs/MANUAL_SECRET_ROTATION.md b/docs/MANUAL_SECRET_ROTATION.md index a9258960..1a731e73 100644 --- a/docs/MANUAL_SECRET_ROTATION.md +++ b/docs/MANUAL_SECRET_ROTATION.md @@ -1,31 +1,5 @@ # Manual Secret Rotation -## INTERNAL ATTESTATION - -**Finding:** Credential provider APIs for automated secret rotation are not supported. - -**Affected Secrets:** - -**Tier 1 (OAuth Secrets) - Blocked by missing credential provider APIs:** -- `RED_HAT_SSO_CLIENT_SECRET` - Red Hat SSO OAuth client secret -- `GMA_CLIENT_SECRET` - Google Marketplace Agent (GMA) client secret - -**Tiers 2 & 3 - Require human intervention (service restart/migration):** -- `DATABASE_URL` - Cloud SQL database credentials, requires coordinated service restart -- `DCR_ENCRYPTION_KEY` - Fernet encryption key requiring re-encryption migration - -**Current Mitigation:** Manual rotation procedures documented in this file, executed annually by operations team. - -**Recommended Action (CIAM team):** Investigate and build API support for: -1. Red Hat SSO Client API - OAuth client secret regeneration endpoint -2. Google Marketplace Agent API - Client secret regeneration endpoint - -**Compensating Control:** Annual manual rotation procedures (detailed below) provide security equivalent to automated rotation at longer intervals. - -**Review Date:** 2026-05-04 - ---- - ## Overview This document provides manual rotation procedures for all secrets in the Red Hat Lightspeed Agent for Google Cloud that cannot be rotated automatically. @@ -42,7 +16,7 @@ This document provides manual rotation procedures for all secrets in the Red Hat | Tier | Complexity | Secrets | Downtime | |------|-----------|---------|----------| -| 1 | Low | OAuth client secrets (SSO, GMA) | None (auto-pickup) | +| 1 | Low | OAuth client secrets (SSO, GMA) | Brief (service restart) | | 2 | Medium | Database credentials | 2-5 minutes (restart) | | 3 | High | DCR encryption key | Marketplace only (migration required) | @@ -82,10 +56,10 @@ This document provides manual rotation procedures for all secrets in the Red Hat ### Coordination Requirements **Tier 1:** -- No coordination needed (zero downtime) +- Brief maintenance window (service restart required) **Tier 2:** -- Maintenance window approval (2-5 minute downtime) +- Maintenance window (2-5 minute downtime) - Stakeholder notification (service restart) **Tier 3:** @@ -97,7 +71,7 @@ This document provides manual rotation procedures for all secrets in the Red Hat ## Tier 1: OAuth Client Secrets **Characteristics:** -- Zero service downtime (Cloud Run auto-picks up new secret versions) +- Brief service downtime (service restart required to pick up new secret versions) - CIAM team coordination required (1-2 business days) - Low technical complexity (Secret Manager version update) @@ -130,7 +104,7 @@ Follow the CIAM self-service client configuration management process: #### Step 2: Update Secret Manager -Add a new version of the secret in Google Cloud Secret Manager. Cloud Run services will automatically use the latest ENABLED version. +Add a new version of the secret in Google Cloud Secret Manager. A service restart is required for Cloud Run to pick up the new version (Step 3). ```bash # Set project @@ -157,9 +131,45 @@ NAME STATE CREATED DESTROYED Version 4 is the new secret (latest). -#### Step 3: Verification +#### Step 3: Restart Services + +Restart Cloud Run services to pick up the new secret version. + +```bash +# Restart agent service +gcloud run services update lightspeed-agent \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Restart marketplace handler +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Monitor service health +echo "Waiting for services to become ready..." +sleep 10 + +# Verify both services are healthy +gcloud run services describe lightspeed-agent \ + --region=us-central1 \ + --format='value(status.conditions[0].status)' \ + --project="${GOOGLE_CLOUD_PROJECT}" -Cloud Run services automatically pick up new secret versions within minutes. Verify authentication works with the new secret. +gcloud run services describe marketplace-handler \ + --region=us-central1 \ + --format='value(status.conditions[0].status)' \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +**Expected Output (for both services):** +``` +True +``` + +#### Step 4: Verification + +Verify authentication works with the new secret. **Obtain a test JWT token:** @@ -197,11 +207,11 @@ curl -H "Authorization: Bearer $TOKEN" \ } ``` -**If authentication fails**, the new secret is invalid. Proceed to rollback (Step 4). +**If authentication fails**, the new secret is invalid. Proceed to rollback (Step 5). -#### Step 4: Rollback (if needed) +#### Step 5: Rollback (if needed) -If verification fails, disable the new secret version to restore the previous version. +If verification fails, disable the new secret version and restart services to restore the previous version. ```bash # List versions to identify the new version number @@ -226,7 +236,19 @@ NAME STATE CREATED DESTROYED 3 ENABLED 2025-05-04T12:00:00 - ``` -Cloud Run will automatically use version 3 (previous secret). Re-run verification (Step 3) to confirm rollback succeeded. +Restart services to pick up the previous secret version: + +```bash +gcloud run services update lightspeed-agent \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +Re-run verification (Step 4) to confirm rollback succeeded. **Investigate why the new secret failed** before attempting rotation again. Common causes: - Wrong secret value copied (typo) @@ -296,13 +318,38 @@ NAME STATE CREATED DESTROYED 1 ENABLED 2024-05-04T14:00:00 - ``` -#### Step 3: Verification +#### Step 3: Restart Services + +Restart the Marketplace Handler service to pick up the new secret version. + +```bash +# Restart marketplace handler (only service using GMA secret) +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" + +# Wait for service to become ready +sleep 10 + +# Verify service is healthy +gcloud run services describe marketplace-handler \ + --region=us-central1 \ + --format='value(status.conditions[0].status)' \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` + +**Expected Output:** +``` +True +``` + +#### Step 4: Verification -**If verification fails**, proceed to rollback (Step 4). +**If verification fails**, proceed to rollback (Step 5). -#### Step 4: Rollback (if needed) +#### Step 5: Rollback (if needed) ```bash # List versions to identify new version number @@ -320,7 +367,13 @@ gcloud secrets versions list gma-client-secret \ --limit=3 ``` -Cloud Run will automatically use the previous ENABLED version. +Restart the service to pick up the previous secret version: + +```bash +gcloud run services update marketplace-handler \ + --region=us-central1 \ + --project="${GOOGLE_CLOUD_PROJECT}" +``` **Investigate failure** before retrying: - Verify secret value is correct (no copy/paste errors) @@ -804,7 +857,7 @@ gcloud run services update marketplace-handler \ --project="${GOOGLE_CLOUD_PROJECT}" # Wait for service to become ready -sleep 15 +sleep 10 # Check service status gcloud run services describe marketplace-handler \ @@ -1228,6 +1281,7 @@ We will provide updates as the situation develops. **Hour 0-1: Tier 1 Rotation** - 00:30 - New SSO secret received from CIAM - 00:35 - Secret Manager updated +- 00:37 - Services restarted to pick up new secret - 00:40 - Verification complete (authentication restored) **Hour 1-2: Tier 3 Rotation (if needed)** From 5d4ac97f38f87c5f365d000293d164258b2999be Mon Sep 17 00:00:00 2001 From: Ilona Shishov Date: Tue, 12 May 2026 13:36:12 +0300 Subject: [PATCH 3/5] fix: rotate registration_access_token_encrypted during key rotation Handle the second Fernet-encrypted field in DCRClientModel so key rotation does not leave it encrypted with the old key when non-NULL. Co-Authored-By: Claude Opus 4.6 (1M context) --- scripts/rotate_dcr_encryption_key.py | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/scripts/rotate_dcr_encryption_key.py b/scripts/rotate_dcr_encryption_key.py index 3434884b..7cc7deb3 100644 --- a/scripts/rotate_dcr_encryption_key.py +++ b/scripts/rotate_dcr_encryption_key.py @@ -139,6 +139,8 @@ async def rotate( for i, client in enumerate(clients, 1): try: self.reencrypt_secret(client.client_secret_encrypted) + if client.registration_access_token_encrypted: + self.reencrypt_secret(client.registration_access_token_encrypted) logger.debug("Testing client %s (%d/%d)", client.client_id, i, len(clients)) except RotationError as e: errors.append(f"{client.client_id}: {e}") @@ -152,8 +154,13 @@ async def rotate( else: async with self.async_session() as session, session.begin(): for client in clients: - reencrypted = self.reencrypt_secret(client.client_secret_encrypted) - client.client_secret_encrypted = reencrypted + client.client_secret_encrypted = self.reencrypt_secret( + client.client_secret_encrypted + ) + if client.registration_access_token_encrypted: + client.registration_access_token_encrypted = self.reencrypt_secret( + client.registration_access_token_encrypted + ) session.add(client) return RotationResult( From fd468af37435a28f595331dd14a1dae0a34f7ffd Mon Sep 17 00:00:00 2001 From: Ilona Shishov Date: Tue, 12 May 2026 15:40:41 +0300 Subject: [PATCH 4/5] fix: use single session for rotation to avoid detached SQLAlchemy objects Remove fetch_dcr_clients() and move all DB operations into rotate(). Clients are fetched and mutated within the same session, eliminating the cross-session detached-object pattern. Dry-run rolls back on exit, production commits explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) --- scripts/rotate_dcr_encryption_key.py | 91 +++++++++++----------------- 1 file changed, 37 insertions(+), 54 deletions(-) diff --git a/scripts/rotate_dcr_encryption_key.py b/scripts/rotate_dcr_encryption_key.py index 7cc7deb3..511c92b4 100644 --- a/scripts/rotate_dcr_encryption_key.py +++ b/scripts/rotate_dcr_encryption_key.py @@ -87,17 +87,6 @@ async def cleanup(self) -> None: """Dispose database engine.""" await self.engine.dispose() - async def fetch_dcr_clients(self) -> list[DCRClientModel]: - """Fetch all DCR clients from database. - - Returns: - List of DCRClientModel instances - """ - async with self.async_session() as session: - result = await session.execute(select(DCRClientModel)) - clients = result.scalars().all() - return list(clients) - def reencrypt_secret(self, encrypted_secret: str) -> str: """Re-encrypt a secret from old key to new key. @@ -122,37 +111,43 @@ def reencrypt_secret(self, encrypted_secret: str) -> str: except InvalidToken as e: raise RotationError(f"Failed to decrypt secret with old key: {e}") from e - async def rotate( - self, clients: list[DCRClientModel], dry_run: bool = False - ) -> RotationResult: - """Execute rotation with optional dry-run mode. + async def rotate(self, dry_run: bool = False) -> RotationResult: + """Fetch all DCR clients and re-encrypt their secrets. Args: - clients: DCR clients to rotate (from fetch_dcr_clients) dry_run: If True, test decrypt/re-encrypt without database writes Returns: RotationResult with success status and counts """ - if dry_run: - errors = [] - for i, client in enumerate(clients, 1): - try: - self.reencrypt_secret(client.client_secret_encrypted) - if client.registration_access_token_encrypted: - self.reencrypt_secret(client.registration_access_token_encrypted) - logger.debug("Testing client %s (%d/%d)", client.client_id, i, len(clients)) - except RotationError as e: - errors.append(f"{client.client_id}: {e}") - - return RotationResult( - success=len(errors) == 0, - clients_rotated=len(clients), - dry_run=True, - errors=errors - ) - else: - async with self.async_session() as session, session.begin(): + async with self.async_session() as session: + result = await session.execute(select(DCRClientModel)) + clients = result.scalars().all() + + if len(clients) == 0: + logger.warning("No DCR clients found - nothing to rotate") + return RotationResult(success=True, clients_rotated=0, dry_run=dry_run) + + if dry_run: + logger.info("Dry-run: testing decrypt/re-encrypt on %d clients", len(clients)) + errors = [] + for i, client in enumerate(clients, 1): + try: + self.reencrypt_secret(client.client_secret_encrypted) + if client.registration_access_token_encrypted: + self.reencrypt_secret(client.registration_access_token_encrypted) + logger.debug("Testing client %s (%d/%d)", client.client_id, i, len(clients)) + except RotationError as e: + errors.append(f"{client.client_id}: {e}") + + return RotationResult( + success=len(errors) == 0, + clients_rotated=len(clients), + dry_run=True, + errors=errors + ) + else: + logger.info("Production mode: rotating %d clients", len(clients)) for client in clients: client.client_secret_encrypted = self.reencrypt_secret( client.client_secret_encrypted @@ -161,13 +156,13 @@ async def rotate( client.registration_access_token_encrypted = self.reencrypt_secret( client.registration_access_token_encrypted ) - session.add(client) + await session.commit() - return RotationResult( - success=True, - clients_rotated=len(clients), - dry_run=False - ) + return RotationResult( + success=True, + clients_rotated=len(clients), + dry_run=False + ) def parse_args() -> argparse.Namespace: @@ -270,19 +265,7 @@ async def main() -> int: logger.info("Pre-flight checks: validating database connection") await rotator.validate_database() - clients = await rotator.fetch_dcr_clients() - logger.info(f"Pre-flight checks: found {len(clients)} DCR clients") - - if len(clients) == 0: - logger.warning("No DCR clients found - nothing to rotate") - return 0 - - if args.dry_run: - logger.info(f"Dry-run mode: testing decrypt/re-encrypt on {len(clients)} clients") - else: - logger.info(f"Production mode: rotating {len(clients)} clients") - - result = await rotator.rotate(clients, dry_run=args.dry_run) + result = await rotator.rotate(dry_run=args.dry_run) if result.success: if args.dry_run: From c3811029aa29ef8abdb4b3f9def425a2453450e7 Mon Sep 17 00:00:00 2001 From: Ilona Shishov Date: Tue, 12 May 2026 15:49:12 +0300 Subject: [PATCH 5/5] fix: warn against CLI arg usage for secrets in rotation script Update rotation doc to use DCR_OLD_KEY/DCR_NEW_KEY/DATABASE_URL environment variables consistently instead of --old-key/--new-key/--database-url CLI arguments, which expose secrets in process listings. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/MANUAL_SECRET_ROTATION.md | 77 +++++++++++++--------------- scripts/rotate_dcr_encryption_key.py | 26 ++++------ 2 files changed, 48 insertions(+), 55 deletions(-) diff --git a/docs/MANUAL_SECRET_ROTATION.md b/docs/MANUAL_SECRET_ROTATION.md index 1a731e73..1df2ece9 100644 --- a/docs/MANUAL_SECRET_ROTATION.md +++ b/docs/MANUAL_SECRET_ROTATION.md @@ -680,10 +680,10 @@ If backup fails, consider whether to proceed without this safety net. ```bash # Generate new encryption key (44-character base64-encoded Fernet key) -NEW_ENCRYPTION_KEY=$(python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())") +DCR_NEW_KEY=$(python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())") -echo "New encryption key (save in password manager): ${NEW_ENCRYPTION_KEY}" -echo "Length: ${#NEW_ENCRYPTION_KEY} characters (should be 44)" +echo "New encryption key (save in password manager): ${DCR_NEW_KEY}" +echo "Length: ${#DCR_NEW_KEY} characters (should be 44)" ``` **Expected Output:** @@ -698,19 +698,19 @@ Length: 44 characters ```bash # Get current encryption key -OLD_ENCRYPTION_KEY=$(gcloud secrets versions access latest \ +DCR_OLD_KEY=$(gcloud secrets versions access latest \ --secret=dcr-encryption-key \ --project="${GOOGLE_CLOUD_PROJECT}") echo "Old key retrieved" -echo "Length: ${#OLD_ENCRYPTION_KEY} characters (should be 44)" +echo "Length: ${#DCR_OLD_KEY} characters (should be 44)" ``` **Verify both keys are 44 characters** and different: ```bash # Verify keys are different -if [ "${OLD_ENCRYPTION_KEY}" == "${NEW_ENCRYPTION_KEY}" ]; then +if [ "${DCR_OLD_KEY}" == "${DCR_NEW_KEY}" ]; then echo "ERROR: Keys are identical! Generate a new key." exit 1 else @@ -718,7 +718,7 @@ else fi # Verify lengths -if [ ${#OLD_ENCRYPTION_KEY} -ne 44 ] || [ ${#NEW_ENCRYPTION_KEY} -ne 44 ]; then +if [ ${#DCR_OLD_KEY} -ne 44 ] || [ ${#DCR_NEW_KEY} -ne 44 ]; then echo "ERROR: Invalid key length!" exit 1 else @@ -739,18 +739,15 @@ cd /path/to/google-lightspeed-agent # Activate virtual environment source .venv/bin/activate -# Get database URL -DATABASE_URL=$(gcloud secrets versions access latest \ +# Set environment variables (the script reads DCR_OLD_KEY, DCR_NEW_KEY, DATABASE_URL) +export DCR_OLD_KEY +export DCR_NEW_KEY +export DATABASE_URL=$(gcloud secrets versions access latest \ --secret=database-url \ --project="${GOOGLE_CLOUD_PROJECT}") # Run dry-run migration -python scripts/rotate_dcr_encryption_key.py \ - --old-key="${OLD_ENCRYPTION_KEY}" \ - --new-key="${NEW_ENCRYPTION_KEY}" \ - --database-url="${DATABASE_URL}" \ - --dry-run \ - --verbose +python scripts/rotate_dcr_encryption_key.py --dry-run --verbose ``` **Expected Output (successful dry-run):** @@ -779,7 +776,7 @@ ERROR: Pre-flight check failed: Failed to decrypt secret with old key: ... - Database contains secrets encrypted with a different key (previous rotation incomplete) - Secret corruption in database -**Resolution:** Verify `OLD_ENCRYPTION_KEY` matches the current production key in Secret Manager. If mismatch, do NOT proceed. Investigate database state. +**Resolution:** Verify `DCR_OLD_KEY` matches the current production key in Secret Manager. If mismatch, do NOT proceed. Investigate database state. --- @@ -789,11 +786,8 @@ ERROR: Pre-flight check failed: Failed to decrypt secret with old key: ... ```bash # Production migration (no --dry-run flag) -python scripts/rotate_dcr_encryption_key.py \ - --old-key="${OLD_ENCRYPTION_KEY}" \ - --new-key="${NEW_ENCRYPTION_KEY}" \ - --database-url="${DATABASE_URL}" \ - --verbose +# DCR_OLD_KEY, DCR_NEW_KEY, DATABASE_URL already set in Step 5 +python scripts/rotate_dcr_encryption_key.py --verbose ``` **Expected Output (successful rotation):** @@ -826,7 +820,7 @@ Only update Secret Manager **after migration succeeds** (Step 6). ```bash # Add new encryption key version -echo -n "${NEW_ENCRYPTION_KEY}" | gcloud secrets versions add dcr-encryption-key \ +echo -n "${DCR_NEW_KEY}" | gcloud secrets versions add dcr-encryption-key \ --data-file=- \ --project="${GOOGLE_CLOUD_PROJECT}" @@ -886,12 +880,14 @@ True **Option A: Reverse migration (preferred)** — re-run the script with keys swapped: ```bash -# Re-run migration with keys swapped to restore old encryption -python scripts/rotate_dcr_encryption_key.py \ - --old-key="${NEW_ENCRYPTION_KEY}" \ - --new-key="${OLD_ENCRYPTION_KEY}" \ - --database-url="${DATABASE_URL}" \ - --verbose +# Swap keys to reverse the migration +_TMP="${DCR_OLD_KEY}" +export DCR_OLD_KEY="${DCR_NEW_KEY}" +export DCR_NEW_KEY="${_TMP}" +unset _TMP + +# Re-run migration to restore old encryption +python scripts/rotate_dcr_encryption_key.py --verbose ``` Then disable the new key in Secret Manager and restart the marketplace handler: @@ -943,8 +939,8 @@ gcloud run services update marketplace-handler \ Zero encryption keys from environment and shell history after confirming rotation succeeded (or rollback completed): ```bash -unset OLD_ENCRYPTION_KEY -unset NEW_ENCRYPTION_KEY +unset DCR_OLD_KEY +unset DCR_NEW_KEY unset DATABASE_URL history -c @@ -962,29 +958,30 @@ history -c ### Usage +**Set secrets via environment variables** (CLI arguments expose keys in process listings): + +```bash +export DCR_OLD_KEY="" +export DCR_NEW_KEY="" +export DATABASE_URL="postgresql+asyncpg://user:pass@host/db" +``` + **Dry-Run Mode (recommended first):** ```bash -python scripts/rotate_dcr_encryption_key.py \ - --old-key="" \ - --new-key="" \ - --database-url="postgresql+asyncpg://user:pass@host/db" \ - --dry-run +python scripts/rotate_dcr_encryption_key.py --dry-run ``` **Production Mode:** ```bash -python scripts/rotate_dcr_encryption_key.py \ - --old-key="" \ - --new-key="" \ - --database-url="postgresql+asyncpg://user:pass@host/db" +python scripts/rotate_dcr_encryption_key.py ``` **Verbose Output:** ```bash -python scripts/rotate_dcr_encryption_key.py ... --verbose +python scripts/rotate_dcr_encryption_key.py --verbose ``` **Environment Variables (alternative to CLI args):** diff --git a/scripts/rotate_dcr_encryption_key.py b/scripts/rotate_dcr_encryption_key.py index 511c92b4..af6fb992 100644 --- a/scripts/rotate_dcr_encryption_key.py +++ b/scripts/rotate_dcr_encryption_key.py @@ -171,25 +171,21 @@ def parse_args() -> argparse.Namespace: description="Re-encrypt DCR OAuth client secrets from old Fernet key to new key", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" +SECURITY: Use environment variables instead of CLI arguments for keys and +database URLs. CLI arguments are visible in process listings (ps aux), +shell history, and system audit logs. + Examples: + # Set secrets via environment variables (recommended) + export DCR_OLD_KEY="..." + export DCR_NEW_KEY="..." + export DATABASE_URL="postgresql+asyncpg://..." + # Dry-run mode (test only, no database changes) - python rotate-dcr-encryption-key.py \\ - --old-key="" \\ - --new-key="" \\ - --database-url="postgresql+asyncpg://..." \\ - --dry-run + python rotate_dcr_encryption_key.py --dry-run # Production mode (modifies database) - python rotate-dcr-encryption-key.py \\ - --old-key="" \\ - --new-key="" \\ - --database-url="postgresql+asyncpg://..." - - # Using environment variables - export DCR_OLD_KEY="..." - export DCR_NEW_KEY="..." - export DATABASE_URL="..." - python rotate-dcr-encryption-key.py --dry-run + python rotate_dcr_encryption_key.py """ )