Skip to content

CSPL-4577 Codex Agentic DX & Harness Foundation#1738

Draft
vivekr-splunk wants to merge 8 commits intodevelopfrom
cspl-4577
Draft

CSPL-4577 Codex Agentic DX & Harness Foundation#1738
vivekr-splunk wants to merge 8 commits intodevelopfrom
cspl-4577

Conversation

@vivekr-splunk
Copy link
Collaborator

Summary

Establish a production-grade agentic DX foundation for SOK: AGENTS docs, skills, dev harness scripts, test generator improvements, upgrade helpers, and OSS governance + review hygiene.

Key Changes

  • AGENTS docs for core repo areas
  • 8 Codex skills with scope + DoD + output contract
  • Shared scripts/dev harness entrypoints
  • Testcase generator + agent docs
  • Upgrade verification helpers in testenv
  • PR template, CODEOWNERS, and governance docs

Testing and Verification

Not run (workflow changes only).

Suggested local gates:

  • scripts/dev/pr_check.sh
  • scripts/dev/unit.sh

Related Issues

  • Epic: CSPL-4577
  • Stories: CSPL-4578, CSPL-4579, CSPL-4580, CSPL-4581, CSPL-4582, CSPL-4583

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

CLA Assistant Lite bot CLA Assistant Lite bot All contributors have signed the COC ✍️ ✅

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This is a comprehensive pull request that establishes a production-grade agentic development experience (DX) foundation for the Splunk Operator for Kubernetes (SOK). The PR adds extensive documentation, helper scripts, test generation tools, and governance files to support both human and agent-assisted development workflows.

Changes:

  • Added AGENTS.md documentation files across key repository areas (test/, pkg/splunk/enterprise/, internal/controller/, api/)
  • Created 8 Codex skills with detailed workflow documentation in .agents/skills/
  • Added shared development scripts under scripts/dev/ for common workflows (unit tests, linting, PR checks, etc.)
  • Enhanced test infrastructure with operator upgrade verification helpers in test/testenv/
  • Created Python test case generator (scripts/generate_testcase.py) with YAML/JSON spec support
  • Added verification scripts for CRD, RBAC, and bundle outputs
  • Established governance framework (CODEOWNERS, GOVERNANCE.md, CODE_OF_CONDUCT.md, SECURITY.md, SUPPORT.md, MAINTAINERS.md)
  • Added comprehensive agent-focused documentation under docs/agent/ (CRD_MAP.md, RECONCILE_FLOW.md, TEST_MATRIX.md, etc.)
  • Created PR and issue templates
  • Added GitHub Actions workflow for PR checks
  • Updated Makefile with verification targets
  • Removed unused code (telemetry functions, IngestorReady function) and simplified app verification logic

Reviewed changes

Copilot reviewed 65 out of 65 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/testenv/verificationutils.go Added upgrade verification helpers (VerifyOperatorImage, VerifyPodImageContains, etc.); removed telemetry and IngestorReady functions; cleaned up unused imports
test/testenv/testcaseenv.go Added UpdateOperatorImage method and GetOperatorName getter for upgrade testing
test/AGENTS.md Documentation for test directory structure and conventions
templates/pull_request.md PR template with standardized sections
scripts/verify_repo.sh Main verification orchestration script with multiple verification modes
scripts/verify_crd.sh CRD/RBAC generation verification script
scripts/verify_bundle.sh Bundle and Helm CRD verification script
scripts/generate_testcase.py Python tool to generate integration and KUTTL test scaffolds from YAML/JSON specs
scripts/dev/*.sh Developer harness scripts for common workflows (unit tests, linting, PR checks, kind smoke tests, etc.)
scripts/debug_reconcile.sh Debug data collection script for reconciliation issues
.agents/skills/*/SKILL.md 8 skill documentation files defining workflows for various development tasks
docs/agent/*.md Agent-focused documentation (CRD mapping, reconcile flow, test patterns, release flow, operations)
pkg/splunk/enterprise/AGENTS.md Documentation for core operator logic directory
internal/controller/AGENTS.md Documentation for reconciler directory
api/AGENTS.md Documentation for CRD types directory
CODEOWNERS Code ownership assignments (with TODO for team name)
GOVERNANCE.md Project governance model
CODE_OF_CONDUCT.md Contributor Covenant code of conduct
SECURITY.md Security vulnerability reporting policy
SUPPORT.md Support channels documentation
MAINTAINERS.md List of current maintainers
README.md Enhanced with agent harness section
Makefile Added verify, verify-crd, verify-bundle, and verify-repo targets
.github/workflows/pr-check.yml GitHub Actions workflow for PR verification
.github/pull_request_template.md Enhanced PR template with suggested local gates
docs/CONTRIBUTING.md Updated with agentic development workflow section and cleaned maintainers list
Comments suppressed due to low confidence (1)

test/testenv/verificationutils.go:216

  • The function IngestorReady has been removed but it is still being called in multiple test files. The function is used in:
  • test/index_and_ingestion_separation/index_and_ingestion_separation_test.go (lines 114, 226, 302)

This removal will cause compilation failures. Either restore the function or update all call sites to use an alternative verification function.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1219 to +1220
func VerifyOperatorImage(ctx context.Context, testenvInstance *TestCaseEnv, expectedImage string) {
_ = ctx // reserved for future use
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context parameter ctx is declared but immediately discarded with _ = ctx. If the context is reserved for future use, consider removing it from the current implementation and adding it back when needed. Alternatively, if it's kept for API consistency, add a comment explaining this decision.

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +195
install_step = f\"\"\"---\napiVersion: kuttl.dev/v1beta1\nkind: TestStep\ncommands:\n - command: {install_cmd}\n namespaced: true\n\"\"\"\n+ ready_assert = \"\"\"---\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: splunk-operator-controller-manager\nstatus:\n readyReplicas: 1\n availableReplicas: 1\n\"\"\"\n+ upgrade_step = f\"\"\"---\napiVersion: kuttl.dev/v1beta1\nkind: TestStep\ncommands:\n - command: {upgrade_cmd}\n namespaced: true\n\"\"\"\n+ image_check_cmd = (\n+ f\"kubectl -n ${{{namespace_env}}} get deploy splunk-operator-controller-manager \"\n+ f\"-o jsonpath='{{{{.spec.template.spec.containers[?(@.name==\\\\\\\"manager\\\\\\\")].image}}}}' \"\n+ f\"| grep -q \\\"${{{operator_image_new_env}}}\\\"\"\n+ )\n+ image_assert_step = f\"\"\"---\napiVersion: kuttl.dev/v1beta1\nkind: TestStep\ncommands:\n - command: {image_check_cmd}\n namespaced: true\n\"\"\"\n+\n+ write_text(test_dir / \"00-install.yaml\", install_step, force)\n+ write_text(test_dir / \"01-assert-operator-ready.yaml\", ready_assert, force)\n+ write_text(test_dir / \"02-upgrade.yaml\", upgrade_step, force)\n+ write_text(test_dir / \"03-assert-operator-image.yaml\", image_assert_step, force)\n+
for index, cr in enumerate(crs):
api_version = cr.get("apiVersion", "")
kind = cr.get("kind", "")
cr_name = cr.get("name", "")
cr_path = Path(cr.get("path", "")).expanduser()
if not api_version or not kind or not cr_name:
print("[ERROR] crs entries must include apiVersion, kind, and name")
sys.exit(1)
if not cr_path.is_absolute():
cr_path = (repo_root / cr_path).resolve()
if not cr_path.exists():
print(f"[ERROR] CR manifest not found: {cr_path}")
sys.exit(1)
deploy_index = index if not upgrade_enabled else index + 4
deploy_name = f"{deploy_index:02d}-deploy-{slugify(kind)}.yaml"
deploy_target = test_dir / deploy_name
if deploy_target.exists() and not force:
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 178-195 contain malformed Python string syntax. The strings appear to use escaped quotes and newlines (f\"\"\"...\n\"\"\") instead of proper triple-quoted strings (f"""..."""). This will cause a Python syntax error. Each string variable assignment (install_step, ready_assert, upgrade_step, image_check_cmd, image_assert_step) needs to use proper Python triple-quoted string syntax.

Copilot uses AI. Check for mistakes.
Comment on lines 1218 to 1316
// VerifyOperatorImage verifies the operator pod is running the expected image
func VerifyOperatorImage(ctx context.Context, testenvInstance *TestCaseEnv, expectedImage string) {
_ = ctx // reserved for future use
var ns string
if testenvInstance.clusterWideOperator != "true" {
ns = testenvInstance.GetName()
} else {
ns = "splunk-operator"
}
logf.Log.Info("GetTelemetryLastSubmissionTime: retrieved status", "status", statusVal)
timeout := time.Duration(SpecifiedTestTimeout) * time.Second
gomega.Eventually(func() bool {
operatorPod := GetOperatorPodName(testenvInstance)
if operatorPod == "" {
logf.Log.Info("Operator pod not found yet", "namespace", ns)
return false
}
return podImageContains(ns, operatorPod, expectedImage)
}, timeout, PollInterval).Should(gomega.Equal(true))
}

var status telemetryStatus
if err := json.Unmarshal([]byte(statusVal), &status); err != nil {
logf.Log.Error(err, "GetTelemetryLastSubmissionTime: failed to unmarshal status", "statusVal", statusVal)
return ""
}
return status.LastTransmission
// VerifyPodImageContains verifies the pod is running a container image that contains expectedImage
func VerifyPodImageContains(ns string, podName string, expectedImage string) {
timeout := time.Duration(SpecifiedTestTimeout) * time.Second
gomega.Eventually(func() bool {
return podImageContains(ns, podName, expectedImage)
}, timeout, PollInterval).Should(gomega.Equal(true))
}

// VerifyTelemetry checks that the telemetry ConfigMap has a non-empty lastTransmission field in its status key.
func VerifyTelemetry(ctx context.Context, deployment *Deployment, prevVal string) {
logf.Log.Info("VerifyTelemetry: start")
// VerifySplunkPodImagesContain verifies all Splunk pods (excluding operator) are running expected image
func VerifySplunkPodImagesContain(ns string, expectedImage string) {
timeout := time.Duration(SpecifiedTestTimeout) * time.Second
gomega.Eventually(func() bool {
currentVal := GetTelemetryLastSubmissionTime(ctx, deployment)
if currentVal != "" && currentVal != prevVal {
logf.Log.Info("VerifyTelemetry: success", "previous", prevVal, "current", currentVal)
return true
pods := DumpGetPods(ns)
checked := 0
for _, pod := range pods {
if !isSplunkWorkloadPod(pod) {
continue
}
checked++
if !podImageContains(ns, pod, expectedImage) {
return false
}
}
return false
}, deployment.GetTimeout(), PollInterval).Should(gomega.Equal(true))
if checked == 0 {
logf.Log.Info("No Splunk pods found yet", "namespace", ns)
return false
}
return true
}, timeout, PollInterval).Should(gomega.Equal(true))
}

// TriggerTelemetrySubmission updates or adds the 'test_submission' key in the telemetry ConfigMap with a JSON value containing a random number.
func TriggerTelemetrySubmission(ctx context.Context, deployment *Deployment) {
const (
configMapName = "splunk-operator-manager-telemetry"
testKey = "test_submission"
)

// Generate a random number
rand.Seed(time.Now().UnixNano())
randomNumber := rand.Intn(1000)

// Create the JSON value
jsonValue, err := json.Marshal(map[string]int{"value": randomNumber})
if err != nil {
logf.Log.Error(err, "Failed to marshal JSON value")
return
// podImageContains checks if any container image or imageID on the pod contains expectedImage
func podImageContains(ns string, podName string, expectedImage string) bool {
if podName == "" {
logf.Log.Info("Pod name is empty; cannot verify image", "namespace", ns)
return false
}

// Update the ConfigMap
cm := &corev1.ConfigMap{}
err = deployment.testenv.GetKubeClient().Get(ctx, client.ObjectKey{Name: configMapName, Namespace: "splunk-operator"}, cm)
output, err := exec.Command("kubectl", "get", "pods", "-n", ns, podName, "-o", "json").Output()
if err != nil {
logf.Log.Error(err, "Failed to get ConfigMap")
return
cmd := fmt.Sprintf("kubectl get pods -n %s %s -o json", ns, podName)
logf.Log.Error(err, "Failed to execute command", "command", cmd)
return false
}

// Update the test_submission key
cm.Data[testKey] = string(jsonValue)
err = deployment.testenv.GetKubeClient().Update(ctx, cm)
restResponse := PodDetailsStruct{}
err = json.Unmarshal([]byte(output), &restResponse)
if err != nil {
logf.Log.Error(err, "Failed to update ConfigMap")
return
logf.Log.Error(err, "Failed to parse pod JSON")
return false
}
found := false
images := []string{}
for _, status := range restResponse.Status.ContainerStatuses {
if status.Image != "" {
images = append(images, status.Image)
}
if status.ImageID != "" {
images = append(images, status.ImageID)
}
if strings.Contains(status.Image, expectedImage) || strings.Contains(status.ImageID, expectedImage) {
found = true
}
}
logf.Log.Info("Pod image check", "pod", podName, "expected", expectedImage, "found", found, "images", images)
return found
}

func isOperatorPod(podName string) bool {
return strings.HasPrefix(podName, "splunk-op") || strings.HasPrefix(podName, "splunk-operator")
}

logf.Log.Info("Successfully updated telemetry ConfigMap", "key", testKey, "value", jsonValue)
func isSplunkWorkloadPod(podName string) bool {
if podName == "" {
return false
}
if isOperatorPod(podName) {
return false
}
return strings.HasPrefix(podName, "splunk-")
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions GetTelemetryLastSubmissionTime, VerifyTelemetry, and TriggerTelemetrySubmission have been removed but they are still being called in multiple test files:

  • test/custom_resource_crud/custom_resource_crud_c3_test.go (lines 72, 86, 87)
  • test/custom_resource_crud/custom_resource_crud_m4_test.go (lines 68, 86, 87)
  • test/custom_resource_crud/custom_resource_crud_s1_test.go (lines 68, 76, 77)

This removal will cause compilation failures. Either restore these functions or update all call sites to remove the telemetry verification steps.

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +147
containerIndex := 0
for i, container := range operator.Spec.Template.Spec.Containers {
if container.Name == "manager" {
containerIndex = i
break
}
}
operator.Spec.Template.Spec.Containers[containerIndex].Image = image
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function assumes that if the "manager" container is not found in the loop, it defaults to containerIndex = 0. However, if there are no containers or the "manager" container doesn't exist, this could update the wrong container. Consider adding validation to ensure the "manager" container is found before proceeding with the update.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +2
# TODO: Replace this placeholder team with real owners.
* @splunk/splunk-operator-for-kubernetes
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CODEOWNERS file contains a TODO comment indicating that the placeholder team needs to be replaced with actual owners. This should be addressed before merging to ensure proper code ownership is established.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants