From 46e446de2043e96678c7ee398c5bcda99dbacb1d Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Sun, 14 Dec 2025 13:18:05 -0500 Subject: [PATCH 1/7] Condense the kdevops project CLAUDE.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When I start claude, it warns: ⚠Large CLAUDE.md will impact performance (53.1k chars > 40.0k) When asked for specifics, claude responded: > 1. Context usage: The entire file is loaded into every > conversation, consuming context window. At ~52KB, it's > substantial but manageable with modern context windows. > 2. Signal-to-noise ratio: The file contains valuable project- > specific guidance (Kconfig patterns, A/B testing, variable > scope issues, commit conventions). However, some sections > are quite detailed - like the multi-page examples for variable > scope debugging. > 3. Redundancy: Some patterns are explained multiple times (e.g., > the CLI override pattern appears in several places with similar > examples). It should be straightforward to restore portions of the file if misbehavior occurs, or just revert this patch. Generated-by: Claude AI Signed-off-by: Chuck Lever --- CLAUDE.md | 1499 +++++++---------------------------------------------- 1 file changed, 196 insertions(+), 1303 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 4e2a5953..c8ed0005 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -29,252 +29,92 @@ Device Under Test (DUT) systems. - **ALWAYS** collect real data from actual fio, fstests, blktests execution - **ALWAYS** verify that test systems are running and accessible before analysis -### Valid approaches: -- ✅ Execute real tests: `ansible all -m shell -a "fio --output-format=json ..."` -- ✅ Collect actual JSON output from running VMs -- ✅ Parse and analyze real performance data from live systems -- ✅ Generate graphs and reports from actual test execution results - -### Forbidden approaches: -- ❌ Creating synthetic performance numbers for demonstrations -- ❌ Generating mock test results for visualization examples -- ❌ Fabricating benchmark data to show workflows -- ❌ Using placeholder values like "let's assume 50K IOPS" - **Violation of this rule undermines the entire purpose of the kdevops testing framework and produces misleading results that could affect important development decisions.** ## Core Architecture ### Build System -- **Configuration**: Uses Linux kernel Kconfig system (`make menuconfig`, - `make nconfig`, `make dynconfig`) +- **Configuration**: Linux kernel Kconfig system (`make menuconfig`, `make nconfig`, `make dynconfig`) - **Build**: Make-based with modular Makefiles (`Makefile.*`) - **Automation**: Ansible playbooks in `playbooks/` directory -- **Infrastructure**: Supports virtualization (libguestfs/libvirt), - cloud providers (AWS, Azure, GCE, OCI), and bare metal +- **Infrastructure**: Virtualization (libguestfs/libvirt), cloud (AWS, Azure, GCE, OCI), bare metal ### Key Components -- `workflows/` - Testing workflows (fstests, blktests, selftests, CXL, - mmtests, NFS, etc.) +- `workflows/` - Testing workflows (fstests, blktests, selftests, CXL, mmtests, NFS, etc.) - `playbooks/` - Ansible automation with 40+ specialized roles -- `kconfigs/` - Modular configuration system. Provides modeling variabiilty - support. This is essentially support for kconfig taken from - the Linux kernel, with support for special "output yaml" - support to allow the project to leverage kconfig symbols on - ansible through the `extra_vars.yaml` file which is always - present. -- `defconfigs/` - Pre-built configurations for common setups. These are - extremely useful for supporting not only easy quick setups - but are also instrumental for our continuous integration - support. +- `kconfigs/` - Modular configuration system with "output yaml" support for `extra_vars.yaml` +- `defconfigs/` - Pre-built configurations for quick setups and CI support - `scripts/` - Workflow automation and helper scripts ## Common Development Commands -### Configuration and Setup -```bash -make menuconfig # Interactive configuration -make dynconfig # Supports dynamically generated kconfig files -make defconfig- # Use predefined configuration (see defconfigs/) -make # Build dependencies and setup -make bringup # Provision and configure systems -make destroy # Destroy provisioned systems -``` - -### Kernel Development -```bash -make linux # Build and install kernel -make linux HOSTS="host1 host2" # Target specific hosts -make linux-uninstall KVER="6.6.0-rc2" # Uninstall specific kernel version -``` - -### Testing Workflows -```bash -# Filesystem testing -make fstests # Run filesystem tests -make fstests-baseline # Establish baseline -make fstests-results # View results - -# Block layer testing -make blktests # Run block layer tests -make blktests-baseline # Establish baseline -make blktests-results # View results - -# Linux kernel selftests -make selftests # Run all selftests -make selftests-firmware # Run specific test suite -make selftests-kmod # Run kernel module tests - -# Other workflows -make pynfs # NFS testing -make gitr # Git regression testing -make ltp # Linux Test Project -make sysbench # Database performance testing - -# Memory management testing -make mmtests # Run memory management tests -make mmtests-compare # Compare baseline vs dev results (A/B testing) -``` - -### Development Utilities ```bash -make help # Show available targets -make V=1 [target] # Verbose build output -ANSIBLE_VERBOSITY=1-6 make [target] # Ansible verbose output (levels 0-6) -make dynconfig # Generate dynamic configuration -make style # Check for whitespace issues - ALWAYS run before completing work -make fix-whitespace-last-commit # Fixes commit white space damage -make mrproper # Clean everything and restart from scratch -``` - -### Ansible Callbacks - -kdevops supports multiple Ansible stdout callback plugins (dense, debug, -diy, lucid, or custom). The default is dense. - -See [docs/ansible-callbacks.md](docs/ansible-callbacks.md) for: -- Supported plugins and configuration -- Command line override via `ANSIBLE_CFG_CALLBACK_PLUGIN` -- Lucid plugin features and parameters - -## Key Workflows - -### fstests (Filesystem Testing) -- **Purpose**: Comprehensive filesystem testing for XFS, Btrfs, EXT4, CIFS, NFS, tmpfs -- **Features**: Expunge list management, baseline tracking, regression detection -- **Location**: `workflows/fstests/` -- **Config**: Enable fstests workflow in menuconfig - -### blktests (Block Layer Testing) -- **Purpose**: Block layer subsystem testing -- **Supports**: NVMe, SCSI, loop devices, NBD, ZBD -- **Location**: `workflows/blktests/` -- **Features**: Similar baseline/regression tracking as fstests - -### Linux Kernel Building -- **Source Management**: Multiple git trees (Linus, stable, next, subsystem trees) -- **Features**: 9P filesystem for host-guest development, mirror support -- **Location**: `workflows/linux/` - -### selftests (Kernel Selftests) -- **Purpose**: Parallel execution of Linux kernel selftests -- **Supports**: firmware, kmod, sysctl, and other kernel subsystem tests -- **Location**: `workflows/selftests/` - -### mmtests (Memory Management Testing) -- **Purpose**: Comprehensive memory management and performance testing -- **Features**: A/B testing support, automated performance comparison, visual analysis -- **Location**: `workflows/mmtests/` -- **Key Capabilities**: - - Run various memory and performance benchmarks (thpcompact, thpchallenge, etc.) - - A/B testing between baseline and development kernels - - Automated performance comparison with statistical analysis - - Graph generation for performance visualization - - HTML reports with embedded performance graphs - -#### mmtests A/B Testing and Comparison -The mmtests workflow supports advanced A/B testing for kernel performance regression detection: - -```bash -# Setup A/B testing configuration -make defconfig-mmtests-ab-testing # Basic A/B testing -make defconfig-mmtests-ab-testing-thpcompact # With monitoring - -# Run the workflow -make bringup # Provision baseline and dev nodes -make mmtests # Run tests on both nodes -make mmtests-compare # Generate comparison reports - -# Results location -# workflows/mmtests/results/compare/ -# - comparison.html # Main HTML report -# - comparison.txt # Text-based comparison -# - graph-*.png # Performance graphs -# - comparison_report.html # Enhanced report with embedded graphs -``` - -**Comparison Features**: -- Automated collection of results from baseline and dev nodes -- Statistical analysis of performance differences -- Multiple visualization formats: - - Performance trend graphs - - Sorted performance comparisons - - Smoothed data analysis - - System monitoring graphs (vmstat, mpstat, proc stats) -- Professional HTML reports with: - - Summary statistics - - Detailed per-metric comparisons - - Embedded performance graphs - - Color-coded performance indicators - -**Technical Implementation**: -- Local mmtests repository management with patch support -- Support for fixing known mmtests issues via patches in `workflows/mmtests/fixes/` -- Python and shell scripts for advanced graph generation -- Robust error handling and dependency management - -### fio-tests (Storage Performance Testing) -- **Purpose**: Comprehensive storage performance analysis using fio -- **Supports**: Block devices and filesystem testing with various configurations -- **Features**: - - Configurable test matrices (block sizes, IO depths, job counts) - - Multiple workload patterns (random/sequential, read/write, mixed) - - Filesystem-specific testing (XFS, ext4, btrfs) with different configurations - - Block size ranges for realistic I/O patterns - - Performance visualization and graphing - - A/B testing for baseline vs development comparisons -- **Location**: `workflows/fio-tests/` -- **Config**: Enable fio-tests workflow in menuconfig - -#### fio-tests Filesystem Testing -The fio-tests workflow supports both direct block device testing and filesystem-based testing: - -**Block Device Testing**: Direct I/O to storage devices for raw performance analysis -**Filesystem Testing**: Tests against mounted filesystems to analyze filesystem-specific performance characteristics - -**Supported Filesystems**: -- **XFS**: Various block sizes (4K-64K) with different sector sizes and features (reflink, rmapbt) -- **ext4**: Standard and bigalloc configurations with different cluster sizes -- **btrfs**: Modern features including no-holes, free-space-tree, and compression options - -**Key Configuration Options**: -- Block size testing: Fixed sizes (4K-128K) or ranges (e.g., 4K-16K) for realistic workloads -- Filesystem features: Enable specific filesystem optimizations and features -- Test patterns: Random/sequential read/write, mixed workloads with configurable ratios -- Performance tuning: IO engines (io_uring, libaio), direct I/O, fsync behavior - -**Example Defconfigs**: -- `defconfig-fio-tests-fs-xfs`: XFS filesystem with 16K block size testing -- `defconfig-fio-tests-fs-ext4-bigalloc`: ext4 with bigalloc and 32K clusters -- `defconfig-fio-tests-fs-btrfs-zstd`: btrfs with zstd compression -- `defconfig-fio-tests-fs-ranges`: Block size range testing with XFS +# Configuration +make menuconfig # Interactive configuration +make dynconfig # Generate dynamic configuration +make defconfig- # Use predefined configuration +make # Build dependencies and setup +make bringup # Provision and configure systems +make destroy # Destroy provisioned systems + +# Kernel Development +make linux # Build and install kernel +make linux HOSTS="host1" # Target specific hosts + +# Testing Workflows +make fstests # Run filesystem tests +make blktests # Run block layer tests +make mmtests # Run memory management tests +make mmtests-compare # A/B testing comparison +make fio-tests # Run storage performance tests +make selftests # Run kernel selftests + +# Quality Control +make style # Check whitespace/formatting - ALWAYS run before completing +make fix-whitespace-last-commit # Fix whitespace in last commit +make help # Show available targets +ANSIBLE_VERBOSITY=1-6 make [target] # Ansible verbose output +``` + +See [docs/ansible-callbacks.md](docs/ansible-callbacks.md) for Ansible callback plugin configuration. + +## Key Workflows Summary + +| Workflow | Purpose | Location | Key Features | +|----------|---------|----------|--------------| +| **fstests** | Filesystem testing | `workflows/fstests/` | XFS, Btrfs, EXT4, NFS; expunge lists; baseline tracking | +| **blktests** | Block layer testing | `workflows/blktests/` | NVMe, SCSI, loop, NBD, ZBD; regression detection | +| **mmtests** | Memory/performance | `workflows/mmtests/` | A/B testing, statistical analysis, HTML reports with graphs | +| **fio-tests** | Storage performance | `workflows/fio-tests/` | Block/filesystem testing, multi-FS comparison, graphing | +| **selftests** | Kernel selftests | `workflows/selftests/` | Parallel execution of firmware, kmod, sysctl tests | +| **linux** | Kernel building | `workflows/linux/` | Multiple git trees, 9P filesystem, mirror support | + +### fio-tests Filesystem Testing +- **Block Device Testing**: Direct I/O for raw performance +- **Filesystem Testing**: XFS (4K-64K blocks), ext4 (bigalloc), btrfs (compression) +- **Multi-Filesystem**: Section-based comparison (xfs-4k vs xfs-16k, xfs vs ext4 vs btrfs) +- **Key Features**: Block size ranges, A/B testing, performance visualization ## Architecture Highlights ### Configuration System -- Uses Linux kernel Kconfig for consistent configuration management -- Modular configuration files in `kconfigs/` for different subsystems -- Dynamic configuration generation with `make dynconfig` -- Pre-built configurations in `defconfigs/` directory +- Linux kernel Kconfig for configuration management +- Dynamic configuration with `make dynconfig` +- CLI overrides via environment variables (see Kconfig CLI Override Patterns below) ### Workflow System -- Each workflow has dedicated Kconfig and Makefile -- Workflow-specific scripts in `scripts/workflows/` -- Ansible roles for automation in `playbooks/roles/` +- Dedicated Kconfig and Makefile per workflow +- Workflow scripts in `scripts/workflows/` +- Ansible roles in `playbooks/roles/` - Result collection and baseline management ### Infrastructure Support -- **Virtualization**: libguestfs with libvirt (recommended), legacy Vagrant -- **Cloud**: AWS, Azure, GCE, OCI support via Terraform -- **PCIe Passthrough**: Real hardware testing in VMs with dynamic device assignment -- **Mirror Support**: For air-gapped environments - -### Kernel CI Features -- Built-in continuous integration support -- Baseline management for known test failures -- Regression detection with git-style diff output -- Watchdog systems for automated recovery from hung tests +- Virtualization: libguestfs/libvirt (recommended), legacy Vagrant +- Cloud: AWS, Azure, GCE, OCI via Terraform +- PCIe Passthrough: Real hardware in VMs +- Mirror Support: Air-gapped environments +- Kernel CI: Baseline management, regression detection, watchdog systems ## Important File Locations @@ -285,983 +125,193 @@ The fio-tests workflow supports both direct block device testing and filesystem- - `scripts/workflows/` - Workflow-specific helper scripts - `docs/` - Comprehensive documentation -## Development Patterns - -1. **Configuration-Driven**: Everything configurable through Kconfig -2. **Modular Design**: Workflow-specific components included conditionally -3. **Ansible Automation**: Role-based infrastructure and testing automation -4. **Baseline Management**: Comprehensive tracking of known failures and regressions -5. **Template Generation**: Dynamic file generation based on configuration - ## Adding New Workflows -When adding a new workflow to kdevops, you must add node generation rules to both -`playbooks/roles/gen_nodes/tasks/main.yml` and `playbooks/roles/gen_hosts/tasks/main.yml` -to avoid the failure "dedicated workflow has no rules for node configuration". - -### Required Additions - -For each new workflow, add the following sections to both playbooks: - -#### gen_nodes playbook -Add node generation rules with appropriate conditional logic based on whether the -workflow uses individual test configuration flags (like mmtests) or choice-based -configuration (like fio-tests): - -```yaml -# For workflows with individual test flags (like mmtests) -- name: Infer enabled WORKFLOW test section types - set_fact: - workflow_enabled_test_types: >- - {{ - [kdevops_host_prefix + '-'] - | product( - lookup('file', topdir_path + '/.config') - | regex_findall('^CONFIG_WORKFLOW_ENABLE_(.*)=y$', multiline=True) - | map('lower') - | list - ) - | map('join') - | list - }} - when: - - kdevops_workflows_dedicated_workflow - - kdevops_workflow_enable_WORKFLOW - - ansible_nodes_template.stat.exists - - not kdevops_baseline_and_dev - -# For workflows with choice-based configuration (like fio-tests) -- name: Generate the WORKFLOW kdevops nodes file using {{ kdevops_nodes_template }} as jinja2 source template - tags: [ 'hosts' ] - vars: - node_template: "{{ kdevops_nodes_template | basename }}" - nodes: "{{ [kdevops_host_prefix + '-WORKFLOW'] }}" - all_generic_nodes: "{{ [kdevops_host_prefix + '-WORKFLOW'] }}" - template: - src: "{{ node_template }}" - dest: "{{ topdir_path }}/{{ kdevops_nodes }}" - force: yes - when: - - kdevops_workflows_dedicated_workflow - - kdevops_workflow_enable_WORKFLOW - - ansible_nodes_template.stat.exists -``` - -#### gen_hosts playbook -Add host file generation task for the workflow: - -```yaml -- name: Generate the Ansible hosts file for a dedicated WORKFLOW setup - tags: [ 'hosts' ] - template: - src: "{{ kdevops_hosts_template }}" - dest: "{{ topdir_path }}/{{ kdevops_hosts }}" - force: yes - trim_blocks: True - lstrip_blocks: True - when: - - kdevops_workflows_dedicated_workflow - - kdevops_workflow_enable_WORKFLOW - - ansible_hosts_template.stat.exists -``` - -#### Update the generic hosts template -Add support for your workflow in the generic hosts template at -`playbooks/roles/gen_hosts/templates/hosts.j2`. Find the dedicated workflow -section and add your workflow's conditional logic: - -```jinja2 -{% if kdevops_workflows_dedicated_workflow %} -{% if kdevops_workflow_enable_WORKFLOW %} -[all] -{{ kdevops_host_prefix }}-WORKFLOW -{% if kdevops_baseline_and_dev %} -{{ kdevops_host_prefix }}-WORKFLOW-dev -{% endif %} - -[all:vars] -ansible_python_interpreter = "{{ kdevops_python_interpreter }}" - -[baseline] -{{ kdevops_host_prefix }}-WORKFLOW - -[baseline:vars] -ansible_python_interpreter = "{{ kdevops_python_interpreter }}" - -{% if kdevops_baseline_and_dev %} -[dev] -{{ kdevops_host_prefix }}-WORKFLOW-dev - -[dev:vars] -ansible_python_interpreter = "{{ kdevops_python_interpreter }}" - -{% endif %} -[WORKFLOW] -{{ kdevops_host_prefix }}-WORKFLOW -{% if kdevops_baseline_and_dev %} -{{ kdevops_host_prefix }}-WORKFLOW-dev -{% endif %} - -[WORKFLOW:vars] -ansible_python_interpreter = "{{ kdevops_python_interpreter }}" -{% else %} -``` - -### Examples - -Refer to the existing mmtests implementation for workflows with multiple individual -test configuration flags, or the fio-tests implementation for workflows with -choice-based configuration patterns. - -**Important**: All workflows use the same generic hosts template in -`playbooks/roles/gen_hosts/templates/hosts.j2`. Do NOT create workflow-specific -template files. Instead, extend the generic template with conditional logic -for your workflow. - -## Quick Setup Examples - -### XFS Filesystem Testing -```bash -make defconfig-xfs # Configure for XFS testing -make bringup # Setup test environment -make fstests # Run filesystem tests -``` - -### Kernel Development Environment -```bash -make menuconfig # Configure kernel workflow -make bringup # Setup development VMs -make linux # Build and install kernel -``` - -### Block Layer Testing with NVMe -```bash -make defconfig-blktests_nvme -make bringup -make blktests -``` - -### Storage Performance Testing with fio-tests - -#### XFS Filesystem Performance Testing -```bash -make defconfig-fio-tests-fs-xfs # Configure for XFS 16K block size testing -make bringup # Setup test environment with filesystem -make fio-tests # Run comprehensive performance tests -make fio-tests-results # Collect and analyze results -``` - -#### ext4 with Bigalloc Testing -```bash -make defconfig-fio-tests-fs-ext4-bigalloc # Configure ext4 with 32K clusters -make bringup -make fio-tests -``` - -#### btrfs with Compression Testing -```bash -make defconfig-fio-tests-fs-btrfs-zstd # Configure btrfs with zstd compression -make bringup -make fio-tests -``` - -#### Block Size Range Testing -```bash -make defconfig-fio-tests-fs-ranges # Configure XFS with block size ranges -make bringup # Test realistic I/O patterns (4K-16K, etc.) -make fio-tests -``` - -## Testing and Quality Assurance +When adding a workflow, add node generation rules to: +1. `playbooks/roles/gen_nodes/tasks/main.yml` (individual test flags or choice-based config) +2. `playbooks/roles/gen_hosts/tasks/main.yml` (host file generation) +3. `playbooks/roles/gen_hosts/templates/hosts.j2` (extend generic template with workflow logic) -- Expunge lists track known test failures in `workflows/*/expunges/` -- Baseline commands establish expected test results -- Results commands show test outcomes and regressions -- Watchdog scripts provide automated monitoring for long-running tests -- Integration with kernel development workflows and patchwork systems - -This framework is designed by kernel developers for kernel developers, -providing production-ready automation for kernel testing and development -workflows. +**Examples**: See mmtests (individual test flags) or fio-tests (choice-based) implementations. +**Important**: Do NOT create workflow-specific template files; extend the generic hosts template. ## Git Commit Guidelines All commits must follow these 5 rules: ### Rule 1/5: One Commit Per Change - -As with the Linux kernel, this project prefers commits to be atomic and to -the point. We don't want spell fixes to be blended in with code changes. -Spell fixes should go into separate commits. When in doubt, just don't do -any spell fixes unless asked explicitly to do that. +Atomic commits only. Don't mix spell fixes with code changes. When in doubt, don't do spell fixes unless explicitly asked. ### Rule 2/5: Use the Signed-off-by Tag - -We want to use the Signed-off-by tag which embodies the application of the -Developer Certificate or Origin. Use the git configured user name and email -for the Signed-off-by tag (check with `git config user.name` and -`git config user.email`). +Use git configured user name and email (check with `git config user.name` and `git config user.email`). ### Rule 3/5: Use Generated-by: Claude AI +Add this tag before Signed-off-by for AI-generated code. -Use this tag for code generated by Claude code AI. Put this before the -Signed-off-by tag. +**CRITICAL FORMATTING**: Generated-by MUST be immediately followed by Signed-off-by with NO empty lines between them. -**CRITICAL FORMATTING RULE**: When using "Generated-by: Claude AI", it MUST be -immediately followed by the "Signed-off-by:" tag with NO empty lines between them. -These two lines must be consecutive. - -Correct format: ``` Subject line -Detailed description of changes... - -Generated-by: Claude AI -Signed-off-by: Your Name -``` +Detailed description... -**WRONG** - Do NOT add empty lines between Generated-by and Signed-off-by: -``` Generated-by: Claude AI - -Signed-off-by: Your Name -``` - -**WRONG** - Do NOT add extra empty lines: -``` -Generated-by: Claude AI - - -Signed-off-by: Your Name +Signed-off-by: Name ``` ### Rule 4/5: Avoid Shopping Cart Lists **CRITICAL RULE: NEVER USE BULLET POINTS OR ITEMIZED LISTS IN COMMIT MESSAGES** -Generative AI seems to like to make commit logs long itemized lists of things -it did. This is stupid. This should be avoided. It is creating very silly -commit logs. Use plain English and get to the point. Be as clear as possible -and get to the point of not what you want to communicate, but rather what -will make a reviewer easily understand what the heck you are implementing. - -You should *think* hard about your commit log, always. +Use plain English paragraphs. Be clear and focused on helping reviewers understand the implementation. -**WRONG - Shopping cart list with bullet points:** +**WRONG**: ``` Refactored to separate concerns: -- Distribution files handle package installation and set nfs_server_service - variable (nfs-kernel-server for Debian/Ubuntu, nfs-server for RedHat/Fedora) -- Single systemd task in main.yml handles service enablement using the variable -``` - -**WRONG - Change list:** -``` -Fix by changing: - - mirror_service_status.item → mirror_service_status.results - - mirror_timer_status.item → mirror_timer_status.results +- Distribution files handle package installation +- Single systemd task handles service enablement ``` -**Correct - Plain English:** +**Correct**: ``` Each distribution file now handles package installation and sets the -nfs_server_service variable to the appropriate service name for that -distribution. A single systemd task in main.yml then handles service -enablement using the variable. -``` - -**Correct - Plain English:** -``` -Change both debug tasks to iterate over the .results list instead of -the non-existent .item attribute. +nfs_server_service variable. A single systemd task in main.yml then +handles service enablement using the variable. ``` ### Rule 5/5: Run make style Before Committing -**IMPORTANT**: Before completing any work, you MUST run `make style` to check for -both whitespace issues and commit message formatting. This ensures code consistency -and prevents formatting issues from being introduced into the codebase. - -The style checker will identify: -- Trailing whitespace -- Mixed tabs and spaces -- Files without newlines at EOF -- Other whitespace-related issues -- Incorrect commit message formatting (Generated-by/Signed-off-by spacing) - -Fix all reported issues before submitting your work. The `make style` command -checks both file whitespace and the most recent commit message format. - -### Commit Message Template for AI Assistants +**IMPORTANT**: Always run `make style` before completing work. This checks: +- Trailing whitespace, mixed tabs/spaces, missing newlines at EOF +- Commit message formatting (Generated-by/Signed-off-by spacing) -**IMPORTANT: All AI-generated commits MUST follow this exact format:** +### Commit Message Template ``` subsystem: brief description in imperative mood (max 50 chars) -Detailed explanation of the problem being solved and why the change -is needed. Use plain English paragraphs - NEVER use bullet points or -itemized lists. +Detailed explanation of the problem and why the change is needed. +Use plain English paragraphs - NEVER bullet points or lists. -Explain what the change does and how it solves the problem. Focus on -clarity for reviewers who need to understand the implementation. - -Multiple paragraphs are fine when needed to explain complex changes. +Explain what the change does and how it solves the problem. Focus +on clarity for reviewers. Generated-by: Claude AI Signed-off-by: Name ``` -**Key requirements:** The subject line must use subsystem prefix in imperative -mood with a maximum of 50 characters. The body must use plain English paragraphs -only with NO bullet points or lists. Generated-by must be immediately followed -by Signed-off-by with no blank lines between them. Use the values from -`git config user.name` and `git config user.email` for the Signed-off-by tag. - ## Code Quality Requirements ### Rust Code Quality - -For Rust code in kdevops (workflows/rcloud, etc.), ALWAYS run both: - +Always run for Rust code (workflows/rcloud, etc.): ```bash -# Format code using Linux kernel rustfmt standards -cargo fmt - -# Check for common mistakes and idioms +cargo fmt # Format using Linux kernel rustfmt standards cargo clippy --all-targets --all-features -- -D warnings ``` -**Rust Quality Checklist**: -- ✅ Run `cargo fmt` to auto-format code according to .rustfmt.toml -- ✅ Run `cargo clippy` with `-D warnings` (treat warnings as errors) -- ✅ Fix ALL clippy warnings before committing -- ✅ Common clippy fixes: remove unnecessary casts, use `.flatten()` instead of manual `if let Ok`, remove unused imports - -The install-rust-deps role provides both cargo/rustc and the .rustfmt.toml -configuration from the Linux kernel, ensuring consistent Rust code quality -across all kdevops workflows. - ### Automatic Whitespace Fixing - -For convenience, you can automatically fix whitespace issues using: ```bash python3 scripts/fix_whitespace_issues.py # Fix all modified files python3 scripts/fix_whitespace_issues.py file1 file2 # Fix specific files +make style # Verify all issues resolved ``` -The fixer script will: -- Remove trailing whitespace from lines -- Add missing newlines at end of files -- Reduce excessive blank lines to maximum 2 consecutive - -Always run `make style` after using the fixer to verify all issues are resolved. - -### Verifying commit has no white space damage - -Run the following after you commit something: -```bash -make fix-whitespace-last-commit -``` - -This will fix all white space only for new files you add. - ### Testing Generated Kconfig Files - -When working with scripts that generate Kconfig files (like `terraform/*/scripts/gen_kconfig_*`), -the indentation checker cannot properly validate Jinja2 template files (.j2) because they -can generate any kind of output, not just Kconfig. - -**Correct approach**: Generate the output to a file named with "Kconfig" prefix and test that: - +Jinja2 templates (.j2) cannot be validated directly. Generate output and test that: ```bash # Example: Testing AWS AMI Kconfig generation -cd terraform/aws/scripts python3 gen_kconfig_ami --quiet > /tmp/Kconfig.ami.test 2>&1 python3 ../../../scripts/detect_indentation_issues.py /tmp/Kconfig.ami.test ``` -The indentation checker recognizes files starting with "Kconfig" and applies the correct -rules (tabs for indentation, tab+2spaces for help text is valid). - -**Why this matters**: Jinja2 templates (.j2) are generic and can generate Python, YAML, -Kconfig, or any other format. The style checker cannot determine the output format from -the template filename alone. Always test the generated output, not the template. - ## Complex System Interactions -kdevops integrates multiple subsystems (Ansible, Kconfig, Git, Make) that often -interact in non-obvious ways. Understanding these interactions is crucial for -effective debugging and development. - ### Ansible Architecture Patterns -#### Host vs Control Node Execution -kdevops uses several Ansible execution patterns that affect variable scope: - -- **Control Host Execution**: `run_once: true, delegate_to: localhost` - - Executes once on the control host, not on target nodes - - Per-node variables may not be available in localhost context - - Common in 9P filesystem builds where single build is shared to all guests - - Use `hostvars[groups['group_name'][0]]['variable_name']` to access node-specific vars - -- **Variable Resolution Issues**: - - Variables set per-node (like A/B testing configs) aren't automatically available on localhost - - Need explicit variable resolution for cross-context access - - Git repository state must be managed carefully when switching between refs +**Execution Contexts**: +- **Normal node**: Runs on each target node, variables are per-node +- **Control host**: `run_once: true, delegate_to: localhost` - runs once on control host +- **Per-node vars not available in localhost context** - use `hostvars[groups['group'][0]]['var']` -#### A/B Testing Variable Management -```yaml -# Detect dev nodes by hostname pattern -- name: Determine if this is a dev node for A/B testing - set_fact: - bootlinux_is_dev_node: "{{ ansible_hostname | regex_search('^.*-dev$') is not none }}" - -# Resolve active parameters for 9P builds -- name: Determine active kernel parameters for A/B testing with 9P - set_fact: - active_linux_ref: "{{ hostvars[groups['dev'][0]]['target_linux_ref'] if 'dev' in group_names else target_linux_ref }}" - run_once: true - delegate_to: localhost -``` +**A/B Testing Variable Management**: +- Detect dev nodes: `bootlinux_is_dev_node: "{{ ansible_hostname | regex_search('^.*-dev$') is not none }}"` +- Resolve for 9P builds: Use global config vars instead of fragile hostvars access ### Kconfig Dynamic Configuration Patterns -#### Shell Command Integration +**Shell Command Integration**: ```kconfig config BOOTLINUX_DEV_TREE_REF string "Development kernel reference" default $(shell, scripts/infer_last_stable_kernel.sh) - help - The default is automatically inferred as the most recent stable - kernel version from the git repository. ``` +Best practices: Fallback values, test in different environments -**Best Practices**: -- Always provide fallback values in scripts -- Place scripts in `scripts/` directory -- Use conditional defaults: `default VALUE if CONDITION` -- Test scripts work in different environments - -#### Dependencies and Conflicts +**Dependencies and Conflicts**: ```kconfig config BOOTLINUX_SHALLOW_CLONE bool "Shallow git clone" - default y if !KDEVOPS_BASELINE_AND_DEV - depends on !BOOTLINUX_AB_DIFFERENT_REF - help - This option is automatically disabled when using A/B testing with - different kernel references, as shallow clones may not contain all - the required refs for checkout. -``` - -**Key Patterns**: -- `depends on !CONFIG_OPTION` - Prevent incompatible combinations -- `default y if !OTHER_CONFIG` - Conditional defaults -- Document why restrictions exist in help text - -#### CLI Override Patterns - -Environment variable override support enables runtime configuration changes without -recompiling. This is essential for CI/demo scenarios where quick test execution -is needed. - -**Basic CLI Override Detection**: -```kconfig -config FIO_TESTS_QUICK_TEST_SET_BY_CLI - bool - output yaml - default $(shell, scripts/check-cli-set-var.sh FIO_TESTS_QUICK_TEST) - -config FIO_TESTS_QUICK_TEST - bool "Enable quick test mode for CI/demo" - default y if FIO_TESTS_QUICK_TEST_SET_BY_CLI - help - Quick test mode reduces test matrix and runtime for rapid validation. - Can be enabled via environment variable: FIO_TESTS_QUICK_TEST=y -``` - -**Runtime Parameter Overrides**: -```kconfig -config FIO_TESTS_RUNTIME - string "Test runtime in seconds" - default "15" if FIO_TESTS_QUICK_TEST - default "300" - help - Runtime can be overridden via environment variable: FIO_TESTS_RUNTIME=60 + depends on !BOOTLINUX_AB_DIFFERENT_REF # Prevent incompatible combinations ``` -**Best Practices for CLI Overrides**: -- Create `*_SET_BY_CLI` detection variables using `scripts/check-cli-set-var.sh` -- Use conditional defaults to automatically adjust configuration when CLI vars detected -- Implement intelligent test matrix reduction for quick modes -- Provide meaningful defaults that work in CI environments (e.g., `/dev/null` for I/O tests) -- Document environment variable names in help text -- Test both manual configuration and CLI override modes - -**Quick Test Implementation Pattern**: +**CLI Override Pattern**: ```kconfig -# Enable quick mode detection config WORKFLOW_QUICK_TEST_SET_BY_CLI bool output yaml default $(shell, scripts/check-cli-set-var.sh WORKFLOW_QUICK_TEST) -# Quick mode configuration with automatic matrix reduction config WORKFLOW_QUICK_TEST bool "Enable quick test mode" default y if WORKFLOW_QUICK_TEST_SET_BY_CLI help - Reduces test matrix and runtime for CI validation. - Environment variable: WORKFLOW_QUICK_TEST=y - -# Conditional parameter adjustment -config WORKFLOW_DEVICE - string "Target device" - default "/dev/null" if WORKFLOW_QUICK_TEST - default "/dev/sdb" - -config WORKFLOW_PATTERN_COMPREHENSIVE - bool "Comprehensive test patterns" - default n if WORKFLOW_QUICK_TEST - default y - help - Full test pattern matrix. Disabled in quick mode for faster execution. + Can be enabled via environment variable: WORKFLOW_QUICK_TEST=y ``` - -**CI Integration**: -CLI overrides enable GitHub Actions workflows to run quick validation: -```yaml -- name: Run quick workflow validation - run: | - WORKFLOW_QUICK_TEST=y make defconfig-workflow-quick - make workflow -``` - -**Key Benefits**: -- **Rapid iteration**: ~1 minute CI validation vs hours for full test suites -- **Resource efficiency**: Use `/dev/null` or minimal targets in quick mode -- **Configuration preservation**: Normal configurations remain unchanged -- **A/B compatibility**: Works with baseline/dev testing infrastructure -- **Pattern reusability**: Same patterns work across all workflows +Use `*_SET_BY_CLI` detection, conditional defaults, intelligent test matrix reduction for CI. ### Git Repository Management +- **Shallow clones**: Conflict with A/B testing different refs (missing git history) +- **Version detection**: Use `sort -V` for semantic versioning, filter `-rc` releases +- **Mirror access**: `git --git-dir=/path/to/mirror.git` for repository queries -#### Shallow Clone Limitations -- **Problem**: A/B testing with different refs requires full git history -- **Solution**: Make shallow clones depend on `!BOOTLINUX_AB_DIFFERENT_REF` -- **Detection**: Use `git --git-dir=/path/to/mirror.git` for mirror access - -#### Version Detection Scripts -```bash -# Get latest stable kernel version, excluding release candidates -LAST_STABLE=$(git --git-dir="$GIT_TREE" tag --list 'v6.*' | \ - grep -v -- '-rc' | \ - sort -V | \ - tail -1) -``` - -**Patterns**: -- Use `sort -V` for proper semantic version ordering -- Filter out pre-release versions with `grep -v -- '-rc'` -- Always provide fallback values -- Handle missing git repositories gracefully +### Debugging Methodology -### Systematic Debugging Methodology +**Configuration Tracing**: +1. Check `extra_vars.yaml` for resolved variables +2. Identify execution context (localhost vs target nodes) +3. Verify prerequisites (git refs exist before checkout) +4. Follow Ansible variable precedence -#### Configuration Tracing -1. **Check actual values**: Look at `extra_vars.yaml` for resolved variables -2. **Trace execution context**: Identify if code runs on localhost vs target nodes -3. **Verify prerequisites**: Ensure git refs exist before checkout attempts -4. **Follow variable inheritance**: Understand Ansible variable precedence - -#### A/B Testing Debug Steps +**A/B Testing Debug**: ```bash -# Check current configuration grep "BOOTLINUX.*REF" .config grep "bootlinux.*tree.*ref" extra_vars.yaml - -# Verify git repository state -git branch -v git describe --tags --always - -# Test kernel version detection scripts/infer_last_stable_kernel.sh - -# Check available refs in mirror -git --git-dir=/mirror/linux.git tag --list 'v6.*' | sort -V | tail -10 ``` -#### Common Root Causes -- **Variable scope mismatches**: Per-node vars not available on localhost -- **Git ref unavailability**: Shallow clones missing required refs -- **Execution context confusion**: Code expecting node context running on localhost -- **Configuration interdependencies**: Features conflicting in unexpected ways - -### Feature Integration Patterns - -#### When Features Conflict -1. **Identify early**: Use Kconfig dependencies to prevent invalid combinations -2. **Provide alternatives**: Suggest compatible configurations -3. **Clear messaging**: Explain why restrictions exist -4. **Graceful degradation**: Disable conflicting features automatically - -Example: Shallow clones + A/B different refs -- **Problem**: Shallow clone may not have required git refs -- **Solution**: `depends on !BOOTLINUX_AB_DIFFERENT_REF` -- **User experience**: Feature automatically disabled with explanation - -#### Smart Defaults Philosophy -- **Infer from system state**: Use dynamic detection where possible -- **Show off capabilities**: Make examples compelling and useful -- **Balance automation with control**: Provide overrides for advanced users -- **Fail gracefully**: Always have sensible fallbacks - -### AI Assistant Development Guidelines - -#### Investigation Sequence -1. **Understand the problem**: What's not working as expected? -2. **Trace execution path**: Follow code from config → ansible → execution -3. **Identify context and scope**: Where does code run? What variables are available? -4. **Find intersection points**: Issues often emerge where subsystems meet -5. **Design holistic solutions**: Fix root cause, enhance the feature -6. **Validate across use cases**: Test both specific case and general functionality - -#### Common Anti-Patterns to Avoid -- Band-aid fixes that ignore root cause -- Breaking existing functionality while fixing edge cases -- Ignoring variable scope and execution context -- Missing cross-feature impact analysis -- Not considering user experience implications - -#### Quality Gates -- Always run `make style` before completion -- Test both the specific case and general functionality -- Consider impact on existing users and configurations -- Document new patterns for future reference -- Verify changes work across different execution contexts - -### Examples from Practice - -#### A/B Kernel Testing Issue -**Symptoms**: `bootlinux_dev_tree_kernelrelease` not being used in dev builds - -**Root Cause Analysis**: -- 9P builds execute `run_once: true, delegate_to: localhost` -- Per-node A/B variables not available in localhost context -- Git checkout failed due to shallow clone missing refs +### Common Root Causes +- Variable scope mismatches: Per-node vars unavailable on localhost +- Git ref unavailability: Shallow clones missing refs +- Execution context confusion: Code expecting node context on localhost +- Configuration interdependencies: Features conflicting unexpectedly -**Solution Components**: -1. Variable resolution: `hostvars[groups['dev'][0]]['target_linux_ref']` -2. Git ref management: Force checkout correct ref before build -3. Configuration fix: Disable shallow clones for A/B different refs -4. Smart defaults: Auto-detect latest stable kernel version +## Per-Node Variable Management -**Key Insight**: Complex issues often involve multiple subsystem interactions -rather than bugs in individual components. +**Common Problems**: +1. Per-node variables not available on localhost +2. Host scope limiting with `HOSTS=` parameter +3. Race conditions in variable resolution -## Per-Node Variable Management and Scope Issues - -One of the most common and subtle sources of bugs in kdevops is per-node variable -scope issues, particularly when combining Ansible's execution contexts with -complex features like A/B testing and 9P builds. - -### Understanding Ansible Execution Contexts - -kdevops uses multiple Ansible execution patterns that affect variable visibility: - -#### 1. Normal Node Execution -```yaml -- name: Set per-node variable - set_fact: - my_node_var: "{{ some_value }}" - # Runs on each target node, variable is per-node -``` - -#### 2. Control Host Execution (run_once + delegate_to: localhost) -```yaml -- name: Process shared data - set_fact: - shared_var: "{{ processed_value }}" - run_once: true - delegate_to: localhost - # Runs once on control host, not on target nodes -``` - -#### 3. Mixed Context Operations -```yaml -- name: Access per-node data from control host - set_fact: - aggregated_data: "{{ hostvars[groups['target'][0]]['node_var'] }}" - run_once: true - delegate_to: localhost - # Attempts to access per-node variable from control context -``` - -### Common Variable Scope Problems - -#### Problem 1: Per-Node Variables Not Available on localhost - -**Symptom**: Variables set on target nodes are undefined when accessed from -localhost tasks. - -**Example**: -```yaml -# This sets target_linux_ref ONLY on nodes matching the condition -- name: Set development kernel parameters for dev nodes - set_fact: - target_linux_ref: "{{ bootlinux_dev_tree_ref }}" - when: - - kdevops_baseline_and_dev|bool - - bootlinux_ab_different_ref|bool - - bootlinux_is_dev_node|default(false)|bool - -# This tries to access per-node variable from localhost - MAY FAIL -- name: Use dev node's kernel ref for 9P build - set_fact: - active_ref: "{{ hostvars[groups['dev'][0]]['target_linux_ref'] }}" - run_once: true - delegate_to: localhost -``` - -**Root Cause**: The first task only runs on dev nodes and sets a per-node -variable. The second task runs on localhost but may not have access to the -per-node variable if there are timing or context issues. - -#### Problem 2: Host Scope Limiting with HOSTS Parameter - -**Symptom**: When using `make target HOSTS=specific-host`, variables set on -other hosts become inaccessible. - -**Example**: -```bash -# This limits the playbook scope to only the dev node -make linux HOSTS=demo-fio-tests-dev -``` - -If your playbook tries to access variables from baseline nodes or uses -`hostvars[groups['baseline'][0]]`, these may fail because the baseline -nodes are not in the current run scope. - -#### Problem 3: Race Conditions in Variable Resolution - -**Symptom**: Variables appear to be set inconsistently or use wrong values. - -**Root Cause**: Tasks with `run_once: true` may execute before per-node -tasks complete, leading to variable access before they're properly set. - -### Best Practices for Variable Management - -#### 1. Prefer Global Variables for Cross-Context Access - -**Bad**: -```yaml -# Set per-node, access from localhost - fragile -- name: Set node-specific value - set_fact: - node_value: "{{ some_computation }}" - -- name: Use in shared context - command: "process {{ hostvars[groups['target'][0]]['node_value'] }}" - run_once: true - delegate_to: localhost -``` - -**Good**: -```yaml -# Use global variable that's accessible everywhere -- name: Use global configuration - command: "process {{ global_config_value }}" - run_once: true - delegate_to: localhost -``` - -#### 2. Explicit Variable Resolution with Fallbacks +**Best Practices**: +- Prefer global variables for cross-context access +- Use explicit resolution with fallbacks +- Validate variable availability +- Use consistent naming: `global_*`, `node_*`, `active_*` **Recommended Pattern**: ```yaml -- name: Resolve variable with robust fallback - set_fact: - active_value: >- - {{ - hostvars[groups['dev'][0]]['target_value'] - if (groups['dev'] | length > 0 and - hostvars[groups['dev'][0]]['target_value'] is defined) - else fallback_global_value - }} - run_once: true - delegate_to: localhost -``` - -#### 3. Validate Variable Availability - -**Add Validation Tasks**: -```yaml -- name: Validate required variables are available - fail: - msg: "Required variable {{ item }} not found in dev node context" - when: hostvars[groups['dev'][0]][item] is not defined - loop: - - target_linux_ref - - target_linux_config - run_once: true - delegate_to: localhost -``` - -#### 4. Use Consistent Variable Naming - -**Pattern**: Use prefixes to indicate variable scope: -- `global_*` - Available everywhere -- `node_*` - Per-node variables -- `active_*` - Resolved variables for shared operations - -### Debugging Variable Scope Issues - -#### 1. Add Debug Tasks - -```yaml -- name: Debug variable availability - debug: - msg: | - Groups: {{ groups }} - Dev group: {{ groups['dev'] | default([]) }} - Hostvars keys: {{ hostvars.keys() | list }} - Target var: {{ hostvars[groups['dev'][0]]['target_var'] | default('UNDEFINED') }} - run_once: true - delegate_to: localhost -``` - -#### 2. Check Execution Context - -```yaml -- name: Show execution context - debug: - msg: | - Running on: {{ inventory_hostname }} - Delegate to: {{ ansible_delegated_vars | default('none') }} - Group names: {{ group_names }} - tags: debug -``` - -#### 3. Use Ansible Verbose Mode - -```bash -# Run with verbose output to see variable resolution -ANSIBLE_VERBOSITY=3 make target # Ansible verbose level 3 -``` - -### A/B Testing Variable Resolution Example - -The A/B testing feature demonstrates proper variable resolution: - -```yaml -# Step 1: Set per-node variables (runs on dev nodes only) -- name: Set development kernel parameters for dev nodes - set_fact: - target_linux_ref: "{{ bootlinux_dev_tree_ref }}" - when: - - kdevops_baseline_and_dev|bool - - bootlinux_ab_different_ref|bool - - bootlinux_is_dev_node|default(false)|bool - -# Step 2: Resolve for shared operations (robust fallback) -- name: Determine active kernel parameters for A/B testing with 9P - set_fact: - active_linux_ref: "{{ bootlinux_dev_tree_ref }}" - # Direct use of global var instead of fragile hostvars access - when: - - kdevops_baseline_and_dev|bool - - bootlinux_ab_different_ref|bool - - bootlinux_9p|bool - run_once: true - delegate_to: localhost - -# Step 3: Use resolved variable -- name: Checkout correct git ref for A/B testing with 9P - git: - version: "{{ active_linux_ref | default(target_linux_ref) }}" - run_once: true - delegate_to: localhost -``` - -### Testing Variable Resolution - -When developing features that involve per-node variables: - -1. **Test with different HOSTS parameters**: - ```bash - make target HOSTS=demo-fio-tests # baseline only - make target HOSTS=demo-fio-tests-dev # dev only - make target # both nodes - ``` - -2. **Test with different configurations**: - - A/B testing enabled/disabled - - Different node configurations - - Varying group memberships - -3. **Validate variable values**: - ```bash - # Check resolved variables - grep "active_.*:" extra_vars.yaml - - # Verify node-specific settings - ANSIBLE_VERBOSITY=2 make target | grep -A5 -B5 "Set.*fact" - ``` - -### Common Patterns to Avoid - -#### Anti-Pattern 1: Assuming Variable Availability -```yaml -# DON'T: Assume hostvars access will work -- name: Use dev node variable - command: "build {{ hostvars[groups['dev'][0]]['target_ref'] }}" -``` - -#### Anti-Pattern 2: Complex Conditional Logic in Variable Access -```yaml -# DON'T: Bury complex logic in variable expressions -- name: Set complex variable - set_fact: - value: "{{ 'dev' in group_names | ternary(dev_value, baseline_value) if condition else other_value }}" -``` - -#### Anti-Pattern 3: Side Effects in Variable Resolution -```yaml -# DON'T: Modify state during variable resolution -- name: Set variable with side effects - set_fact: - computed_value: "{{ some_value }}" - changed_when: true # Misleading change indication -``` - -### Recommended Patterns - -#### Pattern 1: Explicit Variable Resolution Phase -```yaml -# Phase 1: Collect per-node data -- name: Collect node-specific configurations - set_fact: - node_config: "{{ local_config }}" - -# Phase 2: Resolve for shared operations -- name: Resolve shared configuration - set_fact: - shared_config: "{{ resolved_value }}" - run_once: true - delegate_to: localhost - -# Phase 3: Execute with resolved config -- name: Execute shared operation - command: "process {{ shared_config }}" - run_once: true - delegate_to: localhost -``` - -#### Pattern 2: Configuration-Driven Variable Resolution -```yaml -# Use configuration variables that are globally accessible +# Use global configuration variables instead of fragile hostvars - name: Resolve kernel reference set_fact: active_kernel_ref: >- @@ -1274,260 +324,103 @@ When developing features that involve per-node variables: delegate_to: localhost ``` -This approach avoids the fragile `hostvars` access pattern and relies on -configuration variables that are available in all execution contexts. - -## Filesystem Testing Implementation Validation - -When implementing filesystem testing features like the fio-tests filesystem support, -follow this systematic validation approach: - -### 1. Configuration Validation -```bash -# Apply the defconfig and verify configuration generation -make defconfig-fio-tests-fs-xfs -grep "fio_tests.*fs" .config -grep "fio_tests.*xfs" .config - -# Check variable resolution in YAML -grep -A5 -B5 "fio_tests_mkfs" extra_vars.yaml -grep "fio_tests_fs_device" extra_vars.yaml -``` - -### 2. Third Drive Integration Testing -Validate that filesystem testing uses the correct storage device hierarchy: -- `kdevops0`: Data partition (`/data`) -- `kdevops1`: Block device testing (original fio-tests target) -- `kdevops2`: Filesystem testing (new third drive usage) - -Check in `extra_vars.yaml`: -```yaml -# Expected device mapping -fio_tests_device: "/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_kdevops1" # Block testing -fio_tests_fs_device: "/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_kdevops2" # Filesystem testing -``` - -### 3. Template Engine Validation -The fio job template should intelligently select between filesystem and block modes: -```bash -# Verify template handles both modes -ansible-playbook --check playbooks/fio-tests.yml --tags debug -``` - -### 4. A/B Testing Infrastructure -When using `CONFIG_KDEVOPS_BASELINE_AND_DEV=y`, verify: +**Debugging**: ```bash -# Both VMs should be created -ls -la /xfs1/libvirt/kdevops/guestfs/debian13-fio-tests* -# Should show: debian13-fio-tests and debian13-fio-tests-dev - -# Check hosts file generation -cat hosts -# Should include both [baseline] and [dev] groups +ANSIBLE_VERBOSITY=3 make target # Verbose output +grep "active_.*:" extra_vars.yaml # Check resolved variables ``` -### 5. Kconfig Dependency Validation -Filesystem testing properly sets dependencies: -```bash -# Should automatically enable these when filesystem testing is selected -grep "CONFIG_FIO_TESTS_REQUIRES_MKFS_DEVICE=y" .config -grep "CONFIG_FIO_TESTS_REQUIRES_FILESYSTEM=y" .config -``` - -### 6. Block Size Range Support -Test both fixed and range configurations: -```bash -# Fixed block sizes (traditional) -grep "fio_tests_bs_.*=.*True" extra_vars.yaml - -# Range configurations (when enabled) -make defconfig-fio-tests-fs-ranges -grep "fio_tests_enable_bs_ranges.*True" extra_vars.yaml -``` - -### 7. Filesystem-Specific Features -Each filesystem type should generate appropriate mkfs commands: - -**XFS with reflink + rmapbt:** -```yaml -fio_tests_mkfs_cmd: "-f -m reflink=1,rmapbt=1 -i sparse=1 -b size=16k" -``` - -**ext4 with bigalloc:** -```yaml -fio_tests_mkfs_cmd: "-F -O bigalloc -C 32k" -``` - -**btrfs with compression:** -```yaml -fio_tests_mount_opts: "defaults,compress=zstd:3" -``` - -### 8. Known Issues and Solutions - -**VM Provisioning Timeouts:** -- Initial `make bringup` can take 30+ minutes for package upgrades -- VM disk creation succeeds even if provisioning times out -- Check VM directories in `/xfs1/libvirt/kdevops/guestfs/` for progress - -**Configuration Dependencies:** -- Use `CONFIG_KDEVOPS_WORKFLOW_ENABLE_FIO_TESTS=y` not old `CONFIG_WORKFLOW_FIO_TESTS` -- Always run `make style` before completion to catch formatting issues -- Missing newlines in Kconfig files will cause syntax errors - -**Third Drive Device Selection:** -- Infrastructure-specific defaults automatically select correct devices -- libvirt uses NVMe: `nvme-QEMU_NVMe_Ctrl_kdevops2` -- AWS/cloud providers use different device naming schemes +## Filesystem Testing Implementation Validation -### 9. Testing Best Practices +**Checklist for fio-tests filesystem features**: -**Start with Simple Configurations:** -```bash -make defconfig-fio-tests-fs-xfs # Single filesystem, fixed block sizes -make defconfig-fio-tests-fs-ranges # Block size ranges testing -``` +1. **Configuration**: Verify `.config` and `extra_vars.yaml` values +2. **Third Drive Integration**: Check device mapping (kdevops0=data, kdevops1=block, kdevops2=filesystem) +3. **Template Engine**: Verify fio job template handles block/filesystem modes +4. **A/B Testing**: Check both baseline and dev VMs created +5. **Kconfig Dependencies**: Verify auto-enabled dependencies +6. **Block Size Ranges**: Test fixed and range configurations +7. **Filesystem Features**: Verify mkfs commands (XFS reflink, ext4 bigalloc, btrfs compression) -**Incremental Validation:** -1. Configuration generation (`make`) -2. Variable resolution (`extra_vars.yaml`) -3. VM creation (`make bringup`) -4. Filesystem setup verification -5. fio job execution +**Known Issues**: +- VM provisioning can take 30+ minutes for package upgrades +- Use `CONFIG_KDEVOPS_WORKFLOW_ENABLE_FIO_TESTS=y` (not old `CONFIG_WORKFLOW_FIO_TESTS`) +- Missing newlines in Kconfig files cause syntax errors -**Debugging Techniques:** +**Debugging**: ```bash -# Check Ansible variable resolution ansible-playbook playbooks/fio-tests.yml --tags debug -v - -# Verify filesystem creation ansible all -m shell -a "lsblk" ansible all -m shell -a "mount | grep fio-tests" - -# Test fio job template generation -ansible-playbook playbooks/fio-tests.yml --tags setup --check ``` -This systematic approach ensures filesystem testing implementations are robust, -properly integrated with existing kdevops infrastructure, and ready for -production use. - ## Multi-Filesystem Testing Architecture -The fio-tests workflow supports multi-filesystem performance comparison through -a section-based approach similar to fstests. This enables comprehensive -performance analysis across different filesystem configurations. - -### Multi-Filesystem Section Configuration - -Multi-filesystem testing creates separate VMs for each filesystem configuration, -enabling isolated performance comparison: +**Section-Based Approach**: Separate VMs for each filesystem configuration ```bash -# XFS block size comparison -make defconfig-fio-tests-fs-xfs-4k-vs-16k -make bringup # Creates VMs: demo-fio-tests-xfs-4k, demo-fio-tests-xfs-16k - -# Comprehensive XFS block size analysis -make defconfig-fio-tests-fs-xfs-all-fsbs -make bringup # Creates VMs for 4K, 16K, 32K, 64K block sizes - -# Cross-filesystem comparison -make defconfig-fio-tests-fs-xfs-vs-ext4-vs-btrfs -make bringup # Creates VMs: xfs-16k, ext4-bigalloc, btrfs-zstd +make defconfig-fio-tests-fs-xfs-4k-vs-16k # Compare XFS block sizes +make defconfig-fio-tests-fs-xfs-all-fsbs # All XFS block sizes +make defconfig-fio-tests-fs-xfs-vs-ext4-vs-btrfs # Cross-filesystem comparison +make bringup +make fio-tests +make fio-tests-multi-fs-compare # Generate comparison graphs ``` -### Node Generation Architecture - -Multi-filesystem testing uses dynamic node generation based on enabled sections: - -1. **Section Detection**: Scans `.config` for `CONFIG_FIO_TESTS_SECTION_*=y` patterns -2. **Node Creation**: Generates separate VM nodes for each enabled section -3. **Host Groups**: Creates Ansible groups for each filesystem configuration -4. **A/B Testing**: Supports baseline/dev comparisons across all configurations - -### Filesystem Configuration Mapping +**Node Generation**: Dynamic based on `CONFIG_FIO_TESTS_SECTION_*=y` patterns +**Results**: Performance heatmaps, IO depth scaling, statistical summaries +**Integration**: Baseline management, A/B testing, existing result infrastructure -Each section maps to specific filesystem configurations defined in `workflows/fio-tests/sections.conf`: +## AI Assistant Development Guidelines -**XFS Configurations**: -- `xfs-4k`: 4K block size, 4K sector, reflink + rmapbt -- `xfs-16k`: 16K block size, 4K sector, reflink + rmapbt -- `xfs-32k`: 32K block size, 4K sector, reflink + rmapbt -- `xfs-64k`: 64K block size, 4K sector, reflink + rmapbt +**Investigation Sequence**: +1. Understand the problem +2. Trace execution path: config → ansible → execution +3. Identify context and scope +4. Find subsystem intersection points +5. Design holistic solutions (fix root cause) +6. Validate across use cases -**Cross-Filesystem Configurations**: -- `xfs-16k`: XFS with 16K blocks and modern features -- `ext4-bigalloc`: ext4 with bigalloc and 32K clusters -- `btrfs-zstd`: btrfs with zstd compression and modern features +**Common Anti-Patterns to Avoid**: +- Band-aid fixes ignoring root cause +- Breaking existing functionality +- Ignoring variable scope and execution context +- Missing cross-feature impact analysis -### Results Collection and Analysis +**Quality Gates**: +- Always run `make style` before completion +- Test both specific case and general functionality +- Consider impact on existing users +- Verify across different execution contexts -Multi-filesystem results are collected and analyzed through specialized tooling: +## Quick Setup Examples ```bash -make fio-tests # Run tests across all filesystem configurations -make fio-tests-results # Collect results from all VMs -make fio-tests-multi-fs-compare # Generate comparison graphs and analysis -``` - -**Generated Analysis**: -- Performance overview across filesystems -- Block size performance heatmaps -- IO depth scaling analysis -- Statistical summaries and CSV exports - -### Performance Tuning for Long Runs - -For comprehensive performance analysis (1+ hour runs): - -**Configuration Adjustments**: -```kconfig -CONFIG_FIO_TESTS_RUNTIME="3600" # 1 hour per test -CONFIG_FIO_TESTS_RAMP_TIME="30" # Extended ramp time -CONFIG_FIO_TESTS_LOG_AVG_MSEC=1000 # 1-second averaging for detailed logs -``` +# XFS Filesystem Testing +make defconfig-xfs && make bringup && make fstests -**Parallel Execution Benefits**: -- Multiple VMs run simultaneously across different configurations -- Results collection aggregated from all VMs at completion -- A/B testing infrastructure ensures fair comparison baselines +# Kernel Development +make menuconfig && make bringup && make linux -### CLI Override Patterns for Multi-Filesystem Testing +# Storage Performance with fio-tests +make defconfig-fio-tests-fs-xfs && make bringup && make fio-tests -Multi-filesystem testing supports all CLI override patterns: - -```bash -# Quick validation across all filesystem configurations -FIO_TESTS_QUICK_TEST=y make defconfig-fio-tests-fs-xfs-all-fsbs -make bringup -make fio-tests # ~1 minute per filesystem configuration - -# Extended analysis with custom runtime -FIO_TESTS_RUNTIME=1800 make defconfig-fio-tests-fs-xfs-vs-ext4-vs-btrfs -make bringup -make fio-tests # 30 minutes per filesystem configuration +# Multi-Filesystem Comparison +make defconfig-fio-tests-fs-xfs-vs-ext4-vs-btrfs && make bringup && make fio-tests ``` -**Key Features**: -- Intelligent test matrix reduction in quick mode -- Consistent CLI override behavior across single and multi-filesystem modes -- Automatic parameter adjustment based on configuration complexity - -### Integration with Existing Infrastructure - -Multi-filesystem testing integrates seamlessly with existing kdevops patterns: +## Testing and Quality Assurance -1. **Baseline Management**: Supports per-filesystem baseline tracking -2. **A/B Testing**: Enables kernel version comparison across all filesystems -3. **Results Infrastructure**: Uses existing result collection and graphing -4. **Configuration System**: Follows kdevops Kconfig patterns and conventions +- Expunge lists: Track known failures in `workflows/*/expunges/` +- Baseline commands: Establish expected test results +- Results commands: Show outcomes and regressions +- Watchdog scripts: Automated monitoring for long-running tests -This architecture enables comprehensive filesystem performance analysis while -maintaining compatibility with existing kdevops workflows and infrastructure. +This framework is designed by kernel developers for kernel developers, +providing production-ready automation for kernel testing and development +workflows. ## Prompt Examples -Refer to PROMPTS.md for example set of prompts used to generate code on -kdevops using different AI agents and their respective commits and gradings. -This can be useful for humans but also for generative AI so it can improve. +Refer to PROMPTS.md for examples of prompts used to generate code on kdevops +using different AI agents with their respective commits and gradings. From fe58e7d08283826bcca8d053b292027089cd75e3 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain Date: Sat, 6 Dec 2025 08:56:15 -0800 Subject: [PATCH 2/7] terraform: Use directory checksum in SSH key filenames SSH keys now use directory-based checksums to support multiple kdevops installations coexisting without conflicts. The SSH key filename format is now ~/.ssh/kdevops_terraform_ where checksum is the first 8 characters of the SHA256 hash of the kdevops directory path. This was previously only applied to Lambda Labs provider but is now the default for all providers, as it solves the general problem of multiple installations sharing the same home directory. Generated-by: Claude AI Signed-off-by: Luis Chamberlain --- terraform/Kconfig.ssh | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/terraform/Kconfig.ssh b/terraform/Kconfig.ssh index 4e239a35..b03f962f 100644 --- a/terraform/Kconfig.ssh +++ b/terraform/Kconfig.ssh @@ -21,14 +21,17 @@ config TERRAFORM_SSH_CONFIG_USER config TERRAFORM_SSH_CONFIG_PUBKEY_FILE string "File containing Ansible's ssh public key" - default "~/.ssh/kdevops_terraform_$(shell, echo $(TOPDIR_PATH) | sha256sum | cut -c1-8).pub" if TERRAFORM_LAMBDALABS - default "~/.ssh/kdevops_terraform.pub" + default "~/.ssh/kdevops_terraform_$(shell, echo $(TOPDIR_PATH) | sha256sum | cut -c1-8).pub" help The filename of the file containing an ssh public key Ansible is to use to manage its target nodes. The matching private key should be located in a file using the same basename (without the ".pub"). + The filename includes an 8-character hash of the current + directory path, allowing multiple kdevops installations to + use separate SSH keys without conflicts. + config TERRAFORM_SSH_CONFIG_GENKEY bool "Should we create a new random key for you?" default y From 82937e3f1c543fdcd57deaa126e0b8b06ab6d176 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Sat, 6 Dec 2025 08:56:15 -0800 Subject: [PATCH 3/7] terraform: Add ssh key migration mechanism ssh key file name with a SHA-256 hash was previously only applied to the Lambda Labs provider but is now the default for all providers. This presents a migration hazard for users of non-Lambda providers. A migration check now runs during the "make" step to notify users with existing keys at the old path. Users with running VMs can migrate their keys using the provided commands; otherwise new keys are generated automatically. Generated-by: Claude AI Signed-off-by: Chuck Lever --- scripts/check-ssh-key-migration.sh | 35 ++++++++++++++++++++++++++++++ scripts/terraform.Makefile | 5 +++++ 2 files changed, 40 insertions(+) create mode 100755 scripts/check-ssh-key-migration.sh diff --git a/scripts/check-ssh-key-migration.sh b/scripts/check-ssh-key-migration.sh new file mode 100755 index 00000000..7785f2cd --- /dev/null +++ b/scripts/check-ssh-key-migration.sh @@ -0,0 +1,35 @@ +#!/bin/bash +# SPDX-License-Identifier: copyleft-next-0.3.1 +# +# Check if SSH keys need migration from old (unhashed) to new (hashed) paths. +# This helps users upgrading from older kdevops versions that used a fixed +# SSH key path to the new per-directory hashed paths. + +set -e + +TOPDIR_PATH="${1:-.}" +HASH=$(echo "$TOPDIR_PATH" | sha256sum | cut -c1-8) + +OLD_KEY="$HOME/.ssh/kdevops_terraform" +OLD_PUBKEY="$HOME/.ssh/kdevops_terraform.pub" +NEW_KEY="$HOME/.ssh/kdevops_terraform_${HASH}" +NEW_PUBKEY="$HOME/.ssh/kdevops_terraform_${HASH}.pub" + +# Only show notice if old key exists but new key doesn't +if [ -f "$OLD_PUBKEY" ] && [ ! -f "$NEW_PUBKEY" ]; then + cat < Date: Sun, 14 Dec 2025 11:08:08 -0500 Subject: [PATCH 4/7] terraform/oci: skip Kconfig gen when OCI is not configured Add require_oci_config() to oci_common.py which validates OCI configuration via the SDK early in script execution. When OCI is not configured, the scripts exit gracefully with success rather than failing with errors. This allows users without OCI accounts to run make dynconfig without issues. This approach centralizes the handling and avoids TOCTOU race conditions from manual file existence checks. The SDK also validates the config file format, not just its existence. Generated-by: Claude AI Signed-off-by: Chuck Lever --- terraform/oci/scripts/gen_kconfig_image | 19 +++++++------ terraform/oci/scripts/gen_kconfig_location | 17 ++++++----- terraform/oci/scripts/gen_kconfig_shape | 19 +++++++------ terraform/oci/scripts/oci_common.py | 33 ++++++++++++++++++++++ 4 files changed, 65 insertions(+), 23 deletions(-) diff --git a/terraform/oci/scripts/gen_kconfig_image b/terraform/oci/scripts/gen_kconfig_image index 977446fd..58c650e0 100755 --- a/terraform/oci/scripts/gen_kconfig_image +++ b/terraform/oci/scripts/gen_kconfig_image @@ -50,7 +50,6 @@ Usage: """ import sys -import os import argparse import re from collections import defaultdict @@ -65,6 +64,8 @@ from oci_common import ( create_compute_client, get_all_region_keys, get_region_kconfig_name, + require_oci_config, + OciNotConfiguredError, ) # Cache for loaded publisher definitions (loaded once per script execution) @@ -239,7 +240,7 @@ def get_all_images(compartment_ocid, regions=None, quiet=False): - all_publishers: Combined dictionary of known and discovered publishers """ # Lazy import of OCI exceptions - only needed when actually using OCI API - from oci.exceptions import ServiceError, ConfigFileNotFound + from oci.exceptions import ServiceError if regions is None: regions = [get_default_region()] @@ -273,12 +274,6 @@ def get_all_images(compartment_ocid, regions=None, quiet=False): image, "listing_type", None ) - except ConfigFileNotFound: - print( - "Error: OCI config file not found. Please configure OCI CLI.", - file=sys.stderr, - ) - return {}, {} except ServiceError as e: print(f" Warning: Could not query {region}: {e.message}", file=sys.stderr) continue @@ -750,6 +745,14 @@ def main(): output_publishers_raw(args.quiet) return + # Allow make dynconfig to succeed without OCI credentials + try: + require_oci_config() + except OciNotConfiguredError: + if not args.quiet: + print("OCI not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + compartment_ocid = get_default_compartment() if not compartment_ocid: print("Error: Could not determine compartment OCID", file=sys.stderr) diff --git a/terraform/oci/scripts/gen_kconfig_location b/terraform/oci/scripts/gen_kconfig_location index 17ec8266..8234b0fe 100755 --- a/terraform/oci/scripts/gen_kconfig_location +++ b/terraform/oci/scripts/gen_kconfig_location @@ -31,7 +31,6 @@ Usage: ./gen_kconfig_location us-chicago-1 """ -import os import sys import argparse @@ -44,6 +43,8 @@ from oci_common import ( load_yaml_config, create_identity_client, get_oci_config, + require_oci_config, + OciNotConfiguredError, ) @@ -179,12 +180,6 @@ def get_all_regions(): return sorted(regions, key=lambda x: x["region_name"]) - except oci.exceptions.ConfigFileNotFound: - print( - "Error: OCI config file not found. Please configure OCI CLI.", - file=sys.stderr, - ) - return [] except Exception as e: print(f"Error retrieving OCI regions: {e}", file=sys.stderr) return [] @@ -432,6 +427,14 @@ def main(): """Main function to run the program.""" args = parse_arguments() + # Allow make dynconfig to succeed without OCI credentials + try: + require_oci_config() + except OciNotConfiguredError: + if not args.quiet: + print("OCI not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + if not args.quiet: print("Fetching list of OCI region subscriptions...", file=sys.stderr) regions = get_all_regions() diff --git a/terraform/oci/scripts/gen_kconfig_shape b/terraform/oci/scripts/gen_kconfig_shape index 09dd6d6f..65a0e2be 100755 --- a/terraform/oci/scripts/gen_kconfig_shape +++ b/terraform/oci/scripts/gen_kconfig_shape @@ -31,11 +31,10 @@ Usage: """ import sys -import os import argparse import oci -from oci.exceptions import ServiceError, ConfigFileNotFound +from oci.exceptions import ServiceError from oci_common import ( get_default_region, @@ -44,6 +43,8 @@ from oci_common import ( get_jinja2_environment, load_yaml_config, create_compute_client, + require_oci_config, + OciNotConfiguredError, ) @@ -82,12 +83,6 @@ def get_all_shapes(compartment_ocid, regions=None, quiet=False): if not quiet: print(f" Found {len(shapes)} shapes in {region}", file=sys.stderr) - except ConfigFileNotFound: - print( - "Error: OCI config file not found. Please configure OCI CLI.", - file=sys.stderr, - ) - return [] except ServiceError as e: print(f" Warning: Could not query {region}: {e.message}", file=sys.stderr) continue @@ -887,6 +882,14 @@ def main(): """Main function to run the program.""" args = parse_arguments() + # Allow make dynconfig to succeed without OCI credentials + try: + require_oci_config() + except OciNotConfiguredError: + if not args.quiet: + print("OCI not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + compartment_ocid = get_default_compartment() if not compartment_ocid: print("Error: Could not determine compartment OCID", file=sys.stderr) diff --git a/terraform/oci/scripts/oci_common.py b/terraform/oci/scripts/oci_common.py index 047b576f..cbe49a04 100644 --- a/terraform/oci/scripts/oci_common.py +++ b/terraform/oci/scripts/oci_common.py @@ -20,6 +20,12 @@ from jinja2 import Environment, FileSystemLoader +class OciNotConfiguredError(Exception): + """Raised when OCI configuration is not available.""" + + pass + + def get_default_region() -> str: """ Get the default OCI region from the ~/.oci/config file. @@ -174,6 +180,33 @@ def get_oci_config(region: Optional[str] = None) -> dict[str, Any]: return config +def require_oci_config() -> dict[str, Any]: + """ + Require OCI configuration, raising an exception if not configured. + + This function should be called early in main() to validate OCI + configuration. If OCI is not configured, it raises OciNotConfiguredError + to let the caller decide how to handle it. + + This centralizes the handling of missing OCI configuration and avoids + TOCTOU race conditions from manual file existence checks. + + Returns: + dict: OCI configuration dictionary if available + + Raises: + OciNotConfiguredError: If OCI configuration file is not found + """ + # Lazy import - only load OCI SDK when actually needed + import oci + from oci.exceptions import ConfigFileNotFound + + try: + return oci.config.from_file() + except ConfigFileNotFound as e: + raise OciNotConfiguredError("OCI configuration not found") from e + + def create_identity_client(region: Optional[str] = None) -> oci.identity.IdentityClient: """ Create an OCI Identity client with optional region override. From fd41bff199c8455d1b4cc22922d76385287106c0 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Sun, 14 Dec 2025 13:44:25 -0500 Subject: [PATCH 5/7] terraform/aws: skip Kconfig generation when AWS is not configured Add require_aws_credentials() to aws_common.py which validates AWS credentials via the STS API early in script execution. When AWS is not configured, the scripts exit gracefully with success rather than failing with errors. This allows users without AWS accounts to run make dynconfig without issues. This approach centralizes the handling and avoids TOCTOU race conditions from manual file existence checks. The STS API also validates the credentials work, not just that config files exist. Generated-by: Claude AI Signed-off-by: Chuck Lever --- terraform/aws/scripts/aws_common.py | 30 ++++++++++++++++++++++ terraform/aws/scripts/gen_kconfig_ami | 17 +++++++----- terraform/aws/scripts/gen_kconfig_instance | 14 +++++++--- terraform/aws/scripts/gen_kconfig_location | 14 +++++++--- 4 files changed, 61 insertions(+), 14 deletions(-) diff --git a/terraform/aws/scripts/aws_common.py b/terraform/aws/scripts/aws_common.py index d953fb6c..495a1127 100644 --- a/terraform/aws/scripts/aws_common.py +++ b/terraform/aws/scripts/aws_common.py @@ -18,6 +18,12 @@ from botocore.exceptions import ClientError, NoCredentialsError +class AwsNotConfiguredError(Exception): + """Raised when AWS credentials are not available.""" + + pass + + def get_default_region(): """ Get the default AWS region from the ~/.aws/config file. @@ -128,6 +134,30 @@ def handle_aws_credentials_error(quiet=False): return False +def require_aws_credentials(): + """ + Require AWS credentials, raising an exception if not configured. + + This function should be called early in main() to validate AWS + credentials. If AWS is not configured, it raises AwsNotConfiguredError + to let the caller decide how to handle it. + + This centralizes the handling of missing AWS credentials and avoids + TOCTOU race conditions from manual file existence checks. + + Returns: + dict: Caller identity information if credentials are valid + + Raises: + AwsNotConfiguredError: If AWS credentials are not found + """ + try: + sts = boto3.client("sts") + return sts.get_caller_identity() + except NoCredentialsError as e: + raise AwsNotConfiguredError("AWS credentials not found") from e + + def get_all_regions(quiet=False): """ Retrieve the list of all AWS regions. diff --git a/terraform/aws/scripts/gen_kconfig_ami b/terraform/aws/scripts/gen_kconfig_ami index 67a66943..305657b6 100755 --- a/terraform/aws/scripts/gen_kconfig_ami +++ b/terraform/aws/scripts/gen_kconfig_ami @@ -14,15 +14,15 @@ from collections import defaultdict from datetime import datetime, timedelta from concurrent.futures import ThreadPoolExecutor, as_completed -import boto3 -from botocore.exceptions import ClientError, NoCredentialsError +from botocore.exceptions import ClientError from aws_common import ( + AwsNotConfiguredError, get_default_region, get_jinja2_environment, create_ec2_client, handle_aws_client_error, - handle_aws_credentials_error, + require_aws_credentials, ) @@ -257,9 +257,6 @@ def discover_ami_patterns( return discovered_patterns - except NoCredentialsError: - handle_aws_credentials_error(quiet) - return {} except ClientError as e: handle_aws_client_error(e, f"discovering AMI patterns for {owner_name}", quiet) return {} @@ -822,6 +819,14 @@ def main(): output_owners_raw(owners, args.quiet) return + # Allow make dynconfig to succeed without AWS credentials + try: + require_aws_credentials() + except AwsNotConfiguredError: + if not args.quiet: + print("AWS not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + if args.region: region = args.region else: diff --git a/terraform/aws/scripts/gen_kconfig_instance b/terraform/aws/scripts/gen_kconfig_instance index 9b574de6..00fe4d53 100755 --- a/terraform/aws/scripts/gen_kconfig_instance +++ b/terraform/aws/scripts/gen_kconfig_instance @@ -9,14 +9,12 @@ construct the "instance" Kconfig menu. import sys import argparse -from botocore.exceptions import ClientError, NoCredentialsError - from aws_common import ( + AwsNotConfiguredError, get_default_region, get_all_instance_types, get_jinja2_environment, - handle_aws_client_error, - handle_aws_credentials_error, + require_aws_credentials, ) @@ -316,6 +314,14 @@ def main(): """Main function to run the program.""" args = parse_arguments() + # Allow make dynconfig to succeed without AWS credentials + try: + require_aws_credentials() + except AwsNotConfiguredError: + if not args.quiet: + print("AWS not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + if args.region: region = args.region else: diff --git a/terraform/aws/scripts/gen_kconfig_location b/terraform/aws/scripts/gen_kconfig_location index 7c974d16..fb8f8b99 100755 --- a/terraform/aws/scripts/gen_kconfig_location +++ b/terraform/aws/scripts/gen_kconfig_location @@ -10,17 +10,15 @@ import sys import argparse from concurrent.futures import ThreadPoolExecutor, as_completed -from botocore.exceptions import ClientError, NoCredentialsError from aws_common import ( + AwsNotConfiguredError, get_default_region, get_all_regions, get_region_availability_zones, get_jinja2_environment, get_region_kconfig_name, - create_ec2_client, - handle_aws_client_error, - handle_aws_credentials_error, + require_aws_credentials, ) @@ -214,6 +212,14 @@ def main(): """Main function to run the program.""" args = parse_arguments() + # Allow make dynconfig to succeed without AWS credentials + try: + require_aws_credentials() + except AwsNotConfiguredError: + if not args.quiet: + print("AWS not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + if not args.quiet: print("Fetching list of all AWS regions...", file=sys.stderr) regions = get_all_regions() From 36dc2bf270b1bccc97bf8ca9af355baa96ecee59 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Sun, 14 Dec 2025 13:46:48 -0500 Subject: [PATCH 6/7] terraform/azure: skip Kconfig generation when Azure is not configured Add require_azure_credentials() to azure_common.py which validates Azure credentials via the Azure CLI profile early in script execution. When Azure is not configured, the scripts exit gracefully with success rather than failing with errors. This allows users without Azure accounts to run make dynconfig without issues. This approach mirrors the AWS implementation and centralizes the handling of missing Azure credentials while avoiding TOCTOU race conditions from manual file existence checks. The Azure CLI profile API validates that credentials are functional, not just that config files exist. Generated-by: Claude AI Signed-off-by: Chuck Lever --- terraform/azure/scripts/azure_common.py | 48 ++++++++++++++++++++ terraform/azure/scripts/gen_kconfig_image | 11 ++++- terraform/azure/scripts/gen_kconfig_location | 12 ++++- terraform/azure/scripts/gen_kconfig_size | 13 ++++-- 4 files changed, 78 insertions(+), 6 deletions(-) diff --git a/terraform/azure/scripts/azure_common.py b/terraform/azure/scripts/azure_common.py index de9abb91..f05db4a8 100644 --- a/terraform/azure/scripts/azure_common.py +++ b/terraform/azure/scripts/azure_common.py @@ -17,6 +17,12 @@ from jinja2 import Environment, FileSystemLoader +class AzureNotConfiguredError(Exception): + """Raised when Azure credentials are not available.""" + + pass + + def get_default_region(): """ Get the default Azure region from Azure configuration. @@ -371,3 +377,45 @@ def exit_on_empty_result(result, context, quiet=False): ) print("Run 'az login' to authenticate with Azure.", file=sys.stderr) sys.exit(1) + + +def require_azure_credentials(): + """ + Require Azure credentials, raising an exception if not configured. + + This function should be called early in main() to validate Azure + credentials. If Azure is not configured, it raises AzureNotConfiguredError + to let the caller decide how to handle it. + + This centralizes the handling of missing Azure credentials and avoids + TOCTOU race conditions from manual file existence checks. + + Returns: + str: Subscription ID if credentials are valid + + Raises: + AzureNotConfiguredError: If Azure credentials are not found + """ + try: + from azure.common.credentials import get_cli_profile + + profile = get_cli_profile() + credentials, subscription_id, _ = profile.get_login_credentials( + resource="https://management.azure.com" + ) + return subscription_id + except ImportError as e: + raise AzureNotConfiguredError("Azure SDK not installed") from e + except Exception as e: + # Only treat as "not configured" if it looks like an auth/login issue + error_msg = str(e).lower() + auth_indicators = [ + "login", + "logged in", + "authenticate", + "credential", + "az login", + ] + if any(phrase in error_msg for phrase in auth_indicators): + raise AzureNotConfiguredError("Azure credentials not found") from e + raise diff --git a/terraform/azure/scripts/gen_kconfig_image b/terraform/azure/scripts/gen_kconfig_image index 673474ce..aa0a0980 100755 --- a/terraform/azure/scripts/gen_kconfig_image +++ b/terraform/azure/scripts/gen_kconfig_image @@ -55,11 +55,12 @@ from collections import defaultdict from functools import lru_cache from azure_common import ( + AzureNotConfiguredError, get_default_region, get_jinja2_environment, - get_all_regions, get_region_kconfig_name, get_all_offers_and_skus, + require_azure_credentials, ) @@ -730,6 +731,14 @@ def main(): output_publishers_raw(args.quiet) return + # Allow make dynconfig to succeed without Azure credentials + try: + require_azure_credentials() + except AzureNotConfiguredError: + if not args.quiet: + print("Azure not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + publishers = get_known_publishers() # Filter to specific publisher if requested diff --git a/terraform/azure/scripts/gen_kconfig_location b/terraform/azure/scripts/gen_kconfig_location index 287fb894..2b04ef14 100755 --- a/terraform/azure/scripts/gen_kconfig_location +++ b/terraform/azure/scripts/gen_kconfig_location @@ -26,11 +26,12 @@ import sys import argparse from azure_common import ( + AzureNotConfiguredError, get_default_region, get_all_regions, get_jinja2_environment, get_region_kconfig_name, - exit_on_empty_result, + require_azure_credentials, ) @@ -191,11 +192,18 @@ def main(): """Main function to run the program.""" args = parse_arguments() + # Allow make dynconfig to succeed without Azure credentials + try: + require_azure_credentials() + except AzureNotConfiguredError: + if not args.quiet: + print("Azure not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + if not args.quiet: print("Fetching list of all Azure regions...", file=sys.stderr) regions = get_all_regions(args.quiet) - exit_on_empty_result(regions, "Azure region query", args.quiet) if args.regions: if args.format == "kconfig": diff --git a/terraform/azure/scripts/gen_kconfig_size b/terraform/azure/scripts/gen_kconfig_size index bdaaa55f..fd50883b 100755 --- a/terraform/azure/scripts/gen_kconfig_size +++ b/terraform/azure/scripts/gen_kconfig_size @@ -39,11 +39,12 @@ import os import yaml from azure_common import ( + AzureNotConfiguredError, get_default_region, get_all_regions, get_jinja2_environment, get_vm_sizes_and_skus, - exit_on_empty_result, + require_azure_credentials, ) @@ -602,6 +603,14 @@ def main(): """Main function to run the program.""" args = parse_arguments() + # Allow make dynconfig to succeed without Azure credentials + try: + require_azure_credentials() + except AzureNotConfiguredError: + if not args.quiet: + print("Azure not configured - skipping (optional)", file=sys.stderr) + sys.exit(0) + # Determine which regions to query if args.region: # Query specific region only @@ -609,14 +618,12 @@ def main(): elif args.all_regions: # Query all regions regions = get_all_regions(args.quiet) - exit_on_empty_result(regions, "Azure region query", args.quiet) else: # Query default region only regions = [get_default_region()] # Get VM sizes and capabilities in a single API call sizes, sku_capabilities = get_all_vm_sizes_and_capabilities(regions, args.quiet) - exit_on_empty_result(sizes, "Azure VM size query", args.quiet) if args.families: output_families_raw(sizes, args.quiet) From b96ceff43044e39831c72d2e2e628a9045850a40 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain Date: Sat, 6 Dec 2025 08:56:17 -0800 Subject: [PATCH 7/7] terraform: Enable fact gathering for localhost Terraform providers are platform-specific binaries installed to paths that include the operating system and architecture: ~/.terraform.d/plugins/registry.terraform.io////_/ The DataCrunch provider integration uses ansible_system and ansible_architecture to construct this path dynamically, allowing the same playbook to work on Linux x86_64, Linux ARM64, macOS, etc. These variables are Ansible facts populated by the setup module during fact gathering. With gather_facts disabled, these variables are undefined and the playbook fails when attempting to install or locate the DataCrunch terraform provider. Remove the gather_facts: false directive so Ansible collects system facts before the terraform role executes. Generated-by: Claude AI Signed-off-by: Luis Chamberlain --- playbooks/terraform.yml | 1 - 1 file changed, 1 deletion(-) diff --git a/playbooks/terraform.yml b/playbooks/terraform.yml index 61a16a2c..9c99f20d 100644 --- a/playbooks/terraform.yml +++ b/playbooks/terraform.yml @@ -1,6 +1,5 @@ --- - name: Manage infrastructure lifecycle and SSH access with Terraform hosts: localhost - gather_facts: false roles: - role: terraform