Conversation
Rebase to develop
There was a problem hiding this comment.
Pull request overview
This PR targets GitHub Actions L2 test stability by improving bundle handling and platform-specific behavior (cgroupv2/CI quirks), and by adjusting templates/CI steps to reduce environment-dependent failures.
Changes:
- Harden L2 test runner utilities (bundle extraction validation, DobbyDaemon readiness probing, container launch retries, improved async log reading).
- Make multiple L2 tests more CI-tolerant (bundle config sanitization, cgroupv1/v2 pid-limit path resolution, more robust bundle comparisons).
- Add CI workflow step to regenerate bundles for cgroupv2 compatibility and relax LCOV handling for unused coverage artifacts.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/L2_testing/test_runner/thunder_plugin.py | Sanitizes bundle config and makes test set platform-conditional; improves handling of invalid bundles. |
| tests/L2_testing/test_runner/test_utils.py | Adds bundle extraction validation, nested bundle path detection, DobbyDaemon readiness probing, and more robust container launch behavior. |
| tests/L2_testing/test_runner/start_from_bundle.py | Uses validated bundle context and waits briefly for log flushing before asserting. |
| tests/L2_testing/test_runner/runner.py | Fixes test-results aggregation to only attach per-suite JSON when tests actually ran. |
| tests/L2_testing/test_runner/pid_limit_tests.py | Resolves pids.max across cgroup v1/v2 layouts, including /proc/<pid>/cgroup-derived paths. |
| tests/L2_testing/test_runner/network_tests.py | Switches to shared launch_container and adds invalid-bundle handling. |
| tests/L2_testing/test_runner/memcr_tests.py | Hardens PID parsing and optionally skips PID checkpoint validation when PIDs aren’t reported. |
| tests/L2_testing/test_runner/bundle_generation.py | Replaces filesystem diff with normalized config.json comparison and adds rootfs existence check. |
| tests/L2_testing/test_runner/bundle/regenerate_bundles.py | New CI helper to patch and repack test bundles for cgroupv2 compatibility. |
| tests/L2_testing/test_runner/basic_sanity_tests.py | Reworks async log reading with select() and makes daemon-stop verification less log-dependent. |
| tests/L2_testing/test_runner/annotation_tests.py | Starts container from spec path (vs extracted bundle) and stops container after annotation checks. |
| client/tool/source/Main.cpp | Removes mutex lock from stop callback before completing a promise. |
| bundle/lib/source/templates/OciConfigJsonVM1.0.2-dobby.template | Removes memory swappiness field from generated OCI config templates. |
| bundle/lib/source/templates/OciConfigJson1.0.2-dobby.template | Removes memory swappiness field from generated OCI config templates. |
| .github/workflows/L2-tests.yml | Adds bundle-regeneration step and relaxes LCOV errors for unused coverage. |
Comments suppressed due to low confidence (1)
tests/L2_testing/test_runner/network_tests.py:121
- If
launch_containerfails, the test still proceeds to sleep and read from netcat, and then overwrites the earlier launch-failure message with a generic "Received ... expected ..." message. This makes failures harder to diagnose and can produce misleading output. Consider short-circuiting whenlaunch_resultis false (or preserving the original error message/log) before attempting to validate the netcat payload.
launch_result = test_utils.launch_container(container_name, bundle_path)
message = ""
result = True
if not launch_result:
message = "Container did not launch successfully"
result = False
# give container time to start and send message before checking netcat listener
sleep(2)
nc_message = nc.get_output().rstrip("\n")
# check if netcat listener received message
if test.expected_output.lower() not in nc_message.lower():
message = "Received '%s' from container, expected '%s'" % (nc_message.lower(), test.expected_output.lower())
result = False
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| iterations = int(exit_timeout / 0.5) | ||
| for i in range(0, iterations): | ||
| sleep(0.5) |
There was a problem hiding this comment.
iterations = int(exit_timeout / 0.5) truncates and can become 0 for small exit_timeout values (or under-wait when exit_timeout isn’t a multiple of 0.5). That would immediately take the timeout path without any DobbyTool info checks. Consider using a ceiling calculation and/or enforcing a minimum of 1 iteration so the wait duration matches exit_timeout more closely.
| if (state == IDobbyProxyEvents::ContainerState::Stopped && containerId == *id) | ||
| { | ||
| AI_LOG_INFO("Container %s has stopped", containerId.c_str()); | ||
| std::lock_guard<std::mutex> locker(gLock); | ||
| promise.set_value(); | ||
| } |
There was a problem hiding this comment.
Removing the mutex lock in the callback introduces a data race on the global promise: stopCommand reassigns promise under gLock, while this callback can call promise.set_value() concurrently from the proxy event thread. That’s undefined behavior and can crash intermittently. Restore synchronization (e.g., keep std::lock_guard<std::mutex> locker(gLock); here, or switch to a thread-safe pattern such as storing the promise in a shared_ptr captured by the callback).
| Changes made for cgroupv2 compatibility: | ||
| - Removes 'swappiness' from memory resources (not supported in cgroupv2) | ||
| - Sets realtimeRuntime and realtimePeriod to valid values or removes them | ||
| - Updates rootfsPropagation to 'slave' for better compatibility |
There was a problem hiding this comment.
The module docstring says the script “Updates rootfsPropagation to 'slave'”, but the implementation removes both linux.rootfsPropagation and top-level rootfsPropagation. This mismatch is confusing for anyone using the script to understand what’s being changed. Please update the docstring to reflect the actual behavior (or implement the documented change).
| - Updates rootfsPropagation to 'slave' for better compatibility | |
| - Removes both linux.rootfsPropagation and top-level rootfsPropagation |
| with test_utils.dobby_daemon(), test_utils.untar_bundle(container_name) as bundle_path: | ||
| bundle_ctx = test_utils.untar_bundle(container_name) | ||
| with test_utils.dobby_daemon(), bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: |
There was a problem hiding this comment.
bundle_ctx.valid is accessed here, but test_utils.untar_bundle currently doesn't define a valid attribute (it only untars in __init__ and returns the path in __enter__). This will raise AttributeError and fail the test before any bundle validation logic runs. Either add a valid field to untar_bundle (and set it based on tar exit status + existence of required files like config.json/rootfs) or remove these checks and handle extraction/validation failures another way.
| if not bundle_ctx.valid: | |
| bundle_valid = ( | |
| path.isdir(bundle_path) and | |
| path.isfile(path.join(bundle_path, "config.json")) and | |
| path.isdir(path.join(bundle_path, "rootfs")) | |
| ) | |
| if not bundle_valid: |
| bundle_ctx = test_utils.untar_bundle(container_id) | ||
| with bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| return False, "Bundle extraction or validation failed" |
There was a problem hiding this comment.
bundle_ctx.valid is referenced, but test_utils.untar_bundle doesn't currently provide a valid attribute. This will throw at runtime and mask the real extraction/validation failure. Consider extending untar_bundle to expose valid (and any failure reason) or remove this check and instead validate the extracted bundle path (e.g., check config.json exists) before launching.
| bundle_ctx = test_utils.untar_bundle(container_id) | ||
| with bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| return False, "Bundle extraction or validation failed" | ||
|
|
There was a problem hiding this comment.
bundle_ctx.valid is used here, but test_utils.untar_bundle has no valid attribute in its current implementation, so this will raise AttributeError. If you need to guard against missing/invalid bundles, implement that in untar_bundle (e.g., capture tar return code and verify required files) and expose it via valid/error, or do explicit file checks on bundle_path.
| bundle_ctx = test_utils.untar_bundle(container_name) | ||
| with bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| test = tests[0] | ||
| output = test_utils.create_simple_test_output( | ||
| test, False, | ||
| "Bundle extraction or validation failed", | ||
| "Bundle tarball could not be extracted or config.json was missing" | ||
| ) |
There was a problem hiding this comment.
bundle_ctx.valid is accessed here, but test_utils.untar_bundle doesn't currently have a valid attribute. This will raise at runtime and cause this test group to error rather than report a clean bundle-extraction failure. If validation is desired, update untar_bundle to track success/failure (and maybe an error message) or validate bundle_path directly before proceeding.
| if (state == wp->state && containerId == wp->containerId) | ||
| { | ||
| AI_LOG_INFO("Wait complete"); | ||
| std::lock_guard<std::mutex> locker(gLock); | ||
| promise.set_value(); | ||
| } |
There was a problem hiding this comment.
Same issue as the stop callback: without locking (or another synchronization mechanism), promise.set_value() can race with promise = std::promise<void>() in the command handler. Please reintroduce gLock protection here or use a callback-specific promise instance that can’t be concurrently reassigned.
|
|
||
| # Backup original | ||
| if backup: | ||
| backup_path = bundle_tarball.with_suffix('.tar.gz.bak') |
There was a problem hiding this comment.
backup_path = bundle_tarball.with_suffix('.tar.gz.bak') will produce an odd filename for inputs like foo.tar.gz (e.g., foo.tar.tar.gz.bak) because Path.with_suffix replaces only the last suffix. If you want foo.tar.gz.bak, build the backup name explicitly (e.g., Path(str(bundle_tarball) + '.bak') or bundle_tarball.with_name(bundle_tarball.name + '.bak')).
| backup_path = bundle_tarball.with_suffix('.tar.gz.bak') | |
| backup_path = bundle_tarball.with_name(bundle_tarball.name + '.bak') |
| bundle_ctx = test_utils.untar_bundle(container_name) | ||
| with test_utils.dobby_daemon(), netcat_listener() as nc, bundle_ctx as bundle_path: | ||
| if not bundle_ctx.valid: | ||
| output = test_utils.create_simple_test_output(tests[0], False, "Bundle extraction or validation failed", | ||
| log_content="Bundle extraction or validation failed; container was never launched.") | ||
| output_table.append(output) |
There was a problem hiding this comment.
bundle_ctx.valid is checked here, but test_utils.untar_bundle doesn't define a valid attribute. This will fail with AttributeError even when bundle extraction succeeds. Implement valid on the context manager (and set it based on extraction + required file existence) or replace this with explicit validation of bundle_path (e.g., check config.json exists) before starting the container.
f1bf099 to
298f0e7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for member in tar.getmembers(): | ||
| member_path = (bundle_dir / member.name).resolve() | ||
| if not str(member_path).startswith(str(bundle_dir.resolve())): | ||
| raise RuntimeError( | ||
| f"Tarball member '{member.name}' would escape extraction " | ||
| f"directory '{bundle_dir}' — aborting for safety" | ||
| ) | ||
| tar.extractall(path=bundle_dir) |
There was a problem hiding this comment.
The tar path-traversal check is incorrect: using str(member_path).startswith(str(bundle_dir.resolve())) can be bypassed by paths like <bundle_dir>_evil/... (prefix match) and would allow extraction outside bundle_dir. Use a real path containment check (e.g., member_path.relative_to(bundle_dir.resolve()) in a try/except, or os.path.commonpath), or extract members individually after validating each target path.
| for member in tar.getmembers(): | |
| member_path = (bundle_dir / member.name).resolve() | |
| if not str(member_path).startswith(str(bundle_dir.resolve())): | |
| raise RuntimeError( | |
| f"Tarball member '{member.name}' would escape extraction " | |
| f"directory '{bundle_dir}' — aborting for safety" | |
| ) | |
| tar.extractall(path=bundle_dir) | |
| extraction_root = bundle_dir.resolve() | |
| for member in tar.getmembers(): | |
| member_path = (bundle_dir / member.name).resolve() | |
| try: | |
| member_path.relative_to(extraction_root) | |
| except ValueError: | |
| raise RuntimeError( | |
| f"Tarball member '{member.name}' would escape extraction " | |
| f"directory '{bundle_dir}' — aborting for safety" | |
| ) | |
| tar.extract(member, path=bundle_dir) |
| """ | ||
| Script to regenerate L2 test bundles for cgroupv2 compatibility. | ||
|
|
||
| This script: | ||
| 1. Extracts each .tar.gz bundle | ||
| 2. Patches config.json to remove cgroupv2-incompatible settings | ||
| 3. Repacks the bundle | ||
|
|
||
| Changes made for cgroupv2 compatibility: | ||
| - Removes 'swappiness' from memory resources (not supported in cgroupv2) | ||
| - Sets realtimeRuntime and realtimePeriod to valid values or removes them | ||
| - Updates rootfsPropagation to 'slave' for better compatibility | ||
| """ | ||
|
|
||
| import json | ||
| import shutil | ||
| import sys | ||
| import tarfile | ||
| from pathlib import Path | ||
|
|
||
|
|
||
| def patch_config_for_cgroupv2(config: dict, bundle_name: str = "") -> dict: | ||
| """Patch OCI config.json for cgroupv2 compatibility.""" | ||
|
|
||
| # Remove swappiness from memory resources (not supported in cgroupv2) | ||
| if 'linux' in config and 'resources' in config['linux']: | ||
| resources = config['linux']['resources'] | ||
|
|
||
| if 'memory' in resources: | ||
| memory = resources['memory'] | ||
| if 'swappiness' in memory: | ||
| del memory['swappiness'] | ||
| print(" - Removed 'swappiness' from memory resources") | ||
|
|
||
| # Fix cpu realtime settings - remove null values | ||
| if 'cpu' in resources: | ||
| cpu = resources['cpu'] | ||
| if cpu.get('realtimeRuntime') is None: | ||
| del cpu['realtimeRuntime'] | ||
| print(" - Removed null 'realtimeRuntime'") | ||
| if cpu.get('realtimePeriod') is None: | ||
| del cpu['realtimePeriod'] | ||
| print(" - Removed null 'realtimePeriod'") | ||
| # Remove cpu section entirely if empty | ||
| if not cpu: | ||
| del resources['cpu'] | ||
| print(" - Removed empty 'cpu' section") | ||
|
|
||
| # Remove rootfsPropagation entirely - it causes "make rootfs private" errors | ||
| # in user namespace environments like GitHub Actions | ||
| if 'linux' in config and 'rootfsPropagation' in config['linux']: | ||
| del config['linux']['rootfsPropagation'] | ||
| print(" - Removed linux.rootfsPropagation") | ||
|
|
||
| # Remove top-level rootfsPropagation as well | ||
| if 'rootfsPropagation' in config: | ||
| del config['rootfsPropagation'] | ||
| print(" - Removed top-level rootfsPropagation") |
There was a problem hiding this comment.
The module docstring says the script "Updates rootfsPropagation to 'slave'", but the implementation actually removes both linux.rootfsPropagation and top-level rootfsPropagation. Please update the docstring (or the logic) so it accurately reflects what the script does.
Fix L2 test failures on GitHub Actions CI (cgroupv2 compatibility and GCOV coverage issues)
This PR addresses multiple cascading failures in the L2 test suite when running on GitHub Actions runners which use cgroupv2 (unified hierarchy) and have coverage instrumentation enabled.
Root Causes Fixed:
cgroupv2 Compatibility - The codebase assumed cgroupv1 (separate controller mounts). GitHub Actions uses cgroupv2 where all controllers share a single mount point. Updated DobbyEnv.cpp, IonMemoryPlugin.cpp, GpuPlugin.cpp, and DobbyTemplate.cpp to detect and handle cgroupv2. Also added runtime patching in test_utils.py to remove unsupported settings (swappiness, kernel memory, RT scheduling) from bundle configs.
GCOV Coverage Write Failures - Instrumented binaries (DobbyPluginLauncher) couldn't write .gcda files to /home/runner/work, causing hooks to exit with code 1 and breaking all container operations. Fixed by setting GCOV_PREFIX=/tmp/gcov at workflow level and passing GCOV environment to DobbyDaemon via sudo -E.
Network Setup Issues - DNS resolution and NAT not working in containers because eth0 was hardcoded but doesn't exist on GitHub runners, and systemd stub resolver doesn't work well in containers. Fixed by detecting default-route interface dynamically and forcing reliable DNS servers (1.1.1.1, 8.8.8.8).
OCI Hook Config Path - createContainer hooks run in container namespace and couldn't access host paths for config.json. Fixed by mounting config.json to /tmp/dobby_config.json inside container and adding fallback path resolution in DobbyPluginLauncher.
Invalid Interface Index in Networking - Netlink code was attempting to set addresses on interface index 0 (invalid), causing network setup failures. Added validation to reject invalid interface indices before setting addresses.
Test Procedure
Push PR to trigger GitHub Actions L2 test workflow
Verify all test groups pass:
basic_sanity_tests
container_manipulations
bundle_generation
plugin_launcher
command_line_containers
annotation_tests
start_from_bundle
thunder_plugin
network_tests
pid_limit_tests
memcr_tests
Type of Change
[Bug fix (non-breaking change which fixes an issue)
Requires Bitbake Recipe changes?
No - This PR contains only code fixes and CI workflow changes. No new dependencies, build options, or installed files are added.