Skip to content

RDKEMW-13253- Dobby l2 fix #433

Open
Sonajeya31 wants to merge 4 commits intordkcentral:developfrom
Sonajeya31:topic/RDKEMW-13253-5
Open

RDKEMW-13253- Dobby l2 fix #433
Sonajeya31 wants to merge 4 commits intordkcentral:developfrom
Sonajeya31:topic/RDKEMW-13253-5

Conversation

@Sonajeya31
Copy link
Copy Markdown

Fix L2 test failures on GitHub Actions CI (cgroupv2 compatibility and GCOV coverage issues)

This PR addresses multiple cascading failures in the L2 test suite when running on GitHub Actions runners which use cgroupv2 (unified hierarchy) and have coverage instrumentation enabled.
Root Causes Fixed:

cgroupv2 Compatibility - The codebase assumed cgroupv1 (separate controller mounts). GitHub Actions uses cgroupv2 where all controllers share a single mount point. Updated DobbyEnv.cpp, IonMemoryPlugin.cpp, GpuPlugin.cpp, and DobbyTemplate.cpp to detect and handle cgroupv2. Also added runtime patching in test_utils.py to remove unsupported settings (swappiness, kernel memory, RT scheduling) from bundle configs.

GCOV Coverage Write Failures - Instrumented binaries (DobbyPluginLauncher) couldn't write .gcda files to /home/runner/work, causing hooks to exit with code 1 and breaking all container operations. Fixed by setting GCOV_PREFIX=/tmp/gcov at workflow level and passing GCOV environment to DobbyDaemon via sudo -E.

Network Setup Issues - DNS resolution and NAT not working in containers because eth0 was hardcoded but doesn't exist on GitHub runners, and systemd stub resolver doesn't work well in containers. Fixed by detecting default-route interface dynamically and forcing reliable DNS servers (1.1.1.1, 8.8.8.8).

OCI Hook Config Path - createContainer hooks run in container namespace and couldn't access host paths for config.json. Fixed by mounting config.json to /tmp/dobby_config.json inside container and adding fallback path resolution in DobbyPluginLauncher.

Invalid Interface Index in Networking - Netlink code was attempting to set addresses on interface index 0 (invalid), causing network setup failures. Added validation to reject invalid interface indices before setting addresses.

Test Procedure
Push PR to trigger GitHub Actions L2 test workflow
Verify all test groups pass:
basic_sanity_tests
container_manipulations
bundle_generation
plugin_launcher
command_line_containers
annotation_tests
start_from_bundle
thunder_plugin
network_tests
pid_limit_tests
memcr_tests

Type of Change
[Bug fix (non-breaking change which fixes an issue)
Requires Bitbake Recipe changes?
No - This PR contains only code fixes and CI workflow changes. No new dependencies, build options, or installed files are added.

Copilot AI review requested due to automatic review settings April 15, 2026 07:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets GitHub Actions L2 test stability by improving bundle handling and platform-specific behavior (cgroupv2/CI quirks), and by adjusting templates/CI steps to reduce environment-dependent failures.

Changes:

  • Harden L2 test runner utilities (bundle extraction validation, DobbyDaemon readiness probing, container launch retries, improved async log reading).
  • Make multiple L2 tests more CI-tolerant (bundle config sanitization, cgroupv1/v2 pid-limit path resolution, more robust bundle comparisons).
  • Add CI workflow step to regenerate bundles for cgroupv2 compatibility and relax LCOV handling for unused coverage artifacts.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/L2_testing/test_runner/thunder_plugin.py Sanitizes bundle config and makes test set platform-conditional; improves handling of invalid bundles.
tests/L2_testing/test_runner/test_utils.py Adds bundle extraction validation, nested bundle path detection, DobbyDaemon readiness probing, and more robust container launch behavior.
tests/L2_testing/test_runner/start_from_bundle.py Uses validated bundle context and waits briefly for log flushing before asserting.
tests/L2_testing/test_runner/runner.py Fixes test-results aggregation to only attach per-suite JSON when tests actually ran.
tests/L2_testing/test_runner/pid_limit_tests.py Resolves pids.max across cgroup v1/v2 layouts, including /proc/<pid>/cgroup-derived paths.
tests/L2_testing/test_runner/network_tests.py Switches to shared launch_container and adds invalid-bundle handling.
tests/L2_testing/test_runner/memcr_tests.py Hardens PID parsing and optionally skips PID checkpoint validation when PIDs aren’t reported.
tests/L2_testing/test_runner/bundle_generation.py Replaces filesystem diff with normalized config.json comparison and adds rootfs existence check.
tests/L2_testing/test_runner/bundle/regenerate_bundles.py New CI helper to patch and repack test bundles for cgroupv2 compatibility.
tests/L2_testing/test_runner/basic_sanity_tests.py Reworks async log reading with select() and makes daemon-stop verification less log-dependent.
tests/L2_testing/test_runner/annotation_tests.py Starts container from spec path (vs extracted bundle) and stops container after annotation checks.
client/tool/source/Main.cpp Removes mutex lock from stop callback before completing a promise.
bundle/lib/source/templates/OciConfigJsonVM1.0.2-dobby.template Removes memory swappiness field from generated OCI config templates.
bundle/lib/source/templates/OciConfigJson1.0.2-dobby.template Removes memory swappiness field from generated OCI config templates.
.github/workflows/L2-tests.yml Adds bundle-regeneration step and relaxes LCOV errors for unused coverage.
Comments suppressed due to low confidence (1)

tests/L2_testing/test_runner/network_tests.py:121

  • If launch_container fails, the test still proceeds to sleep and read from netcat, and then overwrites the earlier launch-failure message with a generic "Received ... expected ..." message. This makes failures harder to diagnose and can produce misleading output. Consider short-circuiting when launch_result is false (or preserving the original error message/log) before attempting to validate the netcat payload.
        launch_result = test_utils.launch_container(container_name, bundle_path)

        message = ""
        result = True

        if not launch_result:
            message = "Container did not launch successfully"
            result = False

        # give container time to start and send message before checking netcat listener
        sleep(2)

        nc_message = nc.get_output().rstrip("\n")

        # check if netcat listener received message
        if test.expected_output.lower() not in nc_message.lower():
            message = "Received '%s' from container, expected '%s'" % (nc_message.lower(), test.expected_output.lower())
            result = False

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/L2_testing/test_runner/thunder_plugin.py
Comment thread tests/L2_testing/test_runner/test_utils.py
Comment thread tests/L2_testing/test_runner/bundle/regenerate_bundles.py
Copilot AI review requested due to automatic review settings April 16, 2026 06:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 366 to 368
iterations = int(exit_timeout / 0.5)
for i in range(0, iterations):
sleep(0.5)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterations = int(exit_timeout / 0.5) truncates and can become 0 for small exit_timeout values (or under-wait when exit_timeout isn’t a multiple of 0.5). That would immediately take the timeout path without any DobbyTool info checks. Consider using a ceiling calculation and/or enforcing a minimum of 1 iteration so the wait duration matches exit_timeout more closely.

Copilot uses AI. Check for mistakes.
Comment on lines 100 to 104
if (state == IDobbyProxyEvents::ContainerState::Stopped && containerId == *id)
{
AI_LOG_INFO("Container %s has stopped", containerId.c_str());
std::lock_guard<std::mutex> locker(gLock);
promise.set_value();
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the mutex lock in the callback introduces a data race on the global promise: stopCommand reassigns promise under gLock, while this callback can call promise.set_value() concurrently from the proxy event thread. That’s undefined behavior and can crash intermittently. Restore synchronization (e.g., keep std::lock_guard<std::mutex> locker(gLock); here, or switch to a thread-safe pattern such as storing the promise in a shared_ptr captured by the callback).

Copilot uses AI. Check for mistakes.
Changes made for cgroupv2 compatibility:
- Removes 'swappiness' from memory resources (not supported in cgroupv2)
- Sets realtimeRuntime and realtimePeriod to valid values or removes them
- Updates rootfsPropagation to 'slave' for better compatibility
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says the script “Updates rootfsPropagation to 'slave'”, but the implementation removes both linux.rootfsPropagation and top-level rootfsPropagation. This mismatch is confusing for anyone using the script to understand what’s being changed. Please update the docstring to reflect the actual behavior (or implement the documented change).

Suggested change
- Updates rootfsPropagation to 'slave' for better compatibility
- Removes both linux.rootfsPropagation and top-level rootfsPropagation

Copilot uses AI. Check for mistakes.
with test_utils.dobby_daemon(), test_utils.untar_bundle(container_name) as bundle_path:
bundle_ctx = test_utils.untar_bundle(container_name)
with test_utils.dobby_daemon(), bundle_ctx as bundle_path:
if not bundle_ctx.valid:
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundle_ctx.valid is accessed here, but test_utils.untar_bundle currently doesn't define a valid attribute (it only untars in __init__ and returns the path in __enter__). This will raise AttributeError and fail the test before any bundle validation logic runs. Either add a valid field to untar_bundle (and set it based on tar exit status + existence of required files like config.json/rootfs) or remove these checks and handle extraction/validation failures another way.

Suggested change
if not bundle_ctx.valid:
bundle_valid = (
path.isdir(bundle_path) and
path.isfile(path.join(bundle_path, "config.json")) and
path.isdir(path.join(bundle_path, "rootfs"))
)
if not bundle_valid:

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +69
bundle_ctx = test_utils.untar_bundle(container_id)
with bundle_ctx as bundle_path:
if not bundle_ctx.valid:
return False, "Bundle extraction or validation failed"
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundle_ctx.valid is referenced, but test_utils.untar_bundle doesn't currently provide a valid attribute. This will throw at runtime and mask the real extraction/validation failure. Consider extending untar_bundle to expose valid (and any failure reason) or remove this check and instead validate the extracted bundle path (e.g., check config.json exists) before launching.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +67
bundle_ctx = test_utils.untar_bundle(container_id)
with bundle_ctx as bundle_path:
if not bundle_ctx.valid:
return False, "Bundle extraction or validation failed"

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundle_ctx.valid is used here, but test_utils.untar_bundle has no valid attribute in its current implementation, so this will raise AttributeError. If you need to guard against missing/invalid bundles, implement that in untar_bundle (e.g., capture tar return code and verify required files) and expose it via valid/error, or do explicit file checks on bundle_path.

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +109
bundle_ctx = test_utils.untar_bundle(container_name)
with bundle_ctx as bundle_path:
if not bundle_ctx.valid:
test = tests[0]
output = test_utils.create_simple_test_output(
test, False,
"Bundle extraction or validation failed",
"Bundle tarball could not be extracted or config.json was missing"
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundle_ctx.valid is accessed here, but test_utils.untar_bundle doesn't currently have a valid attribute. This will raise at runtime and cause this test group to error rather than report a clean bundle-extraction failure. If validation is desired, update untar_bundle to track success/failure (and maybe an error message) or validate bundle_path directly before proceeding.

Copilot uses AI. Check for mistakes.
Comment on lines 119 to 123
if (state == wp->state && containerId == wp->containerId)
{
AI_LOG_INFO("Wait complete");
std::lock_guard<std::mutex> locker(gLock);
promise.set_value();
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the stop callback: without locking (or another synchronization mechanism), promise.set_value() can race with promise = std::promise<void>() in the command handler. Please reintroduce gLock protection here or use a callback-specific promise instance that can’t be concurrently reassigned.

Copilot uses AI. Check for mistakes.

# Backup original
if backup:
backup_path = bundle_tarball.with_suffix('.tar.gz.bak')
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backup_path = bundle_tarball.with_suffix('.tar.gz.bak') will produce an odd filename for inputs like foo.tar.gz (e.g., foo.tar.tar.gz.bak) because Path.with_suffix replaces only the last suffix. If you want foo.tar.gz.bak, build the backup name explicitly (e.g., Path(str(bundle_tarball) + '.bak') or bundle_tarball.with_name(bundle_tarball.name + '.bak')).

Suggested change
backup_path = bundle_tarball.with_suffix('.tar.gz.bak')
backup_path = bundle_tarball.with_name(bundle_tarball.name + '.bak')

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +98
bundle_ctx = test_utils.untar_bundle(container_name)
with test_utils.dobby_daemon(), netcat_listener() as nc, bundle_ctx as bundle_path:
if not bundle_ctx.valid:
output = test_utils.create_simple_test_output(tests[0], False, "Bundle extraction or validation failed",
log_content="Bundle extraction or validation failed; container was never launched.")
output_table.append(output)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundle_ctx.valid is checked here, but test_utils.untar_bundle doesn't define a valid attribute. This will fail with AttributeError even when bundle extraction succeeds. Implement valid on the context manager (and set it based on extraction + required file existence) or replace this with explicit validation of bundle_path (e.g., check config.json exists) before starting the container.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 16, 2026 08:32
@Sonajeya31 Sonajeya31 force-pushed the topic/RDKEMW-13253-5 branch from f1bf099 to 298f0e7 Compare April 16, 2026 08:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +116 to +123
for member in tar.getmembers():
member_path = (bundle_dir / member.name).resolve()
if not str(member_path).startswith(str(bundle_dir.resolve())):
raise RuntimeError(
f"Tarball member '{member.name}' would escape extraction "
f"directory '{bundle_dir}' — aborting for safety"
)
tar.extractall(path=bundle_dir)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tar path-traversal check is incorrect: using str(member_path).startswith(str(bundle_dir.resolve())) can be bypassed by paths like <bundle_dir>_evil/... (prefix match) and would allow extraction outside bundle_dir. Use a real path containment check (e.g., member_path.relative_to(bundle_dir.resolve()) in a try/except, or os.path.commonpath), or extract members individually after validating each target path.

Suggested change
for member in tar.getmembers():
member_path = (bundle_dir / member.name).resolve()
if not str(member_path).startswith(str(bundle_dir.resolve())):
raise RuntimeError(
f"Tarball member '{member.name}' would escape extraction "
f"directory '{bundle_dir}' — aborting for safety"
)
tar.extractall(path=bundle_dir)
extraction_root = bundle_dir.resolve()
for member in tar.getmembers():
member_path = (bundle_dir / member.name).resolve()
try:
member_path.relative_to(extraction_root)
except ValueError:
raise RuntimeError(
f"Tarball member '{member.name}' would escape extraction "
f"directory '{bundle_dir}' — aborting for safety"
)
tar.extract(member, path=bundle_dir)

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +59
"""
Script to regenerate L2 test bundles for cgroupv2 compatibility.

This script:
1. Extracts each .tar.gz bundle
2. Patches config.json to remove cgroupv2-incompatible settings
3. Repacks the bundle

Changes made for cgroupv2 compatibility:
- Removes 'swappiness' from memory resources (not supported in cgroupv2)
- Sets realtimeRuntime and realtimePeriod to valid values or removes them
- Updates rootfsPropagation to 'slave' for better compatibility
"""

import json
import shutil
import sys
import tarfile
from pathlib import Path


def patch_config_for_cgroupv2(config: dict, bundle_name: str = "") -> dict:
"""Patch OCI config.json for cgroupv2 compatibility."""

# Remove swappiness from memory resources (not supported in cgroupv2)
if 'linux' in config and 'resources' in config['linux']:
resources = config['linux']['resources']

if 'memory' in resources:
memory = resources['memory']
if 'swappiness' in memory:
del memory['swappiness']
print(" - Removed 'swappiness' from memory resources")

# Fix cpu realtime settings - remove null values
if 'cpu' in resources:
cpu = resources['cpu']
if cpu.get('realtimeRuntime') is None:
del cpu['realtimeRuntime']
print(" - Removed null 'realtimeRuntime'")
if cpu.get('realtimePeriod') is None:
del cpu['realtimePeriod']
print(" - Removed null 'realtimePeriod'")
# Remove cpu section entirely if empty
if not cpu:
del resources['cpu']
print(" - Removed empty 'cpu' section")

# Remove rootfsPropagation entirely - it causes "make rootfs private" errors
# in user namespace environments like GitHub Actions
if 'linux' in config and 'rootfsPropagation' in config['linux']:
del config['linux']['rootfsPropagation']
print(" - Removed linux.rootfsPropagation")

# Remove top-level rootfsPropagation as well
if 'rootfsPropagation' in config:
del config['rootfsPropagation']
print(" - Removed top-level rootfsPropagation")
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says the script "Updates rootfsPropagation to 'slave'", but the implementation actually removes both linux.rootfsPropagation and top-level rootfsPropagation. Please update the docstring (or the logic) so it accurately reflects what the script does.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants