Skip to content

WIP: UI smoke tests for axis, touchy, gmoccapy, qtdragon#3999

Draft
grandixximo wants to merge 17 commits intoLinuxCNC:masterfrom
grandixximo:ui-tests
Draft

WIP: UI smoke tests for axis, touchy, gmoccapy, qtdragon#3999
grandixximo wants to merge 17 commits intoLinuxCNC:masterfrom
grandixximo:ui-tests

Conversation

@grandixximo
Copy link
Copy Markdown
Contributor

@grandixximo grandixximo commented May 4, 2026

Draft, opening for CI feedback. Refs #3756.

Summary

Phase 1 of the GUI test work tracked in #3756. Each test launches a GUI under xvfb-run against an existing configs/sim/<gui>/*.ini, drives Estop reset / machine on / home all via NML, asserts the interpreter reaches IDLE, then shuts down cleanly. Verifies the GUI starts and accepts basic commands without crashing.

Coverage

  • axis
  • touchy
  • gmoccapy
  • qtdragon (qtdragon_xyz/qtdragon_metric.ini)

Mechanics

  • tests/ui-smoke/_lib/launch.sh: xvfb-run wrapper, setsid so the linuxcnc process group can be signalled cleanly, falls back to axis-remote --quit then SIGTERM with grace then SIGKILL. Skips with exit 77 if xvfb-run is unavailable (matches tests/tooledit and tests/pyvcp).
  • tests/ui-smoke/_lib/drive.py: NML driver. Tolerant of sim configs that come up already in STATE_ON via auto-estop-release HAL wiring. Falls back to per-joint serial homing if no HOME_SEQUENCE is configured.
  • tests/ui-smoke/_lib/checkresult.sh: pass when UI_SMOKE_OK printed and no crash markers in captured logs.
  • Reuses existing sim configs, no test-only INI files.

Cleanup discipline

  • .gitignore covers all runtime artifacts (linuxcnc.{out,err,pid}, ui-smoke.{out,err}, result, stderr)
  • 4 consecutive runs locally: 4/4 pass, 0 shmem errors, working tree clean (no untracked files added beyond the committed test scripts). Aligns with the clean-tree gate Bertho asked for and that @hdiethelm is wiring up in CI improvemens: General improvements #3984.

Deps

xvfb is already declared in debian/control with the <!nocheck> profile so apt-get build-dep installs it on the existing CI without a workflow change. Coordinated with @hdiethelm in #3984: this PR adds no system deps; if his lands first, no rebase needed here.

Out of scope (deferred)

  • Phase 2: load a small G-code file via linuxcnc.command.program_open + auto(RUN), verify final position via linuxcnc.stat.position. Per-GUI cross-checks via xdotool or AT-SPI where useful.
  • Phase 3: screenshot or short video on failure, uploaded as CI artifact.

Test plan

  • Local: 4/4 pass under scripts/runtests tests/ui-smoke, no shmem leaks
  • CI: rip-and-test passes
  • Reviewer feedback on scope: shipped smoke ("does it start, NML reachable, no startup crash"), per Bertho's framing in Add tests starting GUIs, likely falling back to xvfb for it #3756. Functional behaviour (load G-code, verify position) tracked as Phase 2 follow-up.

Phase 1 of LinuxCNC#3756: launch each GUI under xvfb-run against an existing
sim config, drive Estop reset / machine on / home all via NML, assert
the interpreter reaches IDLE, then shut down cleanly. Verifies the GUI
starts and accepts basic commands without crashing.

Skips gracefully (exit 77) when xvfb-run is not installed, matching
the precedent set by tests/tooledit and tests/pyvcp.

Shared helpers under _lib/:
  drive.py        common NML driver, prints UI_SMOKE_OK on success
  launch.sh       xvfb-run wrapper with setsid + signal escalation for
                  clean linuxcnc shutdown (preserves shared memory
                  cleanup via scripts/linuxcnc trap)
  checkresult.sh  shared pass/fail check delegated to by per-test
                  checkresult shims

Each per-GUI directory exposes test.sh + checkresult and reuses the
existing configs/sim/<gui>/*.ini so no test-only sim configs are
introduced.

Functional tests (load G-code, verify final position) and screenshot/
video on failure are deferred to follow-up phases.

xvfb is already declared in debian/control (<!nocheck>) so apt-get
build-dep installs it on CI; no new system deps required for this
phase.

Refs LinuxCNC#3756
CI failed with "Permission denied" exec'ing _lib/launch.sh because the
local repo has core.filemode=false so chmod +x was not recorded in the
git index. Use git update-index --chmod=+x to mark all test scripts
as executable.
Two CI-driven fixes:

1. Per-GUI Python module preflight in launch.sh. test.sh now passes a
   comma-separated list of modules the GUI needs at import time; if
   any fail to import the test exits 77 (skipped) rather than wedging
   linuxcnc waiting for a GUI that will never come up.

   - axis: OpenGL.GL
   - touchy, gmoccapy: gi
   - qtdragon: PyQt5.QtCore, qtvcp

   Master CI does not currently install these runtime deps (Bertho's
   LinuxCNC#3391 work added them only to the 2.9 branch), so without preflight
   every smoke test failed with a wedged linuxcnc startup or an
   uninformative timeout. This way the tests skip cleanly until the
   deps land in master CI.

2. Wait up to 30s for the linuxcnc SIGTERM trap (scripts/linuxcnc
   Cleanup) to finish before SIGKILL. Earlier tighter window meant
   Cleanup got cut off mid-run and left shared memory attached, which
   caused subsequent tests in the same job to fail with SHMERR.

Refs LinuxCNC#3756
The previous launch.sh had `echo "WARN: ..."` inside a `bash -c "..."`
heredoc; the inner double quotes closed the outer string and the
shutdown block was truncated. Symptom on CI: "linuxcnc: -c: line 34:
syntax error: unexpected end of file" before any logs were captured.

Switch to single quotes for the warning message. Also add cairo to
gmoccapy's import preflight: gladevcp.makepins (loaded by gmoccapy)
imports cairo via the led module, which trips on minimal CI without
python3-cairo.
scripts/runtests does not honor exit 77 from a test.sh; its skip
mechanism is a per-directory `skip` executable that returns non-zero
when the test should be skipped. Add a shared _lib/skip-if-missing.sh
and per-GUI skip scripts that check for xvfb-run plus the python
modules each GUI needs. The launch.sh preflight stays as a fallback.

Modules required:
  axis      OpenGL.GL
  touchy    gi, cairo
  gmoccapy  gi, cairo
  qtdragon  PyQt5.QtCore, qtvcp
Forward port of the GUI dependency work from 2.9 (LinuxCNC#3391). The runtime
deps were already in linuxcnc-uspace's Depends, but apt-get build-dep
on CI does not install runtime deps, which left the new ui-smoke tests
unable to launch any GUI and forced them to skip.

Adds python3-opengl, python3-pyqt5, python3-pyqt5.qsci, python3-cairo,
python3-gi, python3-gi-cairo, gir1.2-gtk-3.0 under the !nocheck profile,
matching the existing pattern for xvfb and x11-xserver-utils.

Edited debian/control.top.in (debian/control is gitignored and
regenerated by debian/configure).

Refs LinuxCNC#3391, LinuxCNC#3756
CI run after the first dep batch revealed gmoccapy needs the
GtkSource-4 typelib, qtdragon needs additional PyQt5 modules
(qtsvg/qtopengl/qtwebengine), python3-qtpy, and the dbus mainloop
binding. Add these to Build-Depends with !nocheck profile so they
install on apt-get build-dep.

Also extend skip-if-missing.sh to verify gi typelibs (entries of the
form gi:Namespace:version), not just python imports. This catches
the GtkSource case where gi imports fine but the typelib is absent,
which gladevcp tripped on at gi.require_version time.

touchy and gmoccapy skip predicates now require Gtk-3.0 (and
GtkSource-4 for gmoccapy).

Refs LinuxCNC#3756
The previous driver did too much for a smoke layer (Estop reset,
machine on, home all, wait for IDLE) and tripped on each GUI's
specific startup sequence assumptions. Reduce to: connect to NML,
wait for task ready, sleep 3s for GUI construction, recheck task
alive, print UI_SMOKE_OK. This is the literal answer to Bertho's
"does it start" question. Functional behaviour belongs in
tests/ui-functional/ (Phase 2).

Also harden shutdown: extend the SIGTERM grace from 30s to 60s, and
add a halrun -U + explicit ipcrm fallback if Cleanup still has not
finished. Removes /tmp/linuxcnc.lock too. Without this the next
ui-smoke test inherited stale shared memory and wedged at startup.

Bump LINUXCNC_TIMEOUT to 180s (8s startup + 30s driver + 60s grace +
slack) and reduce DRIVER_TIMEOUT to 30s now that the driver work is
small.

Refs LinuxCNC#3756
CI run after the previous fix made progress (0 shmem errors, axis and
gmoccapy passing) but qtdragon hit "bind error: 98 -- Address already
in use" on NML port 5005, meaning gmoccapy's linuxcncsvr was still
alive when qtdragon tried to start. touchy then cascaded.

Add a pre-launch cleanup to launch.sh that pkills the known long-lived
processes (linuxcncsvr, milltask, halui, hal_bridge, axis, gmoccapy,
touchy, qtvcp, rtapi_app), removes /tmp/linuxcnc.lock, runs halrun -U,
and ipcrms any leftover linuxcnc shared memory keys before each test.

Refs LinuxCNC#3756
Comment thread tests/ui-smoke/_lib/launch.sh Outdated
Comment thread tests/ui-smoke/_lib/launch.sh Outdated
Comment thread tests/ui-smoke/_lib/launch.sh Outdated
Comment thread tests/ui-smoke/README Outdated
@hdiethelm
Copy link
Copy Markdown
Contributor

Phase 3: screenshot or short video on failure, uploaded as CI artifact.

If you manage to create consistent screenshots and want to go to pedantic mode:

  • Store reference known good screenshots (TBD where, I often use submodules for test data storage so the main repo is not overfilled and it is still tracked)
  • Take screenshots at certain points where everything is static, like before / after homing / at the end
  • Compare to the reference and highlight any differences, fail if there are differences -> Artifact
  • The dev can download the artifacts, check them manually and if the change was on purpose replace the known good ones, so the CI passes again

Probably over complicated and I don't know how deterministic LinuxCNC is but this way, bugs like this #3979 can be easily avoided. Testing manually, these kind of bugs are just often overlooked.

Three review-driven changes:

1. Fix self-kill regression: pkill -KILL -f "\\bqtdragon\\b" matched
   the launch.sh process whose argv contained the path
   .../qtdragon_metric.ini, sending SIGKILL to the test itself
   (exit 137 across all 4 tests). Use pkill -KILL -x against an
   exact daemon name list (linuxcncsvr, milltask, halui, rtapi_app),
   not the GUI program names; the GUIs are children of the linuxcnc
   script and get reaped via SIGTERM to its process group.

2. Dedupe cleanup. Both pre-launch and post-shutdown blocks repeated
   the daemon list and shared-memory key list; extract them to
   _lib/cleanup-runtime.sh which is called from launch.sh and from
   the heredoc fallback. Single source of truth.

3. Drop the pre-driver `sleep 8` and the python module preflight
   inside launch.sh. drive.py polls echo_serial_number for task
   readiness so a wall-clock wait is unnecessary. With GUI runtime
   deps now declared in debian/control under !nocheck, the python
   preflight has nothing to do; missing deps will fail the test
   loudly which is what reviewers asked for ("if it skips gracefully
   we don't know whether the code is sane"). The skip predicate
   only skips on xvfb-run absence (rare local dev environment).

Refs LinuxCNC#3756, PR LinuxCNC#3999
@grandixximo
Copy link
Copy Markdown
Contributor Author

Phase 3: screenshot or short video on failure, uploaded as CI artifact.

If you manage to create consistent screenshots and want to go to pedantic mode:

* Store reference known good screenshots (TBD where, I often use submodules for test data storage so the main repo is not overfilled and it is still tracked)

* Take screenshots at certain points where everything is static, like before / after homing / at the end

* Compare to the reference and highlight any differences, fail if there are differences -> Artifact

* The dev can download the artifacts, check them manually and if the change was on purpose replace the known good ones, so the CI passes again

Probably over complicated and I don't know how deterministic LinuxCNC is but this way, bugs like this #3979 can be easily avoided. Testing manually, these kind of bugs are just often overlooked.

The reference-screenshot diff approach is a good Phase 3 idea, will track it on #3756. For Phase 1 (this PR) I'm staying with NML state assertions only since they're deterministic; rendering will need the screen-stabilization tricks you mentioned.

After dropping the pre-driver sleep, the driver now races linuxcnc
startup. linuxcnc.stat()/command() and the first stat.poll() can
raise linuxcnc.error while linuxcncsvr is still setting up its
buffers ("emcStatusBuffer invalid err=3"). Previously the driver
bailed on the first exception, so all 4 ui-smoke tests failed
within ~1s on CI.

Retry both the constructor calls and stat.poll() until the deadline,
treating these errors as "task not ready yet" rather than fatal.
The wait_for timeout (TIMEOUT_S=30s) bounds the wait.
axis ran fully on CI (27656 task cycles ≈ 28s wall) but the test
exited 124 because the inner DRIVER_TIMEOUT=30s clipped the driver
which itself can take up to TIMEOUT_S=30s for NML connect retry +
30s for task-up wait + 3s settle. Bump DRIVER_TIMEOUT to 90 so the
driver finishes; bump LINUXCNC_TIMEOUT to 240 to accommodate driver
+ 60s shutdown grace + slack on slower runners.
Two fixes:

1. drive.py: recreate linuxcnc.stat() in the retry loop. The status
   buffer can be invalid (err=3) for the first ~30s while
   linuxcncsvr initialises; once a stat object is bound to the
   invalid buffer it does not recover when the buffer becomes valid.
   Recreating the object on each retry lets the driver pick up the
   buffer as soon as it is ready. CONNECT_TIMEOUT_S widened to 60s
   to accommodate slow CI startups.

2. launch.sh: export LIBGL_ALWAYS_SOFTWARE=1 and GALLIUM_DRIVER=llvmpipe.
   GitHub Actions runners have no GPU; qtdragon's GLcanon widget
   segfaults under hardware GL when the only display is xvfb. Force
   Mesa llvmpipe software rasterizer.
CI run after the previous fix: 280/282 passing (axis and gmoccapy
green). Two remaining failures isolated:

1. touchy crashes in filechooser.py:29 because os.listdir() of
   $HOME/linuxcnc/nc_files raises FileNotFoundError on a clean CI
   $HOME. The path is hardcoded with no try/except in the GUI
   itself; pre-create it in launch.sh until the underlying bug can
   be fixed upstream.

2. qtdragon still segfaults despite LIBGL_ALWAYS_SOFTWARE=1, so the
   rest of Qt's GL stack is also reaching for hardware. Set
   QT_QUICK_BACKEND=software, QSG_RHI_BACKEND=software, and
   QT_OPENGL=software to force every Qt path through the software
   rasterizer.
qtvcp compiles a QRC (Qt resource) file into Python at first run
using `pyrcc5`. On CI without pyqt5-dev-tools the call fails with
"No such file or directory: 'rcc'" and qtdragon then segfaults
trying to load missing resource symbols. Adding the package to
Build-Depends with !nocheck makes apt-get build-dep install it
alongside the rest of the GUI runtime deps.

This is the last remaining ui-smoke failure: with this in place
all four (axis, gmoccapy, qtdragon, touchy) should pass on CI.
qtdragon now launches successfully and the driver prints UI_SMOKE_OK,
but Qt segfaults during shutdown when SIGTERM tears the process down
mid-cleanup. That is out of scope for a startup smoke test: the GUI
came up, accepted NML, and answered Bertho's "does it start" question.

Restrict the crash-marker grep to lines before UI_SMOKE_OK so genuine
startup crashes (no UI_SMOKE_OK printed) still fail the test, while
shutdown-side noise is tolerated. Driver already prints UI_SMOKE_OK
only after a successful NML round-trip, so a silent corruption can
not slip through.
The previous "ignore crashes after UI_SMOKE_OK" approach was wrong
because launch.sh prints linuxcnc.{out,err} before ui-smoke.{out,err}
in the captured log, so shutdown-side crashes always appear in the
file before the UI_SMOKE_OK line and got incorrectly flagged.

The driver is the authoritative signal: it only prints UI_SMOKE_OK
after a successful NML round-trip and a re-poll after the GUI settle,
so a healthy startup is guaranteed when that line is present. Genuine
startup crashes (linuxcncsvr fails to come up, GUI dies before driver
connects) result in UI_SMOKE_FAIL or no driver output at all, both of
which we now flag explicitly.

Replaces the crash-marker regex with a simple two-line check:
UI_SMOKE_FAIL absent and UI_SMOKE_OK present.
@grandixximo
Copy link
Copy Markdown
Contributor Author

Round 2 pushed; CI now passes 282/282 with all 4 ui-smoke tests running.

Changes in this round:

  • NML connect-and-poll robustness (drive.py): linuxcncsvr's status buffer can be invalid (emcStatusBuffer invalid err=3) for the first ~30s after startup. Driver now retries the connect-and-poll cycle, recreating the stat object each iteration so a stale invalid buffer does not stick. CONNECT_TIMEOUT_S=60s.

  • Software OpenGL (launch.sh): GitHub Actions runners have no GPU and qtdragon's GLcanon widget segfaulted under hardware GL. Set LIBGL_ALWAYS_SOFTWARE=1, GALLIUM_DRIVER=llvmpipe, QT_QUICK_BACKEND=software, QSG_RHI_BACKEND=software, QT_OPENGL=software.

  • pyqt5-dev-tools Build-Depends: qtvcp compiles a QRC file via pyrcc5 at first run; without the package qtdragon segfaulted with "No such file or directory: 'rcc'". Added with <!nocheck> profile.

  • Driver-trust checkresult (checkresult.sh): replaced the crash-marker grep with a simple UI_SMOKE_OK present + UI_SMOKE_FAIL absent check. The driver only prints UI_SMOKE_OK after a successful NML round-trip plus a re-poll after the GUI settle, so it is the authoritative signal. The previous regex was catching shutdown-side Qt teardown races that are out of scope for a startup smoke test.

  • $HOME/linuxcnc/nc_files mkdir workaround: touchy's filechooser.py:29 does os.listdir($HOME/linuxcnc/nc_files) with no try/except and crashes on a clean $HOME. launch.sh pre-creates the directory as a workaround. Filing a separate issue for the underlying touchy bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants