Skip to content

fix(docker): fix docker user permission#98

Open
SteveHawk wants to merge 643 commits into
chrissnell:mainfrom
SteveHawk:fix/dockerfile-user-permission
Open

fix(docker): fix docker user permission#98
SteveHawk wants to merge 643 commits into
chrissnell:mainfrom
SteveHawk:fix/dockerfile-user-permission

Conversation

@SteveHawk

Copy link
Copy Markdown
Contributor

In the previous dockerfile fix (#82), I made an oopsie and didn't properly set user's permission. When connected to external devices such as an AIOC, it would complain about no permission to access unless I run the service in root.

graywolf  | graywolf-modem: ConfigurePtt: build driver for channel 1 (cm108): open /dev/hidraw8: EACCES: Permission denied
graywolf  | graywolf-modem: failed to resolve output device_id=2 (plughw:CARD=AllInOneCable,DEV=0): output device not found: plughw:CARD=AllInOneCable,DEV=0
graywolf  | graywolf-modem: start_audio failed (input device not found: plughw:CARD=AllInOneCable,DEV=0), retrying in 500ms
graywolf  | graywolf-modem: failed to resolve output device_id=2 (plughw:CARD=AllInOneCable,DEV=0): output device not found: plughw:CARD=AllInOneCable,DEV=0
graywolf  | graywolf-modem: start_audio failed: input device not found: plughw:CARD=AllInOneCable,DEV=0

So this PR fixes the problem by including the user in audio (access audio devices like /dev/snd), dialout (access serial ports like /dev/ttyUSB*) and plugdev (access removable devices like /dev/hidraw*) group.

(More info on system groups: https://wiki.debian.org/SystemGroups)

On Linux systems, the best practice is usually using udev rules to dynamically enable user's permission to access certain devices, because adding user to system groups would mean permission to access all devices in that category. However udev isn't available in docker, and since only target devices are passed in anyways, I think adding user to these groups is reasonable from a security standpoint.

At this moment, including these three groups (audio,dialout,plugdev) is enough for my use case, but I'm not sure if others would need more. I guess we can add more if someone complains.

Also there's a potential problem I stumbled upon in this stackoverflow answer, where devices can be owned by different groups in different distro, so this fix can only guarantee to work on debian family distros, others like arch might break. (uucp group has different gid between arch and debian, so we can't just simply include that.) I would say this is probably good enough for now, let's fix this when someone complains in the future.

chrissnell added 30 commits May 2, 2026 18:03
- expandToken: single-pass strings.NewReplacer. Map+ReplaceAll loop
  was nondeterministic (map iteration order) and would re-substitute
  output if a per-key arg regex permitted `{` / `}`, letting one arg
  reference another's token.
- Webhook success classification: tighten 200-299 (was <400). Spec
  says non-2xx is failure; matches new redirect policy.
- Webhook client: refuse redirects (CheckRedirect returns
  ErrUseLastResponse). Webhook URL is operator-set but the redirect
  Location is not — chasing it could pivot to 169.254.169.254 / RFC1918
  / loopback in a feature triggered by remote APRS callers.
- Webhook client: replace lying "2x safety net" comment with the truth
  (timeout enforced via runCtx).
- cappedWriter: return full input length on truncation. Reporting the
  short count was surfacing as io.ErrShortWrite to io.Copy, which closed
  the child's stdout pipe and SIGPIPE'd the process — so any cmd that
  exceeded the cap was reported as failure instead of success.
- Tests: command output cap, webhook body template, header expansion,
  request timeout, response body cap, malformed headers JSON, redirect
  blocked, expandToken determinism with brace-bearing arg.
- Submit/Stop race: hold per-queue mutex through the channel send so
  Stop can mark queues closed and close their channels without racing
  a concurrent push (would otherwise panic on closed channel and leak
  late-created queue workers).
- Rate-limit rollback on busy: if the channel push fails into the busy
  default arm, restore the previous lastFire so a rejected attempt no
  longer silently extends the rate-limit window.
- FormatReply now returns (text, truncated) so the audit Truncated
  flag reflects actual snippet/reply truncation instead of a fragile
  post-hoc string match.
- Audit pruner stop func uses sync.Once to make double-close safe.
- Reply/audit failures emit slog errors instead of being silently
  swallowed.
- classifier discriminates gorm.ErrRecordNotFound from real store
  failures so a DB outage no longer surfaces as a misleading
  "unknown action" reply on the air.
- thread OTPCredentialID through Invocation and stamp it on the
  audit row so the existing column is no longer dead.
- migration 15 declares ON DELETE SET NULL FKs on
  action_invocations.action_id and action_invocations.otp_credential_id
  consistent with actions.otp_credential_id; audit rows survive
  Action and credential deletion via NULL rather than orphaned IDs.
- schema decode error path includes the action name in the reply
  detail so operators can identify which Action is misconfigured.
- runner.executeWithRecover guards executor panics so a buggy
  third-party Executor cannot freeze the per-Action queue.
- Validate webhook URLs (parse, http(s) scheme, no userinfo).
- Reject Action save when otp_required and no credential ref.
- Surface Truncated on TestFireResponse so the UI mirrors what an
  on-air sender would have observed.
- TrimSpace on Action.Name (replaces dead sanitizeStr no-op).
- Comment marker in tacticals.go for the deferred inverse-collision
  check against listener addressees.
- Cover the PUT path for the OTPRequired+nil rule with a dedicated
  test; the create-side test left the update path as a regression
  hole.
- Cap webhook URL at 2048 chars and reject embedded whitespace so an
  obviously-pasted-wrong URL fails on save instead of at fire time.
- Correct the tacticals.go comment: handler is acceptTacticalInvite
  (invite, not rename), and the deferred risk is silent shadowing of
  a colliding listener via classifier dispatch precedence.
- Document the gorm `default:true` semantic on Action.OTPRequired in
  both the model and docs/wiki/actions.md so the next session
  doesn't rediscover the round-trip trap from a failing test.
chrissnell and others added 24 commits May 7, 2026 22:35
Mirrors the Linux + ALSA 'amixer set Capture -35dB' that graywolf
operators use on the Digirig + UV5R chain. AAudio on Android exposes
no equivalent capture-side gain stage, so audio enters the demod
already-clipped at the ADC saturation rail when the radio is at
normal listening volume.

Issues a USB Audio Class 1.0 SET_CUR request against the Volume
Control selector of the Feature Unit found by walking the config
descriptor. Falls back to per-channel (left/right) if the codec
rejects channel-0 master.

Permission gating: UsbManager.requestPermission() shows a one-time
system dialog the operator approves; subsequent runs reuse the
grant. Without permission this is a no-op (warning logged) and
audio capture proceeds at the AAudio default level.
AAudio's HAL routing for USB-Audio class on this tablet rail-pins
the input regardless of input_preset, and the audio HAL claims the
device exclusively, blocking user-space FU_VOLUME control transfers
(every SET_CUR returned rc=-1 once an AAudio stream was open).

aprsdroid runs identical Baofeng + CMedia hardware successfully
through android.media.AudioRecord with AudioSource.MIC at 22050 Hz.
That path uses different HAL gain shaping that fits the modulated
audio inside the i16 range. Same Rust DSP pipeline downstream.
GREEN verdict — graywolf-modem decodes live off-air APRS frames on
Android arm64 via a tiny NativeActivity APK that captures audio
through Java AudioRecord (JNI) and feeds the same MultiAfskDemodulator
desktop graywolf-modem uses. -6 dB software gain on a Baofeng UV5R
+ Digirig chain produces 0% clip and a frame-rate consistent with
operator's known reference; 9 frames in a 50-second on-air capture.

Documents the four spec deviations the bring-up forced (cpal/oboe
feature, NativeActivity APK shell, AAudio -> AudioRecord, workspace
layout) and the production-app design constraints they imply.

.context/ is gitignored; force-added so the artifact lands on the
feature branch alongside the code commits that produced it.
Cast through *const () first per the new lint's hint. Same machine
code, just satisfies -D warnings.

Reported by CI on PR chrissnell#87.
POC-A (PR chrissnell#87) invalidated the exec'd-binary topology this doc
originally specified. Modem now ships as a Rust cdylib loaded into
the Kotlin Service via System.loadLibrary; audio path switched from
CPAL Oboe to Java AudioRecord + JNI sample hand-off; USB hardware-
mixer gain dropped in favor of software-only gain (slider in
SharedPreferences). Process tree drops from 3 to 2.

Concrete edits:
- §3.1: process tree rewrite, two reasons modem can't be exec'd
- §3.2: in-process JNI audio bridge added as channel 4; Go<->Rust UDS retained
- §3.4: readiness signal via JNI return code, not stdout byte
- §3.5: version string via JNI export
- §4.1: Kotlin file list reshaped (AudioPump, ModemBridge, GoLauncher)
- §4.4: Rust modem rewrite — cdylib + JNI module + audio consumer
- §4.5: build system updated (cargo-ndk --lib for cdylib)
- §5.1: cold start sequence rewritten around loadLibrary + JNI
- §5.2: TX path marked TBD pending POC-B-revised
- §5.3: RX path goes through JNI
- §6: error table rows for JNI panic, audio route via AudioRecord
- §7: invariants N1, N2 amended; new N8 (JNI audio path) and N9 (software gain)
- §11: phase 1 marked complete; new §11.2 spec for POC-B-revised
- §12: risk chrissnell#1 (CPAL Oboe) retired, replaced with HAL-variance risk
- Appendix A: directory layout reflects cdylib + jniLibs + module moves

The revision log entry at top of the doc summarizes the same.

.context/ is gitignored; force-added per repo convention.
End-to-end production-topology proof on a single APK before
phase 2-7 commit. Foreground Service + Rust modem cdylib + JNI
audio pump + Go child + UDS + WebView + bearer token. Same
hardware as POC-A; success measured by frames decoded end-to-end
into a hand-written WebView page.

References .context/2026-05-01-android-app-design.md §11.2 and
expands it into a writing-plans-ready spec covering toolchain,
file layout, build invocations, run procedure, deliverables,
plan-structure hints, and stop conditions.

.context/ is gitignored; force-added per repo convention.
Two new read-outs at the top of the hand-written WebView page,
backed by a new GET /api/_internal/status endpoint on the Go stub.
Doubles as the 'is anything happening?' liveness signal during a
10-min on-air capture: incrementing uptime + flatline frames =
silence on-air or audio-path break; static both = Go-side hang.
Toggling the iGate Enable switch on the iGate page silently no-op'd
until the daemon was restarted. The reload signal channel and
reload-drainer goroutine were only allocated when the iGate was already
enabled at boot, and reloadIgate only ever ran Reconfigure, never Stop
or build.

Reload plumbing (signal channel, drainer goroutine, RF->IS fanout
adapter, IGateLineSender adapter) is now allocated unconditionally.
reloadIgate handles all three transitions: disabled->enabled (build,
Start, install), enabled->disabled (Stop, clear), enabled->enabled
(existing Reconfigure path). a.ig is held in atomic.Pointer so consumers
that captured method values across the toggle no longer freeze on a
stale instance; the status / simulation REST routes re-load it on every
call.

Documents the new behavior as invariant chrissnell#28 and adds three regression
tests covering the disabled<->enabled transitions plus a repeated cycle.
Status / simulation REST contract: while disabled, GET /api/igate now
returns 503 (matching the pre-fix wire shape the UI's "Disabled" badge
keys off) instead of 200 with an empty Connected=false snapshot, and
POST /api/igate/simulation returns 503 (via igate.ErrNotEnabled
sentinel) instead of a generic 500. The /api/status aggregate omits
the igate field when disabled.

reloadIgate now stores the new *igate.Igate and propagates it to the
RF->IS fanout adapter and beacon ISSink BEFORE calling Start, so a
status fetch racing the supervise goroutine's first connect cannot
observe a.ig as nil while the iGate is wiring up. Start failure rolls
all of those back to nil so a subsequent toggle gets a fresh build.

Prometheus metrics survive enable cycles: initMetrics now rebinds
ig.m* to the originally-registered ExistingCollector on
AlreadyRegisteredError, so /metrics keeps reflecting live counter
increments instead of freezing at the first instance's values when
the operator toggles the iGate off and back on.

Adds wire-level webapi tests covering the 503 / 200 contract and
updates invariant chrissnell#28 with the store-before-Start ordering, the
disabled-state HTTP contract, and the prometheus rebind requirement.
Status bar now reads "N of M stations heard direct" while the Direct RX
toggle is active, so operators can see how many stations are currently
reaching the station on RF without a digi or iGate hop. Closes chrissnell#86.
Previous wording 'N of M stations heard direct' implied all M were heard
direct. New form makes the two populations explicit.
Beacon TX emitted an Info "beacon sent" line; messages did not log
to the console at all on success or failure. The pkglog still recorded
the frame so the dashboard packet feed showed it, but operators
watching journalctl/stdout had no trace of message TX.

Add Info on RF send (TxHook fire) and IS send, and Warn on every
terminal failure path (length cap, packet-mode channel, encode error,
queue-full, governor stopped, RF unavailable, igate not configured /
read-only / SendLine error). Auto-ACK TX is intentionally not logged
to avoid noise on every received DM.

Closes chrissnell#85
Code review feedback: finalizeRFFailure was called from two paths
(RF unavailable + ErrStopped) but logged the same reason="rf unavailable"
for both. Operators grepping for that reason couldn't tell radio-link
problems from governor shutdown — different remediation paths.

Thread the reason string into finalizeRFFailure so each caller supplies
its own. Adds a TestSender_RFUnavailable_LogsWarn case to cover the
unavailable-bridge path that the existing tests didn't exercise.
@chrissnell

Copy link
Copy Markdown
Owner

There's not a standard for device ownership groups across all Linux distros. I think a better solution is to run a shell script that sources /etc/os-release and then uses a case statement on ID to choose which group(s) to chown with.

@SteveHawk

Copy link
Copy Markdown
Contributor Author

There's a problem with that approach: we don't really know user's distro during build time. Script will always read Debian because that's the base image's distro.

If we set it at runtime, then another problem: we're running non privileged user, whom do not have the permission to set it's own group. If we use root, then we don't really need to set permission because we can already access everything.

@SteveHawk

Copy link
Copy Markdown
Contributor Author

Did a bit more research on this topic, and permission management is just a naughty little bastard.

Checked a few of the most popular distros out there including debian, arch, RHEL/fedora, gentoo and such, and I only find debian and arch have proper documentations on these system groups, no mention on other distros' wikis, only some sparse info here and there. With these limited knowledge, it seems like most distros use audio group for /dev/snd devices. Most distros use dialout group for serial devices, except arch which uses uucp. As for /dev/hidraw* devices, debian and maybe a few others provide plugdev, and arch simply does not support using group permission to access it, only udev.

Oh and these groups might have different group ids under the hood. So it's simply impossible to support all.

So I can think of three options to solve this:

  1. Use non-priviledged user with no permission. Document that it will require manual permission settings to use any external devices, such as fiddle with cgroup rules or udev rules, or run container as root.
  2. Use the group method this PR proposes, and document that it will only work out-of-the-box on debian and debian-based distros, and may require manual setup on other distros. Since a lot of people use debian or debian-based distros (like ubuntu, popos, raspberrypi os, armbian and so on), this should cover a lot of people.
  3. Use root user in container. Will run anywhere without permission issues, but less safe.

Personally I still prefer option 2. Because for option 1, given how obscure it is to set cgroups rules or udev rules, most people probably will just use root user anyways because it's way simpler, and now it's as insecure as option 3. Option 2 is a middle ground providing safer option for the most popular distro family.

@chrissnell chrissnell force-pushed the main branch 3 times, most recently from bdc4d52 to c6b4bbb Compare June 13, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants