Sorry in advance - this is a long collection of thoughts and questions about symptoms I am seeing...
Summary
I have been seeing unreliable confirmation behaviour in a real MeshCore deployment with repeaters. Packets often appear to send successfully, but it is common not to see an expected heard repeat or ACK.
One of the more prominent repeaters in the area also regularly reports Debug 2. From reading the code, I understand this to map to ERR_EVENT_CAD_TIMEOUT, meaning the node had a packet ready to transmit, considered the radio/channel busy for around 4 seconds, then continued with the transmit.
I have spent some time reading through the firmware and existing issues. I am not opening this as a “this is definitely a bug” report. I am trying to understand the current design choices, what has already been tried, what problems have been seen in practice, and where maintainers think useful investigation should happen next.
Field observations
The symptoms I am seeing are:
- Sent packets often do not result in a visible heard repeat or ACK.
- Confirmation feels noticeably less reliable around some repeater paths.
- A prominent repeater regularly reports
Debug 2.
- The behaviour feels consistent with contention, overlapping repeats, or lost ACK/path-return traffic, but I do not yet have enough instrumentation to prove that.
Code areas I looked at
The areas that stood out were:
src/Dispatcher.cpp, especially the transmit gate in Dispatcher::checkSend().
src/Dispatcher.h, where ERR_EVENT_CAD_TIMEOUT is defined.
src/helpers/radiolib/RadioLibWrappers.cpp, especially isChannelActive() and noise-floor handling.
examples/companion_radio/MyMesh.cpp, where getInterferenceThreshold() currently returns 0 with a comment about currentRSSI().
examples/simple_repeater/MyMesh.cpp, especially the default values for txdelay, rxdelay, direct.txdelay, and interference_threshold.
src/Mesh.cpp, where flood forwarding schedules retransmission by eligible repeaters.
src/helpers/BaseChatMesh.cpp, where ACK and response delays are fixed.
src/helpers/StaticPoolPacketManager.cpp, where queued packets are prioritised and dispatched.
My understanding is that normal TX is locally gated by _radio->isReceiving(), but that this is not a full contention mechanism across the mesh. Multiple devices can still independently schedule transmissions into overlapping windows, especially after a flood packet is heard by several repeaters.
Related issues I found
I found several issues that seem relevant to the same broad area:
I am trying not to duplicate those issues. This issue is intended to tie my field observations to those existing discussions and ask what the current maintainer thinking is.
Questions for maintainers
1. How should repeater timing defaults be understood today?
The defaults that stood out to me were:
txdelay = 0.5
direct.txdelay = 0.3
rxdelay = 0
- fixed ACK delay of 200 ms
Issue #2123 already raised concerns that the delay defaults may be too low, and #2431 suggests there has been wider confusion about what the timing knobs actually do.
Are the current defaults best understood as conservative latency-friendly defaults, sparse-network defaults, values based on field testing, or something else?
It would be useful to understand the design intent, the practical constraints, and how maintainers currently think these values should be tuned or evolved.
2. What happened with RX delay?
There are comments in the code suggesting rxdelay used to be non-zero, or may be re-enabled once the algorithm is fixed.
Issue #2064 also raises an important point about values between 0 and 1 inverting the intended SNR relationship, which makes me wonder whether the disabled default is partly about avoiding surprising path-selection behaviour.
What problems were seen with the previous RX delay approach, and what would need to be true for RX delay to become a safer default or a more actively recommended setting?
I would be interested in any background on the trade-offs, failed experiments, or design direction for this area.
3. What is the current status of RSSI-based channel detection?
The interference threshold appears to be disabled by default, and companion firmware hard-disables it with a comment about a currentRSSI() problem.
Issue #2051 already raises this for companion firmware and asks whether the currentRSSI() concern is still current.
Is the known issue with currentRSSI() still unresolved, and is it considered radio-specific, board-specific, RadioLib-related, or a more general noise-floor/calibration problem?
More broadly, what direction do maintainers think makes sense for channel activity detection: RSSI thresholding, proper LoRa CAD, a hybrid approach, or something else?
4. How should Debug 2 be interpreted in the field?
My understanding is that Debug 2 means ERR_EVENT_CAD_TIMEOUT: the node considered the radio/channel busy for around 4 seconds, then transmitted anyway.
I can see why that behaviour exists as a recovery path if the radio state is wrong or the busy indication is stale. In a genuinely busy mesh, though, I am not sure whether this should be treated as harmless, a congestion signal, or a sign that the repeater may be contributing to further contention.
How do maintainers currently interpret regular Debug 2 reports from prominent repeaters, and what extra context would help distinguish normal operation from a real MAC-layer or RF problem?
5. Should CAD timeout behaviour depend on packet type?
The current behaviour appears to be that a pending packet eventually gets transmitted after a CAD timeout.
Would it make sense for this to vary by packet type or purpose?
For example, adverts, routine repeats, ACKs, path returns, and user payloads may not all deserve the same behaviour after a busy-channel timeout.
I am not proposing a specific policy here. I am asking whether maintainers think this distinction is useful, risky, unnecessary, or already considered elsewhere.
6. Should ACK timing account for airtime and repeat windows?
ACKs and some responses appear to use fixed delays. My concern is that a fixed 200 ms ACK can be sent while repeaters are still echoing the original flood packet.
This overlaps with the wider timing concerns in #2123, but ACK timing feels slightly different because it affects user-visible delivery confidence even when the original payload may have arrived.
Has airtime-aware or route-aware ACK timing been considered, especially for flood/path-return ACKs?
I would be interested in any known history here, including whether fixed ACK timing has worked well enough in practice or whether it is an area worth instrumenting further.
Upcoming diagnostic PR
I am working on a small instrumentation PR intended to help diagnose this area without changing functional behaviour.
The PR adds two commands:
stats-mac-cad
stats-mac-tx
stats-mac-cad exposes MAC/CAD counters such as CAD deferrals, timeouts, and drops.
stats-mac-tx exposes TX-side counters such as normal TX, retransmit, and queue counters.
The intent is to make field reports more useful. At the moment, a value such as Debug 2 says that something happened, but it does not give much context about how often, under what kind of pending traffic, or alongside what queue/retransmit behaviour.
My plan
Now MeshCore is well established I suspect some of the earlier trade-offs now combine into poor ACK/heard-repeat reliability. I am going to work through some of the ideas above (and any other related ones I find along the way) and then test them locally on my own repeaters. It would be great to get anyones thoughts about these items, especially any indications of which would be most welcomed and valued. If anyone wants to help me test the firmware changes that would be helpful too, as it is difficult to simulate a busy mesh on your own. Don't worry, I will be cautious about making changes and will be ready to revert them if there are unintended consequences.
Sorry in advance - this is a long collection of thoughts and questions about symptoms I am seeing...
Summary
I have been seeing unreliable confirmation behaviour in a real MeshCore deployment with repeaters. Packets often appear to send successfully, but it is common not to see an expected heard repeat or ACK.
One of the more prominent repeaters in the area also regularly reports
Debug 2. From reading the code, I understand this to map toERR_EVENT_CAD_TIMEOUT, meaning the node had a packet ready to transmit, considered the radio/channel busy for around 4 seconds, then continued with the transmit.I have spent some time reading through the firmware and existing issues. I am not opening this as a “this is definitely a bug” report. I am trying to understand the current design choices, what has already been tried, what problems have been seen in practice, and where maintainers think useful investigation should happen next.
Field observations
The symptoms I am seeing are:
Debug 2.Code areas I looked at
The areas that stood out were:
src/Dispatcher.cpp, especially the transmit gate inDispatcher::checkSend().src/Dispatcher.h, whereERR_EVENT_CAD_TIMEOUTis defined.src/helpers/radiolib/RadioLibWrappers.cpp, especiallyisChannelActive()and noise-floor handling.examples/companion_radio/MyMesh.cpp, wheregetInterferenceThreshold()currently returns0with a comment aboutcurrentRSSI().examples/simple_repeater/MyMesh.cpp, especially the default values fortxdelay,rxdelay,direct.txdelay, andinterference_threshold.src/Mesh.cpp, where flood forwarding schedules retransmission by eligible repeaters.src/helpers/BaseChatMesh.cpp, where ACK and response delays are fixed.src/helpers/StaticPoolPacketManager.cpp, where queued packets are prioritised and dispatched.My understanding is that normal TX is locally gated by
_radio->isReceiving(), but that this is not a full contention mechanism across the mesh. Multiple devices can still independently schedule transmissions into overlapping windows, especially after a flood packet is heard by several repeaters.Related issues I found
I found several issues that seem relevant to the same broad area:
rx_delay,tx_delay, anddirect.tx_delayare too low.txdelayandrxdelayactually do and what the mechanics behind is. #2431 asked for better documentation of whattxdelayandrxdelayactually do.txdelayandrxdelaycan be set higher than documented maximum. #2433 raised a mismatch between documented and accepted maximum values fortxdelayandrxdelay.rxdelayvalues between 0 and 1 can invert the intended SNR relationship.int.threshbeing hardcoded to 0 in companion firmware, and the unresolvedcurrentRSSI()question.I am trying not to duplicate those issues. This issue is intended to tie my field observations to those existing discussions and ask what the current maintainer thinking is.
Questions for maintainers
1. How should repeater timing defaults be understood today?
The defaults that stood out to me were:
txdelay = 0.5direct.txdelay = 0.3rxdelay = 0Issue #2123 already raised concerns that the delay defaults may be too low, and #2431 suggests there has been wider confusion about what the timing knobs actually do.
Are the current defaults best understood as conservative latency-friendly defaults, sparse-network defaults, values based on field testing, or something else?
It would be useful to understand the design intent, the practical constraints, and how maintainers currently think these values should be tuned or evolved.
2. What happened with RX delay?
There are comments in the code suggesting
rxdelayused to be non-zero, or may be re-enabled once the algorithm is fixed.Issue #2064 also raises an important point about values between 0 and 1 inverting the intended SNR relationship, which makes me wonder whether the disabled default is partly about avoiding surprising path-selection behaviour.
What problems were seen with the previous RX delay approach, and what would need to be true for RX delay to become a safer default or a more actively recommended setting?
I would be interested in any background on the trade-offs, failed experiments, or design direction for this area.
3. What is the current status of RSSI-based channel detection?
The interference threshold appears to be disabled by default, and companion firmware hard-disables it with a comment about a
currentRSSI()problem.Issue #2051 already raises this for companion firmware and asks whether the
currentRSSI()concern is still current.Is the known issue with
currentRSSI()still unresolved, and is it considered radio-specific, board-specific, RadioLib-related, or a more general noise-floor/calibration problem?More broadly, what direction do maintainers think makes sense for channel activity detection: RSSI thresholding, proper LoRa CAD, a hybrid approach, or something else?
4. How should
Debug 2be interpreted in the field?My understanding is that
Debug 2meansERR_EVENT_CAD_TIMEOUT: the node considered the radio/channel busy for around 4 seconds, then transmitted anyway.I can see why that behaviour exists as a recovery path if the radio state is wrong or the busy indication is stale. In a genuinely busy mesh, though, I am not sure whether this should be treated as harmless, a congestion signal, or a sign that the repeater may be contributing to further contention.
How do maintainers currently interpret regular
Debug 2reports from prominent repeaters, and what extra context would help distinguish normal operation from a real MAC-layer or RF problem?5. Should CAD timeout behaviour depend on packet type?
The current behaviour appears to be that a pending packet eventually gets transmitted after a CAD timeout.
Would it make sense for this to vary by packet type or purpose?
For example, adverts, routine repeats, ACKs, path returns, and user payloads may not all deserve the same behaviour after a busy-channel timeout.
I am not proposing a specific policy here. I am asking whether maintainers think this distinction is useful, risky, unnecessary, or already considered elsewhere.
6. Should ACK timing account for airtime and repeat windows?
ACKs and some responses appear to use fixed delays. My concern is that a fixed 200 ms ACK can be sent while repeaters are still echoing the original flood packet.
This overlaps with the wider timing concerns in #2123, but ACK timing feels slightly different because it affects user-visible delivery confidence even when the original payload may have arrived.
Has airtime-aware or route-aware ACK timing been considered, especially for flood/path-return ACKs?
I would be interested in any known history here, including whether fixed ACK timing has worked well enough in practice or whether it is an area worth instrumenting further.
Upcoming diagnostic PR
I am working on a small instrumentation PR intended to help diagnose this area without changing functional behaviour.
The PR adds two commands:
stats-mac-cadstats-mac-txstats-mac-cadexposes MAC/CAD counters such as CAD deferrals, timeouts, and drops.stats-mac-txexposes TX-side counters such as normal TX, retransmit, and queue counters.The intent is to make field reports more useful. At the moment, a value such as
Debug 2says that something happened, but it does not give much context about how often, under what kind of pending traffic, or alongside what queue/retransmit behaviour.My plan
Now MeshCore is well established I suspect some of the earlier trade-offs now combine into poor ACK/heard-repeat reliability. I am going to work through some of the ideas above (and any other related ones I find along the way) and then test them locally on my own repeaters. It would be great to get anyones thoughts about these items, especially any indications of which would be most welcomed and valued. If anyone wants to help me test the firmware changes that would be helpful too, as it is difficult to simulate a busy mesh on your own. Don't worry, I will be cautious about making changes and will be ready to revert them if there are unintended consequences.