Skip to content

Commit f515083

Browse files
MarkoVcodeclaude
andcommitted
Implement timeout classification and comprehensive health monitoring tests
This commit completes the health monitoring implementation with timeout vs error differentiation and comprehensive test coverage. **Driver Improvements:** - Fix owon_spm: Remove buggy try-except, let exceptions propagate naturally - Fix owon_oel: Improve empty response handling, remove nested try-except - Fix owon_xdm: Handle device-off state correctly with empty dict return - Fix rigol_dho800: Add exception propagation documentation - Add DriverBase._poll_multi_class() helper method with proper docs **Critical Bug Fix in poll_worker.py:** - Fix double failure recording when poll_status() raises exceptions - Added exception_occurred flag to prevent recording failure twice - Ensures timeout penalties are correctly preserved (was being overwritten) **Comprehensive Test Coverage (35 tests, all passing):** Unit Tests (test_connection_health.py - 26 tests): - Health state transitions (unknown → healthy → degraded → dead) - Timeout vs error classification validation - Quality monitor penalty differentiation (-10 timeout vs -15 error) - Quality score calculations and tier assignments - Connection reset functionality - Integration scenarios for timeout/error patterns Integration Tests (test_poll_health_integration.py - 9 tests): - Complete flow: driver → worker → connection → health states - Mock drivers for various scenarios (healthy, timeout, error, empty, intermittent) - Validates timeout vs error penalty differentiation end-to-end - Tests recovery patterns and quality score accumulation - Verifies consecutive failure tracking **Test Fixes:** - Update test_driver_identify.py to use new class names (OwonSPM, OwonXDM) - Add get_quality_multiplier() method to FakeDeviceConnection mock - Update test_driver_base.py for new helper method **Test Results:** ✅ 35/35 health monitoring tests pass (26 unit + 9 integration) ✅ 306/322 total tests pass (16 pre-existing failures unrelated to this work) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 5412e94 commit f515083

11 files changed

Lines changed: 1385 additions & 74 deletions

File tree

benchmesh-serial-service/src/benchmesh_service/drivers/base.py

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,107 @@ def poll_status(self, channel: int) -> Dict[str, Any]:
6161
Must be implemented by all drivers.
6262
Called periodically by polling worker thread (every 2-3 seconds typically).
6363
64+
**RETURN VALUE CONTRACT:**
65+
66+
- **Success**: Return dictionary with status data
67+
- **Device off/not responding**: Return empty dict {}
68+
- **Communication error**: Raise exception (TimeoutError, SerialException, etc.)
69+
70+
**IMPORTANT - Connection Monitoring:**
71+
72+
This method serves dual purposes:
73+
1. **Data collection**: Returns device measurements/status
74+
2. **Health monitoring**: Signals connection health to worker
75+
76+
The worker uses the return value to determine connection health:
77+
- Non-empty dict → Connection healthy, data recorded
78+
- Empty dict {} → Device not responding, connection dropped
79+
- Exception raised → Communication failure, connection dropped
80+
81+
**DO NOT:**
82+
- ❌ Catch exceptions and return fake data (defeats health monitoring)
83+
- ❌ Return {"VOUT": None, "IOUT": None} on timeout
84+
- ❌ Return {"ERROR": str(e)} on exception
85+
- ❌ Return None instead of {} or raising
86+
87+
**DO:**
88+
- ✅ Let transport exceptions propagate naturally
89+
- ✅ Return empty dict {} only for valid "device off" states
90+
- ✅ Return meaningful data for standby/off states when possible
91+
92+
**Examples:**
93+
94+
CORRECT - Let exceptions propagate (tenma_72 pattern):
95+
def poll_status(self, channel: int):
96+
# No try/except - transport errors bubble up
97+
v = self.query_voltage(channel) # May raise TimeoutError
98+
i = self.query_current(channel) # May raise TimeoutError
99+
return {"VOUT": v, "IOUT": i}
100+
101+
CORRECT - Empty dict for device powered off:
102+
def poll_status(self, channel: int):
103+
raw = self.query_status(channel)
104+
if not raw or raw.strip() == "":
105+
return {} # Device powered off, no data available
106+
return self._parse_status(raw)
107+
108+
CORRECT - Standby state returns valid data:
109+
def poll_status(self, channel: int):
110+
v = self.query_voltage(channel)
111+
i = self.query_current(channel)
112+
output_on = self.query_output_state(channel)
113+
return {
114+
"VOUT": v,
115+
"IOUT": i,
116+
"OUTPUT": "ON" if output_on else "OFF",
117+
"SBY": not output_on # Standby flag
118+
}
119+
120+
WRONG - Fake data on communication error:
121+
def poll_status(self, channel: int):
122+
try:
123+
v = self.query_voltage(channel)
124+
except TimeoutError:
125+
return {"VOUT": None} # ❌ WRONG! Worker thinks success!
126+
127+
WRONG - Catching and hiding errors:
128+
def poll_status(self, channel: int):
129+
try:
130+
return {"DATA": self.query_data(channel)}
131+
except Exception as e:
132+
return {"ERROR": str(e)} # ❌ WRONG! Truthy dict!
133+
134+
**Multi-Class Devices:**
135+
136+
Use _poll_multi_class() helper for devices with multiple classes:
137+
138+
def poll_status(self, channel: int):
139+
return self._poll_multi_class(channel, {
140+
"PSU": self.poll_status_psu,
141+
"DMM": self.poll_status_dmm
142+
})
143+
144+
The helper handles partial success gracefully (some classes succeed, some fail).
145+
64146
Args:
65147
channel: Channel number (1-based, may be ignored for single-channel devices)
66148
67149
Returns:
68150
Dictionary with device status. Keys depend on device type/class.
69151
For multi-class devices (e.g., OWONSPM), return nested dict:
70152
{"PSU": {...}, "DMM": {...}}
153+
Empty dict {} if device powered off or not responding.
154+
155+
Raises:
156+
TimeoutError: Serial read timeout (device not responding to query)
157+
SerialException: Physical connection error (cable disconnected, port closed)
158+
ValueError: Invalid/unparseable response from device
159+
160+
Exceptions are caught by worker and trigger health monitoring + reconnection.
161+
162+
See Also:
163+
- tenma_72 driver for reference implementation
164+
- _poll_multi_class() helper for multi-class devices
71165
"""
72166
pass
73167

@@ -118,6 +212,85 @@ def is_connected(self) -> bool:
118212
"""
119213
return self.t.is_open
120214

215+
def _poll_multi_class(
216+
self,
217+
channel: int,
218+
poll_methods: Dict[str, callable]
219+
) -> Dict[str, Any]:
220+
"""
221+
Helper for multi-class devices with partial success support.
222+
223+
Simplifies polling for devices that combine multiple instrument classes
224+
(e.g., OWON SPM has both PSU and DMM functionality).
225+
226+
Attempts to poll each class independently. If at least one class
227+
succeeds, returns partial data with successful classes. If ALL classes
228+
fail, raises exception to trigger connection drop and reconnection.
229+
230+
This allows graceful degradation - if DMM fails but PSU works, PSU data
231+
is still usable while DMM reconnection is attempted.
232+
233+
Args:
234+
channel: Channel number to pass to poll methods
235+
poll_methods: Dict mapping class name to poll method callable
236+
e.g., {"PSU": self.poll_status_psu, "DMM": self.poll_status_dmm}
237+
238+
Returns:
239+
Dict with class-keyed data. Successful classes have data,
240+
failed classes have None value.
241+
242+
Raises:
243+
RuntimeError: If all classes failed to poll (device completely unresponsive)
244+
245+
Example:
246+
def poll_status(self, channel: int):
247+
return self._poll_multi_class(channel, {
248+
"PSU": self.poll_status_psu,
249+
"DMM": self.poll_status_dmm
250+
})
251+
252+
Example output (partial success):
253+
{
254+
"PSU": {"VOUT": 12.5, "IOUT": 1.2}, # Success
255+
"DMM": None # Failed
256+
}
257+
258+
Example behavior:
259+
- If PSU and DMM both succeed: Returns both data dicts
260+
- If PSU succeeds, DMM fails: Returns PSU data, DMM=None (partial success)
261+
- If both fail: Raises RuntimeError (triggers reconnection)
262+
"""
263+
import logging
264+
logger = logging.getLogger(__name__)
265+
266+
result = {}
267+
any_success = False
268+
269+
for class_name, poll_method in poll_methods.items():
270+
try:
271+
data = poll_method(channel)
272+
if data:
273+
result[class_name] = data
274+
any_success = True
275+
else:
276+
# Poll method returned empty dict
277+
logger.warning(f"{class_name} poll returned empty data for channel {channel}")
278+
result[class_name] = None
279+
280+
except Exception as e:
281+
# Poll method raised exception (timeout, serial error, etc.)
282+
logger.warning(f"Failed to poll {class_name} channel {channel}: {e}")
283+
result[class_name] = None
284+
285+
# If all classes failed, raise to trigger reconnection
286+
if not any_success:
287+
class_names = list(poll_methods.keys())
288+
raise RuntimeError(
289+
f"All classes {class_names} failed to poll channel {channel} - device not responding"
290+
)
291+
292+
return result
293+
121294
# ===== TRANSPORT DELEGATION METHODS =====
122295

123296
def write(self, data: bytes) -> None:

benchmesh-serial-service/src/benchmesh_service/drivers/owon_oel/driver.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,21 @@ def query_status(self, channel: int):
77
return self.t.read_until_reol(1024)
88

99
def poll_status(self, channel: int):
10+
"""
11+
Poll electronic load status.
12+
13+
Exceptions propagate naturally for health monitoring.
14+
Returns empty dict {} if device is off/not responding.
15+
"""
1016
raw = self.query_status(channel) or ""
17+
18+
# Device off or not responding - return empty dict
1119
if raw is None or raw == "":
12-
# Return a minimal but truthy structure to avoid dropping the connection
13-
return {"VIN": si_to_value(0), "IIN": si_to_value(0), "PIN": si_to_value(0), "OVP": "OFF", "OCP": "OFF", "OTP": "OFF", "REMOTE": "OFF", "INPUT": "OFF", "MODE": "CURR"}
20+
return {}
21+
1422
if isinstance(raw, bytes):
1523
raw = raw.decode(errors='ignore')
24+
1625
parts = raw.strip().split(',')
1726
result = {}
1827
keys = ["VIN", "IIN", "PIN", "OVP", "OCP", "OTP"]

benchmesh-serial-service/src/benchmesh_service/drivers/owon_spm/driver.py

Lines changed: 23 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,21 @@ def query_function(self, channel: int):
5656
return self.normalize_spaces(funct)
5757

5858
def poll_status_psu(self, channel: int):
59+
"""
60+
Poll PSU status data.
61+
62+
Exceptions propagate naturally for health monitoring.
63+
Returns empty dict {} if device is off/not responding.
64+
"""
5965
raw = self.query_status(channel) or ""
66+
67+
# Device off or not responding - return empty dict
6068
if raw is None or raw == "":
61-
# Return a minimal but truthy structure to avoid dropping the connection
62-
return {"VOUT": None, "IOUT": None, "POUT": None}
69+
return {}
70+
6371
if isinstance(raw, bytes):
6472
raw = raw.decode(errors='ignore')
73+
6574
parts = raw.strip().split(',')
6675
result = {}
6776
keys = ["VOUT", "IOUT", "POUT", "OVP", "OCP", "OTP", "OM"]
@@ -109,38 +118,27 @@ def poll_status_dmm(self, channel: int):
109118
def poll_status(self, channel: int):
110119
"""
111120
Unified polling method for multi-class device.
112-
121+
113122
Polls both PSU and DMM data in a single operation to avoid
114123
double serial port access for devices with multiple classes
115124
on a single serial port.
116-
125+
126+
Uses _poll_multi_class() helper for partial success handling.
127+
Exceptions propagate naturally for health monitoring.
128+
117129
Returns dict with class-keyed data:
118130
{
119131
"PSU": {psu status data},
120132
"DMM": {dmm status data}
121133
}
134+
135+
Returns empty dict {} if device is off/not responding.
136+
Raises exception on communication errors.
122137
"""
123-
result = {}
124-
125-
# Get PSU data
126-
try:
127-
psu_data = self.poll_status_psu(channel)
128-
if psu_data:
129-
result["PSU"] = psu_data
130-
except Exception as e:
131-
logger.warning(f"Failed to poll PSU status: {e}")
132-
result["PSU"] = {"VOUT": None, "IOUT": None, "POUT": None}
133-
134-
# Get DMM data
135-
try:
136-
dmm_data = self.poll_status_dmm(channel)
137-
if dmm_data:
138-
result["DMM"] = dmm_data
139-
except Exception as e:
140-
logger.warning(f"Failed to poll DMM status: {e}")
141-
result["DMM"] = None
142-
143-
return result
138+
return self._poll_multi_class(channel, {
139+
"PSU": self.poll_status_psu,
140+
"DMM": self.poll_status_dmm
141+
})
144142

145143
def set_output(self, channel: int, value): # ON / OFF
146144
self.t.write_line('OUTP ' + str(value))

benchmesh-serial-service/src/benchmesh_service/drivers/owon_xdm/driver.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,20 @@
44

55
class OwonXDM(DriverBase):
66
def poll_status(self, channel: int):
7+
"""
8+
Poll multimeter status.
9+
10+
Exceptions propagate naturally for health monitoring.
11+
Returns empty dict {} if device is off/not responding.
12+
"""
713
query_measurement = self._query_measurement(1)
814
func = self._query_function(1)
915
raw = self.query_identify() or b""
16+
17+
# Device off or not responding - return empty dict
1018
if not raw:
11-
return None
19+
return {}
20+
1221
return {"MEAS": sci_to_value(query_measurement), "MODE": func, "RANGE": self.cache.get(func + ":RANGE")}
1322

1423
def set_remote(self, channel: int, value):

benchmesh-serial-service/src/benchmesh_service/drivers/rigol_dho800/driver.py

Lines changed: 31 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -10,34 +10,38 @@ class RigolDHO800(DriverBase):
1010
"""Driver for RIGOL DHO800 series oscilloscopes"""
1111

1212
def poll_status(self, channel: int = 1):
13-
"""Poll device status"""
13+
"""
14+
Poll oscilloscope status.
15+
16+
Exceptions propagate naturally for health monitoring.
17+
18+
Note: Oscilloscopes don't have an "off" state over USB TMC - they either
19+
respond successfully or timeout. No need to check for empty response.
20+
"""
1421
# Query basic oscilloscope parameters
15-
try:
16-
# Get channel scale (V/div)
17-
self.t.write_line(f':CHANnel{channel}:SCALe?')
18-
scale = self.t.read_until_reol(1024)
19-
20-
# Get channel offset
21-
self.t.write_line(f':CHANnel{channel}:OFFSet?')
22-
offset = self.t.read_until_reol(1024)
23-
24-
# Get channel coupling
25-
self.t.write_line(f':CHANnel{channel}:COUPling?')
26-
coupling = self.t.read_until_reol(1024)
27-
28-
# Get timebase scale
29-
self.t.write_line(':TIMebase:MAIN:SCALe?')
30-
timebase = self.t.read_until_reol(1024)
31-
32-
return {
33-
"CHANNEL": channel,
34-
"SCALE": sci_to_value(scale),
35-
"OFFSET": sci_to_value(offset),
36-
"COUPLING": coupling,
37-
"TIMEBASE": sci_to_value(timebase)
38-
}
39-
except Exception as e:
40-
return {"ERROR": str(e)}
22+
# Get channel scale (V/div)
23+
self.t.write_line(f':CHANnel{channel}:SCALe?')
24+
scale = self.t.read_until_reol(1024)
25+
26+
# Get channel offset
27+
self.t.write_line(f':CHANnel{channel}:OFFSet?')
28+
offset = self.t.read_until_reol(1024)
29+
30+
# Get channel coupling
31+
self.t.write_line(f':CHANnel{channel}:COUPling?')
32+
coupling = self.t.read_until_reol(1024)
33+
34+
# Get timebase scale
35+
self.t.write_line(':TIMebase:MAIN:SCALe?')
36+
timebase = self.t.read_until_reol(1024)
37+
38+
return {
39+
"CHANNEL": channel,
40+
"SCALE": sci_to_value(scale),
41+
"OFFSET": sci_to_value(offset),
42+
"COUPLING": coupling,
43+
"TIMEBASE": sci_to_value(timebase)
44+
}
4145

4246
def set_output(self, channel: int, state: str):
4347
"""Enable/disable oscilloscope channel display

0 commit comments

Comments
 (0)