-
Notifications
You must be signed in to change notification settings - Fork 0
FSD Summary Report
How does the FSD Summary report determine a single Cause for a trip? Read on!
When the Fast Shutdown System (FSD) trips, there may be multiple FSD Nodes flagged as currently in the faulted state. The FSD Logger daemon records the begin and end timestamps of the trip and along with all nodes flagged as faulted during the period into an Oracle relational database.
DTM assigns a single cause to each trip, even though there often are multiple FSD Nodes involved. A set of rules provided by Jay B. is used to find the root cause.
The CEBAF Element Database (CED) has information on each FSD Node, and that information is mapped / embedded into the raw FSD data.
The FSD Summary report uses this data. It should be noted that the Trips tab in DTM can be used to view/download the raw FSD data (plus the computed root cause).
The rules to determine a root cause from a set of FSD Nodes in a trip are:
- If there is zero nodes or if map to CED fails, use "Unknown/Missing". This is either Phantom trip or misconfiguration in CED
- If there is just a single node in the trip, then the cause is set to the CED FSD Node "HCO Category" field
- If there are multiple nodes, but at least one is an RF node, then the cause is RF.
- If RF, then we further examine to see if just C25/C50 OR just C75/C100 nodes OR just Cryomodule nodes
- Otherwise, we have multiple, so then, first one wins (last one in code):
- If CED types include both
VDiagKickerandHDiagKicker, then cause =Magnets - If CED name includes
IARAD00, then cause =Hall - If CED types includes either
TargetorHallthen cause =Hall
- If CED types include both
There are some additional rules for transforming the CED provided HCO System name into something more specific. For example, If HCO System reads "Safety Systems", that is assigned the root cause, the rules say to further refine to one of "MPS (BLM)", "MPS (IC)", "MPS (BCM/BLA)", or "MPS (Multi/Other)" based on CED Type.
The rules can be examined in code here: FsdRootCauseLogic.java
The raw trip data recorded by the FSD Logger is organized into one record per trip, with each trip having a relation (SQL Join) to zero or more Fault records (node and channel), and each Fault has zero or more Device records on the node/channel.
The Database schema is defined here: Oracle FSD DDL
A few notes on the data:
- We record the state of machine time accounting at a snapshot of time during the trip. This is imperfect, but gives an idea if we're in SAM period for example and trip is bogus or what else might be going on with the machine. Stored in FSD_TRIP table.
- We record DISJOINT_YN column (primary path field in DTM) in the FSD_FAULT table (node/channel). This was something requested by SSG. It turns out to be misleading and unhelpful in most cases. It captures whether a node was in the path that latched its way to the Master node, otherwise was disjoint. Due to limitations of our timing system and IOC scan rates, events are occurring outside of a shared global clock. In other words, our FSD system has race conditions. Within certain narrow cases this information may provide a hint (possibly in cases where nodes are ONLY SSG nodes, Ion Chamber, BLA, BLM), otherwise it's meaningless.
- We record FAULT_CONFIRMATION_YN column in the FSD_DEVICE_EXCEPTION table. This also turns out to be misleading and unhelpful in most cases. It is whether the FSD Logger was able to query the device and ask it to confirm if it faulted or not. Because many (most) devices don't have an API to do this, it is super misleading as it mostly will be set to No, but likely because the device doesn't support being asked if it tripped or not, not because it didn't trip.