Status: Accepted — 2026-04-17 Phases: 3 (FDIR advisor) + 7.3 (mode manager) Commits: 454422c, 619d980
The UniSat firmware needs both fault tracking and fault response.
The naive design puts them in the same module: a single
fdir_tick() that detects, counts, decides, and enacts the
recovery (safe-mode entry, subsystem disable, NVIC reset). This
couples three independent concerns:
Mixing all three in one file forces unit tests to pull in NVIC headers just to verify escalation arithmetic, and hides actual transitions behind mock layers that drift out of sync with production code.
Split into two modules with a clean hand-off:
fdir.c — advisor only. Tracks fault counters, maintains
the fault table + escalation thresholds, exposes
FDIR_GetRecommendedAction(id) — but never invokes a
transition itself.mode_manager.c — commander. Runs from WatchdogTask at
1 Hz, polls every fault id via FDIR_GetRecommendedAction,
selects the worst-severity recommendation, and enacts the
corresponding EnterSafe / EnterDegraded /
RequestReboot. Platform hook ModeManager_PlatformReboot()
is weak so tests provide their own.Schema:
driver detects fault
│
â–Ľ
FDIR_Report(FAULT_X) <- advisory, just counts
│
â–Ľ
FDIR_GetRecommendedAction(X) <- looks up the table
│
â–Ľ
ModeManager_Tick(): worst-case <- commanding
│
â–Ľ
EnterSafe / Degraded / Reboot <- state change + telemetry
│
â–Ľ
NVIC_SystemReset (on target) <- platform hook
test_fdir.c needs zero HAL includes;
test_mode_manager.c runs the full supervisor by overriding
a single weak symbol.fdir.c never wonders “will calling Report reset the MCU?”
— the module explicitly cannot.Positive:
Negative:
fdir.hGetRecommendedAction)
— negligible at 1 Hz supervisor cadenceSee:
firmware/stm32/Core/Src/fdir.c + fdir.hfirmware/stm32/Core/Src/mode_manager.c + mode_manager.hdocs/reliability/fdir.md for the table + severity ladderTests:
firmware/tests/test_fdir.c — 9/9firmware/tests/test_mode_manager.c — 9/9While implementing mode_manager the test suite exposed a latent
bug in FDIR_GetRecommendedAction: it returned the primary
action even for faults that had never been reported
(recent_count == 0). That’s semantically wrong — a fault with no
active report should not drive a recovery. The fix — adding a
recent_count == 0 -> LOG_ONLY fast-path at the top — is part
of the 619d980 commit; two pre-existing test_fdir tests were
updated to match the corrected semantics.