unisat

ADR-006: Warm-reboot-survivable fault log via .noinit SRAM

Status: Accepted β€” 2026-04-17 Phase: 7.4 (Persistent fault log) Commit: 03049a1

Context

The Phase-3 FDIR advisor counts faults in plain .bss, so every warm reboot β€” including reboots caused by a fault β€” erases the counter. A satellite that cycles on repeated FAULT_STACK_OVERFLOW leaves no trace for post-mortem downlink; ground sees only the current-boot state, not why the previous boot ended.

Flight-software / mission-operations wants:

Decision

Reserve a dedicated .noinit section in the STM32F446 linker script and place the persistent ring + header there. The section is marked NOLOAD so Reset_Handler’s BSS zero-init loop never touches it; the SRAM contents survive a soft reset (NVIC_SystemReset, WWDG/IWDG, HardFault recovery path).

Validate-on-read pattern:

   first 4 bytes: magic  ('F''D''I''R' LE)
   next  4 bytes: CRC32 over (header[with crc=0] | ring payload)
   next  1 byte : head
   next  1 byte : count
   next  1 byte : reboot_reason
   padding
   next  4 bytes: reboot_count
   ring[16 Γ— 16 bytes]

At boot:

Rationale

Consequences

Positive:

Negative:

Alternatives considered

Implementation

  MEMORY layout:
    .text / .rodata    FLASH @ 0x08000000
    .data              RAM (loaded)    <- zero-init loop fills
    .bss               RAM             <- zero-init loop fills
    .noinit (NOLOAD)   RAM             <- linker does NOT touch
    ._user_heap_stack  RAM
#define NOINIT_ATTR  __attribute__((section(".noinit")))

static FDIR_PersistentHeader_t g_hdr NOINIT_ATTR;
static FDIR_PersistentEntry_t  g_ring[16] NOINIT_ATTR;

On host (SIMULATION_MODE), NOINIT_ATTR expands to empty; the backing array lives in .bss and the test harness uses FDIR_Persistent_Wipe() + a pristine magic to simulate a cold boot. Exercised by 6 / 6 tests in test_fdir_persistent.c.

Follow-up bug caught during test

Initial implementation had Wipe() leave magic = MAGIC β€” which made the subsequent Init() take the warm-reboot branch and erroneously bump reboot_count. Fixed by splitting wipe_storage() (zero everything including magic) and arm_storage() (set magic + CRC). Init() on a post-wipe state now correctly takes the cold-boot branch.