Reliability and Recovery
EntDB durability and restart safety are driven by WAL-first write semantics and deterministic recovery.
Durability policy is configurable at runtime:
Full: strict sync on commit pathNormal: reduced sync pressure with the same recovery modelOff: best-effort durability for ephemeral workloads
Callers that normally run in Normal or Off can still force a durable boundary on a specific operation.
Reliability stack
- WAL record checksums and replay safety,
- analysis/redo/undo recovery paths,
- failpoint and crash-point matrices,
- idempotent recovery expectations.
WAL Recovery Flow
+------------------------------+
| startup |
+------------------------------+
|
v
+------------------------------+
| scan WAL records |
+------------------------------+
|
v
+------------------------------+
| analysis |
| - collect txn states |
| - collect touched pages |
+------------------------------+
|
v
+------------------------------+
| redo |
| - replay committed updates |
| - respect page LSN checks |
+------------------------------+
|
v
+------------------------------+
| undo |
| - roll back incomplete txns |
+------------------------------+
|
v
+------------------------------+
| consistent recovered state |
+------------------------------+
Validation coverage
Reliability behavior is validated with crash matrices, failpoint-driven recovery tests, and MVCC restart visibility tests.
What this protects
- committed transactions remain visible across restart,
- incomplete/aborted transactions do not leak visibility,
- repeated recovery runs converge to the same state.
- durability policy changes trade commit latency against sync strictness, not MVCC visibility rules.
Reference files
crates/entdb/src/wal/log_record.rscrates/entdb/src/wal/log_manager.rscrates/entdb/src/wal/recovery.rscrates/entdb/src/wal/tests/recovery_tests.rscrates/entdb/tests/crash_matrix.rs