ADR-0006: WAL Schema and Replay Policy
Status
Accepted
Context
The Write-Ahead Log (WAL) in Tiffany provides durable, append-only storage for agent activity, execution steps, and Git-integrated state transitions. It is a core component in ensuring fault tolerance, crash recovery, and task auditability.
As the Tiffany agent runtime handles multiple concurrent tasks, LLM responses, tool invocations, and virtual canvas operations, the WAL must encode enough structured information to allow:
- Resuming interrupted tasks
- Replaying task progress deterministically
- Cross-checking Git commit state and canvas deltas
This ADR defines the structure, persistence policy, replay semantics, and integrity guarantees of the WAL.
Decision
We adopt a structured, JSON-encoded, append-only WAL system with the following characteristics:
WAL File Format
- File-based, append-only per task context (e.g.
wal/<uuid>.wal) - Each line is a valid JSON object (
serde_json::to_writer) - Log entries are versioned via a top-level
"v"field
WAL Entry Types
{
"v": 1,
"ts": "2025-06-27T15:48:00Z",
"type": "InstructionStart",
"task_id": "uuid",
"instruction": "Add validation to model.rs"
}
Supported entry types:
InstructionStart— original user inputLLMPlan— response from LLM with plan stepsStepStart— beginning of a plan stepToolResult— tool output or errorPatchPreview— virtual diff previewCommit— Git commit metadata (SHA, message)TaskComplete— signal of end of task (with success/failure)ShutdownMarker— graceful termination marker
Storage Model
- Each agent task has a dedicated WAL file
- Optionally streamed to external durable store (e.g. S3, versioned blob)
- Index is built in-memory on load for efficient lookup
Flush Policy
- WAL entries are fsynced after each write
- Entries are buffered briefly for I/O optimization, but never cached in-memory only
Replay Semantics
On startup or recovery, the WAL replay engine:
- Reads all entries into a memory index
- Validates log ordering and checks for
TaskCompleteorShutdownMarker - If incomplete, reconstructs TaskState:
- Current plan
- Last confirmed step index
- Last Git SHA
- Virtual canvas snapshot (if recoverable)
Recovery Modes
resume: continue from last known stateinspect: dry-run and summarize partial progressreplay: re-run steps for reproducibility / audit
WAL Lifecycle Sequence Diagram
sequenceDiagram
participant User
participant Agent
participant WAL
participant LLM
participant Tool
participant Canvas
participant Git
participant ReplayEngine
User->>Agent: Submit task ("Add error handling")
Agent->>WAL: Write InstructionStart
Agent->>LLM: Plan steps
LLM-->>Agent: Return plan
Agent->>WAL: Write LLMPlan
loop For each step
Agent->>WAL: Write StepStart
Agent->>Tool: Execute action
Tool-->>Agent: Output result
Agent->>WAL: Write ToolResult
Agent->>Canvas: Apply virtual patch
Canvas-->>Agent: Preview diff
Agent->>WAL: Write PatchPreview
Agent->>Git: Commit via git2-rs
Git-->>Agent: Commit SHA
Agent->>WAL: Write Commit (with SHA, step index)
end
Agent->>WAL: Write TaskComplete
Agent-->>User: Respond with summary
Note over ReplayEngine: On Recovery or Restart
ReplayEngine->>WAL: Read task WAL file
WAL-->>ReplayEngine: JSON entries
ReplayEngine->>ReplayEngine: Rebuild plan, canvas, Git context
ReplayEngine->>Agent: Resume at next step or verify TaskComplete
Rationale
🔐 Durability & Crash Safety
- WAL ensures no task progress is lost
- Every instruction, LLM plan, tool call, and Git commit is accounted for
🧪 Reproducibility
- WAL enables deterministic re-execution of tasks
- Allows audit of LLM decisions, tool usage, and user confirmations
🧰 Debug & Forensics
- WAL becomes a source of truth in debugging stuck or incomplete tasks
- WAL diffs explain when/why things diverged from expected flow
⚙️ Integration with Canvas & Git
- Each canvas commit references its corresponding WAL entry
- Reconstructing canvas state on restart uses last committed patch snapshot
Consequences
- WAL writer must be thread-safe and coroutine-aware
- Every
AgentTaskmust emit structured WAL entries - On crash or power loss, agent must recover cleanly to last
CommitorPatchPreview cargo testmust include WAL replay + recovery unit tests
Alternatives Considered
- Single journal file for all tasks: Difficult to parallelize, isolate, or replay
- Database-backed log: Overkill, less portable, breaks GitOps principle
- Binary format: More compact, but less debuggable and interoperable than JSON
Related Documents
Adopted
This ADR is accepted as of June 2025. All agent tasks will be tracked through structured, append-only WAL files to ensure traceability, auditability, and deterministic task recovery.
Maintainers: @casibbald, `@microscaler-team