Introduction
Tiffany Runtime Model
Tiffany is a coroutine-first agentic runtime built around the idea of deterministic, traceable, yield-driven execution for autonomous system tasks.
Execution Model
- Coroutine Scheduler: Inspired by David Beazley’s
pyos8, all tasks are cooperative generators. There are no threads, only scheduled yields. - System Calls: Tasks yield events like
ReadWait,Sleep, orSpawnTaskto the scheduler. - Trampolining: Nested coroutines are managed via a LIFO stack to support function-like composition.
- Replayability: Tasks emit
WALlogs with every yield, enabling complete deterministic replay. - Telemetry: Progress is simultaneously emitted to the
PALstream for real-time observability.
Agents and Isolation
Each agent:
- Operates in a self-contained environment (microVM or Apple Container)
- Loads semantic memory (GraphDB + VectorDB)
- Executes plans via the ReAct loop
- Coordinates with orchestrator over mTLS/gRPC
Diagrams
You can find system-level sequence diagrams in the Diagrams section.
Task Scheduler
📌 1. Overview of PyOS (pyos8.py)
PyOS is an educational, minimalistic micro-operating system implemented in Python using generators/coroutines to simulate concurrent tasks. The core features include:
- Tasks as Generators: Each “task” is a coroutine (
yield-based) that encapsulates some ongoing work. - Scheduler: A lightweight round-robin task scheduler (
Scheduler) runs tasks until completion or yields control voluntarily. - System Calls (Syscall): Tasks request OS services (sleep, wait, IO) via structured system calls.
- Event-driven Loop: Scheduler handles tasks based on events (I/O completion, sleep duration, etc.).
📂 2. Detailed Architectural Analysis (with Mermaid)
Here’s the architectural diagram of PyOS:
graph TD
Scheduler["Scheduler (Event Loop)"]
subgraph Task Management
TaskQueue["Ready Task Queue"] --> Scheduler
SleepingQueue["Sleeping Task Queue"]
WaitingTasks["Waiting Task Queue"]
end
Task["Task (Coroutine)"] -->|Syscall| Scheduler
Scheduler -->|Resume Task| Task
Scheduler -->|Sleep Requests| SleepingQueue
Scheduler -->|IO or Wait Requests| WaitingTasks
Scheduler --> IO["I/O Management (Select-based polling)"]
IO --> Scheduler
SleepingQueue --> Scheduler
WaitingTasks --> Scheduler
Core Components:
- Task: coroutine (
yield) wrapped into aTaskclass. Each task maintains its own coroutine state. - Scheduler: event loop managing tasks.
- Syscall: task requests (sleep, IO, etc.) handled by the scheduler.
🧩 3. Task Management Explained
Tasks in PyOS are managed explicitly through a scheduler:
class Task:
taskid = 0
def __init__(self, target):
Task.taskid += 1
self.tid = Task.taskid # Unique task id
self.target = target # Coroutine
self.sendval = None # Value to send into coroutine
self.stack = [] # Call stack for nested coroutines
Scheduler maintains multiple queues:
ready: Tasks ready to run immediately.sleeping: Tasks scheduled to run after a delay.waiting: Tasks blocked on I/O or other events.
Scheduler main loop:
class Scheduler:
def mainloop(self):
while self.taskmap:
if not self.ready:
self.iopoll(None)
else:
task = self.ready.popleft()
result = task.run()
self.schedule(task)
This approach provides:
- Explicit task control
- Predictable execution order
- Clear state management
🚦 4. How PyOS Handles System Calls
Tasks use Syscall to yield structured commands:
class SysCall:
def handle(self, task, sched):
pass
class GetTid(SysCall):
def handle(self, task, sched):
task.sendval = task.tid
sched.schedule(task)
- Tasks yield
SysCallobjects. - Scheduler intercepts, executes, then resumes tasks.
Example Task using Syscall:
def foo():
tid = yield GetTid()
print(f"My task id is {tid}")
🌟 5. Relevance & Value to Tiffany Daemon
Applying this architecture to Tiffany offers significant advantages:
✅ Advantages for your use-case:
- Fine-grained control: Tasks explicitly yield for system operations (like awaiting API or LLM responses).
- Memory-efficient concurrency: Avoid overhead of threads/processes, critical for long-lived agents.
- Traceable state transitions: Easier debugging, transparency in state and lifecycle management.
- Predictable & careful execution flow: Aligns with your “one-foot-on-the-ground” cautious approach.
💡 Use Case Scenarios in Tiffany:
- Agent executing a complex, multi-step workflow (e.g., code edit → git commit → API call → LLM query) as separate cooperative tasks.
- Tasks handling different interactions (CLI vs API) concurrently yet clearly separated in the scheduler.
- Easier rollback and checkpointing: Each step clearly delineated by coroutine yield points.
⚙️ 6. Coroutine Considerations in Rust
While the example uses Python generators, Rust provides advanced coroutine (async/await) models via runtimes like tokio, async-std, or coroutine libraries like may.
- Tokio/Async: Standard Rust solution for async tasks and efficient I/O multiplexing.
- May (Coroutine-focused): Offers coroutine-style programming and lightweight task handling, suitable for fine-grained control.
Rust async example (with Tokio):
#![allow(unused)]
fn main() {
async fn handle_task(task_id: usize) {
println!("Handling task {}", task_id);
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
Rust Coroutine-style example (with May):
use may::coroutine;
fn main() {
coroutine::scope(|s| {
s.spawn(|| {
println!("Task started");
coroutine::sleep(Duration::from_secs(1));
println!("Task done");
});
});
}
🎯 7. Recommended Architecture for Tiffany
Based on PyOS inspiration, I suggest adopting a hybrid Rust async + coroutine model:
✅ Tiffany Task Scheduler (Conceptual):
- Define explicit task states and transitions clearly.
- Tasks (async or coroutine) yielding explicit commands similar to SysCall, allowing predictable scheduling and state management.
- A scheduler loop managing tasks explicitly, providing clear debug and auditability.
Example of explicit Rust-based scheduler task structure:
#![allow(unused)]
fn main() {
enum TaskCommand {
AwaitLLMResponse(Query),
AwaitFileIO(PathBuf),
AwaitUserApproval(String),
}
struct Task {
id: usize,
state: TaskState,
}
enum TaskState {
Running,
Waiting(TaskCommand),
Completed,
}
impl Scheduler {
async fn run(&mut self) {
loop {
for task in &mut self.tasks {
match task.state {
TaskState::Waiting(ref cmd) => {
match cmd {
TaskCommand::AwaitLLMResponse(q) => {
if let Some(resp) = self.llm.try_get_response(q).await {
task.state = TaskState::Running;
}
},
// Handle other cases similarly
_ => {}
}
},
TaskState::Running => {
// Run the task's next step
},
TaskState::Completed => {}
}
}
tokio::time::sleep(Duration::from_millis(10)).await;
}
}
}
}
- Tasks explicitly yield commands the scheduler handles.
- Matches your “one-foot-on-the-ground” operational style.
🛠️ 8. Recommended Roadmap
- Prototype a simple Rust Scheduler inspired by PyOS (using async Rust).
- Define explicit commands/tasks for Tiffany (LLM queries, API calls, filesystem ops).
- Implement simple coroutine/task model (using tokio + explicit state machine).
🚀 Conclusion and Recommendation
Dave Beazley’s PyOS provides a clear and powerful task management blueprint. Its explicit task control, predictable scheduling, and structured system call mechanism align extremely well with your Tiffany vision.
For your Rust-based daemon, adapt these ideas into a hybrid coroutine/async system, managing tasks explicitly for predictability, clarity, and debug ease.
Your next steps:
- Prototype the Rust scheduler inspired by PyOS.
- Define Tiffany-specific tasks and state transitions.
- Validate with simple coroutine-style and async tasks.
References:
Design Document: Coroutine-Based Task Scheduler for Tiffany
Introduction and Background
Tiffany requires a lightweight cooperative multitasking runtime to manage concurrent tasks without the complexity of threads or processes. We will build a coroutine-based task scheduler inspired by prior art in Python and other systems. Coroutines are program components that generalize subroutines for non-preemptive multitasking: they allow a function to suspend execution (yield) and resume later, which is ideal for implementing our own scheduler. By using coroutines, we can explicitly control when tasks give up control (cooperative scheduling) and integrate asynchronous I/O handling directly into the runtime. This approach was popularized in David Beazley’s 2009 PyCon tutorial “A Curious Course on Coroutines and Concurrency,” where he constructed a simple “Python Operating System” (PYOS) using generator-based coroutines. Modern frameworks like Python’s asyncio build on similar concepts (an event loop, tasks, and awaiting I/O), but for Tiffany we will design a minimal scheduler focused on our specific needs rather than a full general-purpose OS or framework.
Prior Art and Influences
David Beazley’s PYOS: Beazley’s example (known through code like pyos.py series) is a key reference. In his design, a central Scheduler manages a queue of tasks, each task being a coroutine wrapped in a lightweight Task object. Tasks voluntarily yield control, either simply to let others run or to request a “system call” (an operation for the scheduler to perform on behalf of the task). The scheduler loop repeatedly takes the next ready task, runs it until its next yield, and then decides what to do based on the yielded value (resume the task, create new tasks, wait for I/O, etc.). This cooperative model avoids preemptive context switches and threads, making it lightweight and deterministic. Beazley’s system introduced system call objects like GetTid, NewTask, WaitTask, ReadWait, etc., which tasks could yield to request scheduler actions. For example, a task could do yield NewTask(some_coro) to spawn a new coroutine task, or yield ReadWait(file_obj) to signal the scheduler to pause the task until a file descriptor is readable. His final iteration also solved the issue of calling coroutines from within coroutines (nested yields) via a technique called trampolining, which we’ll discuss shortly.
Async IO frameworks: Python’s asyncio (PEP 3156) and similar event-loop systems were motivated by the same goal of handling many tasks with one thread by async/await syntax. They formalize coroutines as first-class objects (native async def coroutines in Python 3) and use an event loop to schedule their execution. Our design will conceptually resemble a simplified asyncio: we’ll have an event loop (the scheduler), tasks (coroutine functions), and awaitable events (our system call yields). However, we will implement just what we need for Tiffany rather than a full library. This means focusing on cooperative scheduling and I/O readiness, and possibly timed delays, without the overhead of unrelated features.
Coroutines in Rust: An alternative to a Python implementation is using Rust, which by its 2021 edition (and likely 2024 edition improvements) supports async/await for concurrency. Rust’s model is different under the hood – it uses futures and an executor. Each async fn in Rust compiles to a state-machine that implements the Future trait. An executor (runtime) polls these futures and uses a waker mechanism to resume tasks when events (like I/O) are ready. While Rust’s approach is efficient and memory-safe, implementing a custom coroutine scheduler in Rust is more involved than in Python. There is no stable native “yield” for synchronous coroutines in Rust yet (generators are an unstable feature), so one would typically build on the async/await system (futures). In our context, a Rust implementation might involve spawning tasks as futures and using an event loop (like tokio or a custom one) to drive them. This is certainly feasible and would give performance and safety benefits, but it means embracing Rust’s asynchronous programming model (polling futures) rather than the simpler generator interface that Python offers. We will discuss the trade-offs in a later section. For now, we draw inspiration from the Python model (which is easier to prototype) while keeping in mind that the design concepts (task, scheduler, event waiting) are transferable to Rust or other environments.
Requirements and Scope for Tiffany
Tiffany’s scheduler should support concurrent tasks with minimal overhead, focusing on our use cases. We are not building a general OS with process isolation or preemptive multitasking. Instead, the requirements are:
-
Cooperative Multitasking: Tasks yield control explicitly at well-defined points (e.g. when they need to wait or simply to let others run). This avoids complex context switching – the scheduler will only switch tasks when a task yields. Each task runs to its next yield without interruption. This is sufficient for many I/O-bound or event-driven workloads and simpler to implement than preemptive scheduling. It does mean a badly-behaved task that never yields could block others, so we must ensure our tasks are written to yield periodically (especially during long computations or loops).
-
Task Management: Ability to create (spawn) new tasks dynamically. For example, when a network connection comes in, the scheduler can spawn a new task to handle that connection while the main task continues listening. We might also want the ability to terminate tasks or have tasks wait for others to finish, though these are secondary if our use-case can manage without explicit task joining or killing. (If needed, we can implement a
WaitTasksystem call to allow one task to yield until another task completes, and possibly aKillTaskto remove a task, similar to Beazley’s design.) At minimum, spawning tasks is required. -
I/O and Event Waiting: A mechanism to integrate non-blocking I/O. Tasks should be able to wait for external events (file descriptors being readable/writable, timers, etc.) without blocking the whole program. In a typical scenario, a task waiting for network data will yield an event to the scheduler (e.g. “read wait on this socket”). The scheduler can then pause that task and continue running others. Under the hood, the scheduler will use an OS primitive (like
selectorpoll) to check when the socket is ready, and then resume the task. This means our scheduler likely needs to maintain a registry of file descriptors (or other event sources) to tasks waiting on them. Beazley’s example usesselect.selecton sets of descriptors, which is a straightforward approach we can use as well (since Tiffany might be running in an environment whereselect/pollis available). If Tiffany’s needs include timers (sleeping for X seconds), we could implement a similar mechanism with timeouts: tasks yield a “sleep until time T” event, and the scheduler keeps track of that, resuming the task after the timeout. -
Efficiency and Simplicity: Keep the core loop simple and avoid unnecessary features. We don’t need memory protection, true parallelism, or complex prioritization – all tasks run in the same thread and share memory (like async functions do). Context switching is just a function call (resuming a generator), which is very cheap. We will not implement advanced scheduling policies; a simple round-robin or FIFO scheduling of ready tasks is sufficient. Also, error handling can be basic: if a task raises an uncaught exception (aside from the
StopIterationthat signals a generator’s normal completion), it could terminate that task – the scheduler can catch exceptions fromtask.run()and decide to drop the task or log it. In a robust system we’d have to propagate errors or allow tasks to handle others’ failures, but for now a simple strategy (like printing error and removing the task) is acceptable. -
Trampolining (Nested Coroutines): As an optional but highly useful feature, we want tasks to be able to call coroutine-based subroutines without losing the ability to yield. In a naive generator scheduler, if one generator calls another that uses
yield, things get tricky – only the top-level generator’s yields are noticed by the scheduler. David Beazley’s trampolining solution was to have theTaskobject manage an explicit stack of generators. Whenever a running coroutine yields another generator, the scheduler does not treat it as a final yield; instead, the current task pushes its current generator onto a stack and continues executing the yielded sub-generator. This way, the task can dive into nested coroutines and when the sub-coroutine completes or yields a value back, the task pops the stack and resumes the caller. Trampolining is crucial for code modularity – it allows writing coroutine helpers (e.g. anAccept(sock)coroutine that itself yields events) and calling them naturally withresult = yield Accept(sock)inside a task. We will include trampolining in our design so that tasks can be structured hierarchically without the scheduler needing special logic for it (the Task object will handle it, making it transparent to the scheduler). -
Focus on Tiffany’s domain: If Tiffany is, for example, a provisioning system or similar, typical tasks might include networking (listening for requests, handling connections), file I/O, or invoking system commands. We anticipate heavy use of network/socket I/O which fits well with this async approach. We likely do not need features like CPU-bound parallelism (which would need threads or multiple processes) – cooperative coroutines handle concurrency when tasks are I/O-bound or need to wait. As long as one task is waiting on I/O or an event, another can run. If all tasks are waiting (idle), the scheduler can block on
selectuntil something happens, using minimal CPU.
Architecture Overview
Our coroutine runtime will consist of a few core components: Task, Scheduler, and a set of System Call events that tasks can yield to trigger special behavior. Additionally, we have the notion of the call stack within a Task for trampolining nested coroutines. Below we detail each component and how they interact.
🧬 Enhanced Sequence Diagram: Tiffany Coroutine Scheduler (Expanded)
sequenceDiagram
actor MainTask as Task_Main (Planning)
actor WorkerTask as Task_Worker (LLM Plan)
actor ToolTask as Task_Tool (IO Tool)
participant Scheduler
participant ReadyQueue as ReadyQueue_FIFO
participant CallStack_Main as CallStack_Main (LIFO)
participant CallStack_Worker as CallStack_Worker (LIFO)
participant WaitMap as EventMap (IO/Sleep/Join)
%% === Startup ===
MainTask->>Scheduler: yield NewTask(WorkerTask)
Scheduler->>ReadyQueue: enqueue(WorkerTask)
Scheduler->>MainTask: resume with WorkerTask.tid
%% === MainTask continues planning ===
MainTask-->>Scheduler: yield Sleep(5s)
Scheduler->>WaitMap: record MainTask sleep
Note right of WaitMap: MainTask sleeping until t + 5s
%% === WorkerTask starts execution ===
Scheduler->>ReadyQueue: dequeue next
ReadyQueue-->>Scheduler: WorkerTask
Scheduler->>WorkerTask: resume()
WorkerTask-->>Scheduler: yield NewTask(ToolTask)
Scheduler->>ReadyQueue: enqueue(ToolTask)
Scheduler->>WorkerTask: resume with ToolTask.tid
%% === ToolTask nested sub-coroutine call
WorkerTask-->>CallStack_Worker: push generator
CallStack_Worker->>WorkerTask: switch to ToolTask coroutine
Scheduler->>WorkerTask: continue (trampolining)
%% === ToolTask yields ReadWait
ToolTask-->>Scheduler: yield ReadWait(fd=4)
Scheduler->>WaitMap: register fd=4 for ToolTask
Note right of WaitMap: ToolTask blocked waiting on I/O
%% === I/O becomes ready
Note right of WaitMap: fd=4 becomes readable
WaitMap->>Scheduler: notify ToolTask
Scheduler->>ReadyQueue: enqueue(ToolTask)
%% === ToolTask completes
Scheduler->>ReadyQueue: dequeue ToolTask
ReadyQueue-->>Scheduler: ToolTask
Scheduler->>ToolTask: resume()
ToolTask-->>CallStack_Worker: return result
CallStack_Worker->>WorkerTask: resume with result
%% === WorkerTask finishes work
WorkerTask-->>Scheduler: StopIteration
Scheduler->>WaitMap: resolve joiners on WorkerTask
%% === MainTask wakes up
Note right of WaitMap: Timer expires for MainTask
WaitMap->>Scheduler: wake MainTask
Scheduler->>ReadyQueue: enqueue(MainTask)
%% === Final Resumption
Scheduler->>ReadyQueue: dequeue MainTask
ReadyQueue-->>Scheduler: MainTask
Scheduler->>MainTask: resume()
MainTask-->>Scheduler: continue next step
🧠 Key Concepts Represented:
- Scheduler is the event orchestrator
- Tasks are coroutine-based units of execution
- ReadyQueue (FIFO) holds ready-to-run tasks
- CallStack (LIFO) enables trampolining between nested coroutine calls
- EventMap (join waits, I/O readiness) manages blocked tasks waiting for conditions to be fulfilled
- SystemCall yields are managed through handle() logic per syscall (not shown inline here for brevity)
- Task completion includes reactivating any tasks waiting on the completed task (e.g. via WaitTask)
- Trampolining logic handles yield sub_coro and propagates results back via stack unwind
🧩 Component-Focused Diagrams Plan
| Diagram # | Focus Area | Purpose |
|---|---|---|
| D1 | 📦 Basic Scheduler & Task Yield | Round-robin task switching using FIFO ReadyQueue |
| D2 | 🪜 Nested Coroutines (Trampolining) | How yield SubCoro() pushes/pops the CallStack per task |
| D3 | 🔌 I/O Blocking & Wakeup | Task yields ReadWait, blocks on fd, gets resumed on readiness |
| D4 | ⏲️ Sleep and Timer Handling | Task yields Sleep(duration) and wakes later |
| D5 | 🔁 Join / WaitTask Resolution | Task B waits for Task A to finish and is resumed when Task A completes |
Diagram D1 – Basic Task Switching (ReadyQueue only)
sequenceDiagram
participant Scheduler
participant ReadyQueue as FIFO
actor TaskA
actor TaskB
Scheduler->>ReadyQueue: dequeue TaskA
ReadyQueue-->>Scheduler: TaskA
Scheduler->>TaskA: resume()
TaskA-->>Scheduler: yield (None)
Scheduler->>ReadyQueue: enqueue(TaskA)
Scheduler->>ReadyQueue: dequeue TaskB
ReadyQueue-->>Scheduler: TaskB
Scheduler->>TaskB: resume()
TaskB-->>Scheduler: yield (None)
Scheduler->>ReadyQueue: enqueue(TaskB)
Diagram D2 – Nested Coroutines (Trampolining)
sequenceDiagram
participant CallStack as CallStack_LIFO
actor TaskMain
actor SubCoro
TaskMain-->>CallStack: push(TaskMain.gen)
TaskMain->>SubCoro: start sub-coroutine
SubCoro-->>CallStack: yield value
CallStack->>TaskMain: pop parent
TaskMain->>TaskMain: resume with value
Diagram D3 – I/O Blocking and Wakeup
sequenceDiagram
participant Scheduler
participant WaitMap as EventMap
participant ReadyQueue
actor TaskIO
TaskIO-->>Scheduler: yield ReadWait(fd=7)
Scheduler->>WaitMap: record TaskIO blocked on fd=7
Note right of WaitMap: fd=7 becomes readable
WaitMap->>Scheduler: notify fd=7 ready
Scheduler->>ReadyQueue: enqueue(TaskIO)
ReadyQueue-->>Scheduler: TaskIO ready to run
Diagram D4 – Sleep and Timer Handling
sequenceDiagram
participant Scheduler
participant WaitMap as SleepMap
participant ReadyQueue
actor TaskSleep
TaskSleep-->>Scheduler: yield Sleep(3s)
Scheduler->>WaitMap: record TaskSleep with wake_time = now + 3s
Note right of WaitMap: 3 seconds pass
WaitMap->>Scheduler: wake TaskSleep
Scheduler->>ReadyQueue: enqueue(TaskSleep)
Diagram D5 – Join / WaitTask Resolution
sequenceDiagram
participant Scheduler
participant ReadyQueue as FIFO
actor TaskA
actor TaskB
Scheduler->>ReadyQueue: dequeue TaskA
ReadyQueue-->>Scheduler: TaskA
Scheduler->>TaskA: resume()
TaskA-->>Scheduler: yield (None)
Scheduler->>ReadyQueue: enqueue(TaskA)
Scheduler->>ReadyQueue: dequeue TaskB
ReadyQueue-->>Scheduler: TaskB
Scheduler->>TaskB: resume()
TaskB-->>Scheduler: yield (None)
Scheduler->>ReadyQueue: enqueue(TaskB)
State Machine Diagram: Task Lifecycle
Purpose: Shows all valid transitions for a task (created → ready → running → blocked → complete).
Helps with: Understanding task states, debugging transitions, implementing lifecycle hooks.
stateDiagram-v2
[*] --> Created
Created --> Ready : schedule()
Ready --> Running : run()
Running --> Ready : yield (cooperative)
Running --> Blocked_IO : yield ReadWait/WriteWait
Running --> Blocked_Sleep : yield Sleep
Running --> Blocked_Join : yield WaitTask
Blocked_IO --> Ready : fd becomes ready
Blocked_Sleep --> Ready : timer expires
Blocked_Join --> Ready : waited task completes
Running --> Complete : StopIteration
Class Diagram: Scheduler Components
Purpose: Depicts the structure of Scheduler, Task, SystemCall, ReadyQueue, etc.
Helps with: Implementation planning, unit test coverage, codebase modularity.
classDiagram
class Scheduler {
+mainloop()
+schedule(Task)
+add_new_task(coro)
-ready_queue : Queue
-waiting_fds : Map
-waiting_tasks : Map
}
class Task {
+run()
-tid : int
-target : Generator
-stack : LIFO
-sendval : any
}
class SystemCall {
+handle()
}
class ReadyQueue {
+enqueue(Task)
+dequeue() Task
}
Scheduler --> Task
Task --> SystemCall
Scheduler --> ReadyQueue
Fault Handling & Panic Recovery
This diagram shows what happens when a task panics (e.g. raises an exception or hits StopIteration unexpectedly), and how the scheduler logs, cleans up, and optionally notifies dependent systems via the PAL and WAL.
sequenceDiagram
actor FaultyTask as Task_Foo
participant Scheduler
participant Logger
participant PAL as ProcessActivityLog
participant WAL as WriteAheadLog
participant WaitMap as JoinMap
Scheduler->>FaultyTask: run()
FaultyTask-->>Scheduler: panic (Exception or StopIteration)
alt Exception (unexpected)
Scheduler->>Logger: log(Task_Foo crashed: {error})
Scheduler->>PAL: record TaskFailure(Task_Foo, stacktrace)
Scheduler->>WAL: write TaskAbort(Task_Foo)
Scheduler->>WaitMap: notify any joiners
Scheduler-->>FaultyTask: deallocate & remove from registry
else Normal Completion (StopIteration)
Scheduler->>PAL: record TaskComplete(Task_Foo)
Scheduler->>WAL: write TaskExit(Task_Foo)
Scheduler->>WaitMap: notify any joiners
Scheduler-->>FaultyTask: remove
end
🧠 Interpretation
- Exceptions and normal StopIteration are handled differently, but both trigger:
- Log/record in PAL (for real-time dashboards)
- Append to WAL (for deterministic replay)
- Notify any other tasks waiting on the failed/completed one
- This can be extended later to retry policies, fallbacks, or alert hooks (e.g., webhook to notify UI agent).
Data Flow: Task Execution + WAL/PAL Logging
This shows how data produced during task execution (including yielded values, system calls, and final results) is recorded into:
- WAL (Write-Ahead Log): durable event stream for recovery/replay
- PAL (Process Activity Log): ephemeral, streaming trace for dashboards
- MemoryStore: optional semantic graph / state mutation
Purpose: Shows how yield/return data from tasks is recorded and logged into the system.
- WAL = Write-Ahead Log
- PAL = Process Activity Log
This is helpful for:
- Task recovery
- Auditing
- Live dashboards
flowchart TD
A[Task: Coroutine Execution] --> B1[Yield SystemCall or Result]
B1 --> C1[Scheduler Dispatch]
%% WAL Logging
C1 --> D1[WAL: write - Event]
D1 --> E1[Disk / WAL File]
%% PAL Logging
C1 --> D2[PAL: stream -TaskProgress]
D2 --> E2[PubSub Bus / Dashboard Feed]
%% If Memory Update
C1 -->|Memory-related yield| F1[MemoryStore Update]
F1 --> G1[GraphDB or VecDB]
%% Loop continuation
C1 --> H1[Requeue Task or Terminate]
style E1 fill:#dff,stroke:#0af
style E2 fill:#ffd,stroke:#fa0
style G1 fill:#f9f,stroke:#c0f
🧠 Interpretation Every execution result passes through the Scheduler, which:
- Logs to WAL for recovery (e.g. TaskYielded, TaskWaitIO(fd=4))
- Streams to PAL for dashboards (e.g. TaskStep with step label)
- Optionally updates MemoryStore (e.g. a new semantic link or skill)
- WAL is append-only, immutable
- PAL is transient (but real-time)
- MemoryStore is stateful, must be synchronized with execution semantics
🧠 Summary: Which to prioritize?
| Diagram | Priority | Purpose |
|---|---|---|
| ✅ Task Lifecycle (State) | ⭐⭐⭐⭐ | Debugging, correctness, onboarding |
| ✅ Scheduler Class Overview | ⭐⭐⭐⭐ | Code structuring and test scaffolds |
| ✅ Fault Handling Sequence | ⭐⭐⭐ | Observability, stability |
| ✅ PAL/WAL Data Flow | ⭐⭐⭐ | Logging, traceability, dashboards |
| ✅ SystemCall Dependency Graph | ⭐⭐ | Extending runtime with new features |
Task
A Task represents a running coroutine (generator). It encapsulates the coroutine and its state, including anything needed to resume it. In our design, a Task will have:
tid– a task identifier (unique per task). This can simply be an integer assigned in sequence for debugging or referencing tasks.target– the current generator that the task is running. Initially this is the coroutine function’s generator object. Due to trampolining, thistargetmay change to a sub-generator if the coroutine calls another coroutine (we maintain a stack of generators, see below).stack– a list (stack) of generator objects, initially empty. This is used for trampolining: whenever the task yields a generator (a sub-coroutine), we push the currenttargetontostackand start running the sub-generator as the newtarget. When a generator finishes, we pop from the stack to resume the caller.sendval– a value to send into the coroutine on next resume. This is how we deliver data or signals to the coroutine. For example, when a task yields an event and the scheduler processes it, the scheduler might setsendvalfor when the task resumes (e.g., the result of a system call). Initially, when a task is first started,sendvalisNone(since we have to sendNoneor callnext()to start a generator).
The Task object will have a method like run() which essentially advances the coroutine to its next yield point. Pseudocode for Task.run might be:
class Task:
def run(self):
try:
result = self.target.send(self.sendval) # Resume coroutine, send value
except StopIteration:
# Coroutine finished
if self.stack:
# If there is a calling generator waiting, resume it
self.sendval = None
self.target = self.stack.pop()
return None # indicate we should continue running (scheduler will reschedule immediately)
else:
# No caller, task is truly done
return None
self.sendval = None # reset sendval after sending
if isinstance(result, SystemCall):
# If coroutine yielded a system call (special request)
return result # hand it to scheduler to handle:contentReference[oaicite:25]{index=25}
if isinstance(result, types.GeneratorType):
# Coroutine yielded a sub-generator: trampolining
self.stack.append(self.target)
self.target = result
# loop back (we can just return something that causes scheduler to immediately continue this task)
return self.run() # (or manage via while loop as in Beazley’s code)
else:
# A normal yield (some value yielded to scheduler or caller)
if not self.stack:
# Top-level coroutine yielded a value (not a sub-call and not system call)
return result # this result can be ignored or handled by scheduler if needed
# If there is a caller awaiting this yield:
caller = self.stack.pop()
self.target = caller
self.sendval = result # send the yielded value as result back into caller
return self.run()
This logic is derived from Beazley’s trampolining implementation. In fact, Beazley’s code used a while True loop inside run() to handle nested yields in one go. The key idea is: if a coroutine yields another generator, we don’t return to the scheduler; instead, we switch context to that sub-generator (pushing the current one onto a stack). We only return from run() to the scheduler when we hit a yield that isn’t a sub-coroutine call or when a system call needs handling. If a coroutine fully finishes (StopIteration) and we have a caller on the stack, we pop back to it (effectively as if the yield of the sub-coro returned None). The scheduler will see a return of None from task.run() as either “task not ready to yield to scheduler (continue internally)” or “task completed”. We might handle that by the scheduler: if task.run() returns None, it means either the task finished or is still running a sub-task (depending on how we code it). In Beazley’s code, he returned None to indicate the task isn’t yielding a real value (so the scheduler will just move to next task) until a system call or completion happens. We can follow a similar convention.
To summarize Task.run outcomes: It can return a SystemCall object (meaning “scheduler, please handle this and don’t immediately reschedule me”), or return a regular value/None. If it returns a non-SystemCall (including None), the scheduler can assume the task either completed or just yielded a value without any special meaning – usually we would simply put the task back on the ready queue if it returned a value or None (unless None signifies completion). One way to detect completion is to track if the task’s coroutine is exhausted. In our code, if a StopIteration occurs with no caller, that task is done and should not be rescheduled. We might mark a flag or simply not requeue it.
Scheduler
The Scheduler is the event loop that coordinates all tasks. It maintains data structures for ready tasks and for tasks waiting on events (like I/O or other tasks). Key parts of the scheduler design:
-
A queue (FIFO) of ready tasks that can run. We can use a simple list or
collections.dequefor this. Tasks get pulled from here to execute, and if they yield cooperatively (normal yield), they go back to the end of the queue. This ensures round-robin fairness for CPU time among tasks that are ready to run. -
A task table or registry to keep track of tasks by ID. For example, a dictionary
tasksmappingtid -> Task. This is useful if we implement operations likeKillTask(tid)orWaitTask(tid)where the scheduler needs to find a specific task by its ID. -
Data structures for waiting tasks:
- For I/O: perhaps two dicts,
read_waitingandwrite_waiting, mapping file descriptors to Task waiting on them (if any). Alternatively, a combined structure mapping an event key to tasks. Initially, we can keep it simple with two dicts as in Beazley’s example (one for read, one for write). - For task waiting: if implementing
WaitTask, we need a map of task IDs to a list of tasks waiting for them to finish. E.g.,waitting[tid] = [list of Tasks paused until tid finishes]. - For timers: if we want
Sleep(duration)functionality, we might maintain a min-heap or list of (wake_time, Task) for tasks that are sleeping, and on each loop check the current time to see if any should be woken and moved to ready. (This is similar to how some event loops manage call-at-future-time events.)
- For I/O: perhaps two dicts,
-
The main loop of the scheduler will roughly do: while there are tasks (or events to wait for):
-
If there are no ready tasks, block on I/O (and possibly timer) events until something is ready (to avoid busy-wait). If there are ready tasks, we might do a non-blocking check for I/O instead. This logic is found in Beazley’s
iotask()which the scheduler can also manage as a special periodic task. We can incorporate the polling directly in the loop. -
Take the next task from the ready queue.
-
Call
task.run(). -
Examine the return value:
- If
resultis a SystemCall instance: This means the task is requesting a special operation (and has yielded control intentionally for this). The scheduler will handle the system call immediately. Typically, the SystemCall object has ahandle()method that the scheduler can invoke. In our design, we can implement each SystemCall’shandleto perform the needed action. Before callinghandle(), we usually set context on the SystemCall: e.g.result.task = current_taskandresult.sched = selfso it knows who’s calling and can communicate back. After handling, the task might be ready to continue (or not). For example,GetTidsystem call will set the task’ssendvalto its tid and immediately requeue the task. AReadWaitsystem call will not requeue the task; instead it will put the task intoread_waitingand not schedule it until the I/O is ready. The SystemCall’s logic will determine that. Afterhandle(), the scheduler’s loop continues to the next iteration without requeuing the current task (unless the system call did so). - If
resultisNoneor a regular value (not a SystemCall): This means the task either completed or just yielded a value. Ifresult is Noneand the task’s generator is done, we drop the task (it finished execution). Ifresult is not None(or is None but the task is not actually done yet due to trampolining), it likely represents a normal cooperative yield – the task yielded control on purpose (maybe via ayieldwith no special object). In such case, we simply put the task back onto the ready queue to resume later. We might not have a strong use for the actual value yielded (some systems use it for passing messages, but in our design we don’t anticipate tasks communicating via yielded values except for the defined SystemCall objects). We can safely ignore or log unexpected yield values. In summary, for a normal yield, requeue the task. - If an exception was raised from
task.run()that isn’t StopIteration (meaning an error in task): we might catch it aroundtask.run(). In such a case, we should consider the task crashed. The scheduler can log the exception (and the task ID for debugging) and not requeue the task. This prevents one task’s error from crashing the whole scheduler loop.
- If
-
-
I/O polling: To handle I/O, the scheduler will incorporate a call to
select(or an equivalent likeselectorsin Python) each cycle. One approach (from Beazley’s code) is to spawn a special internal task whose job is to continuously calliopoll()in a loop and yield, so that it runs concurrently with others. Another approach is to integrate polling into the main loop when no ready tasks are left (or periodically). We can do something like:
# Pseudocode for I/O wait integration in scheduler loop:
if not ready_queue:
# No task ready to run, so we can block on I/O
timeout = None # block indefinitely until an event
else:
timeout = 0 # poll without blocking (just check and return)
readable, writable, _ = select(read_waiting_fds, write_waiting_fds, [], timeout)
for fd in readable:
task = read_waiting.pop(fd)
ready_queue.append(task)
# perhaps set task.sendval = None or some actual data if needed
for fd in writable:
task = write_waiting.pop(fd)
ready_queue.append(task)
This way, if tasks are waiting on I/O, we wake them when ready. If tasks are simply CPU-bound but cooperative, the timeout=0 poll ensures we check I/O quickly without blocking, and then continue round-robin. If all tasks are waiting (no one ready), we do a blocking wait (timeout=None) so that the process sleeps until an I/O event occurs. This is important to avoid spinning CPU when idle.
-
SystemCall Handling: We will define a small set of system call classes that encapsulate operations. Likely candidates for Tiffany:
GetTid– returns the task’s own ID. Useful for tasks to identify themselves (maybe for logging). This would simply do:self.task.sendval = self.task.tid; self.sched.schedule(self.task)(i.e., put the task back in ready queue).NewTask(coro)– create a new Task for the given coroutine and schedule it. The current task could receive the new task’s id as a result of the yield. Implementation: create Task, store in task map, enqueue it; set current task’ssendvalto new task’s id (or some acknowledgment) and then also enqueue current task again (so it continues).WaitTask(tid)– current task wants to wait until tasktidcompletes. We would check if that task is still alive. If yes, remove current from ready queue and put it in a wait list for thattid. If the tasktidis already done, we can immediately continue the current task (perhaps settingsendvalto some result orNone). If waiting, when tasktideventually finishes, the scheduler will need to take any tasks in its wait list and enqueue them (with perhaps a result indicating that the wait is over).KillTask(tid)– to terminate a task. The scheduler would look up the task by id, and if found, remove it from ready queue or waiting structures, and possibly prevent it from running. If a task is killed, tasks waiting on it might need to be notified (maybe with an exception or a special value). We might not need this in Tiffany initially unless we have a use-case to stop tasks.ReadWait(file)/WriteWait(file)– as described, these will take the file’s file descriptor and put the task inread_waitingorwrite_waiting. The task is not requeued; it’ll be resumed by the scheduler when the fd is ready (detected via select). We should mark in the task map that this task is now blocked on I/O. Thehandle()will do:self.sched.waitforread(self.task, fd)for example, and nothing more (do not requeue the task). The scheduler’s polling will take care of waking it.- (Optional)
Sleep(dt)– not mentioned in Beazley’s original, but easy to add. It would record a wake-up time (now + dt) and put the task aside (perhaps in a min-heap sorted by wake time). The scheduler each loop can check the soonest wake-up time to decide a timeout for select. When time is reached or passed, the scheduler will requeue the task. This can be implemented similarly to I/O wait.
We can implement these system calls as classes with a handle() method as mentioned. They might look like:
class SystemCall:
def handle(self):
pass
class GetTid(SystemCall):
def handle(self):
self.task.sendval = self.task.tid
self.sched.schedule(self.task) # put task back ready
class NewTask(SystemCall):
def __init__(self, coro):
self.coro = coro
def handle(self):
new_task = Task(self.coro)
self.sched.add_task(new_task)
# Immediately schedule new task
self.sched.schedule(new_task)
# Also resume current task
self.task.sendval = new_task.tid
self.sched.schedule(self.task)
And so on for others (ReadWait, etc.), following the patterns from Beazley’s code (which we have paraphrased). The scheduler will detect isinstance(result, SystemCall) and invoke result.handle() after attaching result.task = current_task and result.sched = self. This injects the needed context so the SystemCall knows which task is making the request and can access the scheduler.
Scheduler APIs: We should have methods like Scheduler.new(coro) to add a new coroutine (wrap it in Task and enqueue), and Scheduler.schedule(task) to enqueue an existing Task (this might just put it in ready queue). Also, perhaps Scheduler.exit(task) or internal handling when a task finishes (to do cleanup, e.g., remove from task map, and wake any waiters).
The main loop Scheduler.mainloop() drives everything until no tasks remain. It will incorporate the logic discussed: get next ready task, run it, handle results, and poll events. In Python pseudocode:
def mainloop(self):
while self.tasks or not self.ready.empty():
# handle IO and timers
if self.ready.empty():
timeout = None
else:
timeout = 0
# use select or similar to get ready fds
rlist = list(self.read_waiting.keys())
wlist = list(self.write_waiting.keys())
if rlist or wlist:
readable, writable, _ = select.select(rlist, wlist, [], timeout)
for fd in readable:
task = self.read_waiting.pop(fd)
self.schedule(task)
for fd in writable:
task = self.write_waiting.pop(fd)
self.schedule(task)
elif timeout is None:
# No fds to wait on, but we have to wait (meaning no ready tasks and no I/O)
# This situation likely means no tasks at all, or only tasks waiting on something that isn't fd (maybe a timer).
# We could sleep or break.
break
if not self.ready.empty():
task = self.ready.get() # dequeue a ready task
else:
continue # go back to loop top if nothing ready (maybe all tasks done while waiting)
result = None
try:
result = task.run()
except Exception as e:
# handle unexpected task exception
print(f"Task {task.tid} raised {e}, removing it")
# (optionally call some task cleanup)
continue
if result is None:
# Task finished or yielded normally with no special request
if task.is_done(): # we would need a way to check if fully done
# remove from task map, and if someone waiting on it:
if task.tid in self.waiting:
for wtask in self.waiting[task.tid]:
self.schedule(wtask)
del self.waiting[task.tid]
continue # don't requeue
else:
# If not done, it yielded cooperatively (like yield None or something)
self.schedule(task) # just requeue
elif isinstance(result, SystemCall):
result.task = task
result.sched = self
result.handle()
# Note: the SystemCall handler is responsible for requeueing the task or not
else:
# The task yielded a value (not None, not SystemCall)
# We currently have no use for direct yielded values except maybe debug
# Treat it as a cooperative yield: task still alive, requeue it
self.schedule(task)
The above is a rough outline; details will differ (like how we mark task done – perhaps Task.run() sets an internal flag or we catch StopIteration).
One subtlety: In Beazley’s code, when a task returns a SystemCall, Task.run() actually returns that SystemCall up to the scheduler, and the scheduler does not immediately requeue the task but processes the system call. The task in such case is still in a running state, just paused until the system call is done. Many system calls will immediately put the task back to ready (like GetTid, NewTask), but some (ReadWait, WaitTask) deliberately do not requeue the task, causing it to be effectively blocked. Our design follows the same approach.
To illustrate the scheduling and task switching, consider two tasks yielding control back and forth:
Figure 1: Cooperative task switching. In the above sequence diagram, Task A runs and then yields control to the scheduler (e.g., by executing a simple yield with no special event) to allow other tasks to run. The scheduler receives control and marks Task A as ready (or takes it out temporarily since it yielded). The scheduler then picks Task B to run next. Task B runs for a while and then yields control as well. The scheduler resumes Task A, and so on in a round-robin fashion. This cooperative approach ensures tasks take turns. Each arrow labeled “yield … control” represents a task suspending itself by yielding, and “resume Task X” represents the scheduler activating the next task. The scheduler is at the center, orchestrating the switches.
Trampolining and Nested Coroutines
Trampolining is handled within the Task’s run() logic, as described, rather than in the scheduler. This is by design: from the scheduler’s perspective, a Task with trampolining is still a single task – the scheduler doesn’t need to know if the task internally entered a sub-coroutine. The Task will only return to the scheduler when the entire nested sequence yields a system call or finishes. This keeps the scheduler simple and pushes the complexity to the Task object.
To understand trampolining, imagine a server task that calls a helper coroutine. For example:
def Accept(sock):
yield ReadWait(sock) # wait for a client to connect
client, addr = sock.accept() # accept connection (once socket is ready)
yield client, addr # yield the result back to caller
def server(port):
sock = setup_listening_socket(port)
while True:
client, addr = yield Accept(sock) # call sub-coroutine Accept
yield NewTask(handle_client(client, addr))
In this scenario, server() calls Accept(sock) via yield. Without trampolining, Accept’s yields would not be handled correctly. With trampolining, what happens internally is:
-
The
servercoroutine (let’s call it Outer) yields a generator object when it doesyield Accept(sock). That generator corresponds to the Accept coroutine. The Task runningserverdetects it yielded a generator. It pushes the Outer generator onto its stack and sets its currenttargetto the Accept generator. The Task now will start/resume the Accept coroutine. The scheduler is not involved in this switch; it all happens insideTask.run. -
The Accept coroutine runs until its first yield. It yields
ReadWait(sock)– a system call to wait for a socket. At this point, inTask.run,resultis aReadWaitSystemCall, which triggers thereturn resultpath. Thus, the Task returns theReadWaitevent to the scheduler (and pauses). The scheduler sees a SystemCall, sets it up with context, and handles it. In this case,ReadWait.handle()will register that task in theread_waitingmap for that socket’s fd and not requeue it. The task is now blocked waiting for I/O. The scheduler will move on to other tasks (if any) or wait in select for file descriptors. -
When a client connection comes in, the listening socket becomes readable. The scheduler’s
selectcall returns that fd as ready. The scheduler finds which task was waiting on that fd (the Accept’s task) and puts it back on the ready queue. Now the Accept coroutine can resume. The scheduler will eventually run that task again; inTask.run, we re-enter where we left off in Accept (after theyield ReadWait). We likely setsendval = Nonewhen resuming after I/O (no particular value to send, or we could send thesock.accept()result, but in this design Accept itself will callsock.accept()). Accept now executesclient, addr = sock.accept()(since the socket is ready) and thenyield client, addr. That yields a regular value (not a SystemCall or generator) back to the scheduler. -
The Task’s
run()sees the Accept generator yielded a normal value (the client socket and address). Since the Task’s stack is not empty (Outer is waiting), the code will pop the Outer generator off the stack, make it the current target again, and set the value assendval. It then continues running the Outer generator, effectively sending the(client, addr)tuple into theyield Accept(sock)expression. Thus, in theserver()coroutine, the callclient, addr = yield Accept(sock)receives the values and continues. From the scheduler’s view, this entire sequence looks like the task went to sleep on I/O and then produced a SystemCall and later simply yielded a normal value (the tuple) which we ignored/treated as cooperative yield before rescheduling. The Outer coroutine then immediately yieldsNewTask(handle_client(client, addr))as per the code, which will be a SystemCall. The scheduler handles the NewTask system call by creating a new task forhandle_clientand scheduling it. Meanwhile, the server task (Outer) is requeued (since after yielding NewTask it would continue its loop).
This interplay is complex, but the good news is our design cleanly separates concerns:
- The Task/trampoline mechanism handles sub-coroutine calls and returns without scheduler intervention, making nested coroutine calls possible just like nested function calls.
- The scheduler handles system calls and dispatch.
- The net effect is our code can be written in a linear style with
yieldacting as an async wait, which is very convenient.
Figure 2: Nested coroutine (trampolining) and I/O event sequence. The above diagram illustrates the sequence described: The Outer task (e.g., the server loop) yields a sub-coroutine Accept(sock). The Task internals switch to running the Accept coroutine. The Accept coroutine yields a ReadWait system call to the Scheduler, indicating it needs to wait for the socket. The Scheduler pauses the task until the socket is ready (some time passes externally). Once a client connects (socket ready), the Scheduler resumes the Accept coroutine. Accept then produces a result (client, addr) and returns that back to the Outer task. The Outer task resumes, now with the client info, and can continue (in this case, spawning a new task to handle the client). All of this happens seamlessly: from the programmer’s perspective, yield Accept(sock) returned a value when the Accept coroutine completed, just like a normal function call that waited for I/O. This transparency is the benefit of trampolining. The scheduler did not have to know about the call to Accept; it only dealt with the ReadWait and later the NewTask system calls.
Rust Implementation Consideration
We should briefly consider how this design would map to Rust 2024, to ensure our approach is not fundamentally incompatible if we choose or need to implement in Rust. Rust’s async/await works differently but the high-level structure of an event loop with tasks is analogous:
-
In Rust, an async function is analogous to our coroutine. It can
awaiton futures (similar to yielding events). We could create an executor that polls these futures. For example, our scheduler’s main loop in Rust would poll each future (task) until it returnsPoll::PendingorPoll::Ready. If pending, it means the task yielded (awaited) on something – we’d then need to know on what it’s waiting (Rust’s waker system handles this via registering wakers with reactors like epoll or timers). If ready, the future completed and can be dropped. -
Many of the constructs we have as SystemCall in Python would correspond to either standard library futures or custom future combinators in Rust. For instance, waiting on a socket is typically done by an async I/O reactor (like using
tokiowhich integrates with OS polling under the hood). Spawning a new task is typically a method on the executor (spawnin tokio). Getting a task ID is not built-in in Rust futures (since tasks are generally anonymous closures), but one could incorporate an ID for logging. -
One could attempt to implement a stackful coroutine in Rust with the unstable
Generatortrait, which would look more like the Python approach (allowingyieldwithin a generator). However, using that directly is nightly-only. Instead, using async/await (which are stackless coroutines) is the idiomatic way. Each.awaitin Rust is akin to a yield point where the task returns control to the executor. The executor must know how to wake the task when the awaited operation is ready. This is done viaWakerobjects in Rust, typically provided by the runtime. Our Python scheduler used a simpleselectloop; in Rust, an executor often uses OS selectors (epoll, etc.) or leverages a library for readiness (likemioin tokio). -
Minimum needed vs full features: In Rust, if using existing libraries, we might not need to implement our own reactor – we could use something like
futures::executor::block_onorsmolortokiofor a lot of functionality. But if we want a standalone minimal runtime (for learning or control), we could implement a basicselectorpollloop using lower-level crates. Our SystemCall analogs in Rust could be simple functions or futures: e.g., aread_wait(socket)could be a future that registers interest in a socket readability event and yields until the socket is readable. The complexity is higher than in Python due to Rust’s safety and the need to interface with OS properly (non-blocking sockets, etc.).
Evaluating Python vs Rust for Tiffany: For prototyping and quick development, Python’s coroutine scheduler is extremely flexible and easy to change. It runs in a single thread (which might be fine if Tiffany’s workload is I/O-bound and not CPU-bound) and can handle many concurrent tasks (Python generators are lightweight). Python’s drawback is performance – if we need to handle thousands of tasks or high throughput I/O, the Python interpreter might become a bottleneck. Additionally, Python’s GIL prevents multi-core usage, but since we’re doing single-threaded co-op multitasking, that’s not a concern unless one wanted to scale across threads (not in our current scope).
Rust, on the other hand, offers high performance and true non-blocking behavior with zero-cost abstractions once compiled. If Tiffany’s environment is constrained (e.g., running on bare-metal with low resources or needing to maximize network throughput), a Rust implementation might be worthwhile. Rust also offers memory safety – our Python scheduler must be carefully written, but Python itself is memory safe; the main risks in Python are logical (like forgetting to requeue a task or handle an exception). Rust could help avoid certain classes of errors (like data races, if we later use multiple threads for parallelism). However, writing our own Rust async runtime is a significantly bigger engineering effort than the Python prototype. The minimum necessary for our needs might be achieved with the Python approach initially, given we don’t need true parallelism or extreme performance in the prototype stage. We can ensure the design is portable to Rust by mirroring concepts:
- Use Futures in Rust as tasks; use an executor (which we can implement or reuse) akin to our Scheduler.
- Replace yields and SystemCalls with
awaitpoints and either library futures (for I/O) or custom ones (for things likeGetTid– though in Rust,GetTidis trivial or not usually needed, since tasks aren’t typically identified by an integer by default). - We won’t have an exact equivalent of trampolining issue in Rust because
async/awaithandles that automatically via the compiler (an async function can call another async function andawaitit, which is analogous to our yield-from or trampolining solution, but the Rust compiler flattens these awaits in the generated state machine). Essentially, Rust’sasyncis more like having trampolining built-in – you can await nested async calls with no special effort.
In conclusion, for Tiffany’s minimum requirements, the Python coroutine scheduler suffices and is easier to build and iterate on. We will proceed with that design, implementing just the features we need (task creation, basic synchronization, and non-blocking I/O). If performance tests or production considerations demand it, we can plan a migration to Rust using an equivalent structure (possibly leveraging an existing runtime to avoid reinventing the wheel at low level). The conceptual model of our scheduler – an event loop handling tasks and events – is common to virtually all async runtimes (from JavaScript’s event loop to libuv, Python asyncio, and Rust executors), so the knowledge and design should translate well.
Summary of Design Decisions
- Single-threaded cooperative scheduler: simple design, deterministic behavior, no preemption. Suitable for I/O-heavy tasks. Not suitable for CPU-bound parallel work (not needed for now).
- Coroutine tasks via Python generators: easy to implement the logic of yield/resume. We use generator
send()andyieldto alternate between tasks and scheduler. We wrap them in a Task class to manage state and allow nested yields (trampolining). - System call objects for events: We define specific events tasks can yield to trigger actions like spawning tasks or waiting on I/O. This makes the yield interface extensible – we can add new capabilities by introducing new SystemCall classes without changing the core scheduler loop much.
- Trampolining with explicit stack: This is included to allow tasks to call sub-coroutines cleanly. It adds a bit of complexity to Task.run but keeps scheduler logic clean (the scheduler always sees top-level yields only). This decision is justified by the need for clean code structure in tasks (avoiding monolithic giant coroutines because of lack of yield-from). Since we target at least Python 3.7+ for Tiffany (likely, given modern context), we could use
yield frominstead of manual trampolining. Indeed, PEP 380 (yield from) provides a language-level way to delegate to subgenerators, which would make our Task.run simpler (we wouldn’t need to manually push/pop stack). However, usingyield frommeans our coroutines would use that syntax directly in their code. If we have control over writing all coroutine functions, this is fine. If we prefer to support even older style or want to see how it works under the hood, trampolining is educational. Given that yield-from is well-supported, an alternative design is: whenever a generator yields another generator, we could useyield fromto drive it. But since our scheduler is explicitly managing yields, we likely stick to our manual approach. - Minimal set of system calls: We will implement only those needed. For example, if Tiffany’s tasks mostly revolve around network and spawning, we implement
NewTask,ReadWait,WriteWait, and possibly aSleep. If tasks need to coordinate,WaitTaskcan be added. We won’t implement features like prioritization or time-slicing (each task runs until it yields). If a task misbehaves (never yields), it will block the system – this is acceptable given our controlled use cases, but we’ll document that tasks must be written to yield appropriately. - Integration with actual I/O: We assume the environment allows non-blocking I/O operations. For example, for network sockets we will set them to non-blocking mode (Beazley’s example uses
socket.accept()after waiting, which is fine because the accept will not block when select indicates readiness). For file reading or other ops, we may need similar handling. Python’sselectworks on sockets and some other file types on Unix. If we need cross-platform or more complex monitoring, we could use theselectorsmodule which abstractsselect/poll/epolldifferences. But to keep it minimal,selectis okay if our fd sets are small (there is a FD limit on select, but likely not an issue for Tiffany’s scale).
The design above gives us a clear blueprint to start coding. We have focused on the essential mechanisms: task switching, waiting for events, and maintaining state. We have also discussed how to scale it down or up depending on needs (e.g., what happens in Rust, what features to omit for simplicity). With this design, the next step is to implement the Scheduler and Task classes in code, define the necessary SystemCall subclasses, and test the system with some example tasks (like a simple server or a ping-pong between tasks) to ensure it behaves as expected. The sequence diagrams provided illustrate the flow of control between tasks and the scheduler, which we will use as a reference for implementation correctness.
Tiffany System Diagrams
This section contains visual documentation and sequence diagrams for:
- Coroutine scheduler
- WAL/PAL replay and dataflow
- ReAct loop and skill execution
- Task lifecycle and fault handling
All diagrams are written in Mermaid and versioned alongside the system ADRs and concepts.
Scheduler Component Diagrams
Diagram D1 – Basic Task Switching (ReadyQueue only)
sequenceDiagram
participant Scheduler
participant ReadyQueue as FIFO
actor TaskA
actor TaskB
Scheduler->>ReadyQueue: dequeue TaskA
ReadyQueue-->>Scheduler: TaskA
Scheduler->>TaskA: resume()
TaskA-->>Scheduler: yield (None)
Scheduler->>ReadyQueue: enqueue(TaskA)
Scheduler->>ReadyQueue: dequeue TaskB
ReadyQueue-->>Scheduler: TaskB
Scheduler->>TaskB: resume()
TaskB-->>Scheduler: yield (None)
Scheduler->>ReadyQueue: enqueue(TaskB)
Diagram D2 – Nested Coroutines (Trampolining)
sequenceDiagram
participant CallStack as CallStack_LIFO
actor TaskMain
actor SubCoro
TaskMain-->>CallStack: push(TaskMain.gen)
TaskMain->>SubCoro: start sub-coroutine
SubCoro-->>CallStack: yield value
CallStack->>TaskMain: pop parent
TaskMain->>TaskMain: resume with value
Diagram D3 – I/O Blocking and Wakeup
sequenceDiagram
participant Scheduler
participant WaitMap as EventMap
participant ReadyQueue
actor TaskIO
TaskIO-->>Scheduler: yield ReadWait(fd=7)
Scheduler->>WaitMap: record TaskIO blocked on fd=7
Note right of WaitMap: fd=7 becomes readable
WaitMap->>Scheduler: notify fd=7 ready
Scheduler->>ReadyQueue: enqueue(TaskIO)
ReadyQueue-->>Scheduler: TaskIO ready to run
Diagram D4 – Sleep Timer
sequenceDiagram
participant Scheduler
participant WaitMap as SleepMap
participant ReadyQueue
actor TaskSleep
TaskSleep-->>Scheduler: yield Sleep(3s)
Scheduler->>WaitMap: record TaskSleep with wake_time = now + 3s
Note right of WaitMap: 3 seconds pass
WaitMap->>Scheduler: wake TaskSleep
Scheduler->>ReadyQueue: enqueue(TaskSleep)
Diagram D5 – Join / WaitTask
sequenceDiagram
participant Scheduler
participant WaitMap as JoinMap
actor TaskA as Task_Parent
actor TaskB as Task_Child
TaskA-->>Scheduler: yield NewTask(TaskB)
Scheduler->>TaskA: resume with TaskB.tid
TaskA-->>Scheduler: yield WaitTask(TaskB)
Scheduler->>WaitMap: record TaskA waiting on TaskB
Note right of TaskB: TaskB runs and completes
TaskB-->>Scheduler: StopIteration
Scheduler->>WaitMap: notify TaskA to resume
✅ Agent Task Graph (Per Agent / DAG View)
Type: graph TD (or flowchart TD) Purpose: Visualize how one FAR agent’s current task decomposes into subtasks, dependencies, joins, etc.
graph TD
RootTask[Plan Story Code]
Sub1[Setup Repo]
Sub2[Create main.rs]
Sub3[Implement main_fn]
Sub4[Add Tests]
Join[WaitTask: Finalize]
RootTask --> Sub1
RootTask --> Sub2
Sub2 --> Sub3
Sub3 --> Sub4
Sub1 --> Join
Sub4 --> Join
✅ FAR Cluster View: Distributed Agents & Scheduler
Type: graph LR or flowchart LR Purpose: Show how agents across blades interact with a central orchestrator + local schedulers
flowchart LR
Orchestrator[[Central Scheduler]]
subgraph Blade A
A1[Agent 001\n - builder]
SchedA[Local Scheduler]
end
subgraph Blade B
A2[Agent 002\n - Tester]
SchedB[Local Scheduler]
end
Orchestrator --> SchedA --> A1
Orchestrator --> SchedB --> A2
✅ Skill Invocation Trace (ReAct-aware timeline)
Purpose: Show how a ReAct loop invokes skills with context, yields, retries, and memory access
sequenceDiagram
actor ReActLoop
participant SkillManager
participant SkillEdit
participant MemoryStore
participant WAL
ReActLoop->>SkillManager: invoke("Edit main.rs")
SkillManager->>SkillEdit: run()
SkillEdit->>MemoryStore: lookup(path="main.rs")
MemoryStore-->>SkillEdit: file AST
SkillEdit-->>WAL: write(PatchGenerated)
SkillEdit-->>SkillManager: return Patch
SkillManager-->>ReActLoop: Patch ready
✅ 11. Agent Memory Structure (Graph + Vector)
Purpose: Show semantic and episodic memory layers, link to disk/durable store
graph TD
MemRoot[Memory: Agent]
Sem[Semantic Graph]
Episodic[Task Log / WAL]
Vectors[VectorStore - code, plans]
GraphDB[(Neo4j)]
VecDB[(Qdrant/Faiss)]
MemRoot --> Sem --> GraphDB
MemRoot --> Episodic --> WAL
MemRoot --> Vectors --> VecDB
✅ 12. Coroutines Timeline (Concurrent Task Steps)
Purpose: Show wall-clock execution of several tasks with yield/resume gaps
gantt
title Task Concurrency Timeline
dateFormat HH:mm:ss
section Task_Main
run :active, 00:00:00, 5s
sleep : 00:00:05, 3s
resume : 00:00:08, 4s
section Task_Sub
start :active, 00:00:02, 3s
block_io : 00:00:05, 4s
finish : 00:00:09, 2s
✅ System Call Map
Purpose: Visualize all supported system calls and their dispatch targets (WAL, IO, Scheduler)
graph LR
YieldNewTask --> Scheduler
YieldReadWait --> IOSelector
YieldSleep --> TimerWheel
YieldGetTid --> Scheduler
YieldMemoryQuery --> MemoryStore
YieldLog --> PAL
✅ Task WAL Replayer Flow
Purpose: Shows how the replay engine reconstructs a task from the WAL stream, rehydrating state and optionally stepping through each yield point.
flowchart TD
WAL[WAL Stream: task_187.log] --> Parser
Parser --> EventQueue
EventQueue --> Scheduler
Scheduler --> TaskReplayer
TaskReplayer --> CallStack
TaskReplayer --> PAL[Replay Mode: emit PAL ghost events]
TaskReplayer --> Memory[Optional Memory Patching]
style WAL fill:#dff,stroke:#0af
style PAL fill:#eee,stroke:#f08
style Memory fill:#f9f,stroke:#808
✅ Panic Propagation via Wait/Join Graph
Purpose: If a child task panics, show how the error might propagate to parent or dependents if not isolated.
graph TD
A[Task_Main\n - tid 101]
B[Task_Git\n - tid 202]
C[Task_Edit\n - tid 203]
D[Task_Tests\n - tid 204]
A -->|JoinTask| B
A -->|JoinTask| C
C -->|JoinTask| D
D -.->|panic| C
C -.->|propagate panic| A
style D fill:#faa,stroke:#800
style C fill:#ffe,stroke:#aa0
✅ Bonus 3: Skill Invocation Heatmap
Purpose: Visualize which system tools/skills are being used most frequently, based on PAL telemetry stream.
graph LR
Skill_Edit["📝 Edit File"]:::hot
Skill_Grep["🔍 Grep"]:::warm
Skill_Test["✅ Run Tests"]:::hot
Skill_Git["🌿 Git Ops"]:::warm
Skill_Plan["🧠 Plan/Refactor"]:::cold
Skill_Fmt["🧹 Format Code"]:::cold
style Skill_Edit fill:#f88,stroke:#800
style Skill_Test fill:#f88,stroke:#800
style Skill_Grep fill:#fb6,stroke:#aa0
style Skill_Git fill:#fb6,stroke:#aa0
style Skill_Plan fill:#ccf,stroke:#44f
style Skill_Fmt fill:#ccf,stroke:#44f
Tiffany ADR Index
This document tracks relevant Architectural Decision Records (ADRs) for the Tiffany agentic runtime project. ADRs are numbered sequentially and grouped by functional domain.
🗂️ Summary of Additional ADR Recommendations:
| ADR | Description | Summary | Importance |
|---|---|---|---|
| 1 | Choice of Rust Version | Use the latest stable Rust version for performance, safety, and ecosystem benefits. | High |
| 2 | Doc book Policy | Documentation standards, structure, and tooling for high-quality docs. | High |
| 3 | Task Scheduler Model | Coroutine-first, cooperative scheduling with stateful yield and resume semantics. | High |
| 4 | Agent Loop and ReAct Design | Agent lifecycle: reasoning, acting, yielding, and tool interaction. | High |
| 5 | Virtual Canvas Git Strategy | Code change tracking, patch application, and micro-commits. | Medium |
| 6 | WAL Schema and Replay Policy | Write-ahead log format, recovery semantics, and log compaction. | High |
| 7 | Process Activity Log (PAL) | Real-time activity tracking for agent tasks, complementing the WAL. | Medium |
| 8 | Persistent Agent Memory Strategy | Neo4j semantic memory, graph structure, and retrieval policies for agent continuity. | High |
| 9 | Code Structure Graph and Symbol Analysis | Graph-based code indexing and semantic AST analysis into Neo4j. | Medium |
| 10 | Task Plan Timeline and Execution Metadata | Structure for task execution metadata, plan history, and decision lineage. | Medium |
| 11 | Plugin and MCP Tooling Architecture | Integration of internal and external tools securely and generically. | High |
| 12 | Agent Skill System (Future Plan) | Representation of agent capabilities as composable skills. | Medium |
| 13 | Firecracker MicroVM Integration | VM lifecycle, execution isolation, and resource constraints. | High |
| 14 | Command Execution and Safety Policy | Shell execution flow, trust boundaries, and --yolo mode considerations. | High |
| 15 | REST/gRPC API Design | Interface contract for external task submission, result retrieval, and metadata queries. | High |
| 16 | Filesystem Socket Protocol for CLI | Local interaction protocol for invoking agent actions via socket. | Medium |
| 17 | Kubernetes Operator and CRD Design | Custom resource definitions for managing agent lifecycle and orchestration. | Medium |
| 18 | Metrics Policy and Instrumentation Plan | Prometheus metrics structure, naming conventions, and dashboard philosophy. | Medium |
| 19 | Logging Strategy and Span Hierarchy | Use of tracing, span lifecycles, and log level defaults. | Medium |
| 20 | Versioning and Release Policy | Semantic versioning, LTS channels, and changelog protocol. | Medium |
| 21 | Contributor Roles and Governance | Roles, responsibilities, PR review flow, and escalation path. | Medium |
| 22 | Backup, Disaster Recovery, Failover | Policies for data recovery, backups, failover, and explicit RTO/RPO. | Critical |
| 23 | Secrets and Credentials Management | Handling of secrets, keys, tokens, integration with vaults, and rotation policies. | Critical |
| 24 | Authentication and Authorization | Secure authentication, RBAC, OAuth/OIDC, mTLS, and trust boundaries. | Critical |
| 25 | Dependency Management and Update Policy | Management and update of dependencies, vulnerability scanning, and update policies. | High |
| 26 | Performance and Scalability Strategy | Performance benchmarks, scalability tests, and optimization strategies. | High |
| 27 | Localization and Internationalization | Handling of i18n, Unicode support, locale-aware formatting, and error handling. | Medium |
| 28 | Compliance, Auditing, Regulatory Considerations | Compliance with privacy regulations, auditing, log retention, and reporting mechanisms. | High |
| 29 | Data Retention and Privacy | Data retention policies, lifecycle management, and privacy guarantees. | High |
| 30 | Cost Management and Budgeting | Cost monitoring, cloud usage, budgeting, and cost optimization strategies. | Medium |
| 31 | Accessibility and Usability Guidelines | Ensuring interfaces are accessible, usable, and WCAG compliant. | Medium |
Choice of Rust Version
- ADR-0001: Choice of Rust Version
Decision to use the latest stable Rust version for performance, safety, and ecosystem benefits.
Doc book Policy
- ADR-0002: Doc book Policy
Documentation standards, structure, and tooling for maintaining high-quality project documentation.
🧠 Agent Architecture
-
ADR-0003: Task Scheduler Model
Coroutine-first, cooperative scheduling with stateful yield and resume semantics. -
ADR-0004: Agent Loop and ReAct Design
Lifecycle of an agent: reasoning, acting, yielding, and interacting with tools. -
ADR-0005: Virtual Canvas Git Strategy
Strategy for code change tracking, patch application, and micro-commits.
📦 Storage & Durability
-
ADR-0006: WAL Schema and Replay Policy
Format of write-ahead log, recovery semantics, and log compaction strategy. -
ADR-0007: Process Activity Log (PAL)
Real-time activity tracking for agent tasks, complementing the WAL. -
ADR-0008: Persistent Agent Memory Strategy
Design of Neo4j semantic memory, graph structure, and retrieval policies. Emphasize goals, prompts, plan recall, continuity between sessions. -
ADR-0009: Code Structure Graph and Symbol Analysis graph-based code indexing and semantic AST analysis into Neo4j, enabling precise symbol tracking and refactoring intelligence.
-
ADR-0010: Task Plan Timeline and Execution Metadata Structure for task execution metadata, including plan history and decision lineage.
🧩 Modularity & Extensibility
-
ADR-0011: Plugin and MCP Tooling Architecture
How Tiffany integrates internal and external tools securely and generically. -
ADR-0012-: Agent Skill System (Future Plan)
Representation of agent capabilities as composable skills.
🔐 Runtime & Execution
-
ADR-0013: Firecracker MicroVM Integration
VM lifecycle, execution isolation, resource constraints. -
ADR-0014: Command Execution and Safety Policy
Shell execution flow, trust boundaries, and--yolomode considerations.
📡 Interfacing & I/O
-
ADR-0015: REST/gRPC API Design
Interface contract for external task submission, result retrieval, and metadata queries. -
ADR-0016: Filesystem Socket Protocol for CLI
Local interaction protocol for invoking agent actions via socket. -
ADR-0017: Kubernetes Operator and CRD Design Custom resource definitions for managing agent lifecycle and task orchestration.
🧪 Observability
-
ADR-0018: Metrics Policy and Instrumentation Plan
Prometheus metrics structure, naming conventions, and dashboard philosophy. -
ADR-0019: Logging Strategy and Span Hierarchy
Use oftracing, span lifecycles, log level defaults.
🧭 Governance & Open Source
-
ADR-0020: Versioning and Release Policy
Semantic versioning, LTS channels, changelog protocol. -
ADR-0021: Contributor Roles and Governance
Roles, responsibilities, PR review flow, escalation path.
22. Backup, Disaster Recovery, and Failover Strategy
- Policies around data recovery, backups, and failover mechanisms.
- Clearly defines how agent memory, logs, and metadata are backed up.
- Explicit recovery process, RTO (Recovery Time Objective), and RPO (Recovery Point Objective).
23. Secrets and Credentials Management
- Explicit handling of secrets, keys, tokens, and credentials.
- Integration with secure vaults (e.g., HashiCorp Vault, AWS Secrets Manager, Kubernetes Secrets).
- Auditing access and rotation policies.
24. Authentication and Authorization
- How FAR agents, developers, and other components authenticate securely.
- Role-Based Access Control (RBAC), OAuth/OIDC integration, or mTLS approaches.
- Explicitly defined trust boundaries and security contexts.
25. Dependency Management and Update Policy
- Management and update processes for third-party libraries, Rust crates, Kubernetes components, and container images.
- Vulnerability scanning and patching procedures.
- Policies around dependency updates, deprecations, and removals.
26. Performance and Scalability Strategy
- Detailed performance benchmarks, scalability tests, and capacity planning.
- Horizontal vs. vertical scaling policies.
- Strategies for proactive performance optimization and reactive capacity adjustments.
27. Localization and Internationalization (i18n) Policy
- How the system will handle internationalization, localization, and multilingual capabilities.
- Consideration of Unicode support, locale-aware formatting, and error handling.
28. Compliance, Auditing, and Regulatory Considerations
- Compliance with privacy regulations (GDPR, CCPA), standards (SOC 2, ISO 27001).
- Auditing policies, log retention strategies, and compliance reporting mechanisms.
29. Data Retention and Privacy
- Explicit data retention policies, data lifecycle management, and privacy guarantees.
- Handling and protection of personally identifiable information (PII), if relevant.
30. Cost Management and Budgeting
- Explicit policies around cost monitoring, cloud usage, and resource budgeting.
- Alerting on cost overruns, proactive cost management strategies, and cost optimization.
31. Accessibility and Usability Guidelines
- Ensuring the interfaces (CLI, UI, documentation) are accessible and usable.
- Compliance with WCAG accessibility standards.
ADR-0003: Task Scheduler Model
Status
Accepted
Context
Tiffany operates as a coroutine-oriented, agentic runtime designed for reasoning, acting, and tool orchestration. Core to this architecture is a cooperative task scheduler responsible for managing:
- Long-running agent workflows
- Step-wise ReAct loops
- Concurrent tool invocations
- Cancelable task execution
As of Rust 2024, first-class coroutines provide an ideal primitive for implementing these requirements with clarity and zero-cost abstraction.
This ADR defines our task scheduling model, the task lifecycle states, the yield/resume semantics, and how the system will structure and manage cooperative multitasking.
Decision
We will implement a coroutine-first, cooperative task scheduler with the following characteristics:
🧵 Task Type
- All tasks will be represented by a unified trait:
#![allow(unused)]
fn main() {
trait AgentTask {
fn poll(&mut self, ctx: &mut TaskContext) -> TaskState;
}
}
📐 Task States
Tasks can exist in the following explicit states:
Ready– enqueued for executionRunning– currently executingWaiting– yielded for tool response or LLMCompleted– finished with result or errorCanceled– forcibly stopped
🔄 Yield/Resume Semantics
Tasks may yield cooperatively during:
- LLM calls
- Tool executions
- User confirmations
- Awaiting subprocess completion
🧰 Runtime Loop
The scheduler will:
- Poll each
Readytask - Route yielded work (e.g. to LLM executor or tool manager)
- Queue task back when dependency resolves
🧭 Goals
- Deterministic, testable behavior
- Serializable/resumable task state
- Decoupled from
tokio::spawnor native threads - Pluggable scheduling policy (FIFO, priority, dependency-aware)
Rationale
🔁 Cooperative vs Preemptive
Tiffany requires transparent control over task transitions. Preemptive systems (e.g. thread pools) make it difficult to audit agent state or model planning steps. Cooperative coroutines, by contrast, allow us to:
- Yield at semantic boundaries
- Inject logs and metrics at every step
- Serialize/resume entire task graphs
🧠 Agent Design Requires Suspended Thought
An agent might:
#![allow(unused)]
fn main() {
let plan = yield plan_with_llm("Build test harness");
for step in plan.steps {
yield apply_code_diff(step);
yield confirm_with_user(step);
}
}
This structure is naturally represented with a coroutine and state machine — not an async future.
🧪 Testability
We can model agent execution using deterministic stepping:
#![allow(unused)]
fn main() {
let mut scheduler = TestScheduler::new();
scheduler.inject_mock_tool("ls", "result");
scheduler.step_until_idle();
assert_eq!(scheduler.task_state(task_id), TaskState::Completed);
}
This level of control is difficult in actor or spawn-based models.
📦 Integration Simplicity
The scheduler serves as glue between:
- LLM router
- Tool executor
- WAL
- Canvas
Having a single poll-loop mediator makes integration simpler and easier to visualize.
Consequences
- Adds internal coroutine scheduler as a first-class subsystem
- Task implementations will need to support resumable
poll()style execution Executor,LLM, andToolinterfaces will interact through message passing / callbacks with the scheduler- CI tests will include end-to-end scheduling tests using mocked yield points
Alternatives Considered
- Tokio TaskPool: Too opaque for agentic step control; no built-in yield
- Actor Model (e.g.,
actix): Good for I/O, but overkill for structured flows futures-based step machines: Verbose, brittle, not coroutine-native
Related Documents
Adopted
This ADR is accepted as of June 2025. All internal workflows that require suspendable agent behavior will be modeled as AgentTasks and scheduled cooperatively.
Maintainers: @casibbald, @microscaler-team
ADR-0006: WAL Schema and Replay Policy
Status
Accepted
Context
The Write-Ahead Log (WAL) in Tiffany provides durable, append-only storage for agent activity, execution steps, and Git-integrated state transitions. It is a core component in ensuring fault tolerance, crash recovery, and task auditability.
As the Tiffany agent runtime handles multiple concurrent tasks, LLM responses, tool invocations, and virtual canvas operations, the WAL must encode enough structured information to allow:
- Resuming interrupted tasks
- Replaying task progress deterministically
- Cross-checking Git commit state and canvas deltas
This ADR defines the structure, persistence policy, replay semantics, and integrity guarantees of the WAL.
Decision
We adopt a structured, JSON-encoded, append-only WAL system with the following characteristics:
WAL File Format
- File-based, append-only per task context (e.g.
wal/<uuid>.wal) - Each line is a valid JSON object (
serde_json::to_writer) - Log entries are versioned via a top-level
"v"field
WAL Entry Types
{
"v": 1,
"ts": "2025-06-27T15:48:00Z",
"type": "InstructionStart",
"task_id": "uuid",
"instruction": "Add validation to model.rs"
}
Supported entry types:
InstructionStart— original user inputLLMPlan— response from LLM with plan stepsStepStart— beginning of a plan stepToolResult— tool output or errorPatchPreview— virtual diff previewCommit— Git commit metadata (SHA, message)TaskComplete— signal of end of task (with success/failure)ShutdownMarker— graceful termination marker
Storage Model
- Each agent task has a dedicated WAL file
- Optionally streamed to external durable store (e.g. S3, versioned blob)
- Index is built in-memory on load for efficient lookup
Flush Policy
- WAL entries are fsynced after each write
- Entries are buffered briefly for I/O optimization, but never cached in-memory only
Replay Semantics
On startup or recovery, the WAL replay engine:
- Reads all entries into a memory index
- Validates log ordering and checks for
TaskCompleteorShutdownMarker - If incomplete, reconstructs TaskState:
- Current plan
- Last confirmed step index
- Last Git SHA
- Virtual canvas snapshot (if recoverable)
Recovery Modes
resume: continue from last known stateinspect: dry-run and summarize partial progressreplay: re-run steps for reproducibility / audit
WAL Lifecycle Sequence Diagram
sequenceDiagram
participant User
participant Agent
participant WAL
participant LLM
participant Tool
participant Canvas
participant Git
participant ReplayEngine
User->>Agent: Submit task ("Add error handling")
Agent->>WAL: Write InstructionStart
Agent->>LLM: Plan steps
LLM-->>Agent: Return plan
Agent->>WAL: Write LLMPlan
loop For each step
Agent->>WAL: Write StepStart
Agent->>Tool: Execute action
Tool-->>Agent: Output result
Agent->>WAL: Write ToolResult
Agent->>Canvas: Apply virtual patch
Canvas-->>Agent: Preview diff
Agent->>WAL: Write PatchPreview
Agent->>Git: Commit via git2-rs
Git-->>Agent: Commit SHA
Agent->>WAL: Write Commit (with SHA, step index)
end
Agent->>WAL: Write TaskComplete
Agent-->>User: Respond with summary
Note over ReplayEngine: On Recovery or Restart
ReplayEngine->>WAL: Read task WAL file
WAL-->>ReplayEngine: JSON entries
ReplayEngine->>ReplayEngine: Rebuild plan, canvas, Git context
ReplayEngine->>Agent: Resume at next step or verify TaskComplete
Rationale
🔐 Durability & Crash Safety
- WAL ensures no task progress is lost
- Every instruction, LLM plan, tool call, and Git commit is accounted for
🧪 Reproducibility
- WAL enables deterministic re-execution of tasks
- Allows audit of LLM decisions, tool usage, and user confirmations
🧰 Debug & Forensics
- WAL becomes a source of truth in debugging stuck or incomplete tasks
- WAL diffs explain when/why things diverged from expected flow
⚙️ Integration with Canvas & Git
- Each canvas commit references its corresponding WAL entry
- Reconstructing canvas state on restart uses last committed patch snapshot
Consequences
- WAL writer must be thread-safe and coroutine-aware
- Every
AgentTaskmust emit structured WAL entries - On crash or power loss, agent must recover cleanly to last
CommitorPatchPreview cargo testmust include WAL replay + recovery unit tests
Alternatives Considered
- Single journal file for all tasks: Difficult to parallelize, isolate, or replay
- Database-backed log: Overkill, less portable, breaks GitOps principle
- Binary format: More compact, but less debuggable and interoperable than JSON
Related Documents
Adopted
This ADR is accepted as of June 2025. All agent tasks will be tracked through structured, append-only WAL files to ensure traceability, auditability, and deterministic task recovery.
Maintainers: @casibbald, `@microscaler-team
ADR-0007: Process Activity Log (PAL) and Real-Time Status Stream
Status
Accepted
Context
While the Write-Ahead Log (WAL) guarantees task durability and auditability in Tiffany, it is not suited for real-time observability. Users and dashboards need visibility into what the system is currently doing — not just what it has already persisted.
To address this, we introduce the Process Activity Log (PAL), a real-time, ephemeral streaming mechanism for broadcasting the active state of each agent task as it progresses through its coroutine lifecycle.
The PAL augments the WAL by offering live introspection into step-level execution, LLM calls, tool activity, patching, and user confirmation — enabling powerful local CLI feedback, remote dashboards, and multi-user observability.
PALs are not commonly implemented in most systems. However, in the context of an agentic runtime with coroutine yield points, explicit steps, and Git-integrated patching, the value of live process state becomes critically important for:
- Progress visibility during long-running operations
- Safe human-in-the-loop oversight
- Remote system administration and dashboarding
- Collaborative multi-user environments
Decision
Tiffany will maintain a separate PAL channel per running task, broadcasting structured ActivityEvent objects to all connected consumers (CLI, dashboards, logs).
PAL Characteristics
- Non-durable, ephemeral (not fsynced or written to disk)
- Optimized for streaming, not audit
- Broadcast to:
- Local CLI (via Unix domain socket)
- Remote UI (via WebSocket/gRPC stream)
- Logging adapter (optional)
- Pub/Sub (e.g., Redis Streams, NATS, Fluvio, etc.)
Event Schema
{
"ts": "2025-06-27T16:12:00Z",
"task_id": "uuid",
"stage": "ExecutingTool",
"message": "Running shell: cargo test"
}
PAL Event Types
ReceivedInstructionWaitingForLLMToolExecutionStartToolExecutionCompletePreviewingPatchAwaitingUserApprovalCommittingPatchIdleCompleted
PAL Channel Infrastructure
- Each Tiffany runtime instance will maintain an async broadcast stream (
tokio::broadcast) for each task - The CLI and UI connect as subscribers, filtered by task ID
- Internal PAL server optionally bridges these into REST, WebSocket, or external pub/sub systems
Sequence Diagram: PAL Streaming in Action
sequenceDiagram
participant User
participant CLI
participant UI
participant Agent
participant PAL
participant WAL
participant PubSub
User->>CLI: tinkerbell submit "Add validation to model.rs"
CLI->>Agent: Submit instruction
Agent->>WAL: Log InstructionStart
Agent->>PAL: Emit ReceivedInstruction
PAL-->>CLI: Stream activity
PAL-->>UI: WebSocket push
PAL-->>PubSub: Fan-out event (optional)
Agent->>PAL: Emit WaitingForLLM
Agent->>LLM: Generate plan
LLM-->>Agent: Return plan
Agent->>PAL: Emit ToolExecutionStart ("Running shell tool")
Agent->>Tool: Execute shell script
Tool-->>Agent: Result
Agent->>PAL: Emit ToolExecutionComplete
Agent->>Canvas: Apply patch
Agent->>PAL: Emit PreviewingPatch
Agent->>User: Confirm patch
User-->>Agent: Approve
Agent->>PAL: Emit CommittingPatch
Agent->>Git: Commit
Agent->>PAL: Emit Completed
Agent->>WAL: Log TaskComplete
Rationale
✅ Real-Time Feedback for Humans
Unlike WALs — which are audit-focused and write-after-the-fact — the PAL is designed to describe what the agent is doing right now.
This enables:
- CLI progress bars, spinners, and log overlays
- Remote dashboards that track multiple agents in real time
- Streaming logs with contextual status annotations
- System admins to monitor all active workloads in the FAR runtime
📟 Pub/Sub Observability
PAL emits can be optionally bridged to:
- Redis Streams
- NATS subjects
- Kafka partitions
This enables:
- Multiple UIs to subscribe to live activity
- Alerting and event-driven monitoring pipelines
- Prometheus exporters to map PAL events to gauge metrics
🧩 Complements WAL without Overlapping It
| Feature | WAL (Write-Ahead Log) | PAL (Process Activity Log) |
|---|---|---|
| Format | Append-only, structured JSON | Broadcast JSON, ephemeral |
| Scope | Durable system of record | Live visibility per active task |
| Persistence | Filesystem | Memory / socket / stream |
| Usage | Replay, recovery, audit | CLI/UI visibility, status updates |
| Timing | After the fact | Concurrent with step execution |
Consequences
- Introduces a lightweight PAL layer into each
AgentTask - Developers must emit PAL signals at major yield boundaries
- CLI and UI must support PAL subscription and display logic
- Integration tests will include PAL mock subscribers for validation
- PAL memory usage must be bounded (e.g. backpressure if no consumer)
Alternatives Considered
- Extending WAL with ephemeral events – Pollutes durability layer; delays update timing
- Relying on console logs – Not structured, not streamable, not indexable by UI
- Polling task status – Inefficient and fails in multi-user concurrent settings
Related Documents
- ADR-0006: WAL Schema and Replay Policy
- ADR-0004: Agent Loop and ReAct Design
- Tiffany System Architecture
Adopted
This ADR is accepted as of June 2025. All running agent tasks will emit real-time PAL events through an ephemeral pub-sub system to enable responsive CLI feedback, multi-user UI dashboards, and remote introspection.
Maintainers: @casibbald, @microscaler-team
ADR-0019: Logging Strategy and Span Hierarchy
Status
Proposed
Context
Effective logging is vital for the observability, debugging, and operational transparency of autonomous agentic runtimes. Without structured logging and detailed execution tracing, debugging complex distributed operations becomes difficult, leading to increased operational overhead and reduced system reliability.
Common issues in logging for complex systems include:
- Unstructured logs: Difficult to parse or correlate.
- Lack of contextual tracing: Limited ability to trace execution flow or diagnose issues.
- Inconsistent log levels and verbosity: Excessive noise or insufficient detail for meaningful debugging.
To address these, Tiffany requires a rigorous, structured logging strategy that leverages Rust’s advanced tracing framework, clear span hierarchies, structured logs, and consistent log level defaults.
Decision
We adopt the Rust tracing crate for structured logging and detailed tracing within Tiffany FAR infrastructure, explicitly defining span hierarchies, structured log events, and log level policies.
Core Components:
- Structured Logging: Use of
tracingwith structured, JSON-formatted log outputs. - Detailed Span Hierarchy: Explicit tracing spans for every significant operation (task execution, VM lifecycle events, API calls).
- Consistent Log Levels: Clearly defined log level defaults (ERROR, WARN, INFO, DEBUG, TRACE).
Technical Implementation
1. Structured Logging with Rust tracing
Tiffany will adopt Rust’s tracing crate, providing structured logs for better parseability and integration with modern log management systems.
Log event format (JSON example):
{
"timestamp": "2025-06-28T10:00:00Z",
"level": "INFO",
"target": "tinkerbell::orchestrator",
"message": "Task execution started",
"fields": {
"task_id": "task-12345",
"agent_id": "agent-67890",
"node": "blade-1"
}
}
2. Explicit Span Hierarchy
Spans explicitly represent the hierarchy and nesting of tasks, operations, and system interactions.
Example span hierarchy structure:
Agent Task Execution [trace_id=abc123]
├── Orchestrator Scheduling
│ └── Resource Allocation
├── FAR Agent VM Lifecycle
│ ├── VM Initialization
│ ├── Capability Execution
│ │ └── LLM Interaction
│ └── VM Termination
└── Result Reporting
3. Span Lifecycle Management
Span lifecycle strictly managed using Rust’s tracing macros (span!, .enter(), .exit()), clearly delineating entry and exit points:
Rust tracing code example:
#![allow(unused)]
fn main() {
use tracing::{span, Level};
fn execute_task(task_id: &str) {
let task_span = span!(Level::INFO, "task_execution", task_id = task_id);
let _enter = task_span.enter();
schedule_task();
execute_vm_lifecycle();
report_results();
}
fn execute_vm_lifecycle() {
let vm_span = span!(Level::DEBUG, "vm_lifecycle");
let _enter = vm_span.enter();
vm_initialization();
capability_execution();
vm_termination();
}
}
4. Log Level Defaults
Explicit guidelines on log levels:
| Level | Use-case Example | Retention Policy |
|---|---|---|
| ERROR | Critical failures requiring immediate attention | High priority |
| WARN | Issues potentially impacting normal operation | High-medium priority |
| INFO | General operational events and high-level status | Medium priority |
| DEBUG | Detailed events useful for debugging | Short-term, detailed |
| TRACE | Highly detailed events, rarely enabled in production | Temporary debugging |
Default log levels: INFO in production, DEBUG in staging, TRACE optionally during local development.
📊 Diagram: Span Hierarchy Visualization
graph TD
Task[Task Execution - INFO]
Task --> Scheduling[Orchestrator Scheduling - INFO]
Scheduling --> Allocation[Resource Allocation - DEBUG]
Task --> VM[FAR Agent VM Lifecycle - INFO]
VM --> VM_Init[VM Initialization - DEBUG]
VM --> Capability[Capability Execution - DEBUG]
Capability --> LLM_Interaction[LLM Interaction - TRACE]
VM --> VM_Termination[VM Termination - DEBUG]
Task --> Reporting[Result Reporting - INFO]
🔄 Sequence Diagram: Logging and Span Lifecycle
sequenceDiagram
participant FAR Controller
participant Orchestrator
participant MicroVM (FAR Agent)
participant PAL/WAL
participant Log Collector (Elastic/Loki)
FAR Controller->>Log Collector (Elastic/Loki): Log (INFO): "Task execution started"
Orchestrator->>Log Collector (Elastic/Loki): Log (DEBUG): "Scheduling task, resource allocation started"
Orchestrator->>MicroVM (FAR Agent): Start MicroVM
MicroVM (FAR Agent)->>Log Collector (Elastic/Loki): Log (INFO): "VM initialization successful"
MicroVM (FAR Agent)->>Log Collector (Elastic/Loki): Log (TRACE): "LLM interaction started"
MicroVM (FAR Agent)->>Log Collector (Elastic/Loki): Log (DEBUG): "Capability execution completed"
MicroVM (FAR Agent)->>Orchestrator: Task completion
Orchestrator->>Log Collector (Elastic/Loki): Log (INFO): "Result reporting complete"
FAR Controller->>PAL/WAL: Record final task state
🎯 Rationale for Chosen Approach
- Structured & Parseable Logs: Enhanced operational visibility, easier integration with log aggregation solutions.
- Detailed Tracing and Debugging: Explicit span hierarchies enable precise debugging and operational insights.
- Clear Log Level Policy: Reduces noise and improves the clarity and signal-to-noise ratio of logs.
🚨 Consequences and Trade-offs
- Complexity of Span Management: Developers must adhere strictly to span creation and lifecycle rules.
- Log Volume and Storage: Potential for large log volumes, especially with DEBUG or TRACE enabled.
- Operational Overhead: Requires monitoring of log volumes and retention policies to manage costs.
✅ Alternatives Considered and Dismissed
- Unstructured Logging: Lack of parseability and reduced debugging capability.
- Using
logcrate: Lacks structured tracing capabilities and span support oftracing. - Third-party proprietary logging tools: Reduced control and higher cost; less customization potential.
📌 Implementation Recommendations:
- Create a dedicated Logging Style Guide document specifying
tracingusage patterns. - Use automated linting to enforce span hierarchy and log level conventions during CI/CD pipelines.
- Periodically review log metrics (volume, retention, levels) to manage operational overhead.
🚀 Industry-Leading Capabilities
This structured logging and tracing strategy significantly enhances Tiffany’s ability to manage operational complexity, rapidly diagnose issues, and maintain transparency, positioning it as a leader in observability and autonomous agentic infrastructure management.
📊 Next Steps:
Upon confirmation, this ADR is ready for acceptance and immediate detailed implementation.
✅ Ready for final review and acceptance.