Introduction

Tiffany Runtime Model

Tiffany is a coroutine-first agentic runtime built around the idea of deterministic, traceable, yield-driven execution for autonomous system tasks.

Execution Model

Coroutine Scheduler: Inspired by David Beazley’s pyos8, all tasks are cooperative generators. There are no threads, only scheduled yields.
System Calls: Tasks yield events like ReadWait, Sleep, or SpawnTask to the scheduler.
Trampolining: Nested coroutines are managed via a LIFO stack to support function-like composition.
Replayability: Tasks emit WAL logs with every yield, enabling complete deterministic replay.
Telemetry: Progress is simultaneously emitted to the PAL stream for real-time observability.

Agents and Isolation

Each agent:

Operates in a self-contained environment (microVM or Apple Container)
Loads semantic memory (GraphDB + VectorDB)
Executes plans via the ReAct loop
Coordinates with orchestrator over mTLS/gRPC

Diagrams

You can find system-level sequence diagrams in the Diagrams section.

Task Scheduler

📌 1. Overview of PyOS (pyos8.py)

PyOS is an educational, minimalistic micro-operating system implemented in Python using generators/coroutines to simulate concurrent tasks. The core features include:

Tasks as Generators: Each “task” is a coroutine (yield-based) that encapsulates some ongoing work.
Scheduler: A lightweight round-robin task scheduler (Scheduler) runs tasks until completion or yields control voluntarily.
System Calls (Syscall): Tasks request OS services (sleep, wait, IO) via structured system calls.
Event-driven Loop: Scheduler handles tasks based on events (I/O completion, sleep duration, etc.).

📂 2. Detailed Architectural Analysis (with Mermaid)

Here’s the architectural diagram of PyOS:

graph TD
    Scheduler["Scheduler (Event Loop)"]
    
    subgraph Task Management
        TaskQueue["Ready Task Queue"] --> Scheduler
        SleepingQueue["Sleeping Task Queue"]
        WaitingTasks["Waiting Task Queue"]
    end
    
    Task["Task (Coroutine)"] -->|Syscall| Scheduler
    Scheduler -->|Resume Task| Task
    
    Scheduler -->|Sleep Requests| SleepingQueue
    Scheduler -->|IO or Wait Requests| WaitingTasks
    
    Scheduler --> IO["I/O Management (Select-based polling)"]
    
    IO --> Scheduler
    SleepingQueue --> Scheduler
    WaitingTasks --> Scheduler

Core Components:

Task: coroutine (yield) wrapped into a Task class. Each task maintains its own coroutine state.
Scheduler: event loop managing tasks.
Syscall: task requests (sleep, IO, etc.) handled by the scheduler.

🧩 3. Task Management Explained

Tasks in PyOS are managed explicitly through a scheduler:

class Task:
    taskid = 0
    def __init__(self, target):
        Task.taskid += 1
        self.tid = Task.taskid  # Unique task id
        self.target = target    # Coroutine
        self.sendval = None     # Value to send into coroutine
        self.stack = []         # Call stack for nested coroutines

Scheduler maintains multiple queues:

ready: Tasks ready to run immediately.
sleeping: Tasks scheduled to run after a delay.
waiting: Tasks blocked on I/O or other events.

Scheduler main loop:

class Scheduler:
    def mainloop(self):
        while self.taskmap:
            if not self.ready:
                self.iopoll(None)
            else:
                task = self.ready.popleft()
                result = task.run()
                self.schedule(task)

This approach provides:

Explicit task control
Predictable execution order
Clear state management

🚦 4. How PyOS Handles System Calls

Tasks use Syscall to yield structured commands:

class SysCall:
    def handle(self, task, sched):
        pass

class GetTid(SysCall):
    def handle(self, task, sched):
        task.sendval = task.tid
        sched.schedule(task)

Tasks yield SysCall objects.
Scheduler intercepts, executes, then resumes tasks.

Example Task using Syscall:

def foo():
    tid = yield GetTid()
    print(f"My task id is {tid}")

🌟 5. Relevance & Value to Tiffany Daemon

Applying this architecture to Tiffany offers significant advantages:

✅ Advantages for your use-case:

Fine-grained control: Tasks explicitly yield for system operations (like awaiting API or LLM responses).
Memory-efficient concurrency: Avoid overhead of threads/processes, critical for long-lived agents.
Traceable state transitions: Easier debugging, transparency in state and lifecycle management.
Predictable & careful execution flow: Aligns with your “one-foot-on-the-ground” cautious approach.

💡 Use Case Scenarios in Tiffany:

Agent executing a complex, multi-step workflow (e.g., code edit → git commit → API call → LLM query) as separate cooperative tasks.
Tasks handling different interactions (CLI vs API) concurrently yet clearly separated in the scheduler.
Easier rollback and checkpointing: Each step clearly delineated by coroutine yield points.

⚙️ 6. Coroutine Considerations in Rust

While the example uses Python generators, Rust provides advanced coroutine (async/await) models via runtimes like tokio, async-std, or coroutine libraries like may.

Tokio/Async: Standard Rust solution for async tasks and efficient I/O multiplexing.
May (Coroutine-focused): Offers coroutine-style programming and lightweight task handling, suitable for fine-grained control.

Rust async example (with Tokio):

#![allow(unused)]
fn main() {
async fn handle_task(task_id: usize) {
    println!("Handling task {}", task_id);
    tokio::time::sleep(Duration::from_secs(1)).await;
}
}

Rust Coroutine-style example (with May):

use may::coroutine;

fn main() {
    coroutine::scope(|s| {
        s.spawn(|| {
            println!("Task started");
            coroutine::sleep(Duration::from_secs(1));
            println!("Task done");
        });
    });
}

🎯 7. Recommended Architecture for Tiffany

Based on PyOS inspiration, I suggest adopting a hybrid Rust async + coroutine model:

✅ Tiffany Task Scheduler (Conceptual):

Define explicit task states and transitions clearly.
Tasks (async or coroutine) yielding explicit commands similar to SysCall, allowing predictable scheduling and state management.
A scheduler loop managing tasks explicitly, providing clear debug and auditability.

Example of explicit Rust-based scheduler task structure:

#![allow(unused)]
fn main() {
enum TaskCommand {
    AwaitLLMResponse(Query),
    AwaitFileIO(PathBuf),
    AwaitUserApproval(String),
}

struct Task {
    id: usize,
    state: TaskState,
}

enum TaskState {
    Running,
    Waiting(TaskCommand),
    Completed,
}

impl Scheduler {
    async fn run(&mut self) {
        loop {
            for task in &mut self.tasks {
                match task.state {
                    TaskState::Waiting(ref cmd) => {
                        match cmd {
                            TaskCommand::AwaitLLMResponse(q) => {
                                if let Some(resp) = self.llm.try_get_response(q).await {
                                    task.state = TaskState::Running;
                                }
                            },
                            // Handle other cases similarly
                            _ => {}
                        }
                    },
                    TaskState::Running => {
                        // Run the task's next step
                    },
                    TaskState::Completed => {}
                }
            }
            tokio::time::sleep(Duration::from_millis(10)).await;
        }
    }
}
}

Tasks explicitly yield commands the scheduler handles.
Matches your “one-foot-on-the-ground” operational style.

🛠️ 8. Recommended Roadmap

Prototype a simple Rust Scheduler inspired by PyOS (using async Rust).
Define explicit commands/tasks for Tiffany (LLM queries, API calls, filesystem ops).
Implement simple coroutine/task model (using tokio + explicit state machine).

🚀 Conclusion and Recommendation

Dave Beazley’s PyOS provides a clear and powerful task management blueprint. Its explicit task control, predictable scheduling, and structured system call mechanism align extremely well with your Tiffany vision.

For your Rust-based daemon, adapt these ideas into a hybrid coroutine/async system, managing tasks explicitly for predictability, clarity, and debug ease.

Your next steps:

Prototype the Rust scheduler inspired by PyOS.
Define Tiffany-specific tasks and state transitions.
Validate with simple coroutine-style and async tasks.

References:

Design Document: Coroutine-Based Task Scheduler for Tiffany

Introduction and Background

Tiffany requires a lightweight cooperative multitasking runtime to manage concurrent tasks without the complexity of threads or processes. We will build a coroutine-based task scheduler inspired by prior art in Python and other systems. Coroutines are program components that generalize subroutines for non-preemptive multitasking: they allow a function to suspend execution (yield) and resume later, which is ideal for implementing our own scheduler. By using coroutines, we can explicitly control when tasks give up control (cooperative scheduling) and integrate asynchronous I/O handling directly into the runtime. This approach was popularized in David Beazley’s 2009 PyCon tutorial “A Curious Course on Coroutines and Concurrency,” where he constructed a simple “Python Operating System” (PYOS) using generator-based coroutines. Modern frameworks like Python’s asyncio build on similar concepts (an event loop, tasks, and awaiting I/O), but for Tiffany we will design a minimal scheduler focused on our specific needs rather than a full general-purpose OS or framework.

Prior Art and Influences

David Beazley’s PYOS: Beazley’s example (known through code like pyos.py series) is a key reference. In his design, a central Scheduler manages a queue of tasks, each task being a coroutine wrapped in a lightweight Task object. Tasks voluntarily yield control, either simply to let others run or to request a “system call” (an operation for the scheduler to perform on behalf of the task). The scheduler loop repeatedly takes the next ready task, runs it until its next yield, and then decides what to do based on the yielded value (resume the task, create new tasks, wait for I/O, etc.). This cooperative model avoids preemptive context switches and threads, making it lightweight and deterministic. Beazley’s system introduced system call objects like GetTid, NewTask, WaitTask, ReadWait, etc., which tasks could yield to request scheduler actions. For example, a task could do yield NewTask(some_coro) to spawn a new coroutine task, or yield ReadWait(file_obj) to signal the scheduler to pause the task until a file descriptor is readable. His final iteration also solved the issue of calling coroutines from within coroutines (nested yields) via a technique called trampolining, which we’ll discuss shortly.

Async IO frameworks: Python’s asyncio (PEP 3156) and similar event-loop systems were motivated by the same goal of handling many tasks with one thread by async/await syntax. They formalize coroutines as first-class objects (native async def coroutines in Python 3) and use an event loop to schedule their execution. Our design will conceptually resemble a simplified asyncio: we’ll have an event loop (the scheduler), tasks (coroutine functions), and awaitable events (our system call yields). However, we will implement just what we need for Tiffany rather than a full library. This means focusing on cooperative scheduling and I/O readiness, and possibly timed delays, without the overhead of unrelated features.

Coroutines in Rust: An alternative to a Python implementation is using Rust, which by its 2021 edition (and likely 2024 edition improvements) supports async/await for concurrency. Rust’s model is different under the hood – it uses futures and an executor. Each async fn in Rust compiles to a state-machine that implements the Future trait. An executor (runtime) polls these futures and uses a waker mechanism to resume tasks when events (like I/O) are ready. While Rust’s approach is efficient and memory-safe, implementing a custom coroutine scheduler in Rust is more involved than in Python. There is no stable native “yield” for synchronous coroutines in Rust yet (generators are an unstable feature), so one would typically build on the async/await system (futures). In our context, a Rust implementation might involve spawning tasks as futures and using an event loop (like tokio or a custom one) to drive them. This is certainly feasible and would give performance and safety benefits, but it means embracing Rust’s asynchronous programming model (polling futures) rather than the simpler generator interface that Python offers. We will discuss the trade-offs in a later section. For now, we draw inspiration from the Python model (which is easier to prototype) while keeping in mind that the design concepts (task, scheduler, event waiting) are transferable to Rust or other environments.

Requirements and Scope for Tiffany

Tiffany’s scheduler should support concurrent tasks with minimal overhead, focusing on our use cases. We are not building a general OS with process isolation or preemptive multitasking. Instead, the requirements are:

Cooperative Multitasking: Tasks yield control explicitly at well-defined points (e.g. when they need to wait or simply to let others run). This avoids complex context switching – the scheduler will only switch tasks when a task yields. Each task runs to its next yield without interruption. This is sufficient for many I/O-bound or event-driven workloads and simpler to implement than preemptive scheduling. It does mean a badly-behaved task that never yields could block others, so we must ensure our tasks are written to yield periodically (especially during long computations or loops).
Task Management: Ability to create (spawn) new tasks dynamically. For example, when a network connection comes in, the scheduler can spawn a new task to handle that connection while the main task continues listening. We might also want the ability to terminate tasks or have tasks wait for others to finish, though these are secondary if our use-case can manage without explicit task joining or killing. (If needed, we can implement a WaitTask system call to allow one task to yield until another task completes, and possibly a KillTask to remove a task, similar to Beazley’s design.) At minimum, spawning tasks is required.
I/O and Event Waiting: A mechanism to integrate non-blocking I/O. Tasks should be able to wait for external events (file descriptors being readable/writable, timers, etc.) without blocking the whole program. In a typical scenario, a task waiting for network data will yield an event to the scheduler (e.g. “read wait on this socket”). The scheduler can then pause that task and continue running others. Under the hood, the scheduler will use an OS primitive (like select or poll) to check when the socket is ready, and then resume the task. This means our scheduler likely needs to maintain a registry of file descriptors (or other event sources) to tasks waiting on them. Beazley’s example uses select.select on sets of descriptors, which is a straightforward approach we can use as well (since Tiffany might be running in an environment where select/poll is available). If Tiffany’s needs include timers (sleeping for X seconds), we could implement a similar mechanism with timeouts: tasks yield a “sleep until time T” event, and the scheduler keeps track of that, resuming the task after the timeout.
Efficiency and Simplicity: Keep the core loop simple and avoid unnecessary features. We don’t need memory protection, true parallelism, or complex prioritization – all tasks run in the same thread and share memory (like async functions do). Context switching is just a function call (resuming a generator), which is very cheap. We will not implement advanced scheduling policies; a simple round-robin or FIFO scheduling of ready tasks is sufficient. Also, error handling can be basic: if a task raises an uncaught exception (aside from the StopIteration that signals a generator’s normal completion), it could terminate that task – the scheduler can catch exceptions from task.run() and decide to drop the task or log it. In a robust system we’d have to propagate errors or allow tasks to handle others’ failures, but for now a simple strategy (like printing error and removing the task) is acceptable.
Trampolining (Nested Coroutines): As an optional but highly useful feature, we want tasks to be able to call coroutine-based subroutines without losing the ability to yield. In a naive generator scheduler, if one generator calls another that uses yield, things get tricky – only the top-level generator’s yields are noticed by the scheduler. David Beazley’s trampolining solution was to have the Task object manage an explicit stack of generators. Whenever a running coroutine yields another generator, the scheduler does not treat it as a final yield; instead, the current task pushes its current generator onto a stack and continues executing the yielded sub-generator. This way, the task can dive into nested coroutines and when the sub-coroutine completes or yields a value back, the task pops the stack and resumes the caller. Trampolining is crucial for code modularity – it allows writing coroutine helpers (e.g. an Accept(sock) coroutine that itself yields events) and calling them naturally with result = yield Accept(sock) inside a task. We will include trampolining in our design so that tasks can be structured hierarchically without the scheduler needing special logic for it (the Task object will handle it, making it transparent to the scheduler).
Focus on Tiffany’s domain: If Tiffany is, for example, a provisioning system or similar, typical tasks might include networking (listening for requests, handling connections), file I/O, or invoking system commands. We anticipate heavy use of network/socket I/O which fits well with this async approach. We likely do not need features like CPU-bound parallelism (which would need threads or multiple processes) – cooperative coroutines handle concurrency when tasks are I/O-bound or need to wait. As long as one task is waiting on I/O or an event, another can run. If all tasks are waiting (idle), the scheduler can block on select until something happens, using minimal CPU.

Architecture Overview

Our coroutine runtime will consist of a few core components: Task, Scheduler, and a set of System Call events that tasks can yield to trigger special behavior. Additionally, we have the notion of the call stack within a Task for trampolining nested coroutines. Below we detail each component and how they interact.

🧬 Enhanced Sequence Diagram: Tiffany Coroutine Scheduler (Expanded)

sequenceDiagram
  actor MainTask as Task_Main (Planning)
  actor WorkerTask as Task_Worker (LLM Plan)
  actor ToolTask as Task_Tool (IO Tool)
  participant Scheduler
  participant ReadyQueue as ReadyQueue_FIFO
  participant CallStack_Main as CallStack_Main (LIFO)
  participant CallStack_Worker as CallStack_Worker (LIFO)
  participant WaitMap as EventMap (IO/Sleep/Join)

%% === Startup ===
  MainTask->>Scheduler: yield NewTask(WorkerTask)
  Scheduler->>ReadyQueue: enqueue(WorkerTask)
  Scheduler->>MainTask: resume with WorkerTask.tid

%% === MainTask continues planning ===
  MainTask-->>Scheduler: yield Sleep(5s)
  Scheduler->>WaitMap: record MainTask sleep
  Note right of WaitMap: MainTask sleeping until t + 5s

%% === WorkerTask starts execution ===
  Scheduler->>ReadyQueue: dequeue next
  ReadyQueue-->>Scheduler: WorkerTask
  Scheduler->>WorkerTask: resume()

  WorkerTask-->>Scheduler: yield NewTask(ToolTask)
  Scheduler->>ReadyQueue: enqueue(ToolTask)
  Scheduler->>WorkerTask: resume with ToolTask.tid

%% === ToolTask nested sub-coroutine call
  WorkerTask-->>CallStack_Worker: push generator
  CallStack_Worker->>WorkerTask: switch to ToolTask coroutine
  Scheduler->>WorkerTask: continue (trampolining)

%% === ToolTask yields ReadWait
  ToolTask-->>Scheduler: yield ReadWait(fd=4)
  Scheduler->>WaitMap: register fd=4 for ToolTask
  Note right of WaitMap: ToolTask blocked waiting on I/O

%% === I/O becomes ready
  Note right of WaitMap: fd=4 becomes readable
  WaitMap->>Scheduler: notify ToolTask
  Scheduler->>ReadyQueue: enqueue(ToolTask)

%% === ToolTask completes
  Scheduler->>ReadyQueue: dequeue ToolTask
  ReadyQueue-->>Scheduler: ToolTask
  Scheduler->>ToolTask: resume()

  ToolTask-->>CallStack_Worker: return result
  CallStack_Worker->>WorkerTask: resume with result

%% === WorkerTask finishes work
  WorkerTask-->>Scheduler: StopIteration
  Scheduler->>WaitMap: resolve joiners on WorkerTask

%% === MainTask wakes up
  Note right of WaitMap: Timer expires for MainTask
  WaitMap->>Scheduler: wake MainTask
  Scheduler->>ReadyQueue: enqueue(MainTask)

%% === Final Resumption
  Scheduler->>ReadyQueue: dequeue MainTask
  ReadyQueue-->>Scheduler: MainTask
  Scheduler->>MainTask: resume()
  MainTask-->>Scheduler: continue next step

🧠 Key Concepts Represented:

Scheduler is the event orchestrator
Tasks are coroutine-based units of execution
ReadyQueue (FIFO) holds ready-to-run tasks
CallStack (LIFO) enables trampolining between nested coroutine calls
EventMap (join waits, I/O readiness) manages blocked tasks waiting for conditions to be fulfilled
SystemCall yields are managed through handle() logic per syscall (not shown inline here for brevity)
Task completion includes reactivating any tasks waiting on the completed task (e.g. via WaitTask)
Trampolining logic handles yield sub_coro and propagates results back via stack unwind

🧩 Component-Focused Diagrams Plan

Diagram #	Focus Area	Purpose
D1	📦 Basic Scheduler & Task Yield	Round-robin task switching using FIFO ReadyQueue
D2	🪜 Nested Coroutines (Trampolining)	How `yield SubCoro()` pushes/pops the CallStack per task
D3	🔌 I/O Blocking & Wakeup	Task yields `ReadWait`, blocks on `fd`, gets resumed on readiness
D4	⏲️ Sleep and Timer Handling	Task yields `Sleep(duration)` and wakes later
D5	🔁 Join / WaitTask Resolution	Task B waits for Task A to finish and is resumed when Task A completes

Diagram D1 – Basic Task Switching (ReadyQueue only)

sequenceDiagram
    participant Scheduler
    participant ReadyQueue as FIFO
    actor TaskA
    actor TaskB

    Scheduler->>ReadyQueue: dequeue TaskA
    ReadyQueue-->>Scheduler: TaskA
    Scheduler->>TaskA: resume()
    TaskA-->>Scheduler: yield (None)
    Scheduler->>ReadyQueue: enqueue(TaskA)

    Scheduler->>ReadyQueue: dequeue TaskB
    ReadyQueue-->>Scheduler: TaskB
    Scheduler->>TaskB: resume()
    TaskB-->>Scheduler: yield (None)
    Scheduler->>ReadyQueue: enqueue(TaskB)

Diagram D2 – Nested Coroutines (Trampolining)

sequenceDiagram
    participant CallStack as CallStack_LIFO
    actor TaskMain
    actor SubCoro

    TaskMain-->>CallStack: push(TaskMain.gen)
    TaskMain->>SubCoro: start sub-coroutine
    SubCoro-->>CallStack: yield value
    CallStack->>TaskMain: pop parent
    TaskMain->>TaskMain: resume with value

Diagram D3 – I/O Blocking and Wakeup

sequenceDiagram
    participant Scheduler
    participant WaitMap as EventMap
    participant ReadyQueue
    actor TaskIO

    TaskIO-->>Scheduler: yield ReadWait(fd=7)
    Scheduler->>WaitMap: record TaskIO blocked on fd=7

    Note right of WaitMap: fd=7 becomes readable
    WaitMap->>Scheduler: notify fd=7 ready
    Scheduler->>ReadyQueue: enqueue(TaskIO)
    ReadyQueue-->>Scheduler: TaskIO ready to run

Diagram D4 – Sleep and Timer Handling

sequenceDiagram
  participant Scheduler
  participant WaitMap as SleepMap
  participant ReadyQueue
  actor TaskSleep

  TaskSleep-->>Scheduler: yield Sleep(3s)
  Scheduler->>WaitMap: record TaskSleep with wake_time = now + 3s

  Note right of WaitMap: 3 seconds pass
  WaitMap->>Scheduler: wake TaskSleep
  Scheduler->>ReadyQueue: enqueue(TaskSleep)

Diagram D5 – Join / WaitTask Resolution

sequenceDiagram
  participant Scheduler
  participant ReadyQueue as FIFO
  actor TaskA
  actor TaskB

  Scheduler->>ReadyQueue: dequeue TaskA
  ReadyQueue-->>Scheduler: TaskA
  Scheduler->>TaskA: resume()
  TaskA-->>Scheduler: yield (None)
  Scheduler->>ReadyQueue: enqueue(TaskA)

  Scheduler->>ReadyQueue: dequeue TaskB
  ReadyQueue-->>Scheduler: TaskB
  Scheduler->>TaskB: resume()
  TaskB-->>Scheduler: yield (None)
  Scheduler->>ReadyQueue: enqueue(TaskB)

State Machine Diagram: Task Lifecycle

Purpose: Shows all valid transitions for a task (created → ready → running → blocked → complete).

Helps with: Understanding task states, debugging transitions, implementing lifecycle hooks.

stateDiagram-v2
    [*] --> Created
    Created --> Ready : schedule()
    Ready --> Running : run()
    Running --> Ready : yield (cooperative)
    Running --> Blocked_IO : yield ReadWait/WriteWait
    Running --> Blocked_Sleep : yield Sleep
    Running --> Blocked_Join : yield WaitTask
    Blocked_IO --> Ready : fd becomes ready
    Blocked_Sleep --> Ready : timer expires
    Blocked_Join --> Ready : waited task completes
    Running --> Complete : StopIteration

Class Diagram: Scheduler Components

Purpose: Depicts the structure of Scheduler, Task, SystemCall, ReadyQueue, etc.

Helps with: Implementation planning, unit test coverage, codebase modularity.

classDiagram
    class Scheduler {
        +mainloop()
        +schedule(Task)
        +add_new_task(coro)
        -ready_queue : Queue
        -waiting_fds : Map
        -waiting_tasks : Map
    }

    class Task {
        +run()
        -tid : int
        -target : Generator
        -stack : LIFO
        -sendval : any
    }

    class SystemCall {
        +handle()
    }

    class ReadyQueue {
        +enqueue(Task)
        +dequeue() Task
    }

    Scheduler --> Task
    Task --> SystemCall
    Scheduler --> ReadyQueue

Fault Handling & Panic Recovery

This diagram shows what happens when a task panics (e.g. raises an exception or hits StopIteration unexpectedly), and how the scheduler logs, cleans up, and optionally notifies dependent systems via the PAL and WAL.

sequenceDiagram
    actor FaultyTask as Task_Foo
    participant Scheduler
    participant Logger
    participant PAL as ProcessActivityLog
    participant WAL as WriteAheadLog
    participant WaitMap as JoinMap

    Scheduler->>FaultyTask: run()
    FaultyTask-->>Scheduler: panic (Exception or StopIteration)

    alt Exception (unexpected)
        Scheduler->>Logger: log(Task_Foo crashed: {error})
        Scheduler->>PAL: record TaskFailure(Task_Foo, stacktrace)
        Scheduler->>WAL: write TaskAbort(Task_Foo)
        Scheduler->>WaitMap: notify any joiners
        Scheduler-->>FaultyTask: deallocate & remove from registry
    else Normal Completion (StopIteration)
        Scheduler->>PAL: record TaskComplete(Task_Foo)
        Scheduler->>WAL: write TaskExit(Task_Foo)
        Scheduler->>WaitMap: notify any joiners
        Scheduler-->>FaultyTask: remove
    end

🧠 Interpretation

Exceptions and normal StopIteration are handled differently, but both trigger:
- Log/record in PAL (for real-time dashboards)
- Append to WAL (for deterministic replay)
- Notify any other tasks waiting on the failed/completed one
This can be extended later to retry policies, fallbacks, or alert hooks (e.g., webhook to notify UI agent).

Data Flow: Task Execution + WAL/PAL Logging

This shows how data produced during task execution (including yielded values, system calls, and final results) is recorded into:

WAL (Write-Ahead Log): durable event stream for recovery/replay
PAL (Process Activity Log): ephemeral, streaming trace for dashboards
MemoryStore: optional semantic graph / state mutation

Purpose: Shows how yield/return data from tasks is recorded and logged into the system.

WAL = Write-Ahead Log
PAL = Process Activity Log

This is helpful for:

Task recovery
Auditing
Live dashboards

flowchart TD
    A[Task: Coroutine Execution] --> B1[Yield SystemCall or Result]
    B1 --> C1[Scheduler Dispatch]

    %% WAL Logging
    C1 --> D1[WAL: write - Event]
    D1 --> E1[Disk / WAL File]

    %% PAL Logging
    C1 --> D2[PAL: stream -TaskProgress]
    D2 --> E2[PubSub Bus / Dashboard Feed]

    %% If Memory Update
    C1 -->|Memory-related yield| F1[MemoryStore Update]
    F1 --> G1[GraphDB or VecDB]

    %% Loop continuation
    C1 --> H1[Requeue Task or Terminate]

    style E1 fill:#dff,stroke:#0af
    style E2 fill:#ffd,stroke:#fa0
    style G1 fill:#f9f,stroke:#c0f

🧠 Interpretation Every execution result passes through the Scheduler, which:

Logs to WAL for recovery (e.g. TaskYielded, TaskWaitIO(fd=4))
Streams to PAL for dashboards (e.g. TaskStep with step label)
Optionally updates MemoryStore (e.g. a new semantic link or skill)
WAL is append-only, immutable
PAL is transient (but real-time)
MemoryStore is stateful, must be synchronized with execution semantics

🧠 Summary: Which to prioritize?

Diagram	Priority	Purpose
✅ Task Lifecycle (State)	⭐⭐⭐⭐	Debugging, correctness, onboarding
✅ Scheduler Class Overview	⭐⭐⭐⭐	Code structuring and test scaffolds
✅ Fault Handling Sequence	⭐⭐⭐	Observability, stability
✅ PAL/WAL Data Flow	⭐⭐⭐	Logging, traceability, dashboards
✅ SystemCall Dependency Graph	⭐⭐	Extending runtime with new features

Task

A Task represents a running coroutine (generator). It encapsulates the coroutine and its state, including anything needed to resume it. In our design, a Task will have:

tid – a task identifier (unique per task). This can simply be an integer assigned in sequence for debugging or referencing tasks.
target – the current generator that the task is running. Initially this is the coroutine function’s generator object. Due to trampolining, this target may change to a sub-generator if the coroutine calls another coroutine (we maintain a stack of generators, see below).
stack – a list (stack) of generator objects, initially empty. This is used for trampolining: whenever the task yields a generator (a sub-coroutine), we push the current target onto stack and start running the sub-generator as the new target. When a generator finishes, we pop from the stack to resume the caller.
sendval – a value to send into the coroutine on next resume. This is how we deliver data or signals to the coroutine. For example, when a task yields an event and the scheduler processes it, the scheduler might set sendval for when the task resumes (e.g., the result of a system call). Initially, when a task is first started, sendval is None (since we have to send None or call next() to start a generator).

The Task object will have a method like run() which essentially advances the coroutine to its next yield point. Pseudocode for Task.run might be:

class Task:
    def run(self):
        try:
            result = self.target.send(self.sendval)   # Resume coroutine, send value
        except StopIteration:
            # Coroutine finished
            if self.stack:
                # If there is a calling generator waiting, resume it
                self.sendval = None
                self.target = self.stack.pop()
                return None  # indicate we should continue running (scheduler will reschedule immediately)
            else:
                # No caller, task is truly done
                return None
        self.sendval = None  # reset sendval after sending

        if isinstance(result, SystemCall):
            # If coroutine yielded a system call (special request)
            return result  # hand it to scheduler to handle:contentReference[oaicite:25]{index=25}

        if isinstance(result, types.GeneratorType):
            # Coroutine yielded a sub-generator: trampolining
            self.stack.append(self.target)
            self.target = result
            # loop back (we can just return something that causes scheduler to immediately continue this task)
            return self.run()  # (or manage via while loop as in Beazley’s code)
        else:
            # A normal yield (some value yielded to scheduler or caller)
            if not self.stack:
                # Top-level coroutine yielded a value (not a sub-call and not system call)
                return result  # this result can be ignored or handled by scheduler if needed
            # If there is a caller awaiting this yield:
            caller = self.stack.pop()
            self.target = caller
            self.sendval = result  # send the yielded value as result back into caller
            return self.run()

This logic is derived from Beazley’s trampolining implementation. In fact, Beazley’s code used a while True loop inside run() to handle nested yields in one go. The key idea is: if a coroutine yields another generator, we don’t return to the scheduler; instead, we switch context to that sub-generator (pushing the current one onto a stack). We only return from run() to the scheduler when we hit a yield that isn’t a sub-coroutine call or when a system call needs handling. If a coroutine fully finishes (StopIteration) and we have a caller on the stack, we pop back to it (effectively as if the yield of the sub-coro returned None). The scheduler will see a return of None from task.run() as either “task not ready to yield to scheduler (continue internally)” or “task completed”. We might handle that by the scheduler: if task.run() returns None, it means either the task finished or is still running a sub-task (depending on how we code it). In Beazley’s code, he returned None to indicate the task isn’t yielding a real value (so the scheduler will just move to next task) until a system call or completion happens. We can follow a similar convention.

To summarize Task.run outcomes: It can return a SystemCall object (meaning “scheduler, please handle this and don’t immediately reschedule me”), or return a regular value/None. If it returns a non-SystemCall (including None), the scheduler can assume the task either completed or just yielded a value without any special meaning – usually we would simply put the task back on the ready queue if it returned a value or None (unless None signifies completion). One way to detect completion is to track if the task’s coroutine is exhausted. In our code, if a StopIteration occurs with no caller, that task is done and should not be rescheduled. We might mark a flag or simply not requeue it.

Scheduler

The Scheduler is the event loop that coordinates all tasks. It maintains data structures for ready tasks and for tasks waiting on events (like I/O or other tasks). Key parts of the scheduler design:

A queue (FIFO) of ready tasks that can run. We can use a simple list or collections.deque for this. Tasks get pulled from here to execute, and if they yield cooperatively (normal yield), they go back to the end of the queue. This ensures round-robin fairness for CPU time among tasks that are ready to run.
A task table or registry to keep track of tasks by ID. For example, a dictionary tasks mapping tid -> Task. This is useful if we implement operations like KillTask(tid) or WaitTask(tid) where the scheduler needs to find a specific task by its ID.
Data structures for waiting tasks:
- For I/O: perhaps two dicts, read_waiting and write_waiting, mapping file descriptors to Task waiting on them (if any). Alternatively, a combined structure mapping an event key to tasks. Initially, we can keep it simple with two dicts as in Beazley’s example (one for read, one for write).
- For task waiting: if implementing WaitTask, we need a map of task IDs to a list of tasks waiting for them to finish. E.g., waitting[tid] = [list of Tasks paused until tid finishes].
- For timers: if we want Sleep(duration) functionality, we might maintain a min-heap or list of (wake_time, Task) for tasks that are sleeping, and on each loop check the current time to see if any should be woken and moved to ready. (This is similar to how some event loops manage call-at-future-time events.)
The main loop of the scheduler will roughly do: while there are tasks (or events to wait for):
1. If there are no ready tasks, block on I/O (and possibly timer) events until something is ready (to avoid busy-wait). If there are ready tasks, we might do a non-blocking check for I/O instead. This logic is found in Beazley’s iotask() which the scheduler can also manage as a special periodic task. We can incorporate the polling directly in the loop.
2. Take the next task from the ready queue.
3. Call task.run().
4. Examine the return value:
  - If result is a SystemCall instance: This means the task is requesting a special operation (and has yielded control intentionally for this). The scheduler will handle the system call immediately. Typically, the SystemCall object has a handle() method that the scheduler can invoke. In our design, we can implement each SystemCall’s handle to perform the needed action. Before calling handle(), we usually set context on the SystemCall: e.g. result.task = current_task and result.sched = self so it knows who’s calling and can communicate back. After handling, the task might be ready to continue (or not). For example, GetTid system call will set the task’s sendval to its tid and immediately requeue the task. A ReadWait system call will not requeue the task; instead it will put the task into read_waiting and not schedule it until the I/O is ready. The SystemCall’s logic will determine that. After handle(), the scheduler’s loop continues to the next iteration without requeuing the current task (unless the system call did so).
  - If result is None or a regular value (not a SystemCall): This means the task either completed or just yielded a value. If result is None and the task’s generator is done, we drop the task (it finished execution). If result is not None (or is None but the task is not actually done yet due to trampolining), it likely represents a normal cooperative yield – the task yielded control on purpose (maybe via a yield with no special object). In such case, we simply put the task back onto the ready queue to resume later. We might not have a strong use for the actual value yielded (some systems use it for passing messages, but in our design we don’t anticipate tasks communicating via yielded values except for the defined SystemCall objects). We can safely ignore or log unexpected yield values. In summary, for a normal yield, requeue the task.
  - If an exception was raised from task.run() that isn’t StopIteration (meaning an error in task): we might catch it around task.run(). In such a case, we should consider the task crashed. The scheduler can log the exception (and the task ID for debugging) and not requeue the task. This prevents one task’s error from crashing the whole scheduler loop.
I/O polling: To handle I/O, the scheduler will incorporate a call to select (or an equivalent like selectors in Python) each cycle. One approach (from Beazley’s code) is to spawn a special internal task whose job is to continuously call iopoll() in a loop and yield, so that it runs concurrently with others. Another approach is to integrate polling into the main loop when no ready tasks are left (or periodically). We can do something like:

# Pseudocode for I/O wait integration in scheduler loop:
if not ready_queue:
    # No task ready to run, so we can block on I/O
    timeout = None  # block indefinitely until an event
else:
    timeout = 0  # poll without blocking (just check and return)
readable, writable, _ = select(read_waiting_fds, write_waiting_fds, [], timeout)
for fd in readable:
    task = read_waiting.pop(fd)
    ready_queue.append(task)
    # perhaps set task.sendval = None or some actual data if needed
for fd in writable:
    task = write_waiting.pop(fd)
    ready_queue.append(task)

This way, if tasks are waiting on I/O, we wake them when ready. If tasks are simply CPU-bound but cooperative, the timeout=0 poll ensures we check I/O quickly without blocking, and then continue round-robin. If all tasks are waiting (no one ready), we do a blocking wait (timeout=None) so that the process sleeps until an I/O event occurs. This is important to avoid spinning CPU when idle.

SystemCall Handling: We will define a small set of system call classes that encapsulate operations. Likely candidates for Tiffany:
- GetTid – returns the task’s own ID. Useful for tasks to identify themselves (maybe for logging). This would simply do: self.task.sendval = self.task.tid; self.sched.schedule(self.task) (i.e., put the task back in ready queue).
- NewTask(coro) – create a new Task for the given coroutine and schedule it. The current task could receive the new task’s id as a result of the yield. Implementation: create Task, store in task map, enqueue it; set current task’s sendval to new task’s id (or some acknowledgment) and then also enqueue current task again (so it continues).
- WaitTask(tid) – current task wants to wait until task tid completes. We would check if that task is still alive. If yes, remove current from ready queue and put it in a wait list for that tid. If the task tid is already done, we can immediately continue the current task (perhaps setting sendval to some result or None). If waiting, when task tid eventually finishes, the scheduler will need to take any tasks in its wait list and enqueue them (with perhaps a result indicating that the wait is over).
- KillTask(tid) – to terminate a task. The scheduler would look up the task by id, and if found, remove it from ready queue or waiting structures, and possibly prevent it from running. If a task is killed, tasks waiting on it might need to be notified (maybe with an exception or a special value). We might not need this in Tiffany initially unless we have a use-case to stop tasks.
- ReadWait(file) / WriteWait(file) – as described, these will take the file’s file descriptor and put the task in read_waiting or write_waiting. The task is not requeued; it’ll be resumed by the scheduler when the fd is ready (detected via select). We should mark in the task map that this task is now blocked on I/O. The handle() will do: self.sched.waitforread(self.task, fd) for example, and nothing more (do not requeue the task). The scheduler’s polling will take care of waking it.
- (Optional) Sleep(dt) – not mentioned in Beazley’s original, but easy to add. It would record a wake-up time (now + dt) and put the task aside (perhaps in a min-heap sorted by wake time). The scheduler each loop can check the soonest wake-up time to decide a timeout for select. When time is reached or passed, the scheduler will requeue the task. This can be implemented similarly to I/O wait.

We can implement these system calls as classes with a handle() method as mentioned. They might look like:

class SystemCall:
    def handle(self):
        pass

class GetTid(SystemCall):
    def handle(self):
        self.task.sendval = self.task.tid
        self.sched.schedule(self.task)  # put task back ready

class NewTask(SystemCall):
    def __init__(self, coro):
        self.coro = coro
    def handle(self):
        new_task = Task(self.coro)
        self.sched.add_task(new_task)
        # Immediately schedule new task
        self.sched.schedule(new_task)
        # Also resume current task
        self.task.sendval = new_task.tid
        self.sched.schedule(self.task)

And so on for others (ReadWait, etc.), following the patterns from Beazley’s code (which we have paraphrased). The scheduler will detect isinstance(result, SystemCall) and invoke result.handle() after attaching result.task = current_task and result.sched = self. This injects the needed context so the SystemCall knows which task is making the request and can access the scheduler.

Scheduler APIs: We should have methods like Scheduler.new(coro) to add a new coroutine (wrap it in Task and enqueue), and Scheduler.schedule(task) to enqueue an existing Task (this might just put it in ready queue). Also, perhaps Scheduler.exit(task) or internal handling when a task finishes (to do cleanup, e.g., remove from task map, and wake any waiters).

The main loop Scheduler.mainloop() drives everything until no tasks remain. It will incorporate the logic discussed: get next ready task, run it, handle results, and poll events. In Python pseudocode:

def mainloop(self):
    while self.tasks or not self.ready.empty():
        # handle IO and timers
        if self.ready.empty():
            timeout = None
        else:
            timeout = 0
        # use select or similar to get ready fds
        rlist = list(self.read_waiting.keys())
        wlist = list(self.write_waiting.keys())
        if rlist or wlist:
            readable, writable, _ = select.select(rlist, wlist, [], timeout)
            for fd in readable:
                task = self.read_waiting.pop(fd)
                self.schedule(task)
            for fd in writable:
                task = self.write_waiting.pop(fd)
                self.schedule(task)
        elif timeout is None:
            # No fds to wait on, but we have to wait (meaning no ready tasks and no I/O)
            # This situation likely means no tasks at all, or only tasks waiting on something that isn't fd (maybe a timer).
            # We could sleep or break.
            break

        if not self.ready.empty():
            task = self.ready.get()  # dequeue a ready task
        else:
            continue  # go back to loop top if nothing ready (maybe all tasks done while waiting)

        result = None
        try:
            result = task.run()
        except Exception as e:
            # handle unexpected task exception
            print(f"Task {task.tid} raised {e}, removing it")
            # (optionally call some task cleanup)
            continue

        if result is None:
            # Task finished or yielded normally with no special request
            if task.is_done():  # we would need a way to check if fully done
                # remove from task map, and if someone waiting on it:
                if task.tid in self.waiting:
                    for wtask in self.waiting[task.tid]:
                        self.schedule(wtask)
                    del self.waiting[task.tid]
                continue  # don't requeue
            else:
                # If not done, it yielded cooperatively (like yield None or something)
                self.schedule(task)  # just requeue
        elif isinstance(result, SystemCall):
            result.task = task
            result.sched = self
            result.handle()
            # Note: the SystemCall handler is responsible for requeueing the task or not
        else:
            # The task yielded a value (not None, not SystemCall)
            # We currently have no use for direct yielded values except maybe debug
            # Treat it as a cooperative yield: task still alive, requeue it
            self.schedule(task)

The above is a rough outline; details will differ (like how we mark task done – perhaps Task.run() sets an internal flag or we catch StopIteration).

One subtlety: In Beazley’s code, when a task returns a SystemCall, Task.run() actually returns that SystemCall up to the scheduler, and the scheduler does not immediately requeue the task but processes the system call. The task in such case is still in a running state, just paused until the system call is done. Many system calls will immediately put the task back to ready (like GetTid, NewTask), but some (ReadWait, WaitTask) deliberately do not requeue the task, causing it to be effectively blocked. Our design follows the same approach.

To illustrate the scheduling and task switching, consider two tasks yielding control back and forth:

Figure 1: Cooperative task switching. In the above sequence diagram, Task A runs and then yields control to the scheduler (e.g., by executing a simple yield with no special event) to allow other tasks to run. The scheduler receives control and marks Task A as ready (or takes it out temporarily since it yielded). The scheduler then picks Task B to run next. Task B runs for a while and then yields control as well. The scheduler resumes Task A, and so on in a round-robin fashion. This cooperative approach ensures tasks take turns. Each arrow labeled “yield … control” represents a task suspending itself by yielding, and “resume Task X” represents the scheduler activating the next task. The scheduler is at the center, orchestrating the switches.

Trampolining and Nested Coroutines

Trampolining is handled within the Task’s run() logic, as described, rather than in the scheduler. This is by design: from the scheduler’s perspective, a Task with trampolining is still a single task – the scheduler doesn’t need to know if the task internally entered a sub-coroutine. The Task will only return to the scheduler when the entire nested sequence yields a system call or finishes. This keeps the scheduler simple and pushes the complexity to the Task object.

To understand trampolining, imagine a server task that calls a helper coroutine. For example:

def Accept(sock):
    yield ReadWait(sock)            # wait for a client to connect
    client, addr = sock.accept()    # accept connection (once socket is ready)
    yield client, addr              # yield the result back to caller

def server(port):
    sock = setup_listening_socket(port)
    while True:
        client, addr = yield Accept(sock)   # call sub-coroutine Accept
        yield NewTask(handle_client(client, addr))

In this scenario, server() calls Accept(sock) via yield. Without trampolining, Accept’s yields would not be handled correctly. With trampolining, what happens internally is:

The server coroutine (let’s call it Outer) yields a generator object when it does yield Accept(sock). That generator corresponds to the Accept coroutine. The Task running server detects it yielded a generator. It pushes the Outer generator onto its stack and sets its current target to the Accept generator. The Task now will start/resume the Accept coroutine. The scheduler is not involved in this switch; it all happens inside Task.run.
The Accept coroutine runs until its first yield. It yields ReadWait(sock) – a system call to wait for a socket. At this point, in Task.run, result is a ReadWait SystemCall, which triggers the return result path. Thus, the Task returns the ReadWait event to the scheduler (and pauses). The scheduler sees a SystemCall, sets it up with context, and handles it. In this case, ReadWait.handle() will register that task in the read_waiting map for that socket’s fd and not requeue it. The task is now blocked waiting for I/O. The scheduler will move on to other tasks (if any) or wait in select for file descriptors.
When a client connection comes in, the listening socket becomes readable. The scheduler’s select call returns that fd as ready. The scheduler finds which task was waiting on that fd (the Accept’s task) and puts it back on the ready queue. Now the Accept coroutine can resume. The scheduler will eventually run that task again; in Task.run, we re-enter where we left off in Accept (after the yield ReadWait). We likely set sendval = None when resuming after I/O (no particular value to send, or we could send the sock.accept() result, but in this design Accept itself will call sock.accept()). Accept now executes client, addr = sock.accept() (since the socket is ready) and then yield client, addr. That yields a regular value (not a SystemCall or generator) back to the scheduler.
The Task’s run() sees the Accept generator yielded a normal value (the client socket and address). Since the Task’s stack is not empty (Outer is waiting), the code will pop the Outer generator off the stack, make it the current target again, and set the value as sendval. It then continues running the Outer generator, effectively sending the (client, addr) tuple into the yield Accept(sock) expression. Thus, in the server() coroutine, the call client, addr = yield Accept(sock) receives the values and continues. From the scheduler’s view, this entire sequence looks like the task went to sleep on I/O and then produced a SystemCall and later simply yielded a normal value (the tuple) which we ignored/treated as cooperative yield before rescheduling. The Outer coroutine then immediately yields NewTask(handle_client(client, addr)) as per the code, which will be a SystemCall. The scheduler handles the NewTask system call by creating a new task for handle_client and scheduling it. Meanwhile, the server task (Outer) is requeued (since after yielding NewTask it would continue its loop).

This interplay is complex, but the good news is our design cleanly separates concerns:

The Task/trampoline mechanism handles sub-coroutine calls and returns without scheduler intervention, making nested coroutine calls possible just like nested function calls.
The scheduler handles system calls and dispatch.
The net effect is our code can be written in a linear style with yield acting as an async wait, which is very convenient.

Figure 2: Nested coroutine (trampolining) and I/O event sequence. The above diagram illustrates the sequence described: The Outer task (e.g., the server loop) yields a sub-coroutine Accept(sock). The Task internals switch to running the Accept coroutine. The Accept coroutine yields a ReadWait system call to the Scheduler, indicating it needs to wait for the socket. The Scheduler pauses the task until the socket is ready (some time passes externally). Once a client connects (socket ready), the Scheduler resumes the Accept coroutine. Accept then produces a result (client, addr) and returns that back to the Outer task. The Outer task resumes, now with the client info, and can continue (in this case, spawning a new task to handle the client). All of this happens seamlessly: from the programmer’s perspective, yield Accept(sock) returned a value when the Accept coroutine completed, just like a normal function call that waited for I/O. This transparency is the benefit of trampolining. The scheduler did not have to know about the call to Accept; it only dealt with the ReadWait and later the NewTask system calls.

Rust Implementation Consideration

We should briefly consider how this design would map to Rust 2024, to ensure our approach is not fundamentally incompatible if we choose or need to implement in Rust. Rust’s async/await works differently but the high-level structure of an event loop with tasks is analogous:

In Rust, an async function is analogous to our coroutine. It can await on futures (similar to yielding events). We could create an executor that polls these futures. For example, our scheduler’s main loop in Rust would poll each future (task) until it returns Poll::Pending or Poll::Ready. If pending, it means the task yielded (awaited) on something – we’d then need to know on what it’s waiting (Rust’s waker system handles this via registering wakers with reactors like epoll or timers). If ready, the future completed and can be dropped.
Many of the constructs we have as SystemCall in Python would correspond to either standard library futures or custom future combinators in Rust. For instance, waiting on a socket is typically done by an async I/O reactor (like using tokio which integrates with OS polling under the hood). Spawning a new task is typically a method on the executor (spawn in tokio). Getting a task ID is not built-in in Rust futures (since tasks are generally anonymous closures), but one could incorporate an ID for logging.
One could attempt to implement a stackful coroutine in Rust with the unstable Generator trait, which would look more like the Python approach (allowing yield within a generator). However, using that directly is nightly-only. Instead, using async/await (which are stackless coroutines) is the idiomatic way. Each .await in Rust is akin to a yield point where the task returns control to the executor. The executor must know how to wake the task when the awaited operation is ready. This is done via Waker objects in Rust, typically provided by the runtime. Our Python scheduler used a simple select loop; in Rust, an executor often uses OS selectors (epoll, etc.) or leverages a library for readiness (like mio in tokio).
Minimum needed vs full features: In Rust, if using existing libraries, we might not need to implement our own reactor – we could use something like futures::executor::block_on or smol or tokio for a lot of functionality. But if we want a standalone minimal runtime (for learning or control), we could implement a basic select or poll loop using lower-level crates. Our SystemCall analogs in Rust could be simple functions or futures: e.g., a read_wait(socket) could be a future that registers interest in a socket readability event and yields until the socket is readable. The complexity is higher than in Python due to Rust’s safety and the need to interface with OS properly (non-blocking sockets, etc.).

Evaluating Python vs Rust for Tiffany: For prototyping and quick development, Python’s coroutine scheduler is extremely flexible and easy to change. It runs in a single thread (which might be fine if Tiffany’s workload is I/O-bound and not CPU-bound) and can handle many concurrent tasks (Python generators are lightweight). Python’s drawback is performance – if we need to handle thousands of tasks or high throughput I/O, the Python interpreter might become a bottleneck. Additionally, Python’s GIL prevents multi-core usage, but since we’re doing single-threaded co-op multitasking, that’s not a concern unless one wanted to scale across threads (not in our current scope).

Rust, on the other hand, offers high performance and true non-blocking behavior with zero-cost abstractions once compiled. If Tiffany’s environment is constrained (e.g., running on bare-metal with low resources or needing to maximize network throughput), a Rust implementation might be worthwhile. Rust also offers memory safety – our Python scheduler must be carefully written, but Python itself is memory safe; the main risks in Python are logical (like forgetting to requeue a task or handle an exception). Rust could help avoid certain classes of errors (like data races, if we later use multiple threads for parallelism). However, writing our own Rust async runtime is a significantly bigger engineering effort than the Python prototype. The minimum necessary for our needs might be achieved with the Python approach initially, given we don’t need true parallelism or extreme performance in the prototype stage. We can ensure the design is portable to Rust by mirroring concepts:

Use Futures in Rust as tasks; use an executor (which we can implement or reuse) akin to our Scheduler.
Replace yields and SystemCalls with await points and either library futures (for I/O) or custom ones (for things like GetTid – though in Rust, GetTid is trivial or not usually needed, since tasks aren’t typically identified by an integer by default).
We won’t have an exact equivalent of trampolining issue in Rust because async/await handles that automatically via the compiler (an async function can call another async function and await it, which is analogous to our yield-from or trampolining solution, but the Rust compiler flattens these awaits in the generated state machine). Essentially, Rust’s async is more like having trampolining built-in – you can await nested async calls with no special effort.

In conclusion, for Tiffany’s minimum requirements, the Python coroutine scheduler suffices and is easier to build and iterate on. We will proceed with that design, implementing just the features we need (task creation, basic synchronization, and non-blocking I/O). If performance tests or production considerations demand it, we can plan a migration to Rust using an equivalent structure (possibly leveraging an existing runtime to avoid reinventing the wheel at low level). The conceptual model of our scheduler – an event loop handling tasks and events – is common to virtually all async runtimes (from JavaScript’s event loop to libuv, Python asyncio, and Rust executors), so the knowledge and design should translate well.

Summary of Design Decisions

Single-threaded cooperative scheduler: simple design, deterministic behavior, no preemption. Suitable for I/O-heavy tasks. Not suitable for CPU-bound parallel work (not needed for now).
Coroutine tasks via Python generators: easy to implement the logic of yield/resume. We use generator send() and yield to alternate between tasks and scheduler. We wrap them in a Task class to manage state and allow nested yields (trampolining).
System call objects for events: We define specific events tasks can yield to trigger actions like spawning tasks or waiting on I/O. This makes the yield interface extensible – we can add new capabilities by introducing new SystemCall classes without changing the core scheduler loop much.
Trampolining with explicit stack: This is included to allow tasks to call sub-coroutines cleanly. It adds a bit of complexity to Task.run but keeps scheduler logic clean (the scheduler always sees top-level yields only). This decision is justified by the need for clean code structure in tasks (avoiding monolithic giant coroutines because of lack of yield-from). Since we target at least Python 3.7+ for Tiffany (likely, given modern context), we could use yield from instead of manual trampolining. Indeed, PEP 380 (yield from) provides a language-level way to delegate to subgenerators, which would make our Task.run simpler (we wouldn’t need to manually push/pop stack). However, using yield from means our coroutines would use that syntax directly in their code. If we have control over writing all coroutine functions, this is fine. If we prefer to support even older style or want to see how it works under the hood, trampolining is educational. Given that yield-from is well-supported, an alternative design is: whenever a generator yields another generator, we could use yield from to drive it. But since our scheduler is explicitly managing yields, we likely stick to our manual approach.
Minimal set of system calls: We will implement only those needed. For example, if Tiffany’s tasks mostly revolve around network and spawning, we implement NewTask, ReadWait, WriteWait, and possibly a Sleep. If tasks need to coordinate, WaitTask can be added. We won’t implement features like prioritization or time-slicing (each task runs until it yields). If a task misbehaves (never yields), it will block the system – this is acceptable given our controlled use cases, but we’ll document that tasks must be written to yield appropriately.
Integration with actual I/O: We assume the environment allows non-blocking I/O operations. For example, for network sockets we will set them to non-blocking mode (Beazley’s example uses socket.accept() after waiting, which is fine because the accept will not block when select indicates readiness). For file reading or other ops, we may need similar handling. Python’s select works on sockets and some other file types on Unix. If we need cross-platform or more complex monitoring, we could use the selectors module which abstracts select/poll/epoll differences. But to keep it minimal, select is okay if our fd sets are small (there is a FD limit on select, but likely not an issue for Tiffany’s scale).

The design above gives us a clear blueprint to start coding. We have focused on the essential mechanisms: task switching, waiting for events, and maintaining state. We have also discussed how to scale it down or up depending on needs (e.g., what happens in Rust, what features to omit for simplicity). With this design, the next step is to implement the Scheduler and Task classes in code, define the necessary SystemCall subclasses, and test the system with some example tasks (like a simple server or a ping-pong between tasks) to ensure it behaves as expected. The sequence diagrams provided illustrate the flow of control between tasks and the scheduler, which we will use as a reference for implementation correctness.

Tiffany System Diagrams

This section contains visual documentation and sequence diagrams for:

Coroutine scheduler
WAL/PAL replay and dataflow
ReAct loop and skill execution
Task lifecycle and fault handling

All diagrams are written in Mermaid and versioned alongside the system ADRs and concepts.

See also: ADR-0006, ADR-0009

Scheduler Component Diagrams

Diagram D1 – Basic Task Switching (ReadyQueue only)

sequenceDiagram
    participant Scheduler
    participant ReadyQueue as FIFO
    actor TaskA
    actor TaskB

    Scheduler->>ReadyQueue: dequeue TaskA
    ReadyQueue-->>Scheduler: TaskA
    Scheduler->>TaskA: resume()
    TaskA-->>Scheduler: yield (None)
    Scheduler->>ReadyQueue: enqueue(TaskA)

    Scheduler->>ReadyQueue: dequeue TaskB
    ReadyQueue-->>Scheduler: TaskB
    Scheduler->>TaskB: resume()
    TaskB-->>Scheduler: yield (None)
    Scheduler->>ReadyQueue: enqueue(TaskB)

Diagram D2 – Nested Coroutines (Trampolining)

sequenceDiagram
    participant CallStack as CallStack_LIFO
    actor TaskMain
    actor SubCoro
    
    TaskMain-->>CallStack: push(TaskMain.gen)
    TaskMain->>SubCoro: start sub-coroutine
    SubCoro-->>CallStack: yield value
    CallStack->>TaskMain: pop parent
    TaskMain->>TaskMain: resume with value

Diagram D3 – I/O Blocking and Wakeup

sequenceDiagram
    participant Scheduler
    participant WaitMap as EventMap
    participant ReadyQueue
    actor TaskIO
    
    TaskIO-->>Scheduler: yield ReadWait(fd=7)
    Scheduler->>WaitMap: record TaskIO blocked on fd=7
    
    Note right of WaitMap: fd=7 becomes readable
    WaitMap->>Scheduler: notify fd=7 ready
    Scheduler->>ReadyQueue: enqueue(TaskIO)
    ReadyQueue-->>Scheduler: TaskIO ready to run

Diagram D4 – Sleep Timer

sequenceDiagram
    participant Scheduler
    participant WaitMap as SleepMap
    participant ReadyQueue
    actor TaskSleep
    
    TaskSleep-->>Scheduler: yield Sleep(3s)
    Scheduler->>WaitMap: record TaskSleep with wake_time = now + 3s
    
    Note right of WaitMap: 3 seconds pass
    WaitMap->>Scheduler: wake TaskSleep
    Scheduler->>ReadyQueue: enqueue(TaskSleep)

Diagram D5 – Join / WaitTask

sequenceDiagram
    participant Scheduler
    participant WaitMap as JoinMap
    actor TaskA as Task_Parent
    actor TaskB as Task_Child
    
    TaskA-->>Scheduler: yield NewTask(TaskB)
    Scheduler->>TaskA: resume with TaskB.tid
    
    TaskA-->>Scheduler: yield WaitTask(TaskB)
    Scheduler->>WaitMap: record TaskA waiting on TaskB
    
    Note right of TaskB: TaskB runs and completes
    TaskB-->>Scheduler: StopIteration
    Scheduler->>WaitMap: notify TaskA to resume

✅ Agent Task Graph (Per Agent / DAG View)

Type: graph TD (or flowchart TD) Purpose: Visualize how one FAR agent’s current task decomposes into subtasks, dependencies, joins, etc.

graph TD
    RootTask[Plan Story Code]
    Sub1[Setup Repo]
    Sub2[Create main.rs]
    Sub3[Implement main_fn]
    Sub4[Add Tests]
    Join[WaitTask: Finalize]

    RootTask --> Sub1
    RootTask --> Sub2
    Sub2 --> Sub3
    Sub3 --> Sub4
    Sub1 --> Join
    Sub4 --> Join

✅ FAR Cluster View: Distributed Agents & Scheduler

Type: graph LR or flowchart LR Purpose: Show how agents across blades interact with a central orchestrator + local schedulers

flowchart LR
    Orchestrator[[Central Scheduler]]
    subgraph Blade A
        A1[Agent 001\n - builder]
        SchedA[Local Scheduler]
    end
    subgraph Blade B
        A2[Agent 002\n - Tester]
        SchedB[Local Scheduler]
    end

    Orchestrator --> SchedA --> A1
    Orchestrator --> SchedB --> A2

✅ Skill Invocation Trace (ReAct-aware timeline)

Purpose: Show how a ReAct loop invokes skills with context, yields, retries, and memory access

sequenceDiagram
    actor ReActLoop
    participant SkillManager
    participant SkillEdit
    participant MemoryStore
    participant WAL

    ReActLoop->>SkillManager: invoke("Edit main.rs")
    SkillManager->>SkillEdit: run()
    SkillEdit->>MemoryStore: lookup(path="main.rs")
    MemoryStore-->>SkillEdit: file AST
    SkillEdit-->>WAL: write(PatchGenerated)
    SkillEdit-->>SkillManager: return Patch
    SkillManager-->>ReActLoop: Patch ready

✅ 11. Agent Memory Structure (Graph + Vector)

Purpose: Show semantic and episodic memory layers, link to disk/durable store

graph TD
  MemRoot[Memory: Agent]
  Sem[Semantic Graph]
  Episodic[Task Log / WAL]
  Vectors[VectorStore - code, plans]
GraphDB[(Neo4j)]
VecDB[(Qdrant/Faiss)]

MemRoot --> Sem --> GraphDB
MemRoot --> Episodic --> WAL
MemRoot --> Vectors --> VecDB

✅ 12. Coroutines Timeline (Concurrent Task Steps)

Purpose: Show wall-clock execution of several tasks with yield/resume gaps

gantt
    title Task Concurrency Timeline
    dateFormat  HH:mm:ss
    section Task_Main
    run      :active, 00:00:00, 5s
    sleep    : 00:00:05, 3s
    resume   : 00:00:08, 4s
    section Task_Sub
    start    :active, 00:00:02, 3s
    block_io : 00:00:05, 4s
    finish   : 00:00:09, 2s

✅ System Call Map

Purpose: Visualize all supported system calls and their dispatch targets (WAL, IO, Scheduler)

graph LR
    YieldNewTask --> Scheduler
    YieldReadWait --> IOSelector
    YieldSleep --> TimerWheel
    YieldGetTid --> Scheduler
    YieldMemoryQuery --> MemoryStore
    YieldLog --> PAL

✅ Task WAL Replayer Flow

Purpose: Shows how the replay engine reconstructs a task from the WAL stream, rehydrating state and optionally stepping through each yield point.

flowchart TD
    WAL[WAL Stream: task_187.log] --> Parser
    Parser --> EventQueue
    EventQueue --> Scheduler
    Scheduler --> TaskReplayer
    TaskReplayer --> CallStack
    TaskReplayer --> PAL[Replay Mode: emit PAL ghost events]
    TaskReplayer --> Memory[Optional Memory Patching]

    style WAL fill:#dff,stroke:#0af
    style PAL fill:#eee,stroke:#f08
    style Memory fill:#f9f,stroke:#808

✅ Panic Propagation via Wait/Join Graph

Purpose: If a child task panics, show how the error might propagate to parent or dependents if not isolated.

graph TD
    A[Task_Main\n - tid 101]
    B[Task_Git\n - tid 202]
    C[Task_Edit\n - tid 203]
    D[Task_Tests\n - tid 204]

    A -->|JoinTask| B
    A -->|JoinTask| C
    C -->|JoinTask| D

    D -.->|panic| C
    C -.->|propagate panic| A

    style D fill:#faa,stroke:#800
    style C fill:#ffe,stroke:#aa0

✅ Bonus 3: Skill Invocation Heatmap

Purpose: Visualize which system tools/skills are being used most frequently, based on PAL telemetry stream.

graph LR
    Skill_Edit["📝 Edit File"]:::hot
    Skill_Grep["🔍 Grep"]:::warm
    Skill_Test["✅ Run Tests"]:::hot
    Skill_Git["🌿 Git Ops"]:::warm
    Skill_Plan["🧠 Plan/Refactor"]:::cold
    Skill_Fmt["🧹 Format Code"]:::cold

    style Skill_Edit fill:#f88,stroke:#800
    style Skill_Test fill:#f88,stroke:#800
    style Skill_Grep fill:#fb6,stroke:#aa0
    style Skill_Git fill:#fb6,stroke:#aa0
    style Skill_Plan fill:#ccf,stroke:#44f
    style Skill_Fmt fill:#ccf,stroke:#44f

Tiffany ADR Index

This document tracks relevant Architectural Decision Records (ADRs) for the Tiffany agentic runtime project. ADRs are numbered sequentially and grouped by functional domain.

🗂️ Summary of Additional ADR Recommendations:

ADR	Description	Summary	Importance
1	Choice of Rust Version	Use the latest stable Rust version for performance, safety, and ecosystem benefits.	High
2	Doc book Policy	Documentation standards, structure, and tooling for high-quality docs.	High
3	Task Scheduler Model	Coroutine-first, cooperative scheduling with stateful yield and resume semantics.	High
4	Agent Loop and ReAct Design	Agent lifecycle: reasoning, acting, yielding, and tool interaction.	High
5	Virtual Canvas Git Strategy	Code change tracking, patch application, and micro-commits.	Medium
6	WAL Schema and Replay Policy	Write-ahead log format, recovery semantics, and log compaction.	High
7	Process Activity Log (PAL)	Real-time activity tracking for agent tasks, complementing the WAL.	Medium
8	Persistent Agent Memory Strategy	Neo4j semantic memory, graph structure, and retrieval policies for agent continuity.	High
9	Code Structure Graph and Symbol Analysis	Graph-based code indexing and semantic AST analysis into Neo4j.	Medium
10	Task Plan Timeline and Execution Metadata	Structure for task execution metadata, plan history, and decision lineage.	Medium
11	Plugin and MCP Tooling Architecture	Integration of internal and external tools securely and generically.	High
12	Agent Skill System (Future Plan)	Representation of agent capabilities as composable skills.	Medium
13	Firecracker MicroVM Integration	VM lifecycle, execution isolation, and resource constraints.	High
14	Command Execution and Safety Policy	Shell execution flow, trust boundaries, and `--yolo` mode considerations.	High
15	REST/gRPC API Design	Interface contract for external task submission, result retrieval, and metadata queries.	High
16	Filesystem Socket Protocol for CLI	Local interaction protocol for invoking agent actions via socket.	Medium
17	Kubernetes Operator and CRD Design	Custom resource definitions for managing agent lifecycle and orchestration.	Medium
18	Metrics Policy and Instrumentation Plan	Prometheus metrics structure, naming conventions, and dashboard philosophy.	Medium
19	Logging Strategy and Span Hierarchy	Use of `tracing`, span lifecycles, and log level defaults.	Medium
20	Versioning and Release Policy	Semantic versioning, LTS channels, and changelog protocol.	Medium
21	Contributor Roles and Governance	Roles, responsibilities, PR review flow, and escalation path.	Medium
22	Backup, Disaster Recovery, Failover	Policies for data recovery, backups, failover, and explicit RTO/RPO.	Critical
23	Secrets and Credentials Management	Handling of secrets, keys, tokens, integration with vaults, and rotation policies.	Critical
24	Authentication and Authorization	Secure authentication, RBAC, OAuth/OIDC, mTLS, and trust boundaries.	Critical
25	Dependency Management and Update Policy	Management and update of dependencies, vulnerability scanning, and update policies.	High
26	Performance and Scalability Strategy	Performance benchmarks, scalability tests, and optimization strategies.	High
27	Localization and Internationalization	Handling of i18n, Unicode support, locale-aware formatting, and error handling.	Medium
28	Compliance, Auditing, Regulatory Considerations	Compliance with privacy regulations, auditing, log retention, and reporting mechanisms.	High
29	Data Retention and Privacy	Data retention policies, lifecycle management, and privacy guarantees.	High
30	Cost Management and Budgeting	Cost monitoring, cloud usage, budgeting, and cost optimization strategies.	Medium
31	Accessibility and Usability Guidelines	Ensuring interfaces are accessible, usable, and WCAG compliant.	Medium

Choice of Rust Version

ADR-0001: Choice of Rust Version
Decision to use the latest stable Rust version for performance, safety, and ecosystem benefits.

Doc book Policy

ADR-0002: Doc book Policy
Documentation standards, structure, and tooling for maintaining high-quality project documentation.

🧠 Agent Architecture

ADR-0003: Task Scheduler Model
Coroutine-first, cooperative scheduling with stateful yield and resume semantics.
ADR-0004: Agent Loop and ReAct Design
Lifecycle of an agent: reasoning, acting, yielding, and interacting with tools.
ADR-0005: Virtual Canvas Git Strategy
Strategy for code change tracking, patch application, and micro-commits.

📦 Storage & Durability

ADR-0006: WAL Schema and Replay Policy
Format of write-ahead log, recovery semantics, and log compaction strategy.
ADR-0007: Process Activity Log (PAL)
Real-time activity tracking for agent tasks, complementing the WAL.
ADR-0008: Persistent Agent Memory Strategy
Design of Neo4j semantic memory, graph structure, and retrieval policies. Emphasize goals, prompts, plan recall, continuity between sessions.
ADR-0009: Code Structure Graph and Symbol Analysis graph-based code indexing and semantic AST analysis into Neo4j, enabling precise symbol tracking and refactoring intelligence.
ADR-0010: Task Plan Timeline and Execution Metadata Structure for task execution metadata, including plan history and decision lineage.

🧩 Modularity & Extensibility

ADR-0011: Plugin and MCP Tooling Architecture
How Tiffany integrates internal and external tools securely and generically.
ADR-0012-: Agent Skill System (Future Plan)
Representation of agent capabilities as composable skills.

🔐 Runtime & Execution

ADR-0013: Firecracker MicroVM Integration
VM lifecycle, execution isolation, resource constraints.
ADR-0014: Command Execution and Safety Policy
Shell execution flow, trust boundaries, and --yolo mode considerations.

📡 Interfacing & I/O

ADR-0015: REST/gRPC API Design
Interface contract for external task submission, result retrieval, and metadata queries.
ADR-0016: Filesystem Socket Protocol for CLI
Local interaction protocol for invoking agent actions via socket.
ADR-0017: Kubernetes Operator and CRD Design Custom resource definitions for managing agent lifecycle and task orchestration.

🧪 Observability

ADR-0018: Metrics Policy and Instrumentation Plan
Prometheus metrics structure, naming conventions, and dashboard philosophy.
ADR-0019: Logging Strategy and Span Hierarchy
Use of tracing, span lifecycles, log level defaults.

🧭 Governance & Open Source

ADR-0020: Versioning and Release Policy
Semantic versioning, LTS channels, changelog protocol.
ADR-0021: Contributor Roles and Governance
Roles, responsibilities, PR review flow, escalation path.

22. Backup, Disaster Recovery, and Failover Strategy

Policies around data recovery, backups, and failover mechanisms.
Clearly defines how agent memory, logs, and metadata are backed up.
Explicit recovery process, RTO (Recovery Time Objective), and RPO (Recovery Point Objective).

23. Secrets and Credentials Management

Explicit handling of secrets, keys, tokens, and credentials.
Integration with secure vaults (e.g., HashiCorp Vault, AWS Secrets Manager, Kubernetes Secrets).
Auditing access and rotation policies.

24. Authentication and Authorization

How FAR agents, developers, and other components authenticate securely.
Role-Based Access Control (RBAC), OAuth/OIDC integration, or mTLS approaches.
Explicitly defined trust boundaries and security contexts.

25. Dependency Management and Update Policy

Management and update processes for third-party libraries, Rust crates, Kubernetes components, and container images.
Vulnerability scanning and patching procedures.
Policies around dependency updates, deprecations, and removals.

26. Performance and Scalability Strategy

Detailed performance benchmarks, scalability tests, and capacity planning.
Horizontal vs. vertical scaling policies.
Strategies for proactive performance optimization and reactive capacity adjustments.

27. Localization and Internationalization (i18n) Policy

How the system will handle internationalization, localization, and multilingual capabilities.
Consideration of Unicode support, locale-aware formatting, and error handling.

28. Compliance, Auditing, and Regulatory Considerations

Compliance with privacy regulations (GDPR, CCPA), standards (SOC 2, ISO 27001).
Auditing policies, log retention strategies, and compliance reporting mechanisms.

29. Data Retention and Privacy

Explicit data retention policies, data lifecycle management, and privacy guarantees.
Handling and protection of personally identifiable information (PII), if relevant.

30. Cost Management and Budgeting

Explicit policies around cost monitoring, cloud usage, and resource budgeting.
Alerting on cost overruns, proactive cost management strategies, and cost optimization.

31. Accessibility and Usability Guidelines

Ensuring the interfaces (CLI, UI, documentation) are accessible and usable.
Compliance with WCAG accessibility standards.

ADR-0003: Task Scheduler Model

Status

Accepted

Context

Tiffany operates as a coroutine-oriented, agentic runtime designed for reasoning, acting, and tool orchestration. Core to this architecture is a cooperative task scheduler responsible for managing:

Long-running agent workflows
Step-wise ReAct loops
Concurrent tool invocations
Cancelable task execution

As of Rust 2024, first-class coroutines provide an ideal primitive for implementing these requirements with clarity and zero-cost abstraction.

This ADR defines our task scheduling model, the task lifecycle states, the yield/resume semantics, and how the system will structure and manage cooperative multitasking.

Decision

We will implement a coroutine-first, cooperative task scheduler with the following characteristics:

🧵 Task Type

All tasks will be represented by a unified trait:

#![allow(unused)]
fn main() {
trait AgentTask {
    fn poll(&mut self, ctx: &mut TaskContext) -> TaskState;
}
}

📐 Task States

Tasks can exist in the following explicit states:

Ready – enqueued for execution
Running – currently executing
Waiting – yielded for tool response or LLM
Completed – finished with result or error
Canceled – forcibly stopped

🔄 Yield/Resume Semantics

Tasks may yield cooperatively during:

LLM calls
Tool executions
User confirmations
Awaiting subprocess completion

🧰 Runtime Loop

The scheduler will:

Poll each Ready task
Route yielded work (e.g. to LLM executor or tool manager)
Queue task back when dependency resolves

🧭 Goals

Deterministic, testable behavior
Serializable/resumable task state
Decoupled from tokio::spawn or native threads
Pluggable scheduling policy (FIFO, priority, dependency-aware)

Rationale

🔁 Cooperative vs Preemptive

Tiffany requires transparent control over task transitions. Preemptive systems (e.g. thread pools) make it difficult to audit agent state or model planning steps. Cooperative coroutines, by contrast, allow us to:

Yield at semantic boundaries
Inject logs and metrics at every step
Serialize/resume entire task graphs

🧠 Agent Design Requires Suspended Thought

An agent might:

#![allow(unused)]
fn main() {
let plan = yield plan_with_llm("Build test harness");
for step in plan.steps {
    yield apply_code_diff(step);
    yield confirm_with_user(step);
}
}

This structure is naturally represented with a coroutine and state machine — not an async future.

🧪 Testability

We can model agent execution using deterministic stepping:

#![allow(unused)]
fn main() {
let mut scheduler = TestScheduler::new();
scheduler.inject_mock_tool("ls", "result");
scheduler.step_until_idle();
assert_eq!(scheduler.task_state(task_id), TaskState::Completed);
}

This level of control is difficult in actor or spawn-based models.

📦 Integration Simplicity

The scheduler serves as glue between:

LLM router
Tool executor
WAL
Canvas

Having a single poll-loop mediator makes integration simpler and easier to visualize.

Consequences

Adds internal coroutine scheduler as a first-class subsystem
Task implementations will need to support resumable poll() style execution
Executor, LLM, and Tool interfaces will interact through message passing / callbacks with the scheduler
CI tests will include end-to-end scheduling tests using mocked yield points

Alternatives Considered

Tokio TaskPool: Too opaque for agentic step control; no built-in yield
Actor Model (e.g., actix): Good for I/O, but overkill for structured flows
futures-based step machines: Verbose, brittle, not coroutine-native

Adopted

This ADR is accepted as of June 2025. All internal workflows that require suspendable agent behavior will be modeled as AgentTasks and scheduled cooperatively.

Maintainers: @casibbald, @microscaler-team

ADR-0006: WAL Schema and Replay Policy

Status

Accepted

Context

The Write-Ahead Log (WAL) in Tiffany provides durable, append-only storage for agent activity, execution steps, and Git-integrated state transitions. It is a core component in ensuring fault tolerance, crash recovery, and task auditability.

As the Tiffany agent runtime handles multiple concurrent tasks, LLM responses, tool invocations, and virtual canvas operations, the WAL must encode enough structured information to allow:

Resuming interrupted tasks
Replaying task progress deterministically
Cross-checking Git commit state and canvas deltas

This ADR defines the structure, persistence policy, replay semantics, and integrity guarantees of the WAL.

Decision

We adopt a structured, JSON-encoded, append-only WAL system with the following characteristics:

WAL File Format

File-based, append-only per task context (e.g. wal/<uuid>.wal)
Each line is a valid JSON object (serde_json::to_writer)
Log entries are versioned via a top-level "v" field

WAL Entry Types

{
  "v": 1,
  "ts": "2025-06-27T15:48:00Z",
  "type": "InstructionStart",
  "task_id": "uuid",
  "instruction": "Add validation to model.rs"
}

Supported entry types:

InstructionStart — original user input
LLMPlan — response from LLM with plan steps
StepStart — beginning of a plan step
ToolResult — tool output or error
PatchPreview — virtual diff preview
Commit — Git commit metadata (SHA, message)
TaskComplete — signal of end of task (with success/failure)
ShutdownMarker — graceful termination marker

Storage Model

Each agent task has a dedicated WAL file
Optionally streamed to external durable store (e.g. S3, versioned blob)
Index is built in-memory on load for efficient lookup

Flush Policy

WAL entries are fsynced after each write
Entries are buffered briefly for I/O optimization, but never cached in-memory only

Replay Semantics

On startup or recovery, the WAL replay engine:

Reads all entries into a memory index
Validates log ordering and checks for TaskComplete or ShutdownMarker
If incomplete, reconstructs TaskState:
- Current plan
- Last confirmed step index
- Last Git SHA
- Virtual canvas snapshot (if recoverable)

Recovery Modes

resume: continue from last known state
inspect: dry-run and summarize partial progress
replay: re-run steps for reproducibility / audit

WAL Lifecycle Sequence Diagram

sequenceDiagram
    participant User
    participant Agent
    participant WAL
    participant LLM
    participant Tool
    participant Canvas
    participant Git
    participant ReplayEngine

    User->>Agent: Submit task ("Add error handling")
    Agent->>WAL: Write InstructionStart
    Agent->>LLM: Plan steps
    LLM-->>Agent: Return plan
    Agent->>WAL: Write LLMPlan

    loop For each step
        Agent->>WAL: Write StepStart
        Agent->>Tool: Execute action
        Tool-->>Agent: Output result
        Agent->>WAL: Write ToolResult
        Agent->>Canvas: Apply virtual patch
        Canvas-->>Agent: Preview diff
        Agent->>WAL: Write PatchPreview
        Agent->>Git: Commit via git2-rs
        Git-->>Agent: Commit SHA
        Agent->>WAL: Write Commit (with SHA, step index)
    end

    Agent->>WAL: Write TaskComplete
    Agent-->>User: Respond with summary

    Note over ReplayEngine: On Recovery or Restart
    ReplayEngine->>WAL: Read task WAL file
    WAL-->>ReplayEngine: JSON entries
    ReplayEngine->>ReplayEngine: Rebuild plan, canvas, Git context
    ReplayEngine->>Agent: Resume at next step or verify TaskComplete

Rationale

🔐 Durability & Crash Safety

WAL ensures no task progress is lost
Every instruction, LLM plan, tool call, and Git commit is accounted for

🧪 Reproducibility

WAL enables deterministic re-execution of tasks
Allows audit of LLM decisions, tool usage, and user confirmations

🧰 Debug & Forensics

WAL becomes a source of truth in debugging stuck or incomplete tasks
WAL diffs explain when/why things diverged from expected flow

⚙️ Integration with Canvas & Git

Each canvas commit references its corresponding WAL entry
Reconstructing canvas state on restart uses last committed patch snapshot

Consequences

WAL writer must be thread-safe and coroutine-aware
Every AgentTask must emit structured WAL entries
On crash or power loss, agent must recover cleanly to last Commit or PatchPreview
cargo test must include WAL replay + recovery unit tests

Alternatives Considered

Single journal file for all tasks: Difficult to parallelize, isolate, or replay
Database-backed log: Overkill, less portable, breaks GitOps principle
Binary format: More compact, but less debuggable and interoperable than JSON

Adopted

This ADR is accepted as of June 2025. All agent tasks will be tracked through structured, append-only WAL files to ensure traceability, auditability, and deterministic task recovery.

Maintainers: @casibbald, `@microscaler-team

ADR-0007: Process Activity Log (PAL) and Real-Time Status Stream

Status

Accepted

Context

While the Write-Ahead Log (WAL) guarantees task durability and auditability in Tiffany, it is not suited for real-time observability. Users and dashboards need visibility into what the system is currently doing — not just what it has already persisted.

To address this, we introduce the Process Activity Log (PAL), a real-time, ephemeral streaming mechanism for broadcasting the active state of each agent task as it progresses through its coroutine lifecycle.

The PAL augments the WAL by offering live introspection into step-level execution, LLM calls, tool activity, patching, and user confirmation — enabling powerful local CLI feedback, remote dashboards, and multi-user observability.

PALs are not commonly implemented in most systems. However, in the context of an agentic runtime with coroutine yield points, explicit steps, and Git-integrated patching, the value of live process state becomes critically important for:

Progress visibility during long-running operations
Safe human-in-the-loop oversight
Remote system administration and dashboarding
Collaborative multi-user environments

Decision

Tiffany will maintain a separate PAL channel per running task, broadcasting structured ActivityEvent objects to all connected consumers (CLI, dashboards, logs).

PAL Characteristics

Non-durable, ephemeral (not fsynced or written to disk)
Optimized for streaming, not audit
Broadcast to:
- Local CLI (via Unix domain socket)
- Remote UI (via WebSocket/gRPC stream)
- Logging adapter (optional)
- Pub/Sub (e.g., Redis Streams, NATS, Fluvio, etc.)

Event Schema

{
  "ts": "2025-06-27T16:12:00Z",
  "task_id": "uuid",
  "stage": "ExecutingTool",
  "message": "Running shell: cargo test"
}

PAL Event Types

ReceivedInstruction
WaitingForLLM
ToolExecutionStart
ToolExecutionComplete
PreviewingPatch
AwaitingUserApproval
CommittingPatch
Idle
Completed

PAL Channel Infrastructure

Each Tiffany runtime instance will maintain an async broadcast stream (tokio::broadcast) for each task
The CLI and UI connect as subscribers, filtered by task ID
Internal PAL server optionally bridges these into REST, WebSocket, or external pub/sub systems

Sequence Diagram: PAL Streaming in Action

sequenceDiagram
    participant User
    participant CLI
    participant UI
    participant Agent
    participant PAL
    participant WAL
    participant PubSub

    User->>CLI: tinkerbell submit "Add validation to model.rs"
    CLI->>Agent: Submit instruction
    Agent->>WAL: Log InstructionStart
    Agent->>PAL: Emit ReceivedInstruction
    PAL-->>CLI: Stream activity
    PAL-->>UI: WebSocket push
    PAL-->>PubSub: Fan-out event (optional)

    Agent->>PAL: Emit WaitingForLLM
    Agent->>LLM: Generate plan
    LLM-->>Agent: Return plan
    Agent->>PAL: Emit ToolExecutionStart ("Running shell tool")
    Agent->>Tool: Execute shell script
    Tool-->>Agent: Result
    Agent->>PAL: Emit ToolExecutionComplete
    Agent->>Canvas: Apply patch
    Agent->>PAL: Emit PreviewingPatch
    Agent->>User: Confirm patch
    User-->>Agent: Approve
    Agent->>PAL: Emit CommittingPatch
    Agent->>Git: Commit
    Agent->>PAL: Emit Completed
    Agent->>WAL: Log TaskComplete

Rationale

✅ Real-Time Feedback for Humans

Unlike WALs — which are audit-focused and write-after-the-fact — the PAL is designed to describe what the agent is doing right now.

This enables:

CLI progress bars, spinners, and log overlays
Remote dashboards that track multiple agents in real time
Streaming logs with contextual status annotations
System admins to monitor all active workloads in the FAR runtime

📟 Pub/Sub Observability

PAL emits can be optionally bridged to:

Redis Streams
NATS subjects
Kafka partitions

This enables:

Multiple UIs to subscribe to live activity
Alerting and event-driven monitoring pipelines
Prometheus exporters to map PAL events to gauge metrics

🧩 Complements WAL without Overlapping It

Feature	WAL (Write-Ahead Log)	PAL (Process Activity Log)
Format	Append-only, structured JSON	Broadcast JSON, ephemeral
Scope	Durable system of record	Live visibility per active task
Persistence	Filesystem	Memory / socket / stream
Usage	Replay, recovery, audit	CLI/UI visibility, status updates
Timing	After the fact	Concurrent with step execution

Consequences

Introduces a lightweight PAL layer into each AgentTask
Developers must emit PAL signals at major yield boundaries
CLI and UI must support PAL subscription and display logic
Integration tests will include PAL mock subscribers for validation
PAL memory usage must be bounded (e.g. backpressure if no consumer)

Alternatives Considered

Extending WAL with ephemeral events – Pollutes durability layer; delays update timing
Relying on console logs – Not structured, not streamable, not indexable by UI
Polling task status – Inefficient and fails in multi-user concurrent settings

Adopted

This ADR is accepted as of June 2025. All running agent tasks will emit real-time PAL events through an ephemeral pub-sub system to enable responsive CLI feedback, multi-user UI dashboards, and remote introspection.

Maintainers: @casibbald, @microscaler-team

ADR-0019: Logging Strategy and Span Hierarchy

Status

Proposed

Context

Effective logging is vital for the observability, debugging, and operational transparency of autonomous agentic runtimes. Without structured logging and detailed execution tracing, debugging complex distributed operations becomes difficult, leading to increased operational overhead and reduced system reliability.

Common issues in logging for complex systems include:

Unstructured logs: Difficult to parse or correlate.
Lack of contextual tracing: Limited ability to trace execution flow or diagnose issues.
Inconsistent log levels and verbosity: Excessive noise or insufficient detail for meaningful debugging.

To address these, Tiffany requires a rigorous, structured logging strategy that leverages Rust’s advanced tracing framework, clear span hierarchies, structured logs, and consistent log level defaults.

Decision

We adopt the Rust tracing crate for structured logging and detailed tracing within Tiffany FAR infrastructure, explicitly defining span hierarchies, structured log events, and log level policies.

Core Components:

Structured Logging: Use of tracing with structured, JSON-formatted log outputs.
Detailed Span Hierarchy: Explicit tracing spans for every significant operation (task execution, VM lifecycle events, API calls).
Consistent Log Levels: Clearly defined log level defaults (ERROR, WARN, INFO, DEBUG, TRACE).

Technical Implementation

1. Structured Logging with Rust `tracing`

Tiffany will adopt Rust’s tracing crate, providing structured logs for better parseability and integration with modern log management systems.

Log event format (JSON example):

{
  "timestamp": "2025-06-28T10:00:00Z",
  "level": "INFO",
  "target": "tinkerbell::orchestrator",
  "message": "Task execution started",
  "fields": {
    "task_id": "task-12345",
    "agent_id": "agent-67890",
    "node": "blade-1"
  }
}

2. Explicit Span Hierarchy

Spans explicitly represent the hierarchy and nesting of tasks, operations, and system interactions.

Example span hierarchy structure:

Agent Task Execution [trace_id=abc123]
├── Orchestrator Scheduling
│   └── Resource Allocation
├── FAR Agent VM Lifecycle
│   ├── VM Initialization
│   ├── Capability Execution
│   │   └── LLM Interaction
│   └── VM Termination
└── Result Reporting

3. Span Lifecycle Management

Span lifecycle strictly managed using Rust’s tracing macros (span!, .enter(), .exit()), clearly delineating entry and exit points:

Rust tracing code example:

#![allow(unused)]
fn main() {
use tracing::{span, Level};

fn execute_task(task_id: &str) {
    let task_span = span!(Level::INFO, "task_execution", task_id = task_id);
    let _enter = task_span.enter();

    schedule_task();
    execute_vm_lifecycle();
    report_results();
}

fn execute_vm_lifecycle() {
    let vm_span = span!(Level::DEBUG, "vm_lifecycle");
    let _enter = vm_span.enter();

    vm_initialization();
    capability_execution();
    vm_termination();
}
}

4. Log Level Defaults

Explicit guidelines on log levels:

Level	Use-case Example	Retention Policy
ERROR	Critical failures requiring immediate attention	High priority
WARN	Issues potentially impacting normal operation	High-medium priority
INFO	General operational events and high-level status	Medium priority
DEBUG	Detailed events useful for debugging	Short-term, detailed
TRACE	Highly detailed events, rarely enabled in production	Temporary debugging

Default log levels: INFO in production, DEBUG in staging, TRACE optionally during local development.

📊 Diagram: Span Hierarchy Visualization

graph TD
  Task[Task Execution - INFO]
  Task --> Scheduling[Orchestrator Scheduling - INFO]
  Scheduling --> Allocation[Resource Allocation - DEBUG]

  Task --> VM[FAR Agent VM Lifecycle - INFO]
  VM --> VM_Init[VM Initialization - DEBUG]
  VM --> Capability[Capability Execution - DEBUG]
  Capability --> LLM_Interaction[LLM Interaction - TRACE]
  VM --> VM_Termination[VM Termination - DEBUG]

  Task --> Reporting[Result Reporting - INFO]

🔄 Sequence Diagram: Logging and Span Lifecycle

sequenceDiagram
  participant FAR Controller
  participant Orchestrator
  participant MicroVM (FAR Agent)
  participant PAL/WAL
  participant Log Collector (Elastic/Loki)

  FAR Controller->>Log Collector (Elastic/Loki): Log (INFO): "Task execution started"
  Orchestrator->>Log Collector (Elastic/Loki): Log (DEBUG): "Scheduling task, resource allocation started"

  Orchestrator->>MicroVM (FAR Agent): Start MicroVM
  MicroVM (FAR Agent)->>Log Collector (Elastic/Loki): Log (INFO): "VM initialization successful"
  MicroVM (FAR Agent)->>Log Collector (Elastic/Loki): Log (TRACE): "LLM interaction started"
  MicroVM (FAR Agent)->>Log Collector (Elastic/Loki): Log (DEBUG): "Capability execution completed"

  MicroVM (FAR Agent)->>Orchestrator: Task completion
  Orchestrator->>Log Collector (Elastic/Loki): Log (INFO): "Result reporting complete"

  FAR Controller->>PAL/WAL: Record final task state

🎯 Rationale for Chosen Approach

Structured & Parseable Logs: Enhanced operational visibility, easier integration with log aggregation solutions.
Detailed Tracing and Debugging: Explicit span hierarchies enable precise debugging and operational insights.
Clear Log Level Policy: Reduces noise and improves the clarity and signal-to-noise ratio of logs.

🚨 Consequences and Trade-offs

Complexity of Span Management: Developers must adhere strictly to span creation and lifecycle rules.
Log Volume and Storage: Potential for large log volumes, especially with DEBUG or TRACE enabled.
Operational Overhead: Requires monitoring of log volumes and retention policies to manage costs.

✅ Alternatives Considered and Dismissed

Unstructured Logging: Lack of parseability and reduced debugging capability.
Using log crate: Lacks structured tracing capabilities and span support of tracing.
Third-party proprietary logging tools: Reduced control and higher cost; less customization potential.

📌 Implementation Recommendations:

Create a dedicated Logging Style Guide document specifying tracing usage patterns.
Use automated linting to enforce span hierarchy and log level conventions during CI/CD pipelines.
Periodically review log metrics (volume, retention, levels) to manage operational overhead.

🚀 Industry-Leading Capabilities

This structured logging and tracing strategy significantly enhances Tiffany’s ability to manage operational complexity, rapidly diagnose issues, and maintain transparency, positioning it as a leader in observability and autonomous agentic infrastructure management.

📊 Next Steps:

Upon confirmation, this ADR is ready for acceptance and immediate detailed implementation.

✅ Ready for final review and acceptance.

Keyboard shortcuts

Mayfly System Design