DocsNetworkSession Reconnect

Session Reconnect

WebSocket connections drop. Agents keep running. Here's how reconnection works.

Key insight: The agent thread and its IO queues survive the WebSocket. When a client reconnects, the same queues are reattached to the new connection. The agent never knows the difference.

Architecture

Two layers handle session survival:

Two-layer session storage

┌─────────────────────────────────────┐
│ In-Memory (ActiveSessionRegistry)   │
│ Running agents, IO queues, threads  │
│ Cleaned after 10min idle            │
└──────────────┬──────────────────────┘
               │ save on completion
┌──────────────▼──────────────────────┐
│ Disk (.co/session_results.jsonl)    │
│ Final results for polling recovery  │
│ Expires after 24h                   │
└─────────────────────────────────────┘

In-Memory

Keeps the agent thread and IO queues alive so a reconnecting client resumes mid-execution.

Disk (JSONL)

Stores final results so a client that never reconnects can poll later.

Session Lifecycle

State transitions

register()
    │
    ▼
 RUNNING ──────────────────────► COMPLETED
    │            agent finishes       │
    │                                 │
    ▼ client disconnects              │ 10min idle
 SUSPENDED                           ▼
    │                              REMOVED
    │ client reconnects
    ▼
 RUNNING (same IO queues)

Transition	Trigger	What happens
→ RUNNING	register()	Agent thread spawned, IO queues created
→ SUSPENDED	Client WebSocket drops	Agent keeps running, queues buffer events
→ RUNNING	Client reconnects (same session_id)	Same IO queues reattached to new WebSocket
→ COMPLETED	Agent finishes	Result saved to JSONL, session stays in memory
→ REMOVED	10min idle (no client ping)	Freed from memory

Reconnection Flow

Timeline: connect → disconnect → reconnect → finish

Time   Client              WebSocket Handler    Agent Thread
────   ──────              ─────────────────    ────────────
T+0    INPUT ─────────────► accept
                            register()
                            spawn thread ───────► agent.input() starts

T+5                        ◄─────────────────── io.send(thinking)
       ◄── thinking ────────

T+15                       ◄─────────────────── io.send(approval_needed)
       ◄── approval_needed─                     io.receive() BLOCKS
                                                 waiting for response...

T+20   ✕ DISCONNECT         (status stays running)
                            (queues stay alive)   (still blocked)

T+25   RECONNECT ──────────► registry.get() → FOUND
                             drain queued events
       ◄── queued events ───
                             update_ping()
                             pump same IO queues
       approve ────────────► io._incoming.put() ► io.receive() unblocks
                                                   agent continues...

T+35                        ◄─────────────────── agent finishes
                             mark_completed()
                             save to JSONL
       ◄── OUTPUT ──────────

What happened: Agent asked for approval at T+15, blocked waiting. Client disconnected at T+20 — agent stayed blocked, events buffered. Client reconnected at T+25 — got buffered events, sent approval. Agent unblocked and finished normally.

IO Queue Bridge

The agent runs in a sync thread. The WebSocket handler is async. Two thread-safe queues bridge them:

WebSocketIO — async/sync bridge

┌───────────────────┐          ┌───────────────────┐
│  Agent Thread      │          │  WebSocket Handler │
│  (sync Python)     │          │  (async ASGI)      │
│                    │          │                    │
│  io.send(event) ──►│─outgoing─│►── ws.send(event)  │
│                    │  queue   │                    │
│  io.receive()  ◄──│─incoming─│◄── ws.receive()    │
│  (blocks)          │  queue   │                    │
└───────────────────┘          └───────────────────┘

On disconnect

io.close() sets _closed = True and puts a sentinel in the incoming queue, unblocking any waiting receive(). After close, io.send() silently drops events.

On reconnect

The same io object is reused. A new WebSocket handler pumps the same queues. Caveat: IO must be reopened (_closed = False) for the agent to send again.

Keep-Alive

Server sends PING every 30s. Client responds with PONG. Each message updates last_ping in the registry.

PING/PONG heartbeat

Client                    Server
  │                         │
  │◄──── PING ──────────────│  every 30s
  │───── PONG ─────────────►│  update last_ping
  │                         │
  │◄──── PING ──────────────│
  │───── PONG ─────────────►│  update last_ping
  │                         │
  │  ✕ disconnect            │
  │                         │  last_ping freezes
  │                         │  idle timer starts
  │                         │  ...
  │                         │  10min idle → cleanup

Session Cleanup

One rule for all non-running sessions:

Cleanup rule

             status != 'running'
             AND idle > 10min
                   │
                   ▼
          ┌────────────────┐
          │ REMOVE from    │
          │ registry       │
          │ (memory freed) │
          └────────────────┘

•

No special cases. connected only — same rule.

•

Results already on disk. JSONL storage has the final result.

•

Client can still poll. GET /sessions/{id} works for 24h.

•

Background job. Runs every 60s to sweep expired sessions.

Recovery Without Reconnect

If the client never comes back:

Polling recovery after disconnect

Client gone                 Server
                              │
                              │  agent finishes
                              │  save result to .co/session_results.jsonl
                              │  mark_completed()
                              │
                              │  ... 10min idle ...
                              │
                              │  cleanup_expired() → removed from memory
                              │
                              │  (result still on disk for 24h)
                              │
Client returns (hours later)  │
  │                           │
  │── GET /sessions/{id} ────►│  read from JSONL
  │◄── result ────────────────│

No data loss. The JSONL file is the durable record.

Session Merge

When a client reconnects and both sides have session state, merge_sessions() resolves the conflict using iteration count (incremented on each LLM call):

Iteration-based conflict resolution

Client (stale)              Server (continued)
iteration: 5                iteration: 10
    │                           │
    └───────────┬───────────────┘
                │ merge_sessions()
                ▼
          server wins (higher iteration)
          → use server session state

Scenario	Resolution
Server continued (iteration 10 vs 5)	Server wins
Client newer (iteration 8 vs 3)	Client wins
Tie (same iteration)	Higher timestamp wins

Server Console Output

The WebSocket handler prints structured status lines to the server console. Designed for quick scanning: routine messages are compact, data flow events are indented sub-lines.

Connection lifecycle

⚡ ws+ 127.0.0.1 (0 active)        # new WebSocket, show session count
✓ CONNECT identity=0x2f3d... session=aad5... status=new
✓ INPUT identity=0x2f3d... session=aad5... prompt=hello world...
⚡ ws- (1 active)                    # disconnect, remaining sessions

Data flow visibility — when client data is used

✓ CONNECT identity=0x2f3d... session=aad5... status=connected
  ↑ client session: 4 messages       # client sent history
  ↕ merged sessions (server newer)   # server had newer data

✓ CONNECT identity=0x2f3d... session=aad5... status=running
  ↻ reattaching to running agent     # reconnecting mid-execution

✓ INPUT identity=0x2f3d... session=aad5... prompt=analyze this...
  ↑ 2 images, 1 files                # client sent attachments

Suppressed

CONNECT, INPUT, SESSION_STATUS, PONG — these have their own status lines.

Still logged

ADMIN_*, ONBOARD_SUBMIT, and unexpected types print ← WS recv:.

Known Issue: Reconnect During Approval

When a client refreshes while the agent is blocked waiting for approval (e.g., bash tool), reconnection fails. Three bugs compound:

Bug chain: refresh during approval

T+0    Agent sends approval_needed, blocks on io.receive()
T+5    Client refreshes → WebSocket disconnects
       → io.close() puts sentinel in io._incoming
       → io.receive() unblocks with {"type": "io_closed"}
       → Agent treats as "connection closed" error
       → run_agent() has NO try/finally
         → agent_finished.set() NEVER fires
         → _pipe_ws_io hangs forever

T+10   New WebSocket connects → CONNECT { session_id }
       → registry.get() finds session, still 'running'
       → Reattach: uses SAME io object
       → BUT io._closed = True → io.send() drops all events
       → Agent can't send to new client

•

run_agent() has no error handling. If agent crashes, agent_finished.set() never fires. _pipe_ws_io hangs forever waiting.

•

Reattach uses closed IO. On reconnect, the server reattaches to the old io object with _closed = True. io.send() silently drops all events.

•

Two _pipe_ws_io loops compete. The old loop (stuck) and the new loop (from reattach) both reference the same agent_finished event.

Fix plan

1. run_agent(): wrap in try/finally — always set agent_finished, capture error in error_holder.

2. Reattach: reopen IO — reset io._closed = False so agent can send events through new WebSocket.

3. Old _pipe_ws_io: detect superseded — when new connection reattaches, old pipe should exit cleanly.

Key Files

File	Role
network/host/session/active.py	ActiveSessionRegistry — in-memory session tracking
network/io/websocket.py	WebSocketIO — queue bridge between async/sync
network/host/session/storage.py	SessionStorage — JSONL persistence
network/host/session/merge.py	Session merge conflict resolution
network/asgi/websocket.py	WebSocket handler — orchestrates reconnection