Hermes Subagent Delegation and Failures

Hermes Subagent Delegation and Failures | Generated by AI

Home 2026.05

Here’s a comprehensive breakdown of how subagents work in Hermes and what happens with network issues:

How Subagent Delegation Works

Architecture (tools/delegate_tool.py):

Parent Agent (your main conversation)
  └─ delegate_task tool call
       ├─ [subagent-0] ── ThreadPoolExecutor worker ── child.run_conversation()
       ├─ [subagent-1] ── ThreadPoolExecutor worker ── child.run_conversation()
       └─ [subagent-2] ── ThreadPoolExecutor worker ── child.run_conversation()
            (up to max_concurrent_children, default 3)

Each child gets:

Fresh AIAgent instance (no parent conversation history)
Own task_id (own terminal session, file ops cache)
Restricted toolsets (delegate_task, clarify, memory, send_message, execute_code are blocked)
Own iteration budget (default 50 iterations, configurable via delegation.max_iterations)
Focused system prompt built from the goal + context
Shared credential pool (children can rotate keys on rate limits)

Children run in a ThreadPoolExecutor. The parent blocks until all complete (up to child_timeout_seconds, default 600s). The parent only sees the final summary — never intermediate tool calls.

Network Failure Handling — Two Layers

Your log shows stream drops from xiaomi. Here’s exactly what’s happening:

Layer 1: Stream-level retry (inside _interruptible_streaming_api_call)

In agent/chat_completion_helpers.py, the streaming call runs in a background thread with a polling loop on the main thread. Three failure detectors:

Stale stream detector — if no chunks arrive within HERMES_STREAM_STALE_TIMEOUT (default 180s), kills the connection and lets the retry loop reconnect. Scales up for large contexts (>100k tokens → 300s, >50k → 240s).
Mid-tool-call stream drop — if the stream dies while a tool call JSON is being streamed AND the error is transient (ReadTimeout, connection reset, SSE “Network connection lost”), it silently retries up to _max_stream_retries. It resets all accumulators, rebuilds the OpenAI client connection pool, and starts fresh. The user sees “⚠ xiaomi stream drop (ReadTimeout) after 122.6s — reconnecting, retry 2/3”.
Pre-delivery stream drop — if the stream dies before any tokens were delivered, same retry logic. After retries are exhausted, it propagates the error to the outer retry loop.

Layer 2: Conversation-level retry (conversation_loop.py)

When the streaming layer exhausts its retries and raises, the outer loop catches the error:

Invalid/empty responses → retry with jittered backoff (5s base, 120s cap)
API errors (429, 500, 502, 503, 504, 524) → classified by error_classifier.py, retry with backoff
Credential exhaustion → tries credential pool rotation, then fallback provider
Provider fallback chain → if xiaomi keeps failing and you have fallback_model configured, it switches

The “Model returned empty after tool calls — nudging to continue” message is a different path: the model finished tool calls but returned no text content, so the loop injects a synthetic user message to nudge the model to summarize/continue.

What Happens When the Network Changes

For subagents specifically:

Each child has its own API call loop. A network change mid-task means every in-flight HTTP request hits ReadTimeout or connection reset.
The stream-level retry will fire for each subagent independently (that’s why you see [subagent-0], [subagent-1], [subagent-2] all dropping simultaneously around 122-123s).
If the network recovers within the retry window (typically 3 attempts), the subagents silently reconnect and continue. Their conversation history is preserved — the retry replays the same API call with the same messages.
If the network stays down through all retries:
- Stream layer raises the error
- Conversation layer retries with backoff
- If still failing, tries fallback provider (if configured)
- If all exhausted, the child returns {“failed”: True, “error”: “…”}
- Parent sees a “failed” entry in the delegation results and reports it
The child_timeout (default 600s) is the hard cap. A subagent stuck on network issues will eventually timeout. The heartbeat thread (30s interval) monitors child progress — if neither iteration count nor current_tool advances for 15 cycles (450s idle), it stops touching the parent’s activity, letting the gateway timeout fire.
Each subagent rebuilds its OpenAI client on retry (agent._replace_primary_openai_client), purging dead connections from the pool.

Key config knobs

delegation.child_timeout_seconds — hard per-child timeout (default 600)
delegation.max_iterations — per-child iteration cap (default 50)
delegation.max_concurrent_children — parallel children (default 3)
HERMES_STREAM_STALE_TIMEOUT — override stale stream detection (env var)
Fallback model chain — in config.yaml, so children inherit the parent’s fallback chain

The short answer: network changes during subagent execution cause parallel stream drops across all children. Each child retries independently at both the stream level (3 attempts, connection rebuild) and the conversation level (backoff + credential rotation + fallback). If the network recovers within ~5-10 minutes, most subagents will transparently resume. If not, they fail gracefully and the parent gets error summaries instead of results.

Back Donate