The Query Engine: The Complete Lifecycle of a Conversation

Deep dive into QueryEngine.ts and query.ts, tracing the full journey of a conversation from user input to final response, and understanding the async-generator-driven streaming query loop

Overview

In the previous big-picture article, we saw that Claude Code's architecture can be divided into 5 layers, with the query engine sitting at the very core. Now, we're going to dive into that layer.

When you type something into the terminal — "help me fix this bug" — what actually happens between that moment and when Claude starts streaming its response, executing tools, and delivering the final result? The answer lies in two files:

  • QueryEngine.ts (~1,295 lines) — the session manager, maintaining state across turns
  • query.ts (~1,729 lines) — the streaming query loop, an async-generator-driven state machine

This article will trace the complete lifecycle of a conversation, from the entry point at submitMessage(), through the query() generator's loop execution, to the triggering of recovery strategies. This is the key to understanding how Claude Code interacts with the LLM.


QueryEngine: The Session's State Container

QueryEngine is not created on every call — it is a long-lived object that persists throughout the entire conversation session. Its responsibility is managing state across turns:

src/QueryEngine.ts:184-207
TypeScript
184export class QueryEngine {
185 private config: QueryEngineConfig
186 private mutableMessages: Message[] // Complete message history across turns
187 private abortController: AbortController // User interrupt signal
188 private permissionDenials: SDKPermissionDenial[] // Permission denial records
189 private totalUsage: NonNullableUsage // Cumulative token usage
190 private hasHandledOrphanedPermission = false
191 private readFileState: FileStateCache // Read file cache (avoids redundant reads)
192 private discoveredSkillNames = new Set<string>()
193 private loadedNestedMemoryPaths = new Set<string>()
194
195 constructor(config: QueryEngineConfig) {
196 this.config = config
197 this.mutableMessages = config.initialMessages ?? []
198 this.abortController = config.abortController ?? createAbortController()
199 this.permissionDenials = []
200 this.readFileState = config.readFileCache
201 this.totalUsage = EMPTY_USAGE
202 }
203}

A few key fields are worth noting:

  • mutableMessages — This is the message history for the entire conversation. New messages from each turn are appended here, and tool execution results are also injected here. This array is "mutable" — in Claude Code's overall design that favors immutable data, this is a deliberate exception, since message history requires frequent updates and append operations don't cause concurrency issues.
  • readFileState — A file read cache. When the AI reads a file via FileReadTool, its contents are cached here. If the AI references the file again later, the engine can avoid re-sending the full contents to the API, saving tokens.
  • permissionDenials — When a user denies execution permission for a tool, the denial is recorded here for subsequent SDK reporting.
  • discoveredSkillNames — Tracks skill names discovered during the current turn. Cleared at the start of each turn to prevent unbounded growth in long sessions.
  • loadedNestedMemoryPaths — Records already-loaded nested memory paths to avoid duplicate loading.

QueryEngineConfig: The Engine's Configuration Contract

Before diving into submitMessage(), we need to understand QueryEngineConfig — it defines everything required when creating a QueryEngine:

src/QueryEngine.ts:130-173
TypeScript
130export type QueryEngineConfig = {
131 cwd: string // Working directory
132 tools: Tools // Available tool list
133 commands: Command[] // Slash command list
134 mcpClients: MCPServerConnection[] // MCP client connections
135 agents: AgentDefinition[] // Agent definitions
136 canUseTool: CanUseToolFn // Tool permission check function
137 getAppState: () => AppState // Get application state
138 setAppState: (f: (prev: AppState) => AppState) => void
139 initialMessages?: Message[] // Initial message history (used when resuming sessions)
140 readFileCache: FileStateCache // File state cache
141 customSystemPrompt?: string // Custom system prompt
142 appendSystemPrompt?: string // Appended system prompt
143 userSpecifiedModel?: string // User-specified model
144 fallbackModel?: string // Fallback model
145 thinkingConfig?: ThinkingConfig // Thinking mode configuration
146 maxTurns?: number // Maximum turn limit
147 maxBudgetUsd?: number // Budget limit (USD)
148 taskBudget?: { total: number } // Task budget
149 jsonSchema?: Record<string, unknown> // Structured output schema
150 verbose?: boolean
151 replayUserMessages?: boolean
152 handleElicitation?: ToolUseContext['handleElicitation']
153 snipReplay?: ( // Snip boundary handler
154 yieldedSystemMsg: Message,
155 store: Message[],
156 ) => { messages: Message[]; executed: boolean } | undefined
157}

This configuration object reflects an important design decision: inject all external dependencies at creation time. QueryEngine doesn't fetch the tool list or MCP clients on its own — everything is provided by the caller. This makes the engine flexibly configurable across different usage scenarios (SDK mode, REPL mode, test mode).


submitMessage(): The Entry Point for Each Turn

Whenever the user enters a new message, submitMessage() is called. It is an async generator (async *), meaning it doesn't return a single result but instead progressively yields streaming events:

src/QueryEngine.ts:209-212
TypeScript
209async *submitMessage(
210 prompt: string | ContentBlockParam[],
211 options?: { uuid?: string; isMeta?: boolean }
212): AsyncGenerator<SDKMessage, void, unknown>

The execution of submitMessage() is divided into several clear phases. Let's trace through them one by one.

Phase 1: Turn-Level Initialization

src/QueryEngine.ts:238-241
TypeScript
238this.discoveredSkillNames.clear() // Clear skills discovered in the previous turn
239setCwd(cwd) // Set the working directory
240const persistSession = !isSessionPersistenceDisabled()
241const startTime = Date.now()

At the start of each turn, the engine performs some cleanup. discoveredSkillNames.clear() ensures skill discovery is turn-scoped — long-running sessions won't waste memory due to unbounded growth of the skill name set.

Phase 2: Permission Tracking Wrapper

src/QueryEngine.ts:244-271
TypeScript
244const wrappedCanUseTool: CanUseToolFn = async (tool, input, ...) => {
245 const result = await canUseTool(tool, input, ...)
246
247 // Track denials for SDK reporting
248 if (result.behavior !== 'allow') {
249 this.permissionDenials.push({
250 tool_name: sdkCompatToolName(tool.name),
251 tool_use_id: toolUseID,
252 tool_input: input,
253 })
254 }
255
256 return result
257}

The engine doesn't use the canUseTool function from the config directly — it wraps it with an additional layer. This wrapper records all denied tool calls without changing the permission logic itself. These records ultimately appear in the SDK's returned result message, letting the caller know which operations were denied by the user.

Phase 3: Building the ProcessUserInputContext

This is the biggest step in submitMessage() — constructing a large configuration object containing everything needed for the current turn:

src/QueryEngine.ts:335-395
TypeScript
335let processUserInputContext: ProcessUserInputContext = {
336 messages: this.mutableMessages,
337 setMessages: fn => {
338 this.mutableMessages = fn(this.mutableMessages)
339 },
340 onChangeAPIKey: () => {},
341 handleElicitation: this.config.handleElicitation,
342 options: {
343 commands,
344 tools,
345 verbose,
346 mainLoopModel: initialMainLoopModel,
347 thinkingConfig: initialThinkingConfig,
348 mcpClients,
349 isNonInteractiveSession: true,
350 customSystemPrompt,
351 appendSystemPrompt,
352 agentDefinitions: { activeAgents: agents, allAgents: [] },
353 maxBudgetUsd,
354 },
355 getAppState,
356 setAppState,
357 abortController: this.abortController,
358 readFileState: this.readFileState,
359 nestedMemoryAttachmentTriggers: new Set<string>(),
360 loadedNestedMemoryPaths: this.loadedNestedMemoryPaths,
361 dynamicSkillDirTriggers: new Set<string>(),
362 discoveredSkillNames: this.discoveredSkillNames,
363 // ...more fields
364}

ProcessUserInputContext is the "contract" between the query engine and the query() generator — it defines all the capabilities and state that query() can use. Note the implementation of setMessages: it captures a reference to this.mutableMessages through a closure, allowing slash commands (like /force-snip) to directly modify the message array.

It's worth noting that processUserInputContext is created twice within submitMessage(). The first time is for processing user input (slash commands, attachments, etc.), and the second time is after slash command processing completes, using the updated messages and model. This ensures that state modifications made by slash commands are visible to subsequent query() calls.


Multi-Layer System Prompt Composition

Before calling the API, the engine needs to assemble the System Prompt. This isn't just a simple string — it's a composition of multiple layers of content:

...

Let's see how the code actually implements this composition:

src/QueryEngine.ts:289-325
TypeScript
289const {
290 defaultSystemPrompt,
291 userContext: baseUserContext,
292 systemContext,
293} = await fetchSystemPromptParts({
294 tools,
295 mainLoopModel: initialMainLoopModel,
296 mcpClients,
297 customSystemPrompt: customPrompt,
298})
299
300// Merge coordinator mode user context
301const userContext = {
302 ...baseUserContext,
303 ...getCoordinatorUserContext(mcpClients, scratchpadDir),
304}
305
306// Conditionally load memory mechanics prompt
307const memoryMechanicsPrompt =
308 customPrompt !== undefined && hasAutoMemPathOverride()
309 ? await loadMemoryPrompt()
310 : null
311
312// Final composition
313const systemPrompt = asSystemPrompt([
314 ...(customPrompt !== undefined ? [customPrompt] : defaultSystemPrompt),
315 ...(memoryMechanicsPrompt ? [memoryMechanicsPrompt] : []),
316 ...(appendSystemPrompt ? [appendSystemPrompt] : []),
317])

The role of each layer:

  1. Default System Prompt — Baseline behavior guidelines, including available tool definitions, safety instructions, and output format requirements. Generated by fetchSystemPromptParts(), with user context and system context collection logic located in src/context.ts and src/utils/queryContext.ts.
  2. Custom System Prompt — User-provided custom instructions via configuration. When present, it replaces the default prompt (rather than appending to it).
  3. Memory Mechanics — Operational instructions for the memory system (how to read/write MEMORY.md). Only enabled when two conditions are met simultaneously: a custom prompt exists AND the memory path override environment variable is set.
  4. Append System Prompt — Project-level rules from CLAUDE.md files. These are always appended at the end, regardless of whether a custom prompt exists.

Each layer can contain thousands of tokens. When the System Prompt itself consumes a large portion of the context window, less space remains for the actual conversation — which is why context management is so important.

User Context and System Context

In addition to the System Prompt itself, fetchSystemPromptParts() also returns two context objects:

  • userContext — Injected in [key: value] format before the user message in each API request. Contains dynamic context such as working directory, platform information, and time.
  • systemContext — Injected in a similar format at the end of the system message. Contains relatively static context such as the list of installed MCP servers.

This separation is intentional: userContext may change with each request (e.g., the current working directory), while systemContext remains relatively stable throughout the session.


query(): An Async-Generator-Driven Streaming State Machine

The query() function is the core loop of the entire query engine. Its signature reveals its nature — an async generator:

src/query.ts:219-228
TypeScript
219export async function* query(params: QueryParams): AsyncGenerator<
220 | StreamEvent
221 | RequestStartEvent
222 | Message
223 | TombstoneMessage
224 | ToolUseSummaryMessage,
225 Terminal
226>

Why an async generator? Because streaming AI conversations are inherently a multi-phase, interruptible, stateful process:

  1. Send a request to the API
  2. Receive the streaming response (token by token)
  3. Detect a tool call -> pause streaming output -> execute tool -> inject result -> continue request
  4. Detect a need for recovery -> execute recovery strategy -> retry
  5. Final completion

The generator pattern lets the caller (UI layer) consume these events incrementally, rendering the latest state at each yield point, without waiting for the entire process to complete.

query()'s Two-Layer Structure

query() itself is just a thin wrapper. It delegates the actual work to queryLoop(), then notifies command lifecycles upon normal completion:

src/query.ts:219-239
TypeScript
219export async function* query(params: QueryParams): AsyncGenerator<...> {
220 const consumedCommandUuids: string[] = []
221 const terminal = yield* queryLoop(params, consumedCommandUuids)
222 // Only reaches here on normal completion
223 // On throw, the error propagates through yield*
224 // On .return(), both generators are closed
225 for (const uuid of consumedCommandUuids) {
226 notifyCommandLifecycle(uuid, 'completed')
227 }
228 return terminal
229}

yield* is the key — it "passes through" all of queryLoop()'s yielded values directly to query()'s caller, while also propagating errors and cancellation signals. This pattern makes error handling and resource cleanup natural: if queryLoop() throws an exception, it bubbles directly to the caller; if the caller calls .return(), both generators are properly closed.

State: The Query Loop's Internal State

src/query.ts:204-217
TypeScript
204type State = {
205 messages: Message[]
206 toolUseContext: ToolUseContext
207 autoCompactTracking: AutoCompactTrackingState | undefined
208 maxOutputTokensRecoveryCount: number // Retry count (max 3)
209 hasAttemptedReactiveCompact: boolean // Whether reactive compaction has been attempted
210 maxOutputTokensOverride: number | undefined
211 pendingToolUseSummary: Promise<ToolUseSummaryMessage | null> | undefined
212 stopHookActive: boolean | undefined
213 turnCount: number // Current turn count
214 transition: Continue | undefined // Why the previous iteration continued
215}

This State type is key to understanding the query loop's behavior. A few important fields:

  • maxOutputTokensRecoveryCount — When the API returns a max_output_tokens error (the AI's output was truncated), the engine automatically retries. This counter tracks the number of retries, with an upper limit defined by MAX_OUTPUT_TOKENS_RECOVERY_LIMIT = 3.
  • hasAttemptedReactiveCompact — When context is approaching its limit, the engine attempts "reactive compaction" — automatically compressing historical messages to free up space. This flag ensures compaction is only attempted once, preventing infinite loops.
  • transition — Records why the previous iteration continued. Its value is a Continue type (e.g., { reason: 'max_output_tokens_recovery', attempt: 2 }), letting tests assert whether a recovery path was triggered without needing to inspect message contents.
  • maxOutputTokensOverride — When output truncation is first encountered, the engine first tries to escalate the output token limit from the default 8K to 64K (ESCALATED_MAX_TOKENS), before considering multi-turn recovery.
  • autoCompactTracking — Tracks auto-compaction state, including warning threshold calculations for token usage.

Main Loop Structure

...

The core of the query loop is conceptually a while(true) structure, where each iteration:

  1. Stream — Send the current message history to the API and receive a streaming response
  2. Collect — Gather text content and tool_use calls from the response
  3. Execute — If there are tool_use calls, execute tools via StreamingToolExecutor
  4. Inject — Inject tool execution results as tool_result messages into the history
  5. Decide — Determine whether to continue the loop (new tool results or recovery needed) or end (no further actions)

QueryParams: The Input Contract for query()

Before entering the tool call loop, let's look at the parameters query() receives:

src/query.ts:181-199
TypeScript
181export type QueryParams = {
182 messages: Message[]
183 systemPrompt: SystemPrompt
184 userContext: { [k: string]: string }
185 systemContext: { [k: string]: string }
186 canUseTool: CanUseToolFn
187 toolUseContext: ToolUseContext
188 fallbackModel?: string
189 querySource: QuerySource
190 maxOutputTokensOverride?: number
191 maxTurns?: number
192 skipCacheWrite?: boolean
193 taskBudget?: { total: number }
194 deps?: QueryDeps
195}

Note the deps?: QueryDeps parameter. This is a dependency injection point — in production it uses productionDeps, and in tests it can be replaced with mock implementations. QueryDeps contains the concrete implementations of core capabilities like API calls and tool execution, making query() itself testable without depending on any external modules.

src/query/deps.ts
TypeScript
1export type QueryDeps = {
2 // Core dependencies: API calls, tool execution, etc.
3}
4export const productionDeps: QueryDeps = { ... }

The Tool Call Loop: How the AI Uses Tools

When the API response contains a tool_use content block, the query loop enters the tool execution phase. This is the core capability of Claude Code as an AI agent — the AI doesn't just generate text; it can execute actions.

A typical tool call loop:

sequenceDiagram
    participant Q as query() loop
    participant API as Anthropic API
    participant STE as StreamingToolExecutor
    participant T1 as GrepTool
    participant T2 as FileReadTool

    Q->>API: messages + systemPrompt + tool definitions
    API-->>Q: [text: "Let me search the code..."]
    API-->>Q: [tool_use: grep "function login"]
    API-->>Q: [tool_use: read "src/auth.ts"]
    Note over Q: Response complete, contains 2 tool_use blocks

    Q->>STE: Submit 2 tool calls
    Note over STE: Both tools are concurrencySafe
    par Parallel execution
        STE->>T1: grep "function login"
        STE->>T2: read "src/auth.ts"
    end
    T1-->>STE: Search results
    T2-->>STE: File contents
    STE-->>Q: [tool_result, tool_result]

    Q->>Q: Inject tool_results into message history
    Q->>API: Updated message history (with tool results)
    API-->>Q: [text: "Based on the search results, the bug is on line 42..."]
    Note over Q: No more tool_use, loop ends

Note the key details:

  • Parallel execution: Both GrepTool and FileReadTool declare isConcurrencySafe() = true (they are read-only operations), so StreamingToolExecutor executes them in parallel
  • Result injection: Tool results are appended to the message history as tool_result messages, then the entire history is re-sent to the API
  • Loop continuation: The API generates a new response based on the tool results; if the new response contains more tool_use blocks, the loop continues
  • Permission checks: Before tool execution, wrappedCanUseTool checks whether the user has authorized the operation. Denied operations return an error message to the AI and are recorded in permissionDenials

Tool Result Storage Optimization

Each tool's execution result can be large (e.g., the entire contents of a large file). Storing them directly in the message history would quickly consume the context window. Claude Code uses applyToolResultBudget (from src/utils/toolResultStorage.ts) to apply budget controls to tool results — results exceeding the budget are truncated or summarized, ensuring the message history doesn't explode from a single tool call.

Handling Missing Tool Results

When the API response contains tool_use but execution is interrupted (e.g., the user presses Ctrl+C), an error-type tool_result must be generated for each incomplete tool call. This is required by the Anthropic API — every tool_use must have a corresponding tool_result:

src/query.ts:123-149
TypeScript
123function* yieldMissingToolResultBlocks(
124 assistantMessages: AssistantMessage[],
125 errorMessage: string,
126) {
127 for (const assistantMessage of assistantMessages) {
128 const toolUseBlocks = assistantMessage.message.content.filter(
129 content => content.type === 'tool_use',
130 ) as ToolUseBlock[]
131
132 for (const toolUse of toolUseBlocks) {
133 yield createUserMessage({
134 content: [{
135 type: 'tool_result',
136 content: errorMessage,
137 is_error: true,
138 tool_use_id: toolUse.id,
139 }],
140 toolUseResult: errorMessage,
141 sourceToolAssistantUUID: assistantMessage.uuid,
142 })
143 }
144 }
145}

This function iterates through all tool_use blocks in assistant messages and generates a tool_result user message containing the error information for each one.


Recovery Strategies: When Things Go Wrong

In the real world, API calls don't always succeed. The network might drop, the context might overflow, or the model might be unable to complete its output. query.ts implements automatic recovery strategies for multiple error types.

The Error "Withholding" Mechanism

Before diving into specific strategies, there's a key design to understand: error message withholding.

When the streaming loop detects a max_output_tokens or prompt_too_long error, it does not immediately yield this error message to the caller. Why? Because SDK callers (such as Claude Desktop) might terminate the session immediately upon receiving an error-type message. If the engine yields an error and then successfully recovers through a recovery strategy, the caller has already stopped listening — the recovery is pointless.

So the engine "withholds" the error message and attempts recovery. The error message is only yielded after recovery fails.

src/query.ts:175-179
TypeScript
175// isWithheldMaxOutputTokens checks if a message is a withheld max_output_tokens error
176function isWithheldMaxOutputTokens(
177 msg: Message | StreamEvent | undefined,
178): msg is AssistantMessage {
179 return msg?.type === 'assistant' && msg.apiError === 'max_output_tokens'
180}

Strategy 1: Output Token Escalation (First max_output_tokens Recovery)

When the AI's output is truncated, the engine first tries a "zero-cost" recovery — escalating the output token limit:

src/query.ts:1188-1221
TypeScript
1188if (isWithheldMaxOutputTokens(lastMessage)) {
1189 // Step 1: Escalate to 64K
1190 if (capEnabled && maxOutputTokensOverride === undefined) {
1191 logEvent('tengu_max_tokens_escalate', {
1192 escalatedTo: ESCALATED_MAX_TOKENS,
1193 })
1194 const next: State = {
1195 ...state,
1196 maxOutputTokensOverride: ESCALATED_MAX_TOKENS,
1197 transition: { reason: 'max_output_tokens_escalate' },
1198 }
1199 state = next
1200 continue // Retry the same request with a higher output limit
1201 }

The cleverness of this strategy is that it retries the same request with only the output token limit raised. No recovery message needs to be injected, and the message history length doesn't increase. If 8K wasn't enough but 64K is, the problem is resolved silently.

Strategy 2: Multi-Turn Recovery (Subsequent max_output_tokens Recovery)

If the output is still truncated after escalating to 64K, the engine enters multi-turn recovery mode — injecting a recovery message to let the AI continue from where it left off:

src/query.ts:1223-1252
TypeScript
1223if (maxOutputTokensRecoveryCount < MAX_OUTPUT_TOKENS_RECOVERY_LIMIT) {
1224 const recoveryMessage = createUserMessage({
1225 content:
1226 `Output token limit hit. Resume directly — no apology, ` +
1227 `no recap of what you were doing. Pick up mid-thought ` +
1228 `if that is where the cut happened. Break remaining ` +
1229 `work into smaller pieces.`,
1230 isMeta: true,
1231 })
1232
1233 const next: State = {
1234 messages: [
1235 ...messagesForQuery,
1236 ...assistantMessages,
1237 recoveryMessage,
1238 ],
1239 maxOutputTokensRecoveryCount: maxOutputTokensRecoveryCount + 1,
1240 transition: {
1241 reason: 'max_output_tokens_recovery',
1242 attempt: maxOutputTokensRecoveryCount + 1,
1243 },
1244 // ...other fields
1245 }
1246 state = next
1247 continue
1248}

Notice the wording of the recovery message: "Resume directly - no apology, no recap". This is a carefully crafted prompt that tells the AI not to waste tokens apologizing or restating context, but to continue directly from where it was cut off. This maximizes the utilization of the limited output tokens.

MAX_OUTPUT_TOKENS_RECOVERY_LIMIT = 3, meaning a maximum of 3 retries. If the output is still truncated after 3 attempts, the error message is yielded to the caller.

Strategy 3: Context Collapse and Reactive Compaction (prompt_too_long)

When the message history is too long and the API returns a prompt_too_long error, the engine has two levels of recovery.

Level 1: Context Collapse

src/query.ts:1089-1117
TypeScript
1089if (
1090 feature('CONTEXT_COLLAPSE') &&
1091 contextCollapse &&
1092 state.transition?.reason !== 'collapse_drain_retry'
1093) {
1094 const drained = contextCollapse.recoverFromOverflow(
1095 messagesForQuery, querySource,
1096 )
1097 if (drained.committed > 0) {
1098 const next: State = {
1099 messages: drained.messages,
1100 transition: { reason: 'collapse_drain_retry', committed: drained.committed },
1101 // ...
1102 }
1103 state = next
1104 continue
1105 }
1106}

Context collapse is a lightweight form of compression — it collapses "staged" context blocks while preserving fine-grained information. Note the condition state.transition?.reason !== 'collapse_drain_retry' — if the previous iteration already attempted collapse but still overflowed, it won't try again, instead falling through to reactive compaction.

Level 2: Reactive Compaction

src/query.ts:1119-1166
TypeScript
1119if ((isWithheld413 || isWithheldMedia) && reactiveCompact) {
1120 const compacted = await reactiveCompact.tryReactiveCompact({
1121 hasAttempted: hasAttemptedReactiveCompact,
1122 querySource,
1123 aborted: toolUseContext.abortController.signal.aborted,
1124 messages: messagesForQuery,
1125 cacheSafeParams: {
1126 systemPrompt,
1127 userContext,
1128 systemContext,
1129 toolUseContext,
1130 forkContextMessages: messagesForQuery,
1131 },
1132 })
1133
1134 if (compacted) {
1135 const postCompactMessages = buildPostCompactMessages(compacted)
1136 for (const msg of postCompactMessages) {
1137 yield msg
1138 }
1139 const next: State = {
1140 messages: postCompactMessages,
1141 hasAttemptedReactiveCompact: true,
1142 transition: { reason: 'reactive_compact_retry' },
1143 // ...
1144 }
1145 state = next
1146 continue
1147 }
1148
1149 // Compaction failed — release the withheld error message and exit
1150 yield lastMessage
1151 return { reason: isWithheldMedia ? 'image_error' : 'prompt_too_long' }
1152}

Reactive compaction calls the compaction logic in services/compact/ to summarize historical messages and free up context space. The hasAttemptedReactiveCompact flag ensures this operation only executes once — if compaction still results in overflow, there's a more fundamental problem.

Note that this also handles media size errors (images/PDFs that are too large) — reactive compaction can recover by stripping large media content.

Strategy 4: Model Downgrade (Fallback Model)

When the primary model (e.g., Opus) is temporarily unavailable or encounters a specific error, the engine can switch to a fallback model (e.g., Sonnet) to continue working. This is triggered via FallbackTriggeredError (from src/services/api/withRetry.ts).

Recovery Strategy Decision Tree

...

Avoiding Death Spirals

There's a recurring design concern throughout the code: avoiding death spirals. When an API error occurs, if the engine runs Stop Hooks (used to validate the quality of AI output), the hooks might inject more tokens, causing the context to overflow further, triggering more errors, and forming an infinite loop.

src/query.ts:1258-1264
TypeScript
1258// Skip stop hooks when the last message is an API error
1259// The model never produced a valid response — evaluating it with hooks
1260// would create a death spiral: error -> hook blocking -> retry -> error -> ...
1261if (lastMessage?.isApiErrorMessage) {
1262 void executeStopFailureHooks(lastMessage, toolUseContext)
1263 return { reason: 'completed' }
1264}

Turn Management: Skill Discovery and Cleanup

Each conversation turn involves more than just sending and receiving messages. At the start of each turn, the engine also performs some housekeeping:

src/QueryEngine.ts:238
TypeScript
238// Clear discoveredSkillNames at the start of each turn
239// Prevents unbounded growth of the skill name set in long sessions
240this.discoveredSkillNames.clear()

discoveredSkillNames tracks skill names discovered during the current turn (used for the was_discovered field in tengu_skill_tool_invocation analytics events). Why clear it each turn? Because in long-running sessions (which can last hours in SDK mode), this set would grow continuously without clearing. The source code comment explicitly states this design intent:

Must persist across the two processUserInputContext rebuilds inside submitMessage, but is cleared at the start of each submitMessage to avoid unbounded growth across many turns in SDK mode.

This is a small but important design decision — turn-scoped cleanup.

Skill and Plugin Preloading

Before calling query(), submitMessage() also preloads skills and plugins in parallel:

src/QueryEngine.ts:534-538
TypeScript
534const [skills, { enabled: enabledPlugins }] = await Promise.all([
535 getSlashCommandToolSkills(getCwd()),
536 loadAllPluginsCacheOnly(),
537])

Note the use of loadAllPluginsCacheOnly() — in headless/SDK mode, it doesn't block waiting for network requests to fetch plugins. It only uses plugin data already in the cache. If the latest data is needed, the caller can manually refresh via the /reload-plugins command.


Session Persistence and Recovery

submitMessage() carefully manages session persistence throughout its execution. This makes the --resume feature possible — even if the process is killed midway, the next startup can resume from the interruption point.

Early Persistence of User Messages

src/QueryEngine.ts:450-463
TypeScript
450if (persistSession && messagesFromUserInput.length > 0) {
451 const transcriptPromise = recordTranscript(messages)
452 if (isBareMode()) {
453 void transcriptPromise // Don't await in --bare mode
454 } else {
455 await transcriptPromise
456 }
457}

User messages are persisted before entering the query() loop. The source code comment explains why:

If the process is killed before [the API responds], the transcript is left with only queue-operation entries; getLastSessionLog filters those out, returns null, and --resume fails with "No conversation found".

Async Persistence of Assistant Messages

In contrast, assistant message persistence is fire-and-forget:

src/QueryEngine.ts:727-729
TypeScript
727if (message.type === 'assistant') {
728 void recordTranscript(messages) // Don't await
729}

Why? Because the streaming handler in claude.ts frequently yields assistant messages (one per content block), then modifies the last message's usage and stop_reason in message_delta events. Waiting for persistence to complete each time would block the streaming pipeline. Since enqueueWrite is order-preserving, fire-and-forget is safe here.


Complete Data Flow Recap

Let's summarize the full collaboration between QueryEngine and query() in a single diagram:

...

Transferable Engineering Patterns

From Claude Code's query engine design, we can extract several general-purpose engineering patterns:

1. Async Generators as a Streaming Abstraction

When your system needs to process streaming data (such as SSE event streams or WebSocket messages), async function* is a powerful abstraction. It lets producers yield events at their own pace, while consumers can consume them at their own pace using for await...of.

Claude Code takes this further by using yield* to compose multiple generators — query() delegates to queryLoop(), and submitMessage() consumes query()'s output while adding extra logic. This pattern keeps complex streaming pipelines well-modularized.

2. State Machine + Recovery Counter Pattern

For long-running tasks that need automatic recovery, maintaining recovery counters (maxOutputTokensRecoveryCount) and attempt flags (hasAttemptedReactiveCompact) in the state is a clean and effective pattern. It prevents infinite retries while allowing multiple orderly recoveries.

The design of the transition field is particularly elegant — it not only controls flow but also provides observability for debugging and testing. Tests can assert transition.reason === 'max_output_tokens_recovery' without needing to deeply inspect message contents.

3. Context Objects as Inter-Layer Contracts

The ProcessUserInputContext pattern — packaging all dependencies needed for the current turn into a context object — is a lightweight form of dependency injection. It allows the query() function to operate without directly depending on QueryEngine's internal state, facilitating testing and reuse.

4. Error Withholding and Graduated Recovery

Withholding error messages first, attempting recovery, then releasing them on failure — this pattern is applicable to any streaming system that needs graceful degradation. It prevents callers from prematurely terminating due to transient errors.

5. Choosing Persistence Timing

User messages are persisted synchronously (ensuring recoverability), while assistant messages are persisted asynchronously (ensuring the streaming pipeline isn't blocked). Different types of data use different persistence strategies. This is a careful trade-off between reliability and performance.


Next Up

The query engine drives the entire conversation loop, but its capabilities are limited by the available tools. Article 03: The Tool System will dive into Claude Code's 40+ tools — how are they defined? How are they discovered and executed? Most critically: when the AI performs actions, how does the permission system ensure safety?