API Client: Deep Integration with the Anthropic API

Deep dive into Claude Code's API client — streaming response handling, model selection and fallback, Beta APIs, token counting and cost tracking, retry logic

The Problem

The moment you type a question in Claude Code and press Enter, a precisely orchestrated chain of operations begins: the system builds the message array, selects the appropriate model, adds Beta headers, receives the response via SSE streaming, parses token usage in real time, calculates costs, handles potential 429/529 errors, and falls back to an alternative model if necessary. All of this completes within 1-2 seconds — the user simply sees text start flowing.

Claude Code's API client is not a simple HTTP wrapper — it's a complex system encompassing retry logic, fallback, caching, cost tracking, and multi-provider adaptation. This article provides an in-depth analysis of every layer of this system.


API Client Layer Architecture

Application Layer
query.ts (main loop)
claude.ts (API orchestration)
Retry Layer
withRetry.ts
FallbackTriggeredError
CannotRetryError
SDK Layer
client.ts (SDK initialization)
Anthropic SDK
AWS Bedrock
GCP Vertex AI
Azure Foundry
Support Layer
cost-tracker.ts
promptCacheBreakDetection.ts
bootstrap.ts

Multi-Provider Client

Claude Code supports four API providers, each with different authentication and configuration methods:

src/services/api/client.ts
TypeScript
1// Direct API:
2// ANTHROPIC_API_KEY: Required for direct API access
3//
4// AWS Bedrock:
5// AWS credentials configured via aws-sdk defaults
6// AWS_REGION or AWS_DEFAULT_REGION
7// ANTHROPIC_SMALL_FAST_MODEL_AWS_REGION: Optional override for Haiku
8//
9// Foundry (Azure):
10// ANTHROPIC_FOUNDRY_RESOURCE: Azure resource name
11// ANTHROPIC_FOUNDRY_BASE_URL: Alternative full base URL
12//
13// Vertex AI:
14// Model-specific region variables (VERTEX_REGION_CLAUDE_*)
15// CLOUD_ML_REGION: Default GCP region
16// ANTHROPIC_VERTEX_PROJECT_ID: Required GCP project ID

Client initialization accounts for debugging needs — when stderr is in debug mode, SDK logs are redirected to stderr:

src/services/api/client.ts
TypeScript
1function createStderrLogger(): ClientOptions['logger'] {
2 return {
3 error: (msg, ...args) =>
4 console.error('[Anthropic SDK ERROR]', msg, ...args),
5 warn: (msg, ...args) =>
6 console.error('[Anthropic SDK WARN]', msg, ...args),
7 info: (msg, ...args) =>
8 console.error('[Anthropic SDK INFO]', msg, ...args),
9 debug: (msg, ...args) =>
10 console.error('[Anthropic SDK DEBUG]', msg, ...args),
11 }
12}

Beta Headers Management

Claude Code uses numerous Beta API features, declared via the anthropic-beta header:

src/services/api/claude.ts
TypeScript
1import {
2 AFK_MODE_BETA_HEADER,
3 CONTEXT_1M_BETA_HEADER,
4 CONTEXT_MANAGEMENT_BETA_HEADER,
5 EFFORT_BETA_HEADER,
6 FAST_MODE_BETA_HEADER,
7 PROMPT_CACHING_SCOPE_BETA_HEADER,
8 REDACT_THINKING_BETA_HEADER,
9 STRUCTURED_OUTPUTS_BETA_HEADER,
10 TASK_BUDGETS_BETA_HEADER,
11} from 'src/constants/betas.js'

These Beta features include:

Beta HeaderFeature
CONTEXT_1M_BETA_HEADER1M token context window
CONTEXT_MANAGEMENT_BETA_HEADERServer-side context management
FAST_MODE_BETA_HEADERFast mode (reduced latency)
EFFORT_BETA_HEADEREffort control (adjusts reasoning depth)
PROMPT_CACHING_SCOPE_BETA_HEADERPrompt caching scope
REDACT_THINKING_BETA_HEADERThinking content redaction
STRUCTURED_OUTPUTS_BETA_HEADERStructured outputs
TASK_BUDGETS_BETA_HEADERTask budget control
AFK_MODE_BETA_HEADERAway mode (background execution optimization)

Extra Body Parameters

Users can inject additional API parameters via the CLAUDE_CODE_EXTRA_BODY environment variable:

src/services/api/claude.ts
TypeScript
1export function getExtraBodyParams(betaHeaders?: string[]): JsonObject {
2 const extraBodyStr = process.env.CLAUDE_CODE_EXTRA_BODY
3 let result: JsonObject = {}
4
5 if (extraBodyStr) {
6 try {
7 const parsed = safeParseJSON(extraBodyStr)
8 if (parsed && typeof parsed === 'object' && !Array.isArray(parsed)) {
9 // Shallow clone — safeParseJSON is LRU-cached and returns the
10 // same object reference. Mutating result would poison the cache.
11 result = { ...(parsed as JsonObject) }
12 }
13 } catch (error) {
14 logForDebugging(`Error parsing CLAUDE_CODE_EXTRA_BODY: ${errorMessage(error)}`)
15 }
16 }
17
18 // Anti-distillation: send fake_tools opt-in for 1P CLI only
19 if (feature('ANTI_DISTILLATION_CC') ? /* gate check */ : false) {
20 result.anti_distillation = ['fake_tools']
21 }
22
23 return result
24}

Note the shallow clone — safeParseJSON uses an LRU cache, so directly mutating the return value would poison the cache, causing subsequent calls to see the modified value.


Prompt Cache Control

Prompt caching can be controlled at per-model granularity:

src/services/api/claude.ts
TypeScript
1export function getPromptCachingEnabled(model: string): boolean {
2 if (isEnvTruthy(process.env.DISABLE_PROMPT_CACHING)) return false
3 if (isEnvTruthy(process.env.DISABLE_PROMPT_CACHING_HAIKU)) {
4 if (model === getSmallFastModel()) return false
5 }
6 if (isEnvTruthy(process.env.DISABLE_PROMPT_CACHING_SONNET)) {
7 if (model === getDefaultSonnetModel()) return false
8 }
9 // ...
10}

This per-model disabling design stems from practical needs — the cache creation cost for some models may not be worthwhile (for instance, Haiku is already inexpensive, and the cache creation fee can actually exceed the savings).


Retry System

The retry logic is the most complex part of the API client, defined in withRetry.ts.

Retry Configuration

src/services/api/withRetry.ts
TypeScript
1const DEFAULT_MAX_RETRIES = 10
2const FLOOR_OUTPUT_TOKENS = 3000
3const MAX_529_RETRIES = 3
4export const BASE_DELAY_MS = 500

Foreground vs. Background Query Sources

Not all queries should be retried. Background queries (summaries, title generation, classifiers) immediately give up on 529 errors — they are not what the user is waiting for, and retrying would only amplify capacity cascades:

src/services/api/withRetry.ts
TypeScript
1const FOREGROUND_529_RETRY_SOURCES = new Set<QuerySource>([
2 'repl_main_thread',
3 'repl_main_thread:outputStyle:custom',
4 'repl_main_thread:outputStyle:Explanatory',
5 'repl_main_thread:outputStyle:Learning',
6 'sdk',
7 'agent:custom',
8 'agent:default',
9 'agent:builtin',
10 'compact',
11 'hook_agent',
12 'hook_prompt',
13 'verification_agent',
14 'side_question',
15 'auto_mode',
16])

Retry State Machine

...

Fast Mode Fallback

Fast Mode is a low-latency mode. When rate-limited, the system must decide whether to wait (preserving cache hits) or fall back (switching to standard speed):

src/services/api/withRetry.ts
TypeScript
1if (wasFastModeActive && !isPersistentRetryEnabled() &&
2 error instanceof APIError &&
3 (error.status === 429 || is529Error(error))) {
4 // Overage limit — permanently disable fast mode
5 const overageReason = error.headers?.get(
6 'anthropic-ratelimit-unified-overage-disabled-reason',
7 )
8 if (overageReason !== null && overageReason !== undefined) {
9 handleFastModeOverageRejection(overageReason)
10 retryContext.fastMode = false
11 continue
12 }
13
14 const retryAfterMs = getRetryAfterMs(error)
15 if (retryAfterMs !== null && retryAfterMs < SHORT_RETRY_THRESHOLD_MS) {
16 // Short wait — keep fast mode to protect prompt cache
17 await sleep(retryAfterMs, options.signal, { abortError })
18 continue
19 }
20
21 // Long wait or unknown — enter cooldown period (switch to standard speed)
22 const cooldownMs = Math.max(
23 retryAfterMs ?? DEFAULT_FAST_MODE_FALLBACK_HOLD_MS,
24 MIN_COOLDOWN_MS,
25 )
26 triggerFastModeCooldown(Date.now() + cooldownMs, cooldownReason)
27 retryContext.fastMode = false
28 continue
29}

The decision logic:

  • retry-after < threshold -- short wait, keep fast mode (protects prompt cache from invalidation)
  • retry-after >= threshold or unknown -- enter cooldown period, switch to standard speed
  • Overage limit -- permanently disable fast mode

Authentication Error Recovery

src/services/api/withRetry.ts
TypeScript
1const isStaleConnection = isStaleConnectionError(lastError)
2if (isStaleConnection && getFeatureValue_CACHED_MAY_BE_STALE(...)) {
3 disableKeepAlive() // Disable connection pool, rebuild connection
4}
5
6if (
7 client === null ||
8 (lastError instanceof APIError && lastError.status === 401) ||
9 isOAuthTokenRevokedError(lastError) ||
10 isBedrockAuthError(lastError) ||
11 isVertexAuthError(lastError) ||
12 isStaleConnection
13) {
14 if ((lastError instanceof APIError && lastError.status === 401) ||
15 isOAuthTokenRevokedError(lastError)) {
16 const failedAccessToken = getClaudeAIOAuthTokens()?.accessToken
17 if (failedAccessToken) {
18 await handleOAuth401Error(failedAccessToken)
19 }
20 }
21 client = await getClient() // Rebuild client
22}

Authentication recovery covers special cases for all providers:

  • Anthropic 1P — refresh OAuth token on 401
  • AWS Bedrock — 403 or CredentialsProviderError
  • GCP Vertex — credential refresh failure
  • Connection reset — disable keep-alive and reconnect on ECONNRESET/EPIPE

Consecutive 529 Errors and Model Fallback

src/services/api/withRetry.ts
TypeScript
1if (is529Error(error) &&
2 (process.env.FALLBACK_FOR_ALL_PRIMARY_MODELS ||
3 (!isClaudeAISubscriber() && isNonCustomOpusModel(options.model)))) {
4 consecutive529Errors++
5 if (consecutive529Errors >= MAX_529_RETRIES) {
6 if (options.fallbackModel) {
7 throw new FallbackTriggeredError(
8 options.model,
9 options.fallbackModel,
10 )
11 }
12 }
13}

After 3 consecutive 529 errors, a model fallback is triggered (e.g., Opus to Sonnet). FallbackTriggeredError is caught and handled by query.ts — existing assistant messages are cleared, the model is switched, and the entire request is retried.

Persistent Retry (Unattended Mode)

For automation scenarios (CI/CD, cron jobs), the system supports unlimited retries:

src/services/api/withRetry.ts
TypeScript
1const PERSISTENT_MAX_BACKOFF_MS = 5 * 60 * 1000 // 5 minute max backoff
2const PERSISTENT_RESET_CAP_MS = 6 * 60 * 60 * 1000 // 6 hour timeout
3const HEARTBEAT_INTERVAL_MS = 30_000 // 30 second heartbeat
4
5function isPersistentRetryEnabled(): boolean {
6 return feature('UNATTENDED_RETRY')
7 ? isEnvTruthy(process.env.CLAUDE_CODE_UNATTENDED_RETRY)
8 : false
9}

Persistent retry sends heartbeats via SystemAPIErrorMessage, preventing the host environment (such as a container orchestration system) from marking the session as idle.


Cost Tracking

Every API response updates the cost state:

src/cost-tracker.ts
TypeScript
1type StoredCostState = {
2 totalCostUSD: number
3 totalAPIDuration: number
4 totalAPIDurationWithoutRetries: number
5 totalToolDuration: number
6 totalLinesAdded: number
7 totalLinesRemoved: number
8 lastDuration: number | undefined
9 modelUsage: { [modelName: string]: ModelUsage } | undefined
10}

Cost calculation uses the calculateUSDCost function based on per-model pricing tables:

src/services/api/claude.ts
TypeScript
1import { addToTotalSessionCost } from 'src/cost-tracker.js'

The cost state is not just for display — it is saved to the project configuration during session switches and read back on resume:

src/cost-tracker.ts
TypeScript
1export function saveCurrentSessionCosts(fpsMetrics?: FpsMetrics): void {
2 saveCurrentProjectConfig(current => ({
3 ...current,
4 lastCost: getTotalCostUSD(),
5 lastAPIDuration: getTotalAPIDuration(),
6 lastAPIDurationWithoutRetries: getTotalAPIDurationWithoutRetries(),
7 lastToolDuration: getTotalToolDuration(),
8 lastDuration: getTotalDuration(),
9 // ...
10 }))
11}

Bootstrap API

At startup, the system fetches server-side configuration via the Bootstrap API:

src/services/api/bootstrap.ts
TypeScript
1async function fetchBootstrapAPI(): Promise<BootstrapResponse | null> {
2 if (isEssentialTrafficOnly()) return null // Skip in privacy mode
3 if (getAPIProvider() !== 'firstParty') return null // Skip for third-party providers
4
5 // OAuth preferred, API Key fallback
6 const hasUsableOAuth =
7 getClaudeAIOAuthTokens()?.accessToken && hasProfileScope()
8 if (!hasUsableOAuth && !apiKey) return null
9
10 const endpoint = `${getOauthConfig().BASE_API_URL}/api/claude_cli/bootstrap`
11
12 return await withOAuth401Retry(async () => {
13 const token = getClaudeAIOAuthTokens()?.accessToken
14 // Re-read OAuth token each time (retry may have refreshed it)
15 let authHeaders: Record<string, string>
16 if (token && hasProfileScope()) {
17 authHeaders = { Authorization: `Bearer ${token}`, ... }
18 } else if (apiKey) {
19 authHeaders = { 'x-api-key': apiKey }
20 } else {
21 return null
22 }
23
24 const response = await axios.get(endpoint, {
25 headers: { ...authHeaders },
26 timeout: 5000,
27 })
28 return bootstrapResponseSchema().safeParse(response.data)
29 })
30}

The data returned by Bootstrap includes:

  • client_data — client configuration
  • additional_model_options — list of additional available models

The 5-second timeout ensures startup doesn't hang due to network issues.


Streaming Response Handling

The main loop in query.ts consumes streaming responses via for await...of. Key processing logic includes:

Fallback Handling

When model fallback is triggered during streaming, partially received messages need to be discarded:

src/query.ts
TypeScript
1if (streamingFallbackOccured) {
2 // Generate tombstones for already-emitted messages
3 for (const msg of assistantMessages) {
4 yield { type: 'tombstone' as const, message: msg }
5 }
6
7 assistantMessages.length = 0
8 toolResults.length = 0
9 toolUseBlocks.length = 0
10 needsFollowUp = false
11
12 // Discard pending results from the streaming tool executor
13 if (streamingToolExecutor) {
14 streamingToolExecutor.discard()
15 streamingToolExecutor = new StreamingToolExecutor(
16 toolUseContext.options.tools,
17 canUseTool,
18 toolUseContext,
19 )
20 }
21}

Tombstone messages tell the UI and transcript to remove these partial messages — it is particularly important to remove incomplete thinking blocks, as they carry model-specific signatures that would cause API errors after falling back to a different model.

Error Suppression and Recovery

Certain API errors are recoverable — the system suppresses them within the streaming loop and attempts recovery after the stream ends:

src/query.ts
TypeScript
1let withheld = false
2if (feature('CONTEXT_COLLAPSE')) {
3 if (contextCollapse?.isWithheldPromptTooLong(message, ...)) {
4 withheld = true
5 }
6}
7if (reactiveCompact?.isWithheldPromptTooLong(message)) {
8 withheld = true
9}
10if (mediaRecoveryEnabled && reactiveCompact?.isWithheldMediaSizeError(message)) {
11 withheld = true
12}
13if (isWithheldMaxOutputTokens(message)) {
14 withheld = true
15}
16if (!withheld) {
17 yield yieldMessage
18}

Suppressed messages still join the assistantMessages array — the recovery logic needs to inspect them. However, they are not sent to SDK consumers, as those consumers (such as desktop applications) might terminate the session upon seeing an error.


Request Construction Details

Tool Schema Conversion

Each tool definition needs to be converted to an API-compatible format, including handling of deferred tools:

TypeScript
1// Reference: src/services/api/claude.ts
2import {
3 formatDeferredToolLine,
4 isDeferredTool,
5 TOOL_SEARCH_TOOL_NAME,
6} from '../../tools/ToolSearchTool/prompt.js'

Advisor Mode

When Advisor is enabled, an additional model (such as Opus advising Sonnet) participates in decision-making:

src/services/api/claude.ts
TypeScript
1import {
2 ADVISOR_TOOL_INSTRUCTIONS,
3 getExperimentAdvisorModels,
4 isAdvisorEnabled,
5 isValidAdvisorModel,
6 modelSupportsAdvisor,
7} from 'src/utils/advisor.js'

Session Activity Tracking

During API requests, the session is marked as active, used for resource management in remote environments:

src/services/api/claude.ts
TypeScript
1import {
2 startSessionActivity,
3 stopSessionActivity,
4} from '../../utils/sessionActivity.js'

Summary

Claude Code's API client is a multi-layered defense system:

  • Multi-provider abstraction — unified interface for Anthropic/Bedrock/Vertex/Foundry, configured via environment variables
  • Layered retry — different strategies for different error types (authentication/rate-limiting/overload/connection reset)
  • Intelligent fallback — Fast Mode to standard speed to alternative model, with sound decision logic at each step
  • Streaming error suppression — recoverable errors are not immediately exposed to consumers, giving the system a chance to recover
  • Full-chain cost tracking — from API response to project configuration persistence, with support for session resumption
  • Operational knobs — prompt caching, Fast Mode, retry strategies, and more are all controllable via environment variables and feature flags

The complexity of this system is not accidental — it reflects the reality that production AI applications face: networks are unreliable, services get overloaded, credentials expire, and users need an uninterrupted experience. Every layer of protection corresponds to a real-world failure mode.