Skip to content

Experiment Report: previous_response_id — Quality Impact, Token Duplication, and Tool Call Behavior

Experiment Report: previous_response_id — Quality Impact, Token Duplication, and Tool Call Behavior

Section titled “Experiment Report: previous_response_id — Quality Impact, Token Duplication, and Tool Call Behavior”

Date: 2026-04-22 Models under test: gpt-5.4-mini via raw OpenAI SDK and via LangChain ChatOpenAI Related: Context Duplication Experiment · Compaction Experiment · Original Context Window Experiment

Executive summary: We tested three questions about previous_response_id: (1) Does it improve conversation quality? No — conversation flow scores are identical with and without chaining. (2) Does sending full history + previous_response_id duplicate tokens? Yes — in all clients (raw SDK and LangChain both show ~2x inflation). However, tool calls break the chain — when the agent makes tool calls, LangChain’s internal multi-step loop creates intermediate response chains that prevent previous_response_id from linking turns. Since our production agent always has tools, the duplication does not manifest in practice. (3) Does previous_response_id improve quality or caching? No — both approaches achieve identical scores and ~96% cache hit rates.

Three prior experiments raised questions about previous_response_id:

  1. September 2025: Proved previous_response_id doesn’t prevent context window overflow (GPT-5-mini failed at ~272K tokens).

  2. April 2026 — Compaction: Proved GPT-5.4’s context_management compaction prevents overflow.

  3. April 2026 — Duplication: Found that sending full history + previous_response_id via the raw OpenAI SDK causes 2.3x token inflation.

The duplication finding raised alarm: our production consumer-agent sends full conversation history AND previous_response_id on every turn. If tokens are being double-counted, we’re paying 2x more and hitting context limits faster.

Investigation revealed a nuanced answer: the duplication is real, but tool calls neutralize it in practice.

Does previous_response_id chaining improve multi-turn conversation quality?

Ran 30 synthetic multi-turn conversations through the eval harness twice:

  • Run A: Full message history, previous_response_id=None
  • Run B: Full message history + previous_response_id chained from prior turn

Both used gpt-5.4-mini-low, same system prompt, same MCP tools (prod rover-mcp).

MetricWithout ChainingWith ChainingDelta
Conversation Flow0.9500.953+0.003 (noise)
Total time10m03s7m55s-2m08s

No quality benefit. Scores are statistically identical. The message history array provides all the context the model needs.

Confirmed: Duplication Is Real in Both SDK and LangChain

Section titled “Confirmed: Duplication Is Real in Both SDK and LangChain”

Both the raw OpenAI SDK and LangChain’s ChatOpenAI show the same ~2x token inflation when combining full history + previous_response_id:

Raw SDK (5 turns):

Without chaining: 1,755 total input tokens
With chaining: 4,155 total input tokens
Ratio: 2.37x

LangChain (5 turns, no tools):

Without chaining: 1,755 total input tokens
With chaining: 4,155 total input tokens
Ratio: 2.37x

LangChain with real system prompt (3 turns, no tools, ~9,800 token system prompt):

Without chaining: 29,642 total input tokens
With chaining: 59,858 total input tokens
Ratio: 2.02x

The duplication is consistent across clients and scales.

When the agent has tools, previous_response_id chaining stops working. The duplication disappears:

LangChain with tools (2 turns, echo tool):

Without chaining: 232 total input tokens
With chaining: 232 total input tokens
Ratio: 1.00x

LangChain with real agent + MCP tools (2 turns, real system prompt):

Without chaining: 30,392 total input tokens
With chaining: 30,449 total input tokens
Ratio: 1.00x

When the LangChain agent makes tool calls, its internal execution loop makes multiple OpenAI API calls per turn:

Turn N:
1. Initial call → model decides to call tool (response_id: resp_A)
2. Tool executes locally
3. Follow-up call with tool result → final answer (response_id: resp_B)

Each internal call creates a new response. The previous_response_id we bind is from the previous turn’s final response (resp_prev), but it gets used on step 1 of the current turn. Step 3 creates a completely new response chain (resp_B) that doesn’t link back to resp_prev.

To confirm the chain is truly broken (not just that tokens aren’t double-counted), we tested whether the model can recall facts from earlier turns using only previous_response_id (no message history).

Setup: Turn 1 establishes “My favorite color is purple and my dog’s name is Biscuit.” Turn 2 is a filler (triggers tool call with tools). Turn 3 asks “What is my favorite color and what is my dog’s name?”

No tools — prev_id_only remembers:

Turn 3: "Your favorite color is purple, and your dog's name is Biscuit." ✓

With tools — prev_id_only forgets:

Turn 3: "I don't have that information yet." ✗

All 4 combinations with tools:

ComboRemembers?Why
history_onlyFacts are in the message array
prev_id_onlyChain broken — server-side state lost after tool loop
bothHistory provides context regardless
neitherNo context available

This proves the chain isn’t just “not adding tokens” — it’s completely severed. The model has zero memory of earlier turns via previous_response_id when tools are involved. The message history array is the only source of cross-turn context in our production agent.

By turn N+1, the previous_response_id we captured (resp_B) references a response that was itself an internal follow-up — not a continuation of the original chain. The server-side state from resp_prev is no longer in scope.

Evidence: Response ID prefixes differ between turns when tools are used, but share prefixes when no tools are used:

No tools: Turn 1: resp_0aac20f3... Turn 2: resp_0aac20f3... (same prefix)
With tools: Turn 1: resp_0e4456bd... Turn 2: resp_045eb4ba... (different prefix)

Three integration tests confirm this behavior (tests/integration/test_previd_tool_chaining.py):

TestResult
test_no_tools_chaining_causes_duplication1.46x inflation (PASS)
test_with_tools_chaining_no_duplication1.01x (PASS)
test_tool_calls_break_response_id_chainDifferent response_id prefixes per turn (PASS)

Results (from eval replay, 30 conversations with tools)

Section titled “Results (from eval replay, 30 conversations with tools)”
MetricWithout ChainingWith Chaining
Total cached tokens2,082,0482,080,000
Cache hit rate97.2%97.1%

No caching benefit. OpenAI’s prompt caching is based on identical input prefixes, not previous_response_id. Since both approaches send the same growing message array, cache behavior is identical.

Production consumer-agent sends full history + previous_response_id on every turn. The agent always has MCP tools. Therefore:

  1. Token duplication does not occur because tool calls break the response chain
  2. Quality is identical with or without chaining
  3. Caching is identical with or without chaining
  4. previous_response_id is effectively a no-op in our production setup
OptionImpactRecommendation
Remove previous_response_idEliminates DynamoDB lookups for prior turn IDs, simplifies episode managementRecommended — it does nothing in our tool-based agent
Keep for compactionEnables context_management compaction as a safety netOnly if adopting compaction for long conversations
Keep as-isNo harm (no duplication with tools), just unnecessary complexityAcceptable if cleanup is low priority

The original duplication experiment was correct about the mechanism but its production recommendations need updating:

Original FindingUpdated Understanding
”Approach A (full array + prev_id) causes 2.3x duplication”True for no-tool scenarios. False for production (tools break the chain).
”Never use Approach A”Not applicable to production — tool calls neutralize the duplication.
”Switch to Approach B (new msg only + prev_id)“Unnecessary for our tool-based agent. Would be needed if we had a no-tool path.
TestFile
Token duplication + call counting (4 combos × 2 agent types)tests/integration/test_previd_tool_chaining.py
Context retention (fact recall across turns)tests/integration/test_previd_context_retention.py
Raw SDK duplication testtests/integration/test_context_duplication.py
Multi-turn replay A/Bsrc/consumer_agent/evaluation/tasks.py
Synthetic datasetconsumer-agent-eval-multi-turn-synthetic in Opik
ExperimentModeConversation Flow
ab-v3-no-previd-*History-only0.950
ab-v3-with-previd-*With chaining0.953