Experiment Report: previous_response_id — Quality Impact, Token Duplication, and Tool Call Behavior

Date: 2026-04-22 Models under test: gpt-5.4-mini via raw OpenAI SDK and via LangChain ChatOpenAI Related: Context Duplication Experiment · Compaction Experiment · Original Context Window Experiment

Executive summary: We tested three questions about previous_response_id: (1) Does it improve conversation quality? No — conversation flow scores are identical with and without chaining. (2) Does sending full history + previous_response_id duplicate tokens? Yes — in all clients (raw SDK and LangChain both show ~2x inflation). However, tool calls break the chain — when the agent makes tool calls, LangChain’s internal multi-step loop creates intermediate response chains that prevent previous_response_id from linking turns. Since our production agent always has tools, the duplication does not manifest in practice. (3) Does previous_response_id improve quality or caching? No — both approaches achieve identical scores and ~96% cache hit rates.

Background

Three prior experiments raised questions about previous_response_id:

September 2025: Proved previous_response_id doesn’t prevent context window overflow (GPT-5-mini failed at ~272K tokens).
April 2026 — Compaction: Proved GPT-5.4’s context_management compaction prevents overflow.
April 2026 — Duplication: Found that sending full history + previous_response_id via the raw OpenAI SDK causes 2.3x token inflation.

The duplication finding raised alarm: our production consumer-agent sends full conversation history AND previous_response_id on every turn. If tokens are being double-counted, we’re paying 2x more and hitting context limits faster.

Investigation revealed a nuanced answer: the duplication is real, but tool calls neutralize it in practice.

Experiment 1: Quality Impact

Hypothesis

Does previous_response_id chaining improve multi-turn conversation quality?

Method

Ran 30 synthetic multi-turn conversations through the eval harness twice:

Run A: Full message history, previous_response_id=None
Run B: Full message history + previous_response_id chained from prior turn

Both used gpt-5.4-mini-low, same system prompt, same MCP tools (prod rover-mcp).

Results

Metric	Without Chaining	With Chaining	Delta
Conversation Flow	0.950	0.953	+0.003 (noise)
Total time	10m03s	7m55s	-2m08s

Conclusion

No quality benefit. Scores are statistically identical. The message history array provides all the context the model needs.

Experiment 2: Token Duplication

Confirmed: Duplication Is Real in Both SDK and LangChain

Both the raw OpenAI SDK and LangChain’s ChatOpenAI show the same ~2x token inflation when combining full history + previous_response_id:

Raw SDK (5 turns):

Without chaining: 1,755 total input tokens
With chaining:    4,155 total input tokens
Ratio: 2.37x

LangChain (5 turns, no tools):

Without chaining: 1,755 total input tokens
With chaining:    4,155 total input tokens
Ratio: 2.37x

LangChain with real system prompt (3 turns, no tools, ~9,800 token system prompt):

Without chaining: 29,642 total input tokens
With chaining:    59,858 total input tokens
Ratio: 2.02x

The duplication is consistent across clients and scales.

But: Tool Calls Break the Chain

When the agent has tools, previous_response_id chaining stops working. The duplication disappears:

LangChain with tools (2 turns, echo tool):

Without chaining: 232 total input tokens
With chaining:    232 total input tokens
Ratio: 1.00x

LangChain with real agent + MCP tools (2 turns, real system prompt):

Without chaining: 30,392 total input tokens
With chaining:    30,449 total input tokens
Ratio: 1.00x

Why Tool Calls Break Chaining

When the LangChain agent makes tool calls, its internal execution loop makes multiple OpenAI API calls per turn:

Turn N:
  1. Initial call → model decides to call tool      (response_id: resp_A)
  2. Tool executes locally
  3. Follow-up call with tool result → final answer  (response_id: resp_B)

Each internal call creates a new response. The previous_response_id we bind is from the previous turn’s final response (resp_prev), but it gets used on step 1 of the current turn. Step 3 creates a completely new response chain (resp_B) that doesn’t link back to resp_prev.

Context Retention Test: Definitive Proof

To confirm the chain is truly broken (not just that tokens aren’t double-counted), we tested whether the model can recall facts from earlier turns using only previous_response_id (no message history).

Setup: Turn 1 establishes “My favorite color is purple and my dog’s name is Biscuit.” Turn 2 is a filler (triggers tool call with tools). Turn 3 asks “What is my favorite color and what is my dog’s name?”

No tools — prev_id_only remembers:

Turn 3: "Your favorite color is purple, and your dog's name is Biscuit."  ✓

With tools — prev_id_only forgets:

Turn 3: "I don't have that information yet."  ✗

All 4 combinations with tools:

Combo	Remembers?	Why
history_only	✓	Facts are in the message array
prev_id_only	✗	Chain broken — server-side state lost after tool loop
both	✓	History provides context regardless
neither	✗	No context available

This proves the chain isn’t just “not adding tokens” — it’s completely severed. The model has zero memory of earlier turns via previous_response_id when tools are involved. The message history array is the only source of cross-turn context in our production agent.

By turn N+1, the previous_response_id we captured (resp_B) references a response that was itself an internal follow-up — not a continuation of the original chain. The server-side state from resp_prev is no longer in scope.

Evidence: Response ID prefixes differ between turns when tools are used, but share prefixes when no tools are used:

No tools:  Turn 1: resp_0aac20f3...  Turn 2: resp_0aac20f3...  (same prefix)
With tools: Turn 1: resp_0e4456bd...  Turn 2: resp_045eb4ba...  (different prefix)

Isolated Test

Three integration tests confirm this behavior (tests/integration/test_previd_tool_chaining.py):

Test	Result
`test_no_tools_chaining_causes_duplication`	1.46x inflation (PASS)
`test_with_tools_chaining_no_duplication`	1.01x (PASS)
`test_tool_calls_break_response_id_chain`	Different response_id prefixes per turn (PASS)

Experiment 3: Caching Behavior

Results (from eval replay, 30 conversations with tools)

Metric	Without Chaining	With Chaining
Total cached tokens	2,082,048	2,080,000
Cache hit rate	97.2%	97.1%

No caching benefit. OpenAI’s prompt caching is based on identical input prefixes, not previous_response_id. Since both approaches send the same growing message array, cache behavior is identical.

Production Implications

Current State

Production consumer-agent sends full history + previous_response_id on every turn. The agent always has MCP tools. Therefore:

Token duplication does not occur because tool calls break the response chain
Quality is identical with or without chaining
Caching is identical with or without chaining
previous_response_id is effectively a no-op in our production setup

What Should Change

Option	Impact	Recommendation
Remove `previous_response_id`	Eliminates DynamoDB lookups for prior turn IDs, simplifies episode management	Recommended — it does nothing in our tool-based agent
Keep for compaction	Enables `context_management` compaction as a safety net	Only if adopting compaction for long conversations
Keep as-is	No harm (no duplication with tools), just unnecessary complexity	Acceptable if cleanup is low priority

Corrected Understanding

The original duplication experiment was correct about the mechanism but its production recommendations need updating:

Original Finding	Updated Understanding
”Approach A (full array + prev_id) causes 2.3x duplication”	True for no-tool scenarios. False for production (tools break the chain).
”Never use Approach A”	Not applicable to production — tool calls neutralize the duplication.
”Switch to Approach B (new msg only + prev_id)“	Unnecessary for our tool-based agent. Would be needed if we had a no-tool path.

Appendix

Test Artifacts

Test	File
Token duplication + call counting (4 combos × 2 agent types)	`tests/integration/test_previd_tool_chaining.py`
Context retention (fact recall across turns)	`tests/integration/test_previd_context_retention.py`
Raw SDK duplication test	`tests/integration/test_context_duplication.py`
Multi-turn replay A/B	`src/consumer_agent/evaluation/tasks.py`
Synthetic dataset	`consumer-agent-eval-multi-turn-synthetic` in Opik

Opik Experiments

Experiment	Mode	Conversation Flow
`ab-v3-no-previd-*`	History-only	0.950
`ab-v3-with-previd-*`	With chaining	0.953