Experiment Report: previous_response_id — Quality Impact, Token Duplication, and Tool Call Behavior
Experiment Report: previous_response_id — Quality Impact, Token Duplication, and Tool Call Behavior
Section titled “Experiment Report: previous_response_id — Quality Impact, Token Duplication, and Tool Call Behavior”Date: 2026-04-22
Models under test: gpt-5.4-mini via raw OpenAI SDK and via LangChain ChatOpenAI
Related: Context Duplication Experiment · Compaction Experiment · Original Context Window Experiment
Executive summary: We tested three questions about
previous_response_id: (1) Does it improve conversation quality? No — conversation flow scores are identical with and without chaining. (2) Does sending full history +previous_response_idduplicate tokens? Yes — in all clients (raw SDK and LangChain both show ~2x inflation). However, tool calls break the chain — when the agent makes tool calls, LangChain’s internal multi-step loop creates intermediate response chains that preventprevious_response_idfrom linking turns. Since our production agent always has tools, the duplication does not manifest in practice. (3) Doesprevious_response_idimprove quality or caching? No — both approaches achieve identical scores and ~96% cache hit rates.
Background
Section titled “Background”Three prior experiments raised questions about previous_response_id:
-
September 2025: Proved
previous_response_iddoesn’t prevent context window overflow (GPT-5-mini failed at ~272K tokens). -
April 2026 — Compaction: Proved GPT-5.4’s
context_managementcompaction prevents overflow. -
April 2026 — Duplication: Found that sending full history +
previous_response_idvia the raw OpenAI SDK causes 2.3x token inflation.
The duplication finding raised alarm: our production consumer-agent sends full conversation history AND previous_response_id on every turn. If tokens are being double-counted, we’re paying 2x more and hitting context limits faster.
Investigation revealed a nuanced answer: the duplication is real, but tool calls neutralize it in practice.
Experiment 1: Quality Impact
Section titled “Experiment 1: Quality Impact”Hypothesis
Section titled “Hypothesis”Does
previous_response_idchaining improve multi-turn conversation quality?
Method
Section titled “Method”Ran 30 synthetic multi-turn conversations through the eval harness twice:
- Run A: Full message history,
previous_response_id=None - Run B: Full message history +
previous_response_idchained from prior turn
Both used gpt-5.4-mini-low, same system prompt, same MCP tools (prod rover-mcp).
Results
Section titled “Results”| Metric | Without Chaining | With Chaining | Delta |
|---|---|---|---|
| Conversation Flow | 0.950 | 0.953 | +0.003 (noise) |
| Total time | 10m03s | 7m55s | -2m08s |
Conclusion
Section titled “Conclusion”No quality benefit. Scores are statistically identical. The message history array provides all the context the model needs.
Experiment 2: Token Duplication
Section titled “Experiment 2: Token Duplication”Confirmed: Duplication Is Real in Both SDK and LangChain
Section titled “Confirmed: Duplication Is Real in Both SDK and LangChain”Both the raw OpenAI SDK and LangChain’s ChatOpenAI show the same ~2x token inflation when combining full history + previous_response_id:
Raw SDK (5 turns):
Without chaining: 1,755 total input tokensWith chaining: 4,155 total input tokensRatio: 2.37xLangChain (5 turns, no tools):
Without chaining: 1,755 total input tokensWith chaining: 4,155 total input tokensRatio: 2.37xLangChain with real system prompt (3 turns, no tools, ~9,800 token system prompt):
Without chaining: 29,642 total input tokensWith chaining: 59,858 total input tokensRatio: 2.02xThe duplication is consistent across clients and scales.
But: Tool Calls Break the Chain
Section titled “But: Tool Calls Break the Chain”When the agent has tools, previous_response_id chaining stops working. The duplication disappears:
LangChain with tools (2 turns, echo tool):
Without chaining: 232 total input tokensWith chaining: 232 total input tokensRatio: 1.00xLangChain with real agent + MCP tools (2 turns, real system prompt):
Without chaining: 30,392 total input tokensWith chaining: 30,449 total input tokensRatio: 1.00xWhy Tool Calls Break Chaining
Section titled “Why Tool Calls Break Chaining”When the LangChain agent makes tool calls, its internal execution loop makes multiple OpenAI API calls per turn:
Turn N: 1. Initial call → model decides to call tool (response_id: resp_A) 2. Tool executes locally 3. Follow-up call with tool result → final answer (response_id: resp_B)Each internal call creates a new response. The previous_response_id we bind is from the previous turn’s final response (resp_prev), but it gets used on step 1 of the current turn. Step 3 creates a completely new response chain (resp_B) that doesn’t link back to resp_prev.
Context Retention Test: Definitive Proof
Section titled “Context Retention Test: Definitive Proof”To confirm the chain is truly broken (not just that tokens aren’t double-counted), we tested whether the model can recall facts from earlier turns using only previous_response_id (no message history).
Setup: Turn 1 establishes “My favorite color is purple and my dog’s name is Biscuit.” Turn 2 is a filler (triggers tool call with tools). Turn 3 asks “What is my favorite color and what is my dog’s name?”
No tools — prev_id_only remembers:
Turn 3: "Your favorite color is purple, and your dog's name is Biscuit." ✓With tools — prev_id_only forgets:
Turn 3: "I don't have that information yet." ✗All 4 combinations with tools:
| Combo | Remembers? | Why |
|---|---|---|
| history_only | ✓ | Facts are in the message array |
| prev_id_only | ✗ | Chain broken — server-side state lost after tool loop |
| both | ✓ | History provides context regardless |
| neither | ✗ | No context available |
This proves the chain isn’t just “not adding tokens” — it’s completely severed. The model has zero memory of earlier turns via previous_response_id when tools are involved. The message history array is the only source of cross-turn context in our production agent.
By turn N+1, the previous_response_id we captured (resp_B) references a response that was itself an internal follow-up — not a continuation of the original chain. The server-side state from resp_prev is no longer in scope.
Evidence: Response ID prefixes differ between turns when tools are used, but share prefixes when no tools are used:
No tools: Turn 1: resp_0aac20f3... Turn 2: resp_0aac20f3... (same prefix)With tools: Turn 1: resp_0e4456bd... Turn 2: resp_045eb4ba... (different prefix)Isolated Test
Section titled “Isolated Test”Three integration tests confirm this behavior (tests/integration/test_previd_tool_chaining.py):
| Test | Result |
|---|---|
test_no_tools_chaining_causes_duplication | 1.46x inflation (PASS) |
test_with_tools_chaining_no_duplication | 1.01x (PASS) |
test_tool_calls_break_response_id_chain | Different response_id prefixes per turn (PASS) |
Experiment 3: Caching Behavior
Section titled “Experiment 3: Caching Behavior”Results (from eval replay, 30 conversations with tools)
Section titled “Results (from eval replay, 30 conversations with tools)”| Metric | Without Chaining | With Chaining |
|---|---|---|
| Total cached tokens | 2,082,048 | 2,080,000 |
| Cache hit rate | 97.2% | 97.1% |
No caching benefit. OpenAI’s prompt caching is based on identical input prefixes, not previous_response_id. Since both approaches send the same growing message array, cache behavior is identical.
Production Implications
Section titled “Production Implications”Current State
Section titled “Current State”Production consumer-agent sends full history + previous_response_id on every turn. The agent always has MCP tools. Therefore:
- Token duplication does not occur because tool calls break the response chain
- Quality is identical with or without chaining
- Caching is identical with or without chaining
previous_response_idis effectively a no-op in our production setup
What Should Change
Section titled “What Should Change”| Option | Impact | Recommendation |
|---|---|---|
Remove previous_response_id | Eliminates DynamoDB lookups for prior turn IDs, simplifies episode management | Recommended — it does nothing in our tool-based agent |
| Keep for compaction | Enables context_management compaction as a safety net | Only if adopting compaction for long conversations |
| Keep as-is | No harm (no duplication with tools), just unnecessary complexity | Acceptable if cleanup is low priority |
Corrected Understanding
Section titled “Corrected Understanding”The original duplication experiment was correct about the mechanism but its production recommendations need updating:
| Original Finding | Updated Understanding |
|---|---|
| ”Approach A (full array + prev_id) causes 2.3x duplication” | True for no-tool scenarios. False for production (tools break the chain). |
| ”Never use Approach A” | Not applicable to production — tool calls neutralize the duplication. |
| ”Switch to Approach B (new msg only + prev_id)“ | Unnecessary for our tool-based agent. Would be needed if we had a no-tool path. |
Appendix
Section titled “Appendix”Test Artifacts
Section titled “Test Artifacts”| Test | File |
|---|---|
| Token duplication + call counting (4 combos × 2 agent types) | tests/integration/test_previd_tool_chaining.py |
| Context retention (fact recall across turns) | tests/integration/test_previd_context_retention.py |
| Raw SDK duplication test | tests/integration/test_context_duplication.py |
| Multi-turn replay A/B | src/consumer_agent/evaluation/tasks.py |
| Synthetic dataset | consumer-agent-eval-multi-turn-synthetic in Opik |
Opik Experiments
Section titled “Opik Experiments”| Experiment | Mode | Conversation Flow |
|---|---|---|
ab-v3-no-previd-* | History-only | 0.950 |
ab-v3-with-previd-* | With chaining | 0.953 |