Experiment Report: Context Duplication When Combining Input Array with previous_response_id
Experiment Report: Context Duplication When Combining Input Array with previous_response_id
Section titled “Experiment Report: Context Duplication When Combining Input Array with previous_response_id”Date: 2026-04-15
Model under test: gpt-5.4-mini
Related: Compaction Experiment Report · Original Context Window Experiment
Executive summary: Sending the full conversation history in the
inputarray and passingprevious_response_idcauses the API to double-count prior turns. In a 5-turn test, Approach A (full array + prev_id) consumed 2.3x the input tokens of Approach B (new message only + prev_id). The server does not deduplicate — it assembles prior context fromprevious_response_idand adds the explicit input array on top. The correct pattern for long conversations is to send only the new user message withprevious_response_idand let the server carry forward all prior state.
Background & Motivation
Section titled “Background & Motivation”During the GPT-5.4 compaction experiment, we observed compaction triggering at Turn 11 (~225K input tokens). The test was sending the full conversation array alongside previous_response_id on every turn. This raised the question: is the context being counted twice?
The OpenAI conversation state documentation describes two chaining patterns:
- Stateless input-array chaining: Append all output items to the conversation array and pass the full array each turn. No
previous_response_id. previous_response_idchaining: Pass only the new user message each turn. The server carries forward all prior context.
These are described as alternatives, not complements. We tested what happens when you combine both.
Hypothesis
Section titled “Hypothesis”H1: Sending the full conversation array in
inputwhile also passingprevious_response_idwill result in higher input token counts than either approach alone, because the server assembles prior turns fromprevious_response_idand additionally counts the explicitinputarray — without deduplication.
Method
Section titled “Method”Three approaches tested over 5 identical turns with gpt-5.4-mini. Each turn sends ~575 tokens of user content (excerpts from Moby-Dick) plus asks for a one-sentence summary.
| Approach | input array | previous_response_id | Description |
|---|---|---|---|
| A | Full conversation history (grows each turn) | Yes (chained from prior response) | Both mechanisms combined |
| B | System prompt + new user message only | Yes (chained from prior response) | Server carries all prior state |
| C | Full conversation history (grows each turn) | None | Client manages all state |
All three used identical message content, max_output_tokens=100, store=true, and gpt-5.4-mini.
Results
Section titled “Results”Per-Turn Input Token Comparison
Section titled “Per-Turn Input Token Comparison”| Turn | A: Full Array + prev_id | B: New Msg + prev_id | C: Full Array Only | A vs B Overhead | A vs C Overhead |
|---|---|---|---|---|---|
| 1 | 574 | 574 | 574 | +0 | +0 |
| 2 | 1,769 | 1,189 | 1,148 | +580 | +621 |
| 3 | 3,527 | 1,798 | 1,722 | +1,729 | +1,805 |
| 4 | 5,859 | 2,405 | 2,296 | +3,454 | +3,563 |
| 5 | 8,767 | 3,014 | 2,870 | +5,753 | +5,897 |
Totals
Section titled “Totals”| Approach | Total Input Tokens | vs Approach B |
|---|---|---|
| A (Full array + prev_id) | 20,496 | 2.3x (+128%) |
| B (New msg + prev_id) | 8,980 | 1.0x (baseline) |
| C (Full array, no prev_id) | 8,610 | 0.96x |
Growth Pattern
Section titled “Growth Pattern”Input Tokens 9,000 ┤ ╱ A (full + prev_id) │ ╱╱ 8,000 ┤ ╱╱ │ ╱╱ 7,000 ┤ ╱╱ │ ╱╱ 6,000 ┤ ╱╱ │ ╱╱ 5,000 ┤ ╱╱ │ ╱╱ 4,000 ┤ ╱╱ │ ╱╱ 3,000 ┤ ╱╱ ╱ B (new msg + prev_id) │ ╱╱ ╱╱╱ C (full array only) 2,000 ┤ ╱╱ ╱╱╱╱ │ ╱╱ ╱╱╱╱ 1,000 ┤╱╱ ╱╱╱╱ │ ╱╱╱╱ 0 ┤─────────────╱╱╱╱───────────────────────────── 1 2 3 4 5 TurnApproach A grows quadratically (prior turns counted twice — once from the array, once from server state). Approaches B and C grow linearly at nearly the same rate.
Analysis & Interpretation
Section titled “Analysis & Interpretation”1. The API Does Not Deduplicate
Section titled “1. The API Does Not Deduplicate”When you pass previous_response_id, the server reconstructs the full conversation from its stored state. If you also send the same conversation in the input array, the server counts both:
Effective input = server-side state (from previous_response_id) + explicit input array (from your request)At Turn 5, the server-side state contains Turns 1-4 (~2,870 tokens, matching Approach C). The explicit array also contains Turns 1-5 (~2,870 + Turn 5). Combined: ~5,740 + overhead ≈ 8,767 tokens (matching Approach A).
2. Approach B Is the Correct Pattern
Section titled “2. Approach B Is the Correct Pattern”Approach B (new message only + previous_response_id) produces token counts nearly identical to Approach C (full array, no prev_id). This confirms:
- The server correctly assembles prior context from
previous_response_id - Sending only the new message avoids duplication
- The slight overhead in B vs C (~370 tokens at Turn 5) is the server adding its stored response outputs to the context
3. Approach A Hits Context Limits Prematurely
Section titled “3. Approach A Hits Context Limits Prematurely”In our compaction experiment, compaction first triggered at Turn 11 (~225K input tokens) using Approach A. With Approach B, the same real context would only be ~110K tokens at Turn 11 — well under the 200K threshold. This means:
- Compaction triggered earlier than necessary in our test
- The actual compaction-free conversation length with Approach B is roughly 2x longer
- Our compaction test results are still valid (compaction works), but the trigger points would shift later with the correct pattern
4. Cost Implications
Section titled “4. Cost Implications”Approach A consumed 2.3x the input tokens over just 5 turns. At scale:
| Turns | A: Total Input | B: Total Input | Wasted Tokens (A - B) |
|---|---|---|---|
| 5 | 20,496 | 8,980 | 11,516 |
| 10 | ~80,000 | ~18,000 | ~62,000 |
| 20 | ~320,000 | ~36,000 | ~284,000 |
The waste grows quadratically. For a 20-turn shopping assistant conversation, Approach A would consume roughly 9x the input tokens of Approach B.
Cached Token Behavior
Section titled “Cached Token Behavior”| Turn | A: Cached | B: Cached | C: Cached |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 1,280 | 0 | 0 |
| 4 | 3,328 | 1,280 | 1,280 |
| 5 | 5,376 | 2,304 | 1,792 |
Approach A shows higher cached tokens because the duplicated content is cacheable — the server recognizes the repeated prefix. However, caching mitigates cost, not context window usage. The full (uncached) token count still applies against the context limit.
Implications for the AI Shopping Assistant
Section titled “Implications for the AI Shopping Assistant”Correct Integration Pattern
Section titled “Correct Integration Pattern”# ✅ CORRECT: Send only new message + previous_response_idresponse = client.responses.create( model="gpt-5.4-mini", input=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": new_user_message}, # Only the new message ], previous_response_id=prev_response_id, # Server carries prior context context_management=[{"type": "compaction", "compact_threshold": 200000}], store=True,)# ❌ WRONG: Sends full history + previous_response_id (duplicates context)response = client.responses.create( model="gpt-5.4-mini", input=[ {"role": "system", "content": system_prompt}, *full_conversation_history, # Duplicated with server state! ], previous_response_id=prev_response_id, context_management=[{"type": "compaction", "compact_threshold": 200000}], store=True,)When to Use Each Pattern
Section titled “When to Use Each Pattern”| Pattern | When to Use |
|---|---|
| B: New msg + prev_id | Default for all multi-turn conversations. Most token-efficient. Server manages state. |
| C: Full array, no prev_id | When you need explicit control over conversation history (e.g., editing prior messages, injecting system context mid-conversation). |
| A: Full array + prev_id | Never. Duplicates context, wastes tokens, hits limits prematurely. |
Impact on Episode-Based History
Section titled “Impact on Episode-Based History”Our current episode-based history system stores messages in DynamoDB and reconstructs the full conversation for each turn. If we adopt previous_response_id chaining (Approach B):
- DynamoDB still stores all messages (for audit, replay, analytics)
- But we don’t send them all to OpenAI — just the new message + prev_id
- The server carries forward context via its stored state
- Compaction handles overflow automatically
- History becomes a read-side concern (analytics, debugging), not a write-side concern (context management)
Conclusion
Section titled “Conclusion”H1 confirmed: The Responses API does not deduplicate when both input array and previous_response_id contain overlapping conversation history. The context is double-counted, resulting in 2.3x input token consumption over 5 turns (growing quadratically).
The correct pattern for long-running conversations is Approach B: send only the system prompt + new user message in input, pass previous_response_id for continuity, and let the server manage all prior state. This halves token consumption, delays compaction triggers, and extends conversation length before any context management is needed.
Appendix
Section titled “Appendix”Test Code
Section titled “Test Code”Test file: consumer-agent/tests/integration/test_context_duplication.py
Branch: experiment/gpt-5.4-compaction-test
OpenAI SDK Version
Section titled “OpenAI SDK Version”openai==2.31.0
Raw Data
Section titled “Raw Data”APPROACH A (Full array + previous_response_id): Turn 1: 574 input | 0 cached Turn 2: 1,769 input | 0 cached Turn 3: 3,527 input | 1,280 cached Turn 4: 5,859 input | 3,328 cached Turn 5: 8,767 input | 5,376 cached Total: 20,496 input
APPROACH B (New message + previous_response_id): Turn 1: 574 input | 0 cached Turn 2: 1,189 input | 0 cached Turn 3: 1,798 input | 0 cached Turn 4: 2,405 input | 1,280 cached Turn 5: 3,014 input | 2,304 cached Total: 8,980 input
APPROACH C (Full array, no previous_response_id): Turn 1: 574 input | 0 cached Turn 2: 1,148 input | 0 cached Turn 3: 1,722 input | 0 cached Turn 4: 2,296 input | 1,280 cached Turn 5: 2,870 input | 1,792 cached Total: 8,610 input