Skip to content

Experiment Report: Context Duplication When Combining Input Array with previous_response_id

Experiment Report: Context Duplication When Combining Input Array with previous_response_id

Section titled “Experiment Report: Context Duplication When Combining Input Array with previous_response_id”

Date: 2026-04-15 Model under test: gpt-5.4-mini Related: Compaction Experiment Report · Original Context Window Experiment

Executive summary: Sending the full conversation history in the input array and passing previous_response_id causes the API to double-count prior turns. In a 5-turn test, Approach A (full array + prev_id) consumed 2.3x the input tokens of Approach B (new message only + prev_id). The server does not deduplicate — it assembles prior context from previous_response_id and adds the explicit input array on top. The correct pattern for long conversations is to send only the new user message with previous_response_id and let the server carry forward all prior state.

During the GPT-5.4 compaction experiment, we observed compaction triggering at Turn 11 (~225K input tokens). The test was sending the full conversation array alongside previous_response_id on every turn. This raised the question: is the context being counted twice?

The OpenAI conversation state documentation describes two chaining patterns:

  1. Stateless input-array chaining: Append all output items to the conversation array and pass the full array each turn. No previous_response_id.
  2. previous_response_id chaining: Pass only the new user message each turn. The server carries forward all prior context.

These are described as alternatives, not complements. We tested what happens when you combine both.

H1: Sending the full conversation array in input while also passing previous_response_id will result in higher input token counts than either approach alone, because the server assembles prior turns from previous_response_id and additionally counts the explicit input array — without deduplication.

Three approaches tested over 5 identical turns with gpt-5.4-mini. Each turn sends ~575 tokens of user content (excerpts from Moby-Dick) plus asks for a one-sentence summary.

Approachinput arrayprevious_response_idDescription
AFull conversation history (grows each turn)Yes (chained from prior response)Both mechanisms combined
BSystem prompt + new user message onlyYes (chained from prior response)Server carries all prior state
CFull conversation history (grows each turn)NoneClient manages all state

All three used identical message content, max_output_tokens=100, store=true, and gpt-5.4-mini.

TurnA: Full Array + prev_idB: New Msg + prev_idC: Full Array OnlyA vs B OverheadA vs C Overhead
1574574574+0+0
21,7691,1891,148+580+621
33,5271,7981,722+1,729+1,805
45,8592,4052,296+3,454+3,563
58,7673,0142,870+5,753+5,897
ApproachTotal Input Tokensvs Approach B
A (Full array + prev_id)20,4962.3x (+128%)
B (New msg + prev_id)8,9801.0x (baseline)
C (Full array, no prev_id)8,6100.96x
Input Tokens
9,000 ┤ ╱ A (full + prev_id)
│ ╱╱
8,000 ┤ ╱╱
│ ╱╱
7,000 ┤ ╱╱
│ ╱╱
6,000 ┤ ╱╱
│ ╱╱
5,000 ┤ ╱╱
│ ╱╱
4,000 ┤ ╱╱
│ ╱╱
3,000 ┤ ╱╱ ╱ B (new msg + prev_id)
│ ╱╱ ╱╱╱ C (full array only)
2,000 ┤ ╱╱ ╱╱╱╱
│ ╱╱ ╱╱╱╱
1,000 ┤╱╱ ╱╱╱╱
│ ╱╱╱╱
0 ┤─────────────╱╱╱╱─────────────────────────────
1 2 3 4 5
Turn

Approach A grows quadratically (prior turns counted twice — once from the array, once from server state). Approaches B and C grow linearly at nearly the same rate.

When you pass previous_response_id, the server reconstructs the full conversation from its stored state. If you also send the same conversation in the input array, the server counts both:

Effective input = server-side state (from previous_response_id)
+ explicit input array (from your request)

At Turn 5, the server-side state contains Turns 1-4 (~2,870 tokens, matching Approach C). The explicit array also contains Turns 1-5 (~2,870 + Turn 5). Combined: ~5,740 + overhead ≈ 8,767 tokens (matching Approach A).

Approach B (new message only + previous_response_id) produces token counts nearly identical to Approach C (full array, no prev_id). This confirms:

  • The server correctly assembles prior context from previous_response_id
  • Sending only the new message avoids duplication
  • The slight overhead in B vs C (~370 tokens at Turn 5) is the server adding its stored response outputs to the context

3. Approach A Hits Context Limits Prematurely

Section titled “3. Approach A Hits Context Limits Prematurely”

In our compaction experiment, compaction first triggered at Turn 11 (~225K input tokens) using Approach A. With Approach B, the same real context would only be ~110K tokens at Turn 11 — well under the 200K threshold. This means:

  • Compaction triggered earlier than necessary in our test
  • The actual compaction-free conversation length with Approach B is roughly 2x longer
  • Our compaction test results are still valid (compaction works), but the trigger points would shift later with the correct pattern

Approach A consumed 2.3x the input tokens over just 5 turns. At scale:

TurnsA: Total InputB: Total InputWasted Tokens (A - B)
520,4968,98011,516
10~80,000~18,000~62,000
20~320,000~36,000~284,000

The waste grows quadratically. For a 20-turn shopping assistant conversation, Approach A would consume roughly 9x the input tokens of Approach B.

TurnA: CachedB: CachedC: Cached
1000
2000
31,28000
43,3281,2801,280
55,3762,3041,792

Approach A shows higher cached tokens because the duplicated content is cacheable — the server recognizes the repeated prefix. However, caching mitigates cost, not context window usage. The full (uncached) token count still applies against the context limit.

Implications for the AI Shopping Assistant

Section titled “Implications for the AI Shopping Assistant”
# ✅ CORRECT: Send only new message + previous_response_id
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": new_user_message}, # Only the new message
],
previous_response_id=prev_response_id, # Server carries prior context
context_management=[{"type": "compaction", "compact_threshold": 200000}],
store=True,
)
# ❌ WRONG: Sends full history + previous_response_id (duplicates context)
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{"role": "system", "content": system_prompt},
*full_conversation_history, # Duplicated with server state!
],
previous_response_id=prev_response_id,
context_management=[{"type": "compaction", "compact_threshold": 200000}],
store=True,
)
PatternWhen to Use
B: New msg + prev_idDefault for all multi-turn conversations. Most token-efficient. Server manages state.
C: Full array, no prev_idWhen you need explicit control over conversation history (e.g., editing prior messages, injecting system context mid-conversation).
A: Full array + prev_idNever. Duplicates context, wastes tokens, hits limits prematurely.

Our current episode-based history system stores messages in DynamoDB and reconstructs the full conversation for each turn. If we adopt previous_response_id chaining (Approach B):

  • DynamoDB still stores all messages (for audit, replay, analytics)
  • But we don’t send them all to OpenAI — just the new message + prev_id
  • The server carries forward context via its stored state
  • Compaction handles overflow automatically
  • History becomes a read-side concern (analytics, debugging), not a write-side concern (context management)

H1 confirmed: The Responses API does not deduplicate when both input array and previous_response_id contain overlapping conversation history. The context is double-counted, resulting in 2.3x input token consumption over 5 turns (growing quadratically).

The correct pattern for long-running conversations is Approach B: send only the system prompt + new user message in input, pass previous_response_id for continuity, and let the server manage all prior state. This halves token consumption, delays compaction triggers, and extends conversation length before any context management is needed.

Test file: consumer-agent/tests/integration/test_context_duplication.py Branch: experiment/gpt-5.4-compaction-test

openai==2.31.0

APPROACH A (Full array + previous_response_id):
Turn 1: 574 input | 0 cached
Turn 2: 1,769 input | 0 cached
Turn 3: 3,527 input | 1,280 cached
Turn 4: 5,859 input | 3,328 cached
Turn 5: 8,767 input | 5,376 cached
Total: 20,496 input
APPROACH B (New message + previous_response_id):
Turn 1: 574 input | 0 cached
Turn 2: 1,189 input | 0 cached
Turn 3: 1,798 input | 0 cached
Turn 4: 2,405 input | 1,280 cached
Turn 5: 3,014 input | 2,304 cached
Total: 8,980 input
APPROACH C (Full array, no previous_response_id):
Turn 1: 574 input | 0 cached
Turn 2: 1,148 input | 0 cached
Turn 3: 1,722 input | 0 cached
Turn 4: 2,296 input | 1,280 cached
Turn 5: 2,870 input | 1,792 cached
Total: 8,610 input