Experiment Report: Context Duplication When Combining Input Array with previous_response_id

Date: 2026-04-15 Model under test: gpt-5.4-mini Related: Compaction Experiment Report · Original Context Window Experiment

Executive summary: Sending the full conversation history in the input array and passing previous_response_id causes the API to double-count prior turns. In a 5-turn test, Approach A (full array + prev_id) consumed 2.3x the input tokens of Approach B (new message only + prev_id). The server does not deduplicate — it assembles prior context from previous_response_id and adds the explicit input array on top. The correct pattern for long conversations is to send only the new user message with previous_response_id and let the server carry forward all prior state.

Background & Motivation

During the GPT-5.4 compaction experiment, we observed compaction triggering at Turn 11 (~225K input tokens). The test was sending the full conversation array alongside previous_response_id on every turn. This raised the question: is the context being counted twice?

The OpenAI conversation state documentation describes two chaining patterns:

Stateless input-array chaining: Append all output items to the conversation array and pass the full array each turn. No previous_response_id.
previous_response_id chaining: Pass only the new user message each turn. The server carries forward all prior context.

These are described as alternatives, not complements. We tested what happens when you combine both.

Hypothesis

H1: Sending the full conversation array in input while also passing previous_response_id will result in higher input token counts than either approach alone, because the server assembles prior turns from previous_response_id and additionally counts the explicit input array — without deduplication.

Method

Three approaches tested over 5 identical turns with gpt-5.4-mini. Each turn sends ~575 tokens of user content (excerpts from Moby-Dick) plus asks for a one-sentence summary.

Approach	`input` array	`previous_response_id`	Description
A	Full conversation history (grows each turn)	Yes (chained from prior response)	Both mechanisms combined
B	System prompt + new user message only	Yes (chained from prior response)	Server carries all prior state
C	Full conversation history (grows each turn)	None	Client manages all state

All three used identical message content, max_output_tokens=100, store=true, and gpt-5.4-mini.

Results

Per-Turn Input Token Comparison

Turn	A: Full Array + prev_id	B: New Msg + prev_id	C: Full Array Only	A vs B Overhead	A vs C Overhead
1	574	574	574	+0	+0
2	1,769	1,189	1,148	+580	+621
3	3,527	1,798	1,722	+1,729	+1,805
4	5,859	2,405	2,296	+3,454	+3,563
5	8,767	3,014	2,870	+5,753	+5,897

Totals

Approach	Total Input Tokens	vs Approach B
A (Full array + prev_id)	20,496	2.3x (+128%)
B (New msg + prev_id)	8,980	1.0x (baseline)
C (Full array, no prev_id)	8,610	0.96x

Growth Pattern

Input Tokens
  9,000 ┤                                              ╱ A (full + prev_id)
        │                                           ╱╱
  8,000 ┤                                        ╱╱
        │                                     ╱╱
  7,000 ┤                                  ╱╱
        │                               ╱╱
  6,000 ┤                            ╱╱
        │                         ╱╱
  5,000 ┤                      ╱╱
        │                   ╱╱
  4,000 ┤                ╱╱
        │             ╱╱
  3,000 ┤          ╱╱                               ╱ B (new msg + prev_id)
        │       ╱╱                              ╱╱╱  C (full array only)
  2,000 ┤    ╱╱                            ╱╱╱╱
        │  ╱╱                         ╱╱╱╱
  1,000 ┤╱╱                      ╱╱╱╱
        │                  ╱╱╱╱
      0 ┤─────────────╱╱╱╱─────────────────────────────
         1          2          3          4          5
                              Turn

Approach A grows quadratically (prior turns counted twice — once from the array, once from server state). Approaches B and C grow linearly at nearly the same rate.

Analysis & Interpretation

1. The API Does Not Deduplicate

When you pass previous_response_id, the server reconstructs the full conversation from its stored state. If you also send the same conversation in the input array, the server counts both:

Effective input = server-side state (from previous_response_id)
                + explicit input array (from your request)

At Turn 5, the server-side state contains Turns 1-4 (~2,870 tokens, matching Approach C). The explicit array also contains Turns 1-5 (~2,870 + Turn 5). Combined: ~5,740 + overhead ≈ 8,767 tokens (matching Approach A).

2. Approach B Is the Correct Pattern

Approach B (new message only + previous_response_id) produces token counts nearly identical to Approach C (full array, no prev_id). This confirms:

The server correctly assembles prior context from previous_response_id
Sending only the new message avoids duplication
The slight overhead in B vs C (~370 tokens at Turn 5) is the server adding its stored response outputs to the context

3. Approach A Hits Context Limits Prematurely

In our compaction experiment, compaction first triggered at Turn 11 (~225K input tokens) using Approach A. With Approach B, the same real context would only be ~110K tokens at Turn 11 — well under the 200K threshold. This means:

Compaction triggered earlier than necessary in our test
The actual compaction-free conversation length with Approach B is roughly 2x longer
Our compaction test results are still valid (compaction works), but the trigger points would shift later with the correct pattern

4. Cost Implications

Approach A consumed 2.3x the input tokens over just 5 turns. At scale:

Turns	A: Total Input	B: Total Input	Wasted Tokens (A - B)
5	20,496	8,980	11,516
10	~80,000	~18,000	~62,000
20	~320,000	~36,000	~284,000

The waste grows quadratically. For a 20-turn shopping assistant conversation, Approach A would consume roughly 9x the input tokens of Approach B.

Cached Token Behavior

Turn	A: Cached	B: Cached	C: Cached
1	0	0	0
2	0	0	0
3	1,280	0	0
4	3,328	1,280	1,280
5	5,376	2,304	1,792

Approach A shows higher cached tokens because the duplicated content is cacheable — the server recognizes the repeated prefix. However, caching mitigates cost, not context window usage. The full (uncached) token count still applies against the context limit.

Implications for the AI Shopping Assistant

Correct Integration Pattern

# ✅ CORRECT: Send only new message + previous_response_id
response = client.responses.create(
    model="gpt-5.4-mini",
    input=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": new_user_message},  # Only the new message
    ],
    previous_response_id=prev_response_id,  # Server carries prior context
    context_management=[{"type": "compaction", "compact_threshold": 200000}],
    store=True,
)

# ❌ WRONG: Sends full history + previous_response_id (duplicates context)
response = client.responses.create(
    model="gpt-5.4-mini",
    input=[
        {"role": "system", "content": system_prompt},
        *full_conversation_history,  # Duplicated with server state!
    ],
    previous_response_id=prev_response_id,
    context_management=[{"type": "compaction", "compact_threshold": 200000}],
    store=True,
)

When to Use Each Pattern

Pattern	When to Use
B: New msg + prev_id	Default for all multi-turn conversations. Most token-efficient. Server manages state.
C: Full array, no prev_id	When you need explicit control over conversation history (e.g., editing prior messages, injecting system context mid-conversation).
A: Full array + prev_id	Never. Duplicates context, wastes tokens, hits limits prematurely.

Impact on Episode-Based History

Our current episode-based history system stores messages in DynamoDB and reconstructs the full conversation for each turn. If we adopt previous_response_id chaining (Approach B):

DynamoDB still stores all messages (for audit, replay, analytics)
But we don’t send them all to OpenAI — just the new message + prev_id
The server carries forward context via its stored state
Compaction handles overflow automatically
History becomes a read-side concern (analytics, debugging), not a write-side concern (context management)

Conclusion

H1 confirmed: The Responses API does not deduplicate when both input array and previous_response_id contain overlapping conversation history. The context is double-counted, resulting in 2.3x input token consumption over 5 turns (growing quadratically).

The correct pattern for long-running conversations is Approach B: send only the system prompt + new user message in input, pass previous_response_id for continuity, and let the server manage all prior state. This halves token consumption, delays compaction triggers, and extends conversation length before any context management is needed.

Appendix

Test Code

Test file: consumer-agent/tests/integration/test_context_duplication.py Branch: experiment/gpt-5.4-compaction-test

OpenAI SDK Version

openai==2.31.0

Raw Data

APPROACH A (Full array + previous_response_id):
  Turn 1:     574 input |       0 cached
  Turn 2:   1,769 input |       0 cached
  Turn 3:   3,527 input |   1,280 cached
  Turn 4:   5,859 input |   3,328 cached
  Turn 5:   8,767 input |   5,376 cached
  Total:   20,496 input

APPROACH B (New message + previous_response_id):
  Turn 1:     574 input |       0 cached
  Turn 2:   1,189 input |       0 cached
  Turn 3:   1,798 input |       0 cached
  Turn 4:   2,405 input |   1,280 cached
  Turn 5:   3,014 input |   2,304 cached
  Total:    8,980 input

APPROACH C (Full array, no previous_response_id):
  Turn 1:     574 input |       0 cached
  Turn 2:   1,148 input |       0 cached
  Turn 3:   1,722 input |       0 cached
  Turn 4:   2,296 input |   1,280 cached
  Turn 5:   2,870 input |   1,792 cached
  Total:    8,610 input