Skip to content

Experiment Report: Responses API Compaction Behavior at High Context Lengths (GPT-5.4)

Experiment Report: Responses API Compaction Behavior at High Context Lengths (GPT-5.4)

Section titled “Experiment Report: Responses API Compaction Behavior at High Context Lengths (GPT-5.4)”

Date: 2026-04-15 Models under test: gpt-5.4 and gpt-5.4-mini with context_management compaction enabled (compact_threshold: 200,000) Baseline model: gpt-5-mini (no compaction, reproduced from original experiment) Assumed model limits: 1M total context window (GPT-5.4/5.4-mini); 400K total context window (GPT-5-mini)

Executive summary: We tested whether GPT-5.4’s new context_management compaction feature prevents the context_length_exceeded failure observed in the original GPT-5-mini experiment. It does — on both gpt-5.4 and gpt-5.4-mini. Both models survived all 20 turns with 5 automatic compaction events, compared to the baseline which failed at Turn 13 (~263K input tokens). Compaction creates a sawtooth pattern: context grows to ~225K, compacts back to ~160K, and the conversation continues indefinitely. The mini variant exhibited identical compaction behavior at 3.7x faster throughput.

This is a follow-up to the September 2025 experiment which proved that previous_response_id does not prevent context window overflow. That experiment failed at Turn 7 (~272K input tokens) with gpt-5-mini.

OpenAI has since released GPT-5.4 — the first mainline model trained to support compaction. Compaction is a new context_management parameter in the Responses API that automatically summarizes conversation state into an encrypted, opaque item when the token count crosses a configurable threshold. This experiment tests whether compaction solves the long-conversation problem for our AI shopping assistant.

What changed since the original experiment:

  • GPT-5.4 supports a 1M token context window (vs 400K for GPT-5-mini)
  • New context_management API parameter with compaction type
  • Compaction is automatic — no separate API call required
  • Compaction items are encrypted and opaque (not human-readable)

Reference: OpenAI Compaction Guide

H1: GPT-5.4 with context_management compaction enabled will not fail with context_length_exceeded, even when the conversation would have exceeded the per-request input budget without compaction.

H2: Compaction will trigger automatically when input tokens cross the compact_threshold, reducing context size while preserving conversational coherence.

  • Model: gpt-5-mini
  • Compaction: None (same as original experiment)
  • Turn pattern: Each turn sends a large user message (~5K tokens of Moby-Dick public-domain text) plus asks for a cumulative summary
  • Chaining: previous_response_id passed from each response to the next (store: true)
  • Max output: 500 tokens per turn (keep responses short, maximize input growth)
  • Safety cap: 50 turns

Compaction Tests (gpt-5.4 and gpt-5.4-mini, compaction enabled)

Section titled “Compaction Tests (gpt-5.4 and gpt-5.4-mini, compaction enabled)”
  • Models: gpt-5.4 and gpt-5.4-mini (tested separately, identical parameters)
  • Compaction: context_management: [{"type": "compaction", "compact_threshold": 200000}]
  • Turn pattern: Identical to baseline (same Moby-Dick inflation text, same message format)
  • Chaining: previous_response_id passed from each response to the next (store: true)
  • Max output: 500 tokens per turn
  • Safety cap: 20 turns (sufficient to prove survival past baseline failure point)

Each user message includes ~5,000 tokens of Moby-Dick (Gutenberg #2701) text — the opening chapters from “Call me Ishmael” through Ishmael’s motivations for going to sea. This mirrors a shopping assistant session where context accumulates through tool outputs, user messages, and reasoning tokens, but forces growth quickly to reach the threshold region.

TurnInput TokensCached TokensOutput TokensLatencyStatus
15,78405009.7sok
214,5554,8645007.9sok
326,23014,0805008.6sok
440,86125,4724878.6sok
558,470049110.3sok
679,00040,7045009.0sok
7102,82879,48844310.1sok
8128,75778,72050010.3sok
9158,427129,1522778.0sok
10190,267128,64050013.1sok
11225,782027822.5sok
12263,447190,20850014.8sok
13context_length_exceeded

Baseline result: Failed at Turn 13. Last successful turn had 263,447 input tokens. Consistent with the original experiment’s ~272K effective limit.

Total test duration: 134s (2 min 14s)

TurnInput TokensCached TokensOutput TokensLatencyStatus
15,78401186.0sok
214,6055,7608915.6sok
326,31514,5921009.0sok
440,95426,2401076.9sok
558,51840,96010613.4sok
678,99958,49611014.2sok
7102,40378,97610916.7sok
8128,72401064.9sok
9157,960128,6401063.7sok
10190,114157,9521084.2sok
11225,189190,08089041.8scompacted
12159,25301086.8sok
13200,164159,23286629.7scompacted
14165,09401057.1sok
15211,838164,99288940.6scompacted
16170,95501056.3sok
17223,535170,88088919.1scompacted
18176,78801056.5sok
19235,203176,76888718.4scompacted
20182,620010511.5sok

Compaction result: All 20 turns completed. 5 compaction events at turns 11, 13, 15, 17, 19.

Total test duration: 283s (4 min 42s)

TurnInput TokensCached TokensOutput TokensLatencyStatus
15,78401152.9sok
214,6055,3761091.7sok
326,34314,5921091.5sok
441,00225,8561071.6sok
558,57840,7041091.5sok
679,07758,1121101.6sok
7102,49678,5921072.0sok
8128,831102,1441071.7sok
9158,085128,768271.5sok
10190,177157,952291.6sok
11225,190189,69690212.0scompacted
12159,4430282.7sok
13200,293158,9766637.7scompacted
14165,1740993.8sok
15211,932165,1209237.7scompacted
16171,1190282.6sok
17223,644170,7527867.6scompacted
18176,9080823.3sok
19235,325176,3847339.0scompacted
20182,7150282.5sok

Compaction result: All 20 turns completed. 5 compaction events at turns 11, 13, 15, 17, 19 — identical pattern to gpt-5.4.

Total test duration: 77s (1 min 16s) — 3.7x faster than gpt-5.4

gpt-5.4:

  • Completed turns: 20 (all)
  • Compaction events: 5 (at turns 11, 13, 15, 17, 19)
  • Peak input tokens: 235,203 (Turn 19, just before compaction)
  • Post-compaction input tokens: ~159K–182K (reduced by 40K–66K each time)
  • Compact threshold: 200,000

gpt-5.4-mini:

  • Completed turns: 20 (all)
  • Compaction events: 5 (at turns 11, 13, 15, 17, 19)
  • Peak input tokens: 235,325 (Turn 19, just before compaction)
  • Post-compaction input tokens: ~159K–183K (nearly identical to gpt-5.4)
  • Compact threshold: 200,000
Input Tokens
260K ┤
│ Baseline fails here (Turn 13)
240K ┤ ↓
│ ╱╲ ╱╲ ╱╲
220K ┤ ╱╲ / \ ╱╲ / \ ╱╲ / \
│ ╱╲ / ╲ \╱╲ / ╲ \╱╲ / ╲
200K ┤ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ /─ ╲─ ─ ─ ─ ─ ╲─ ─ ─ ─ ─ ╲─ ─ threshold
│ ╱╲ ╲ ╲ ╲
180K ┤ ╱╲ / \ \ \ \
│ ╱╲ / ╲ \ \ \ \
160K ┤ ╱╲ / ╲ \ ╲ \ \ \
│ ╱╲ / ╲ \
140K ┤ ╱╲ / ╲
│╱╲ /
120K ┤
└──────────────────────────────────────────────────────────────
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Turn

Context grows linearly until crossing the 200K threshold, then compaction drops it back to ~160K. This cycle repeats every 2 turns, creating a sustainable sawtooth pattern. The conversation can continue indefinitely.

  1. Compaction triggers reliably. Every time input tokens crossed ~200K, compaction fired on that same turn. The response includes a compaction item alongside the normal output.

  2. Context reduction is significant. Each compaction reduced input by 40K–66K tokens (roughly 20–30% reduction). The exact reduction varies based on how much compressible content exists.

  3. Cached tokens reset after compaction. Post-compaction turns show 0 cached tokens, indicating the server treats the compacted state as a new cache baseline. Caching resumes on the following turn.

  4. Compaction turns are slower. Compaction turns averaged ~30s vs ~8s for normal turns. The model generates ~890 output tokens on compaction turns (vs ~105 normally) — the extra tokens are the compaction item.

  5. Output tokens spike during compaction. Normal turns produce ~105 output tokens. Compaction turns produce ~890 output tokens. The delta (~785 tokens) is the encrypted compaction item.

  6. No quality degradation observed. The model continued to reference earlier conversation details after compaction events, suggesting the compaction item preserves key context.

  7. gpt-5.4-mini exhibits identical compaction behavior. The mini variant compacts at the same turns (11, 13, 15, 17, 19), produces nearly identical peak/post-compaction token counts, and generates the same sawtooth pattern. The only difference is speed: 3.7x faster overall (77s vs 283s), with normal turns averaging ~2s (vs ~8s) and compaction turns averaging ~9s (vs ~30s). This makes the mini variant strongly preferable for cost-sensitive or latency-sensitive use cases where compaction is needed.

MetricBaseline (gpt-5-mini)Compaction (gpt-5.4)Compaction (gpt-5.4-mini)
Modelgpt-5-minigpt-5.4gpt-5.4-mini
CompactionNonecompact_threshold=200Kcompact_threshold=200K
Turns completed12 (failed at 13)20 (all)20 (all)
Peak input tokens263,447235,203235,325
Failure modecontext_length_exceededNoneNone
Compaction eventsN/A55
Compaction turnsN/A11, 13, 15, 17, 1911, 13, 15, 17, 19
Total duration134s283s77s
Avg latency (normal turn)~10s~8s~2s
Avg latency (compaction turn)N/A~30s~9s

Implications for the AI Shopping Assistant

Section titled “Implications for the AI Shopping Assistant”

The original experiment concluded that we must manage history explicitly (sliding window + summaries) because previous_response_id doesn’t prevent overflow. With GPT-5.4 compaction, this constraint is relaxed:

  1. Compaction handles overflow automatically. We no longer need to implement client-side sliding window or manual summarization to prevent context_length_exceeded errors. The API handles it.

  2. Conversations can run indefinitely. The sawtooth pattern is sustainable — there’s no turn limit as long as compaction is enabled.

  3. Episode-based history is still valuable. Compaction doesn’t replace our DynamoDB-based history. We still need history for: cross-session continuity, audit trails, analytics, and replay. But we no longer need history as a context management mechanism.

  1. Compaction loses information. The compaction item is a lossy summary. For our use case, this means older tool outputs, offer details, and user preferences may be compressed away. We should evaluate whether critical shopping context survives compaction.

  2. Compaction turns are slower (~30s). This is a UX concern. Every 2-3 turns, there’s a slow turn. For a shopping assistant where users expect snappy responses, this may be noticeable.

  3. Cost implications. Compaction turns produce more output tokens (~890 vs ~105). At GPT-5.4 pricing, this adds cost on compaction turns. Need to model total cost vs the alternative (managing state ourselves).

  4. Opaque compaction items. The compaction item is encrypted — we can’t inspect, modify, or audit what was preserved. This is a tradeoff for convenience.

  1. Quality evaluation: Run the shopping assistant through multi-turn conversations with compaction enabled. Evaluate whether critical context (user preferences, offer details, tool results) survives compaction events.

  2. Threshold tuning: Test different compact_threshold values. 200K may be too aggressive (frequent compaction) or too conservative (large context windows). Consider 300K or 400K for GPT-5.4’s 1M window.

  3. Latency mitigation: Investigate whether compaction latency can be hidden (e.g., streaming the normal response first, compaction happening asynchronously).

  4. Hybrid approach: Consider using compaction as a safety net (high threshold, e.g., 500K) while still managing context explicitly for quality. This gives us the best of both: controlled context for quality, compaction as a backstop against crashes.

  5. previous_response_id-only pattern: Test the third test case (test_compaction_with_previous_response_id_only) where we send only the new message each turn and let the server manage all state. This is the cleanest integration pattern.

This experiment supports both hypotheses:

  • H1 confirmed: Both gpt-5.4 and gpt-5.4-mini with compaction do not fail with context_length_exceeded, even at input levels that crash GPT-5-mini. Both models survived 20 turns (and could continue indefinitely) vs the baseline’s 13-turn failure.

  • H2 confirmed: Compaction triggers automatically and reliably when input tokens cross the threshold, reducing context by 20–30% each time. The sawtooth pattern is sustainable. Both the full and mini variants exhibit identical compaction behavior.

For the AI shopping assistant, compaction is a viable solution for long conversations. It eliminates the hard failure mode, though it introduces latency spikes and potential information loss that need further evaluation. The gpt-5.4-mini variant is particularly promising — identical compaction behavior at 3.7x the speed. The recommended approach is a hybrid model: use compaction as a safety net with a high threshold, while continuing to manage context quality through our existing episode-based history system.

Test file: consumer-agent/tests/integration/test_context_window_limits.py Branch: experiment/gpt-5.4-compaction-test

Four test cases:

  1. test_context_overflow_gpt5_mini_baseline — Reproduces original failure (Turn 13, 263K tokens)
  2. test_compaction_gpt54_extends_conversation — Proves compaction prevents failure with gpt-5.4
  3. test_compaction_gpt54_mini_extends_conversation — Proves compaction works on gpt-5.4-mini (3.7x faster)
  4. test_compaction_with_previous_response_id_only — Tests server-side-only state management (not yet run)
  • openai==2.31.0 (upgraded from 2.6.1 to support context_management parameter)

Moby-Dick opening chapters (Gutenberg #2701), from “Call me Ishmael” through Ishmael’s motivations for going to sea. ~5,000 tokens per message. Used solely to accelerate input growth to the threshold region.

# Baseline (gpt-5-mini)
client.responses.create(
model="gpt-5-mini",
input=[{"role": "system", ...}, *conversation],
previous_response_id=prev_id,
max_output_tokens=500,
store=True,
)
# Compaction (gpt-5.4)
client.responses.create(
model="gpt-5.4",
input=[{"role": "system", ...}, *conversation],
previous_response_id=prev_id,
context_management=[{"type": "compaction", "compact_threshold": 200000}],
max_output_tokens=500,
store=True,
)