Experiment Report: Responses API Compaction Behavior at High Context Lengths (GPT-5.4)
Experiment Report: Responses API Compaction Behavior at High Context Lengths (GPT-5.4)
Section titled “Experiment Report: Responses API Compaction Behavior at High Context Lengths (GPT-5.4)”Date: 2026-04-15
Models under test: gpt-5.4 and gpt-5.4-mini with context_management compaction enabled (compact_threshold: 200,000)
Baseline model: gpt-5-mini (no compaction, reproduced from original experiment)
Assumed model limits: 1M total context window (GPT-5.4/5.4-mini); 400K total context window (GPT-5-mini)
Executive summary: We tested whether GPT-5.4’s new
context_managementcompaction feature prevents thecontext_length_exceededfailure observed in the original GPT-5-mini experiment. It does — on bothgpt-5.4andgpt-5.4-mini. Both models survived all 20 turns with 5 automatic compaction events, compared to the baseline which failed at Turn 13 (~263K input tokens). Compaction creates a sawtooth pattern: context grows to ~225K, compacts back to ~160K, and the conversation continues indefinitely. The mini variant exhibited identical compaction behavior at 3.7x faster throughput.
Background & Use Case
Section titled “Background & Use Case”This is a follow-up to the September 2025 experiment which proved that previous_response_id does not prevent context window overflow. That experiment failed at Turn 7 (~272K input tokens) with gpt-5-mini.
OpenAI has since released GPT-5.4 — the first mainline model trained to support compaction. Compaction is a new context_management parameter in the Responses API that automatically summarizes conversation state into an encrypted, opaque item when the token count crosses a configurable threshold. This experiment tests whether compaction solves the long-conversation problem for our AI shopping assistant.
What changed since the original experiment:
- GPT-5.4 supports a 1M token context window (vs 400K for GPT-5-mini)
- New
context_managementAPI parameter withcompactiontype - Compaction is automatic — no separate API call required
- Compaction items are encrypted and opaque (not human-readable)
Reference: OpenAI Compaction Guide
Hypothesis
Section titled “Hypothesis”H1: GPT-5.4 with
context_managementcompaction enabled will not fail withcontext_length_exceeded, even when the conversation would have exceeded the per-request input budget without compaction.
H2: Compaction will trigger automatically when input tokens cross the
compact_threshold, reducing context size while preserving conversational coherence.
Method
Section titled “Method”Baseline Test (gpt-5-mini, no compaction)
Section titled “Baseline Test (gpt-5-mini, no compaction)”- Model:
gpt-5-mini - Compaction: None (same as original experiment)
- Turn pattern: Each turn sends a large user message (~5K tokens of Moby-Dick public-domain text) plus asks for a cumulative summary
- Chaining:
previous_response_idpassed from each response to the next (store: true) - Max output: 500 tokens per turn (keep responses short, maximize input growth)
- Safety cap: 50 turns
Compaction Tests (gpt-5.4 and gpt-5.4-mini, compaction enabled)
Section titled “Compaction Tests (gpt-5.4 and gpt-5.4-mini, compaction enabled)”- Models:
gpt-5.4andgpt-5.4-mini(tested separately, identical parameters) - Compaction:
context_management: [{"type": "compaction", "compact_threshold": 200000}] - Turn pattern: Identical to baseline (same Moby-Dick inflation text, same message format)
- Chaining:
previous_response_idpassed from each response to the next (store: true) - Max output: 500 tokens per turn
- Safety cap: 20 turns (sufficient to prove survival past baseline failure point)
Inflation Strategy
Section titled “Inflation Strategy”Each user message includes ~5,000 tokens of Moby-Dick (Gutenberg #2701) text — the opening chapters from “Call me Ishmael” through Ishmael’s motivations for going to sea. This mirrors a shopping assistant session where context accumulates through tool outputs, user messages, and reasoning tokens, but forces growth quickly to reach the threshold region.
Results
Section titled “Results”Baseline: gpt-5-mini (no compaction)
Section titled “Baseline: gpt-5-mini (no compaction)”| Turn | Input Tokens | Cached Tokens | Output Tokens | Latency | Status |
|---|---|---|---|---|---|
| 1 | 5,784 | 0 | 500 | 9.7s | ok |
| 2 | 14,555 | 4,864 | 500 | 7.9s | ok |
| 3 | 26,230 | 14,080 | 500 | 8.6s | ok |
| 4 | 40,861 | 25,472 | 487 | 8.6s | ok |
| 5 | 58,470 | 0 | 491 | 10.3s | ok |
| 6 | 79,000 | 40,704 | 500 | 9.0s | ok |
| 7 | 102,828 | 79,488 | 443 | 10.1s | ok |
| 8 | 128,757 | 78,720 | 500 | 10.3s | ok |
| 9 | 158,427 | 129,152 | 277 | 8.0s | ok |
| 10 | 190,267 | 128,640 | 500 | 13.1s | ok |
| 11 | 225,782 | 0 | 278 | 22.5s | ok |
| 12 | 263,447 | 190,208 | 500 | 14.8s | ok |
| 13 | — | — | — | — | context_length_exceeded |
Baseline result: Failed at Turn 13. Last successful turn had 263,447 input tokens. Consistent with the original experiment’s ~272K effective limit.
Total test duration: 134s (2 min 14s)
Compaction Test: gpt-5.4
Section titled “Compaction Test: gpt-5.4”| Turn | Input Tokens | Cached Tokens | Output Tokens | Latency | Status |
|---|---|---|---|---|---|
| 1 | 5,784 | 0 | 118 | 6.0s | ok |
| 2 | 14,605 | 5,760 | 89 | 15.6s | ok |
| 3 | 26,315 | 14,592 | 100 | 9.0s | ok |
| 4 | 40,954 | 26,240 | 107 | 6.9s | ok |
| 5 | 58,518 | 40,960 | 106 | 13.4s | ok |
| 6 | 78,999 | 58,496 | 110 | 14.2s | ok |
| 7 | 102,403 | 78,976 | 109 | 16.7s | ok |
| 8 | 128,724 | 0 | 106 | 4.9s | ok |
| 9 | 157,960 | 128,640 | 106 | 3.7s | ok |
| 10 | 190,114 | 157,952 | 108 | 4.2s | ok |
| 11 | 225,189 | 190,080 | 890 | 41.8s | compacted |
| 12 | 159,253 | 0 | 108 | 6.8s | ok |
| 13 | 200,164 | 159,232 | 866 | 29.7s | compacted |
| 14 | 165,094 | 0 | 105 | 7.1s | ok |
| 15 | 211,838 | 164,992 | 889 | 40.6s | compacted |
| 16 | 170,955 | 0 | 105 | 6.3s | ok |
| 17 | 223,535 | 170,880 | 889 | 19.1s | compacted |
| 18 | 176,788 | 0 | 105 | 6.5s | ok |
| 19 | 235,203 | 176,768 | 887 | 18.4s | compacted |
| 20 | 182,620 | 0 | 105 | 11.5s | ok |
Compaction result: All 20 turns completed. 5 compaction events at turns 11, 13, 15, 17, 19.
Total test duration: 283s (4 min 42s)
Compaction Test: gpt-5.4-mini
Section titled “Compaction Test: gpt-5.4-mini”| Turn | Input Tokens | Cached Tokens | Output Tokens | Latency | Status |
|---|---|---|---|---|---|
| 1 | 5,784 | 0 | 115 | 2.9s | ok |
| 2 | 14,605 | 5,376 | 109 | 1.7s | ok |
| 3 | 26,343 | 14,592 | 109 | 1.5s | ok |
| 4 | 41,002 | 25,856 | 107 | 1.6s | ok |
| 5 | 58,578 | 40,704 | 109 | 1.5s | ok |
| 6 | 79,077 | 58,112 | 110 | 1.6s | ok |
| 7 | 102,496 | 78,592 | 107 | 2.0s | ok |
| 8 | 128,831 | 102,144 | 107 | 1.7s | ok |
| 9 | 158,085 | 128,768 | 27 | 1.5s | ok |
| 10 | 190,177 | 157,952 | 29 | 1.6s | ok |
| 11 | 225,190 | 189,696 | 902 | 12.0s | compacted |
| 12 | 159,443 | 0 | 28 | 2.7s | ok |
| 13 | 200,293 | 158,976 | 663 | 7.7s | compacted |
| 14 | 165,174 | 0 | 99 | 3.8s | ok |
| 15 | 211,932 | 165,120 | 923 | 7.7s | compacted |
| 16 | 171,119 | 0 | 28 | 2.6s | ok |
| 17 | 223,644 | 170,752 | 786 | 7.6s | compacted |
| 18 | 176,908 | 0 | 82 | 3.3s | ok |
| 19 | 235,325 | 176,384 | 733 | 9.0s | compacted |
| 20 | 182,715 | 0 | 28 | 2.5s | ok |
Compaction result: All 20 turns completed. 5 compaction events at turns 11, 13, 15, 17, 19 — identical pattern to gpt-5.4.
Total test duration: 77s (1 min 16s) — 3.7x faster than gpt-5.4
Compaction Run Totals
Section titled “Compaction Run Totals”gpt-5.4:
- Completed turns: 20 (all)
- Compaction events: 5 (at turns 11, 13, 15, 17, 19)
- Peak input tokens: 235,203 (Turn 19, just before compaction)
- Post-compaction input tokens: ~159K–182K (reduced by 40K–66K each time)
- Compact threshold: 200,000
gpt-5.4-mini:
- Completed turns: 20 (all)
- Compaction events: 5 (at turns 11, 13, 15, 17, 19)
- Peak input tokens: 235,325 (Turn 19, just before compaction)
- Post-compaction input tokens: ~159K–183K (nearly identical to gpt-5.4)
- Compact threshold: 200,000
Analysis & Interpretation
Section titled “Analysis & Interpretation”Compaction Creates a Sawtooth Pattern
Section titled “Compaction Creates a Sawtooth Pattern”Input Tokens 260K ┤ │ Baseline fails here (Turn 13) 240K ┤ ↓ │ ╱╲ ╱╲ ╱╲ 220K ┤ ╱╲ / \ ╱╲ / \ ╱╲ / \ │ ╱╲ / ╲ \╱╲ / ╲ \╱╲ / ╲ 200K ┤ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ /─ ╲─ ─ ─ ─ ─ ╲─ ─ ─ ─ ─ ╲─ ─ threshold │ ╱╲ ╲ ╲ ╲ 180K ┤ ╱╲ / \ \ \ \ │ ╱╲ / ╲ \ \ \ \ 160K ┤ ╱╲ / ╲ \ ╲ \ \ \ │ ╱╲ / ╲ \ 140K ┤ ╱╲ / ╲ │╱╲ / 120K ┤ └────────────────────────────────────────────────────────────── 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TurnContext grows linearly until crossing the 200K threshold, then compaction drops it back to ~160K. This cycle repeats every 2 turns, creating a sustainable sawtooth pattern. The conversation can continue indefinitely.
Key Observations
Section titled “Key Observations”-
Compaction triggers reliably. Every time input tokens crossed ~200K, compaction fired on that same turn. The response includes a compaction item alongside the normal output.
-
Context reduction is significant. Each compaction reduced input by 40K–66K tokens (roughly 20–30% reduction). The exact reduction varies based on how much compressible content exists.
-
Cached tokens reset after compaction. Post-compaction turns show 0 cached tokens, indicating the server treats the compacted state as a new cache baseline. Caching resumes on the following turn.
-
Compaction turns are slower. Compaction turns averaged ~30s vs ~8s for normal turns. The model generates ~890 output tokens on compaction turns (vs ~105 normally) — the extra tokens are the compaction item.
-
Output tokens spike during compaction. Normal turns produce ~105 output tokens. Compaction turns produce ~890 output tokens. The delta (~785 tokens) is the encrypted compaction item.
-
No quality degradation observed. The model continued to reference earlier conversation details after compaction events, suggesting the compaction item preserves key context.
-
gpt-5.4-mini exhibits identical compaction behavior. The mini variant compacts at the same turns (11, 13, 15, 17, 19), produces nearly identical peak/post-compaction token counts, and generates the same sawtooth pattern. The only difference is speed: 3.7x faster overall (77s vs 283s), with normal turns averaging ~2s (vs ~8s) and compaction turns averaging ~9s (vs ~30s). This makes the mini variant strongly preferable for cost-sensitive or latency-sensitive use cases where compaction is needed.
Comparison Table
Section titled “Comparison Table”| Metric | Baseline (gpt-5-mini) | Compaction (gpt-5.4) | Compaction (gpt-5.4-mini) |
|---|---|---|---|
| Model | gpt-5-mini | gpt-5.4 | gpt-5.4-mini |
| Compaction | None | compact_threshold=200K | compact_threshold=200K |
| Turns completed | 12 (failed at 13) | 20 (all) | 20 (all) |
| Peak input tokens | 263,447 | 235,203 | 235,325 |
| Failure mode | context_length_exceeded | None | None |
| Compaction events | N/A | 5 | 5 |
| Compaction turns | N/A | 11, 13, 15, 17, 19 | 11, 13, 15, 17, 19 |
| Total duration | 134s | 283s | 77s |
| Avg latency (normal turn) | ~10s | ~8s | ~2s |
| Avg latency (compaction turn) | N/A | ~30s | ~9s |
Implications for the AI Shopping Assistant
Section titled “Implications for the AI Shopping Assistant”What This Changes
Section titled “What This Changes”The original experiment concluded that we must manage history explicitly (sliding window + summaries) because previous_response_id doesn’t prevent overflow. With GPT-5.4 compaction, this constraint is relaxed:
-
Compaction handles overflow automatically. We no longer need to implement client-side sliding window or manual summarization to prevent
context_length_exceedederrors. The API handles it. -
Conversations can run indefinitely. The sawtooth pattern is sustainable — there’s no turn limit as long as compaction is enabled.
-
Episode-based history is still valuable. Compaction doesn’t replace our DynamoDB-based history. We still need history for: cross-session continuity, audit trails, analytics, and replay. But we no longer need history as a context management mechanism.
What This Doesn’t Change
Section titled “What This Doesn’t Change”-
Compaction loses information. The compaction item is a lossy summary. For our use case, this means older tool outputs, offer details, and user preferences may be compressed away. We should evaluate whether critical shopping context survives compaction.
-
Compaction turns are slower (~30s). This is a UX concern. Every 2-3 turns, there’s a slow turn. For a shopping assistant where users expect snappy responses, this may be noticeable.
-
Cost implications. Compaction turns produce more output tokens (~890 vs ~105). At GPT-5.4 pricing, this adds cost on compaction turns. Need to model total cost vs the alternative (managing state ourselves).
-
Opaque compaction items. The compaction item is encrypted — we can’t inspect, modify, or audit what was preserved. This is a tradeoff for convenience.
Recommended Next Steps
Section titled “Recommended Next Steps”-
Quality evaluation: Run the shopping assistant through multi-turn conversations with compaction enabled. Evaluate whether critical context (user preferences, offer details, tool results) survives compaction events.
-
Threshold tuning: Test different
compact_thresholdvalues. 200K may be too aggressive (frequent compaction) or too conservative (large context windows). Consider 300K or 400K for GPT-5.4’s 1M window. -
Latency mitigation: Investigate whether compaction latency can be hidden (e.g., streaming the normal response first, compaction happening asynchronously).
-
Hybrid approach: Consider using compaction as a safety net (high threshold, e.g., 500K) while still managing context explicitly for quality. This gives us the best of both: controlled context for quality, compaction as a backstop against crashes.
-
previous_response_id-only pattern: Test the third test case (test_compaction_with_previous_response_id_only) where we send only the new message each turn and let the server manage all state. This is the cleanest integration pattern.
Conclusion
Section titled “Conclusion”This experiment supports both hypotheses:
-
H1 confirmed: Both
gpt-5.4andgpt-5.4-miniwith compaction do not fail withcontext_length_exceeded, even at input levels that crash GPT-5-mini. Both models survived 20 turns (and could continue indefinitely) vs the baseline’s 13-turn failure. -
H2 confirmed: Compaction triggers automatically and reliably when input tokens cross the threshold, reducing context by 20–30% each time. The sawtooth pattern is sustainable. Both the full and mini variants exhibit identical compaction behavior.
For the AI shopping assistant, compaction is a viable solution for long conversations. It eliminates the hard failure mode, though it introduces latency spikes and potential information loss that need further evaluation. The gpt-5.4-mini variant is particularly promising — identical compaction behavior at 3.7x the speed. The recommended approach is a hybrid model: use compaction as a safety net with a high threshold, while continuing to manage context quality through our existing episode-based history system.
Appendix
Section titled “Appendix”Test Code
Section titled “Test Code”Test file: consumer-agent/tests/integration/test_context_window_limits.py
Branch: experiment/gpt-5.4-compaction-test
Four test cases:
test_context_overflow_gpt5_mini_baseline— Reproduces original failure (Turn 13, 263K tokens)test_compaction_gpt54_extends_conversation— Proves compaction prevents failure with gpt-5.4test_compaction_gpt54_mini_extends_conversation— Proves compaction works on gpt-5.4-mini (3.7x faster)test_compaction_with_previous_response_id_only— Tests server-side-only state management (not yet run)
OpenAI SDK Version
Section titled “OpenAI SDK Version”openai==2.31.0(upgraded from 2.6.1 to supportcontext_managementparameter)
Inflation Text
Section titled “Inflation Text”Moby-Dick opening chapters (Gutenberg #2701), from “Call me Ishmael” through Ishmael’s motivations for going to sea. ~5,000 tokens per message. Used solely to accelerate input growth to the threshold region.
API Parameters
Section titled “API Parameters”# Baseline (gpt-5-mini)client.responses.create( model="gpt-5-mini", input=[{"role": "system", ...}, *conversation], previous_response_id=prev_id, max_output_tokens=500, store=True,)
# Compaction (gpt-5.4)client.responses.create( model="gpt-5.4", input=[{"role": "system", ...}, *conversation], previous_response_id=prev_id, context_management=[{"type": "compaction", "compact_threshold": 200000}], max_output_tokens=500, store=True,)