Shopping-list prompt bisection audit
Shopping-list prompt bisection audit
Section titled “Shopping-list prompt bisection audit”Tracking which rules we remove, why, and the eval-score impact per round. The goal is to find the minimal set of rules that keeps purchase-history enumeration deterministic under gpt-5.4-mini, then add back only what’s safe.
Starting state (v3 / v4 / v5 iterations)
Section titled “Starting state (v3 / v4 / v5 iterations)”Evals ran against configs/eval_data/shopping_list_scenarios.yaml (23 items, 7
are flow-1 purchase-history queries). Stub from agent_config.evaluation.tool_stubs
feeds deterministic tool output; eval via
consumer-agent opik eval agent --dataset shopping-list-eval.
| Version | Prompt change | PH avg | Overall avg | Items ≥0.8 |
|---|---|---|---|---|
| v1 (opik) | production baseline | ~0.28 | 0.46 | — |
| v2 (file, first fix) | tiny <shopping_list> block | 0.49 | 0.535 | 8/23 |
| v3 | triggers + FORBIDDEN + personalization override + backend-details caveat | 0.55 | 0.609 | 12/23 |
| v4 | + worked_example | 0.47 | 0.613 | 12/23 |
| v5 | top-level rule refactored to case-based | 0.45 | 0.523 | 9/23 |
The v3 fix works on some phrasings (“List my recent purchases with specific
products” → 1.0) but fails consistently on others (“Build me a weekly restock list”
→ 0.2 category collapse, “What have I bought recently?” → refusal).
Backup saved at /tmp/conversational-xml.v3-baseline.txt.
Why exceptions failed
Section titled “Why exceptions failed”gpt-5.4-mini weighs the first-seen rule heavily. <output_contract> at char 274
says “Never restate tool output” with an exception clause. <shopping_list> at
char 7228 tries to override. 7k chars of distance between rule and override =
override loses.
Adding more explicit override language (v3→v4→v5) didn’t help consistently. Strategy: remove the competing rules, not fight them.
Round A — Aggressive strip
Section titled “Round A — Aggressive strip”Removed rules (all from <output_contract>, <core_rules>, <personalization>):
From <output_contract>:
Section titled “From <output_contract>:”- R1
≤100 words. One short paragraph or tight bullets.— hard cap forcing summarization. - R2
Never restate tool output (names, IDs, point values, retailer mappings). Prose refers to high-level concepts only ("you have some coffee offers available"). Specific details go in render_* tool call arguments.— the direct culprit. - R3
Lead with the conclusion → one-liner rationale → links (if research) → next step.— structural pressure toward terse summary. - R4
Exception: the <shopping_list> block defines when restating tool output IS required (purchase-history-list intent).— no longer needed since R2 removed.
From <core_rules>:
Section titled “From <core_rules>:”- R5 The
backend detailscaveat in theNever reveallist. Kept the list itself, just removed thebackend detailsitem + my note explaining purchase history isn’t backend details.
From <personalization>:
Section titled “From <personalization>:”- R6
Use for ranking/context only; never alter offer data or fabricate personalization.— directly contradicted enumeration. - R7 The two-intents framing added in v3 (the thing that said “personalization vs explicit enumeration”). Simplified to tool-behavior bullets only.
From <shopping_list>:
Section titled “From <shopping_list>:”- R8 The “OVERRIDES every other rule” preamble + “Purchase history belongs to the user…” paragraph. No longer needed if the competing rules are gone.
Kept in <shopping_list>: trigger list, required_behavior, forbidden_outputs,
required_outputs, product_title_shortening.
Add-back plan (if Round A works)
Section titled “Add-back plan (if Round A works)”Bisection order — add one group back, re-run eval, record PH avg:
- R1 alone (≤100 words cap). Does the cap alone cause category collapse?
- R2 alone (“never restate tool output”). Is this the true culprit?
- R1 + R2 together. Is the combination worse than either alone?
- R6 (personalization “context only”). Tool-level pressure.
- R3 (lead with conclusion). Is tone alone enough to cause summarization?
Each step measures per-phrasing SL scores on the 7 flow-1 items. If a rule’s re-addition drops ≥2 items below 0.8, that rule is the problem and we leave it out. If all 7 stay ≥0.8 after a re-addition, that rule is fine and we proceed.
Non-bisection candidates
Section titled “Non-bisection candidates”Rules kept unconditionally:
- All safety / data-integrity / fabrication / URL-integrity rules.
<scope_boundaries>and refusal patterns (unrelated to enumeration).<products>and<offers>tool rules.<rendering>,<product_formatting>.- All dynamic context (date, location, user_id).