Skip to content

Shopping-list prompt bisection audit

Tracking which rules we remove, why, and the eval-score impact per round. The goal is to find the minimal set of rules that keeps purchase-history enumeration deterministic under gpt-5.4-mini, then add back only what’s safe.

Evals ran against configs/eval_data/shopping_list_scenarios.yaml (23 items, 7 are flow-1 purchase-history queries). Stub from agent_config.evaluation.tool_stubs feeds deterministic tool output; eval via consumer-agent opik eval agent --dataset shopping-list-eval.

VersionPrompt changePH avgOverall avgItems ≥0.8
v1 (opik)production baseline~0.280.46
v2 (file, first fix)tiny <shopping_list> block0.490.5358/23
v3triggers + FORBIDDEN + personalization override + backend-details caveat0.550.60912/23
v4+ worked_example0.470.61312/23
v5top-level rule refactored to case-based0.450.5239/23

The v3 fix works on some phrasings (“List my recent purchases with specific products” → 1.0) but fails consistently on others (“Build me a weekly restock list” → 0.2 category collapse, “What have I bought recently?” → refusal). Backup saved at /tmp/conversational-xml.v3-baseline.txt.

gpt-5.4-mini weighs the first-seen rule heavily. <output_contract> at char 274 says “Never restate tool output” with an exception clause. <shopping_list> at char 7228 tries to override. 7k chars of distance between rule and override = override loses.

Adding more explicit override language (v3→v4→v5) didn’t help consistently. Strategy: remove the competing rules, not fight them.

Removed rules (all from <output_contract>, <core_rules>, <personalization>):

  • R1 ≤100 words. One short paragraph or tight bullets. — hard cap forcing summarization.
  • R2 Never restate tool output (names, IDs, point values, retailer mappings). Prose refers to high-level concepts only ("you have some coffee offers available"). Specific details go in render_* tool call arguments. — the direct culprit.
  • R3 Lead with the conclusion → one-liner rationale → links (if research) → next step. — structural pressure toward terse summary.
  • R4 Exception: the <shopping_list> block defines when restating tool output IS required (purchase-history-list intent). — no longer needed since R2 removed.
  • R5 The backend details caveat in the Never reveal list. Kept the list itself, just removed the backend details item + my note explaining purchase history isn’t backend details.
  • R6 Use for ranking/context only; never alter offer data or fabricate personalization. — directly contradicted enumeration.
  • R7 The two-intents framing added in v3 (the thing that said “personalization vs explicit enumeration”). Simplified to tool-behavior bullets only.
  • R8 The “OVERRIDES every other rule” preamble + “Purchase history belongs to the user…” paragraph. No longer needed if the competing rules are gone.

Kept in <shopping_list>: trigger list, required_behavior, forbidden_outputs, required_outputs, product_title_shortening.

Bisection order — add one group back, re-run eval, record PH avg:

  1. R1 alone (≤100 words cap). Does the cap alone cause category collapse?
  2. R2 alone (“never restate tool output”). Is this the true culprit?
  3. R1 + R2 together. Is the combination worse than either alone?
  4. R6 (personalization “context only”). Tool-level pressure.
  5. R3 (lead with conclusion). Is tone alone enough to cause summarization?

Each step measures per-phrasing SL scores on the 7 flow-1 items. If a rule’s re-addition drops ≥2 items below 0.8, that rule is the problem and we leave it out. If all 7 stay ≥0.8 after a re-addition, that rule is fine and we proceed.

Rules kept unconditionally:

  • All safety / data-integrity / fabrication / URL-integrity rules.
  • <scope_boundaries> and refusal patterns (unrelated to enumeration).
  • <products> and <offers> tool rules.
  • <rendering>, <product_formatting>.
  • All dynamic context (date, location, user_id).