Shopping-list prompt bisection audit

Tracking which rules we remove, why, and the eval-score impact per round. The goal is to find the minimal set of rules that keeps purchase-history enumeration deterministic under gpt-5.4-mini, then add back only what’s safe.

Starting state (v3 / v4 / v5 iterations)

Evals ran against configs/eval_data/shopping_list_scenarios.yaml (23 items, 7 are flow-1 purchase-history queries). Stub from agent_config.evaluation.tool_stubs feeds deterministic tool output; eval via consumer-agent opik eval agent --dataset shopping-list-eval.

Version	Prompt change	PH avg	Overall avg	Items ≥0.8
v1 (opik)	production baseline	~0.28	0.46	—
v2 (file, first fix)	tiny `<shopping_list>` block	0.49	0.535	8/23
v3	triggers + FORBIDDEN + personalization override + backend-details caveat	0.55	0.609	12/23
v4	+ worked_example	0.47	0.613	12/23
v5	top-level rule refactored to case-based	0.45	0.523	9/23

The v3 fix works on some phrasings (“List my recent purchases with specific products” → 1.0) but fails consistently on others (“Build me a weekly restock list” → 0.2 category collapse, “What have I bought recently?” → refusal). Backup saved at /tmp/conversational-xml.v3-baseline.txt.

Why exceptions failed

gpt-5.4-mini weighs the first-seen rule heavily. <output_contract> at char 274 says “Never restate tool output” with an exception clause. <shopping_list> at char 7228 tries to override. 7k chars of distance between rule and override = override loses.

Adding more explicit override language (v3→v4→v5) didn’t help consistently. Strategy: remove the competing rules, not fight them.

Round A — Aggressive strip

Removed rules (all from <output_contract>, <core_rules>, <personalization>):

From `<output_contract>`:

R1 ≤100 words. One short paragraph or tight bullets. — hard cap forcing summarization.
R2 Never restate tool output (names, IDs, point values, retailer mappings). Prose refers to high-level concepts only ("you have some coffee offers available"). Specific details go in render_* tool call arguments. — the direct culprit.
R3 Lead with the conclusion → one-liner rationale → links (if research) → next step. — structural pressure toward terse summary.
R4 Exception: the <shopping_list> block defines when restating tool output IS required (purchase-history-list intent). — no longer needed since R2 removed.

From `<core_rules>`:

R5 The backend details caveat in the Never reveal list. Kept the list itself, just removed the backend details item + my note explaining purchase history isn’t backend details.

From `<personalization>`:

R6 Use for ranking/context only; never alter offer data or fabricate personalization. — directly contradicted enumeration.
R7 The two-intents framing added in v3 (the thing that said “personalization vs explicit enumeration”). Simplified to tool-behavior bullets only.

From `<shopping_list>`:

R8 The “OVERRIDES every other rule” preamble + “Purchase history belongs to the user…” paragraph. No longer needed if the competing rules are gone.

Kept in <shopping_list>: trigger list, required_behavior, forbidden_outputs, required_outputs, product_title_shortening.

Add-back plan (if Round A works)

Bisection order — add one group back, re-run eval, record PH avg:

R1 alone (≤100 words cap). Does the cap alone cause category collapse?
R2 alone (“never restate tool output”). Is this the true culprit?
R1 + R2 together. Is the combination worse than either alone?
R6 (personalization “context only”). Tool-level pressure.
R3 (lead with conclusion). Is tone alone enough to cause summarization?

Each step measures per-phrasing SL scores on the 7 flow-1 items. If a rule’s re-addition drops ≥2 items below 0.8, that rule is the problem and we leave it out. If all 7 stay ≥0.8 after a re-addition, that rule is fine and we proceed.

Non-bisection candidates

Rules kept unconditionally:

All safety / data-integrity / fabrication / URL-integrity rules.
<scope_boundaries> and refusal patterns (unrelated to enumeration).
<products> and <offers> tool rules.
<rendering>, <product_formatting>.
All dynamic context (date, location, user_id).

Shopping-list prompt bisection audit

Shopping-list prompt bisection audit

Starting state (v3 / v4 / v5 iterations)

Why exceptions failed

Round A — Aggressive strip

From <output_contract>:

From <core_rules>:

From <personalization>:

From <shopping_list>:

Add-back plan (if Round A works)

Non-bisection candidates

From `<output_contract>`:

From `<core_rules>`:

From `<personalization>`:

From `<shopping_list>`: