JSON Component Fencing Fix: Evaluation Study

Date: 2025-12-08 Status: Complete Author: Prakash Chaudhary Jira: PLT-301

Problem Statement
Solution Overview
Benchmark Results
Recommendations
Testing Methodology
Conclusion

Executive Summary

This study evaluates the effectiveness of prompt engineering changes to fix JSON component fencing compliance in the consumer-agent. The mobile app requires JSON components (like prompt-suggestion) to be wrapped in markdown code fences for correct parsing.

Key Findings

Metric	Original Prompt	Fixed Prompt	Improvement
Success Rate	85.94%	100%	+14.06pp
Failures	9/64	0/64	-9 failures
Failure Rate	14.06%	0%	Eliminated

Recommendation

Deploy the improved fencing logic prompt to Opik for stage/production environments. The fix achieves 100% compliance with no observed failures across 64 geographic test iterations.

Problem Statement

The mobile app displays malformed content when the LLM outputs JSON components without proper markdown code fences. This manifests as raw JSON appearing in the chat interface instead of rendered UI components.

Symptoms

Raw JSON like {"component":"prompt-suggestion","props":{...}} appearing in responses
Broken UI components on mobile
Inconsistent behavior across responses (non-deterministic failures)

Root Cause Analysis

The original Opik prompt had a single, weak instruction for JSON fencing:

* User-visible text must be **GitHub-Flavored Markdown**. No HTML. No code fences around prose. JSON components go in `json` blocks.

This instruction was:

Too subtle (buried in a list of other rules)
Not reinforced elsewhere in the prompt
Missing explicit examples of correct vs incorrect format

Solution Overview

The fix applies the “primacy-recency” principle from cognitive psychology: information at the beginning and end of a sequence is better remembered than information in the middle. The improved fencing logic reinforces JSON formatting requirements in three strategic positions.

Three Strategic Insertions

1. Primacy Position (Line 13-14 - Quick Rules)

* User-visible text must be **GitHub-Flavored Markdown**. No HTML. No code fences around prose.
* **JSON components MUST be wrapped in ```json fences** - raw JSON breaks the mobile app.

2. Middle Position (Line 238 - Follow-Ups section)

**CRITICAL - JSON Component Format (non-negotiable):**

* **ALWAYS** wrap each JSON component in triple-backtick markdown code fences with the `json` language identifier.
* **NEVER** output raw JSON like `{"component":"..."}` without the triple-backtick markdown fences.
* The mobile app parser REQUIRES markdown fences to correctly identify and render components.
* Each component must be in its **own** fenced block, not combined.
* Outputting unfenced JSON will cause the mobile app to display broken/malformed content to users.

**Correct format (REQUIRED):**
```json
\{"component":"prompt-suggestion","props":`{"text":"Check nearby stores","type":"recommended"}`\}

Wrong format (breaks mobile app - NEVER do this):

{"component":"prompt-suggestion","props":{"text":"Check nearby stores","type":"recommended"}} — emitted as raw text, with no surrounding ```json fence.

3. Recency Position (End of prompt)

## FINAL REMINDER - JSON Formatting

Before completing any response that includes prompt-suggestion or other JSON components:
1. Verify each JSON component is wrapped in ```json code fences
2. Never output bare JSON objects in your response
3. The mobile app will malfunction if JSON is not properly fenced

Benchmark Results

Test Configuration

Model: gpt-5-mini with low reasoning effort (conversational agent configuration)
Tools: MCP tools enabled (rover_mcp, web_search)
Components: All enabled (general-instructions, offer-list, prompt-suggestion, offer-shelf)
Test queries: 8 shopping-related queries designed to trigger prompt-suggestion components
Locations: 8 US cities (6 English, 2 Spanish)
Total requests per test: 64 (8 iterations x 8 locations)
Concurrency: 8 parallel requests

Results Summary

Prompt Version	Total Tests	Passed	Failed	Success Rate
Original (weak fencing)	64	55	9	85.94%
Fixed (improved fencing logic)	64	64	0	100%

Recommendations

Deploy the improved fencing prompt to stage and production via Opik.
Monitor JSON component rendering in mobile for at least one release cycle to confirm sustained 100% fencing compliance.
Keep the primacy/middle/recency reminders in future prompt revisions to prevent regression.

Testing Methodology

Test Approach

Testing was performed using a parallel async test runner that:

Sends streaming requests to the API server
Collects full response text from text events
Uses regex to detect:
- Properly fenced JSON: ```json\n{...}\n```
- Unfenced JSON components: {"component":"...","props":{...}}
Reports success/failure rates by location, locale, and query type

Detection Logic

The test runner identifies properly fenced JSON components by looking for markdown code fences with the json language identifier, and detects unfenced components by finding raw JSON component patterns in the response text after removing fenced blocks.

Conclusion

The improved fencing logic effectively eliminates JSON fencing failures:

Metric	Before	After	Change
Success Rate	85.94%	100%	+14.06pp
Failure Rate	14.06%	0%	Eliminated
Failures	9	0	-9

The improved fencing logic works by applying the primacy-recency principle from cognitive psychology: information at the beginning and end of a sequence is better remembered. The fix reinforces JSON formatting requirements in three strategic positions: (1) Primacy effect - first instruction establishes the rule early in the prompt, (2) Detailed examples - middle section provides concrete correct/incorrect examples, (3) Recency effect - final reminder ensures the rule is top-of-mind when generating output, and (4) Explicit consequences - clear explanation that unfenced JSON “breaks the mobile app” creates urgency.

JSON Component Fencing Fix: Evaluation Study

JSON Component Fencing Fix: Evaluation Study

Table of Contents

Executive Summary

Key Findings

Recommendation

Problem Statement

Symptoms

Root Cause Analysis

Solution Overview

Three Strategic Insertions

Benchmark Results

Test Configuration

Results Summary

Recommendations

Testing Methodology

Test Approach

Detection Logic

Conclusion