Opik Integration Guide

Opik integration provides prompt management, versioning, and evaluation capabilities for the Consumer Agent.

Quick Start

1. Setup Credentials

Local Development (.env):

OPIK_API_KEY=your-api-key

Production Deployment:

OPIK_API_KEY stored in AWS Secrets Manager as consumer-agent/opik-api-key
Automatically injected via consumer-agent.yml (FSD deployment)

Get credentials from Comet.com > Opik section.

2. Enable Opik

Set opik.enabled: true in settings.yaml. Environment-specific projects (consumer-agent-dev/stage/prod) and fixed workspace (“consumer-agent”) are configured there.

Architecture

Two Evaluation Modes:

Aspect	CLI (Offline)	Online (Production)
Execution	Developer machine	Opik cloud
Trigger	Manual command	Automatic (every trace)
LLM-as-Judge	Local GEval → OpenAI	Opik → OpenAI
Cost	Developer pays	Opik pricing
Dataset	Curated test cases	Real user traffic
Use For	Testing prompts, A/B tests	Production monitoring, alerts

Components:

PromptEvaluator (src/consumer_agent/evaluation/evaluator.py) - CLI evaluation orchestration
OpikClient (src/consumer_agent/evaluation/client.py) - Opik SDK wrapper for dataset/prompt CRUD
Metrics (src/consumer_agent/metrics/) - 7 metrics (4 LLM-as-judge + 3 rule-based)
Automation Rules (scripts/register_llm_judge_rules.py) - Online evaluation setup

Chat CLI Tracking

The chat CLI interface (consumer-agent chat) includes Opik tracking for local testing and development interactions.

Configuration

Chat CLI uses the same environment-specific project configuration as the rest of the application. See Quick Start → Enable Opik for the full settings.yaml configuration. Local development (default environment) uses the consumer-agent-dev project.

What is Tracked

Each conversation turn is logged as a trace with:

Input: User message Output: Assistant response Metadata:

locale: User locale (en, es-419)
user_id: User ID if provided
latitude/longitude: Location if provided
enabled_components: Active component types
reasoning_length: Length of reasoning text
response_length: Length of response text

Graceful Degradation

If Opik Cloud is unavailable or credentials are missing, the chat CLI continues working without tracking. No errors are shown to the user.

View Chat Traces

Go to Opik Dashboard
Select workspace: consumer-agent
Select project: consumer-agent-dev (for local development)
View all local chat interactions with metadata
Filter by locale, user_id, or components, or tags: [“chat-cli”, “local-testing”]

Implementation

The tracking implementation is in src/consumer_agent/cli/chat.py:273-402:

# Initialize Opik client if enabled
opik_client = None
trace = None
if settings.opik.enabled and settings.opik.api_key:
    opik_client = opik.Opik()

# Wrap agent streaming in trace
if opik_client:
    trace = opik_client.trace(
        name="chat_interaction",
        input={"message": user_message},
        project_name=settings.opik.chat_project_name,
        metadata={...}
    )

Use Cases

Local Testing: Track development conversations for debugging
Integration Testing: Verify prompt changes work as expected
Component Testing: Test specific component behaviors
Locale Testing: Verify translations and locale handling
Tool Testing: Monitor MCP tool calls and responses

Comparison with Production Tracking

Aspect	Chat CLI	FastAPI Service
Project	consumer-agent-chat-cli	consumer-agent
Purpose	Local testing	Production monitoring
Tracking	Enabled (OpikTracer)	Enabled (OpikTracer)
Evaluation	Manual review	Automated rules (LLM-as-judge)
Users	Developers	Real users
thread_id	UUID per session	episode_id (multi-turn grouping)

Production Deployment

FSD Configuration: OPIK_API_KEY secret configured in consumer-agent.yml (AWS Secrets Manager path: consumer-agent/opik-api-key).

Environment-Specific Projects:

Environment	Project Name
default (local)	consumer-agent-dev
stage	consumer-agent-stage
prod	consumer-agent-prod

Workspace: Fixed as “consumer-agent” in settings.yaml across all environments.

Custom Metrics

Consumer-agent implements 7 custom metrics in src/consumer_agent/metrics/:

Metric	Type	Scale	Purpose	When to Use
ResponseQualityMetric	LLM-as-judge	1-10	Overall response quality	All evaluations
GreetingQualityMetric	LLM-as-judge	0-1	Landing page greeting validation	Landing page only
CapabilityAlignmentMetric	LLM-as-judge	0-1	Suggestions align with bot capabilities	Suggestion validation
ConversationFlowMetric	LLM-as-judge	0-1	Multi-turn conversation coherence	Multi-turn conversations
SuggestionValidationMetric	Rule-based	0-1	Exactly 5 suggestions, 50-75 chars each	Landing page
DiversityMetric	Rule-based	0-1	No duplicates or similar suggestions	Landing page
ToolUsageMetric	Rule-based	0 or 1	Binary check if tools were called	Items requiring tools

LLM-as-judge metrics use GEval with gpt-5-mini (4 API calls per evaluation). Rule-based metrics use deterministic logic (no API cost).

Usage: consumer-agent opik eval run --dataset test-v1 --metrics response_quality,greeting_quality

For detailed implementations, see src/consumer_agent/metrics/ directory.

Prompt Management

Environment-Driven Source Selection

Prompts are loaded based on environment configuration in settings.yaml:

Environment	Source	Behavior
default (local)	file	Load from `prompts/` directory
stage	opik	Load from Opik (strict mode - fail if unavailable)
prod	opik	Load from Opik (strict mode - fail if unavailable)

Prompt Definitions

All prompts are defined in agent_config.yaml under prompts.definitions:

prompts:
  file:
    directory: "prompts"
    components_directory: "prompts/components"

  definitions:
    conversational:
      file: conversational.txt
      opik: conversational@latest
    prompt-suggestions:
      file: prompt-suggestions.txt
      opik: prompt-suggestions@latest

Using PromptManager

from consumer_agent.prompts import create_prompt_manager

pm = create_prompt_manager()

# Load prompt (source determined by environment)
prompt = pm.load_prompt("conversational")

# Build complete system prompt for an agent
system_prompt = pm.build_system_prompt(
    agent_id="conversational",
    enabled_components=["offer-list"],
    locale="en",
    latitude=41.8781,
    longitude=-87.6298,
)

Versioning with Opik

Each prompt definition includes an opik reference with explicit version:

conversational@latest - always use latest version
conversational@abc123 - pin to specific commit

Update Prompts

Edit prompt file in prompts/ directory
Test locally (uses file source)
Upload to Opik via CLI or API
Update agent_config.yaml with new version if pinning

Evaluation

Evaluation Context

Configure evaluation section in agent_config.yaml with real test user IDs to avoid 404 errors from MCP tools. See file for full config structure.

Create Evaluation Dataset

from consumer_agent.evaluation import PromptEvaluator

evaluator = PromptEvaluator()

# Define test cases
items = [
    {
        "input": "What coffee offers are available?",
        "expected_output": "List of coffee offers with points and details"
    },
    {
        "input": "Show me nearby snack offers",
        "expected_output": "Snack offers with location information"
    },
    {
        "input": "How do I redeem points?",
        "expected_output": "Point redemption instructions"
    }
]

# Create dataset in Opik
dataset = evaluator.create_dataset("prompt-eval-v1", items)
print(f"Dataset created: {dataset['id']}")

Run Evaluation

from consumer_agent.agent import Agent
from consumer_agent.factory import create_chat_model
from opik.evaluation.metrics import Hallucination, AnswerRelevance

# Setup agent
model = create_chat_model()
agent = Agent(model=model)

# Define metrics to track
metrics = [
    Hallucination(),      # Detects made-up information
    AnswerRelevance(),   # Measures answer quality
]

# Run evaluation on dataset
results = evaluator.evaluate_prompt(
    agent=agent,
    dataset_name="prompt-eval-v1",
    scoring_metrics=metrics
)

# Metrics automatically saved to Opik dashboard

View Results

All evaluation metrics are automatically saved to Opik:

Go to Opik Dashboard
Navigate to your project: consumer-agent
View evaluation results with all metrics
Compare prompt versions side-by-side

Available Metrics

Built-in metrics from Opik:

Hallucination: Detects fabricated information
AnswerRelevance: Measures answer quality and relevance
Moderation: Checks for toxic/inappropriate content
ContextPrecision: Measures retrieval accuracy (RAG systems)
ContextRecall: Measures retrieval completeness (RAG systems)

CLI Commands

Manage datasets and prompts using command-line interface.

Dataset Commands

Create Dataset

Create dataset from JSONL file (one JSON object per line):

# Create dataset from local file
consumer-agent opik dataset create \
  --name sample-test-v1 \
  --file data/raw/sample.jsonl \
  --description "Test dataset description"

# File format (JSONL):
{"input": "...", "expected_output": "...", "metadata": {...}}
{"input": "...", "expected_output": "...", "metadata": {...}}

List Datasets

# List all datasets in Opik
consumer-agent opik dataset list

Export Dataset

# Export dataset to JSONL file
consumer-agent opik dataset export \
  --name sample-test-v1 \
  --output data/external/exported.jsonl

Prompt Commands

Create/Update Prompt

# Create or update prompt from file
consumer-agent opik prompt create \
  --name conversational-v3 \
  --file prompts/conversational-v3.txt

Get Prompt

# Get latest version
consumer-agent opik prompt get --name conversational-v3

# Get specific version
consumer-agent opik prompt get \
  --name conversational-v3 \
  --commit abc123def456

View Prompt History

# View version history
consumer-agent opik prompt history --name conversational-v3

Evaluation Commands

Run Evaluation

Run evaluation on a dataset with auto-selected or custom metrics:

# Run with auto-selected metrics (based on dataset metadata)
consumer-agent opik eval run --dataset test-eval-v1

# Run with specific prompt version
consumer-agent opik eval run \
  --dataset test-eval-v1 \
  --prompt conversational-v4

# Run with custom metrics
consumer-agent opik eval run \
  --dataset test-eval-v1 \
  --metrics suggestion_validation,diversity,tool_usage

# Run with custom model
consumer-agent opik eval run \
  --dataset test-eval-v1 \
  --model gpt-4

# Run with custom experiment name
consumer-agent opik eval run \
  --dataset test-eval-v1 \
  --experiment "prompt-v4-test"

Available Metrics: See Custom Metrics section above for detailed metric descriptions.

Auto-Selection: When no metrics specified, automatically selects based on dataset metadata:

type: "landing" triggers suggestion_validation + diversity
requires_tools: true triggers tool_usage

Online Evaluation

Automated evaluations run on production traffic in real-time using LLM-as-judge rules. Setup: python scripts/register_llm_judge_rules.py. View results in Opik Dashboard > Experiments.

Opik Integration Guide

Opik Integration Guide

Quick Start

1. Setup Credentials

2. Enable Opik

Architecture

Chat CLI Tracking

Configuration

What is Tracked

Graceful Degradation

View Chat Traces

Implementation

Use Cases

Comparison with Production Tracking

Production Deployment

Custom Metrics

Prompt Management

Environment-Driven Source Selection

Prompt Definitions

Using PromptManager

Versioning with Opik

Update Prompts

Evaluation

Evaluation Context

Create Evaluation Dataset

Run Evaluation

View Results

Available Metrics

CLI Commands

Dataset Commands

Create Dataset

List Datasets

Export Dataset

Prompt Commands

Create/Update Prompt

Get Prompt

View Prompt History

Evaluation Commands

Run Evaluation

Online Evaluation

Additional Resources