Skip to content

Opik Integration Guide

Opik integration provides prompt management, versioning, and evaluation capabilities for the Consumer Agent.

Local Development (.env):

Terminal window
OPIK_API_KEY=your-api-key

Production Deployment:

  • OPIK_API_KEY stored in AWS Secrets Manager as consumer-agent/opik-api-key
  • Automatically injected via consumer-agent.yml (FSD deployment)

Get credentials from Comet.com > Opik section.

Set opik.enabled: true in settings.yaml. Environment-specific projects (consumer-agent-dev/stage/prod) and fixed workspace (“consumer-agent”) are configured there.

Two Evaluation Modes:

AspectCLI (Offline)Online (Production)
ExecutionDeveloper machineOpik cloud
TriggerManual commandAutomatic (every trace)
LLM-as-JudgeLocal GEval → OpenAIOpik → OpenAI
CostDeveloper paysOpik pricing
DatasetCurated test casesReal user traffic
Use ForTesting prompts, A/B testsProduction monitoring, alerts

Components:

  • PromptEvaluator (src/consumer_agent/evaluation/evaluator.py) - CLI evaluation orchestration
  • OpikClient (src/consumer_agent/evaluation/client.py) - Opik SDK wrapper for dataset/prompt CRUD
  • Metrics (src/consumer_agent/metrics/) - 7 metrics (4 LLM-as-judge + 3 rule-based)
  • Automation Rules (scripts/register_llm_judge_rules.py) - Online evaluation setup

The chat CLI interface (consumer-agent chat) includes Opik tracking for local testing and development interactions.

Chat CLI uses the same environment-specific project configuration as the rest of the application. See Quick Start → Enable Opik for the full settings.yaml configuration. Local development (default environment) uses the consumer-agent-dev project.

Each conversation turn is logged as a trace with:

Input: User message Output: Assistant response Metadata:

  • locale: User locale (en, es-419)
  • user_id: User ID if provided
  • latitude/longitude: Location if provided
  • enabled_components: Active component types
  • reasoning_length: Length of reasoning text
  • response_length: Length of response text

If Opik Cloud is unavailable or credentials are missing, the chat CLI continues working without tracking. No errors are shown to the user.

  1. Go to Opik Dashboard
  2. Select workspace: consumer-agent
  3. Select project: consumer-agent-dev (for local development)
  4. View all local chat interactions with metadata
  5. Filter by locale, user_id, or components, or tags: [“chat-cli”, “local-testing”]

The tracking implementation is in src/consumer_agent/cli/chat.py:273-402:

# Initialize Opik client if enabled
opik_client = None
trace = None
if settings.opik.enabled and settings.opik.api_key:
opik_client = opik.Opik()
# Wrap agent streaming in trace
if opik_client:
trace = opik_client.trace(
name="chat_interaction",
input={"message": user_message},
project_name=settings.opik.chat_project_name,
metadata={...}
)
  • Local Testing: Track development conversations for debugging
  • Integration Testing: Verify prompt changes work as expected
  • Component Testing: Test specific component behaviors
  • Locale Testing: Verify translations and locale handling
  • Tool Testing: Monitor MCP tool calls and responses
AspectChat CLIFastAPI Service
Projectconsumer-agent-chat-cliconsumer-agent
PurposeLocal testingProduction monitoring
TrackingEnabled (OpikTracer)Enabled (OpikTracer)
EvaluationManual reviewAutomated rules (LLM-as-judge)
UsersDevelopersReal users
thread_idUUID per sessionepisode_id (multi-turn grouping)

FSD Configuration: OPIK_API_KEY secret configured in consumer-agent.yml (AWS Secrets Manager path: consumer-agent/opik-api-key).

Environment-Specific Projects:

EnvironmentProject Name
default (local)consumer-agent-dev
stageconsumer-agent-stage
prodconsumer-agent-prod

Workspace: Fixed as “consumer-agent” in settings.yaml across all environments.

Consumer-agent implements 7 custom metrics in src/consumer_agent/metrics/:

MetricTypeScalePurposeWhen to Use
ResponseQualityMetricLLM-as-judge1-10Overall response qualityAll evaluations
GreetingQualityMetricLLM-as-judge0-1Landing page greeting validationLanding page only
CapabilityAlignmentMetricLLM-as-judge0-1Suggestions align with bot capabilitiesSuggestion validation
ConversationFlowMetricLLM-as-judge0-1Multi-turn conversation coherenceMulti-turn conversations
SuggestionValidationMetricRule-based0-1Exactly 5 suggestions, 50-75 chars eachLanding page
DiversityMetricRule-based0-1No duplicates or similar suggestionsLanding page
ToolUsageMetricRule-based0 or 1Binary check if tools were calledItems requiring tools

LLM-as-judge metrics use GEval with gpt-5-mini (4 API calls per evaluation). Rule-based metrics use deterministic logic (no API cost).

Usage: consumer-agent opik eval run --dataset test-v1 --metrics response_quality,greeting_quality

For detailed implementations, see src/consumer_agent/metrics/ directory.

Prompts are loaded based on environment configuration in settings.yaml:

EnvironmentSourceBehavior
default (local)fileLoad from prompts/ directory
stageopikLoad from Opik (strict mode - fail if unavailable)
prodopikLoad from Opik (strict mode - fail if unavailable)

All prompts are defined in agent_config.yaml under prompts.definitions:

prompts:
file:
directory: "prompts"
components_directory: "prompts/components"
definitions:
conversational:
file: conversational.txt
opik: conversational@latest
prompt-suggestions:
file: prompt-suggestions.txt
opik: prompt-suggestions@latest
from consumer_agent.prompts import create_prompt_manager
pm = create_prompt_manager()
# Load prompt (source determined by environment)
prompt = pm.load_prompt("conversational")
# Build complete system prompt for an agent
system_prompt = pm.build_system_prompt(
agent_id="conversational",
enabled_components=["offer-list"],
locale="en",
latitude=41.8781,
longitude=-87.6298,
)

Each prompt definition includes an opik reference with explicit version:

  • conversational@latest - always use latest version
  • conversational@abc123 - pin to specific commit
  1. Edit prompt file in prompts/ directory
  2. Test locally (uses file source)
  3. Upload to Opik via CLI or API
  4. Update agent_config.yaml with new version if pinning

Configure evaluation section in agent_config.yaml with real test user IDs to avoid 404 errors from MCP tools. See file for full config structure.

from consumer_agent.evaluation import PromptEvaluator
evaluator = PromptEvaluator()
# Define test cases
items = [
{
"input": "What coffee offers are available?",
"expected_output": "List of coffee offers with points and details"
},
{
"input": "Show me nearby snack offers",
"expected_output": "Snack offers with location information"
},
{
"input": "How do I redeem points?",
"expected_output": "Point redemption instructions"
}
]
# Create dataset in Opik
dataset = evaluator.create_dataset("prompt-eval-v1", items)
print(f"Dataset created: {dataset['id']}")
from consumer_agent.agent import Agent
from consumer_agent.factory import create_chat_model
from opik.evaluation.metrics import Hallucination, AnswerRelevance
# Setup agent
model = create_chat_model()
agent = Agent(model=model)
# Define metrics to track
metrics = [
Hallucination(), # Detects made-up information
AnswerRelevance(), # Measures answer quality
]
# Run evaluation on dataset
results = evaluator.evaluate_prompt(
agent=agent,
dataset_name="prompt-eval-v1",
scoring_metrics=metrics
)
# Metrics automatically saved to Opik dashboard

All evaluation metrics are automatically saved to Opik:

  1. Go to Opik Dashboard
  2. Navigate to your project: consumer-agent
  3. View evaluation results with all metrics
  4. Compare prompt versions side-by-side

Built-in metrics from Opik:

  • Hallucination: Detects fabricated information
  • AnswerRelevance: Measures answer quality and relevance
  • Moderation: Checks for toxic/inappropriate content
  • ContextPrecision: Measures retrieval accuracy (RAG systems)
  • ContextRecall: Measures retrieval completeness (RAG systems)

Manage datasets and prompts using command-line interface.

Create dataset from JSONL file (one JSON object per line):

Terminal window
# Create dataset from local file
consumer-agent opik dataset create \
--name sample-test-v1 \
--file data/raw/sample.jsonl \
--description "Test dataset description"
# File format (JSONL):
{"input": "...", "expected_output": "...", "metadata": {...}}
{"input": "...", "expected_output": "...", "metadata": {...}}
Terminal window
# List all datasets in Opik
consumer-agent opik dataset list
Terminal window
# Export dataset to JSONL file
consumer-agent opik dataset export \
--name sample-test-v1 \
--output data/external/exported.jsonl
Terminal window
# Create or update prompt from file
consumer-agent opik prompt create \
--name conversational-v3 \
--file prompts/conversational-v3.txt
Terminal window
# Get latest version
consumer-agent opik prompt get --name conversational-v3
# Get specific version
consumer-agent opik prompt get \
--name conversational-v3 \
--commit abc123def456
Terminal window
# View version history
consumer-agent opik prompt history --name conversational-v3

Run evaluation on a dataset with auto-selected or custom metrics:

Terminal window
# Run with auto-selected metrics (based on dataset metadata)
consumer-agent opik eval run --dataset test-eval-v1
# Run with specific prompt version
consumer-agent opik eval run \
--dataset test-eval-v1 \
--prompt conversational-v4
# Run with custom metrics
consumer-agent opik eval run \
--dataset test-eval-v1 \
--metrics suggestion_validation,diversity,tool_usage
# Run with custom model
consumer-agent opik eval run \
--dataset test-eval-v1 \
--model gpt-4
# Run with custom experiment name
consumer-agent opik eval run \
--dataset test-eval-v1 \
--experiment "prompt-v4-test"

Available Metrics: See Custom Metrics section above for detailed metric descriptions.

Auto-Selection: When no metrics specified, automatically selects based on dataset metadata:

  • type: "landing" triggers suggestion_validation + diversity
  • requires_tools: true triggers tool_usage

Automated evaluations run on production traffic in real-time using LLM-as-judge rules. Setup: python scripts/register_llm_judge_rules.py. View results in Opik Dashboard > Experiments.