Opik Integration Guide
Opik Integration Guide
Section titled “Opik Integration Guide”Opik integration provides prompt management, versioning, and evaluation capabilities for the Consumer Agent.
Quick Start
Section titled “Quick Start”1. Setup Credentials
Section titled “1. Setup Credentials”Local Development (.env):
OPIK_API_KEY=your-api-keyProduction Deployment:
OPIK_API_KEYstored in AWS Secrets Manager asconsumer-agent/opik-api-key- Automatically injected via
consumer-agent.yml(FSD deployment)
Get credentials from Comet.com > Opik section.
2. Enable Opik
Section titled “2. Enable Opik”Set opik.enabled: true in settings.yaml. Environment-specific projects (consumer-agent-dev/stage/prod) and fixed workspace (“consumer-agent”) are configured there.
Architecture
Section titled “Architecture”Two Evaluation Modes:
| Aspect | CLI (Offline) | Online (Production) |
|---|---|---|
| Execution | Developer machine | Opik cloud |
| Trigger | Manual command | Automatic (every trace) |
| LLM-as-Judge | Local GEval → OpenAI | Opik → OpenAI |
| Cost | Developer pays | Opik pricing |
| Dataset | Curated test cases | Real user traffic |
| Use For | Testing prompts, A/B tests | Production monitoring, alerts |
Components:
- PromptEvaluator (
src/consumer_agent/evaluation/evaluator.py) - CLI evaluation orchestration - OpikClient (
src/consumer_agent/evaluation/client.py) - Opik SDK wrapper for dataset/prompt CRUD - Metrics (
src/consumer_agent/metrics/) - 7 metrics (4 LLM-as-judge + 3 rule-based) - Automation Rules (
scripts/register_llm_judge_rules.py) - Online evaluation setup
Chat CLI Tracking
Section titled “Chat CLI Tracking”The chat CLI interface (consumer-agent chat) includes Opik tracking for local testing and development interactions.
Configuration
Section titled “Configuration”Chat CLI uses the same environment-specific project configuration as the rest of the application. See Quick Start → Enable Opik for the full settings.yaml configuration. Local development (default environment) uses the consumer-agent-dev project.
What is Tracked
Section titled “What is Tracked”Each conversation turn is logged as a trace with:
Input: User message Output: Assistant response Metadata:
locale: User locale (en, es-419)user_id: User ID if providedlatitude/longitude: Location if providedenabled_components: Active component typesreasoning_length: Length of reasoning textresponse_length: Length of response text
Graceful Degradation
Section titled “Graceful Degradation”If Opik Cloud is unavailable or credentials are missing, the chat CLI continues working without tracking. No errors are shown to the user.
View Chat Traces
Section titled “View Chat Traces”- Go to Opik Dashboard
- Select workspace: consumer-agent
- Select project: consumer-agent-dev (for local development)
- View all local chat interactions with metadata
- Filter by locale, user_id, or components, or tags: [“chat-cli”, “local-testing”]
Implementation
Section titled “Implementation”The tracking implementation is in src/consumer_agent/cli/chat.py:273-402:
# Initialize Opik client if enabledopik_client = Nonetrace = Noneif settings.opik.enabled and settings.opik.api_key: opik_client = opik.Opik()
# Wrap agent streaming in traceif opik_client: trace = opik_client.trace( name="chat_interaction", input={"message": user_message}, project_name=settings.opik.chat_project_name, metadata={...} )Use Cases
Section titled “Use Cases”- Local Testing: Track development conversations for debugging
- Integration Testing: Verify prompt changes work as expected
- Component Testing: Test specific component behaviors
- Locale Testing: Verify translations and locale handling
- Tool Testing: Monitor MCP tool calls and responses
Comparison with Production Tracking
Section titled “Comparison with Production Tracking”| Aspect | Chat CLI | FastAPI Service |
|---|---|---|
| Project | consumer-agent-chat-cli | consumer-agent |
| Purpose | Local testing | Production monitoring |
| Tracking | Enabled (OpikTracer) | Enabled (OpikTracer) |
| Evaluation | Manual review | Automated rules (LLM-as-judge) |
| Users | Developers | Real users |
| thread_id | UUID per session | episode_id (multi-turn grouping) |
Production Deployment
Section titled “Production Deployment”FSD Configuration: OPIK_API_KEY secret configured in consumer-agent.yml (AWS Secrets Manager path: consumer-agent/opik-api-key).
Environment-Specific Projects:
| Environment | Project Name |
|---|---|
| default (local) | consumer-agent-dev |
| stage | consumer-agent-stage |
| prod | consumer-agent-prod |
Workspace: Fixed as “consumer-agent” in settings.yaml across all environments.
Custom Metrics
Section titled “Custom Metrics”Consumer-agent implements 7 custom metrics in src/consumer_agent/metrics/:
| Metric | Type | Scale | Purpose | When to Use |
|---|---|---|---|---|
| ResponseQualityMetric | LLM-as-judge | 1-10 | Overall response quality | All evaluations |
| GreetingQualityMetric | LLM-as-judge | 0-1 | Landing page greeting validation | Landing page only |
| CapabilityAlignmentMetric | LLM-as-judge | 0-1 | Suggestions align with bot capabilities | Suggestion validation |
| ConversationFlowMetric | LLM-as-judge | 0-1 | Multi-turn conversation coherence | Multi-turn conversations |
| SuggestionValidationMetric | Rule-based | 0-1 | Exactly 5 suggestions, 50-75 chars each | Landing page |
| DiversityMetric | Rule-based | 0-1 | No duplicates or similar suggestions | Landing page |
| ToolUsageMetric | Rule-based | 0 or 1 | Binary check if tools were called | Items requiring tools |
LLM-as-judge metrics use GEval with gpt-5-mini (4 API calls per evaluation). Rule-based metrics use deterministic logic (no API cost).
Usage: consumer-agent opik eval run --dataset test-v1 --metrics response_quality,greeting_quality
For detailed implementations, see src/consumer_agent/metrics/ directory.
Prompt Management
Section titled “Prompt Management”Environment-Driven Source Selection
Section titled “Environment-Driven Source Selection”Prompts are loaded based on environment configuration in settings.yaml:
| Environment | Source | Behavior |
|---|---|---|
| default (local) | file | Load from prompts/ directory |
| stage | opik | Load from Opik (strict mode - fail if unavailable) |
| prod | opik | Load from Opik (strict mode - fail if unavailable) |
Prompt Definitions
Section titled “Prompt Definitions”All prompts are defined in agent_config.yaml under prompts.definitions:
prompts: file: directory: "prompts" components_directory: "prompts/components"
definitions: conversational: file: conversational.txt opik: conversational@latest prompt-suggestions: file: prompt-suggestions.txt opik: prompt-suggestions@latestUsing PromptManager
Section titled “Using PromptManager”from consumer_agent.prompts import create_prompt_manager
pm = create_prompt_manager()
# Load prompt (source determined by environment)prompt = pm.load_prompt("conversational")
# Build complete system prompt for an agentsystem_prompt = pm.build_system_prompt( agent_id="conversational", enabled_components=["offer-list"], locale="en", latitude=41.8781, longitude=-87.6298,)Versioning with Opik
Section titled “Versioning with Opik”Each prompt definition includes an opik reference with explicit version:
conversational@latest- always use latest versionconversational@abc123- pin to specific commit
Update Prompts
Section titled “Update Prompts”- Edit prompt file in
prompts/directory - Test locally (uses file source)
- Upload to Opik via CLI or API
- Update
agent_config.yamlwith new version if pinning
Evaluation
Section titled “Evaluation”Evaluation Context
Section titled “Evaluation Context”Configure evaluation section in agent_config.yaml with real test user IDs to avoid 404 errors from MCP tools. See file for full config structure.
Create Evaluation Dataset
Section titled “Create Evaluation Dataset”from consumer_agent.evaluation import PromptEvaluator
evaluator = PromptEvaluator()
# Define test casesitems = [ { "input": "What coffee offers are available?", "expected_output": "List of coffee offers with points and details" }, { "input": "Show me nearby snack offers", "expected_output": "Snack offers with location information" }, { "input": "How do I redeem points?", "expected_output": "Point redemption instructions" }]
# Create dataset in Opikdataset = evaluator.create_dataset("prompt-eval-v1", items)print(f"Dataset created: {dataset['id']}")Run Evaluation
Section titled “Run Evaluation”from consumer_agent.agent import Agentfrom consumer_agent.factory import create_chat_modelfrom opik.evaluation.metrics import Hallucination, AnswerRelevance
# Setup agentmodel = create_chat_model()agent = Agent(model=model)
# Define metrics to trackmetrics = [ Hallucination(), # Detects made-up information AnswerRelevance(), # Measures answer quality]
# Run evaluation on datasetresults = evaluator.evaluate_prompt( agent=agent, dataset_name="prompt-eval-v1", scoring_metrics=metrics)
# Metrics automatically saved to Opik dashboardView Results
Section titled “View Results”All evaluation metrics are automatically saved to Opik:
- Go to Opik Dashboard
- Navigate to your project:
consumer-agent - View evaluation results with all metrics
- Compare prompt versions side-by-side
Available Metrics
Section titled “Available Metrics”Built-in metrics from Opik:
- Hallucination: Detects fabricated information
- AnswerRelevance: Measures answer quality and relevance
- Moderation: Checks for toxic/inappropriate content
- ContextPrecision: Measures retrieval accuracy (RAG systems)
- ContextRecall: Measures retrieval completeness (RAG systems)
CLI Commands
Section titled “CLI Commands”Manage datasets and prompts using command-line interface.
Dataset Commands
Section titled “Dataset Commands”Create Dataset
Section titled “Create Dataset”Create dataset from JSONL file (one JSON object per line):
# Create dataset from local fileconsumer-agent opik dataset create \ --name sample-test-v1 \ --file data/raw/sample.jsonl \ --description "Test dataset description"
# File format (JSONL):{"input": "...", "expected_output": "...", "metadata": {...}}{"input": "...", "expected_output": "...", "metadata": {...}}List Datasets
Section titled “List Datasets”# List all datasets in Opikconsumer-agent opik dataset listExport Dataset
Section titled “Export Dataset”# Export dataset to JSONL fileconsumer-agent opik dataset export \ --name sample-test-v1 \ --output data/external/exported.jsonlPrompt Commands
Section titled “Prompt Commands”Create/Update Prompt
Section titled “Create/Update Prompt”# Create or update prompt from fileconsumer-agent opik prompt create \ --name conversational-v3 \ --file prompts/conversational-v3.txtGet Prompt
Section titled “Get Prompt”# Get latest versionconsumer-agent opik prompt get --name conversational-v3
# Get specific versionconsumer-agent opik prompt get \ --name conversational-v3 \ --commit abc123def456View Prompt History
Section titled “View Prompt History”# View version historyconsumer-agent opik prompt history --name conversational-v3Evaluation Commands
Section titled “Evaluation Commands”Run Evaluation
Section titled “Run Evaluation”Run evaluation on a dataset with auto-selected or custom metrics:
# Run with auto-selected metrics (based on dataset metadata)consumer-agent opik eval run --dataset test-eval-v1
# Run with specific prompt versionconsumer-agent opik eval run \ --dataset test-eval-v1 \ --prompt conversational-v4
# Run with custom metricsconsumer-agent opik eval run \ --dataset test-eval-v1 \ --metrics suggestion_validation,diversity,tool_usage
# Run with custom modelconsumer-agent opik eval run \ --dataset test-eval-v1 \ --model gpt-4
# Run with custom experiment nameconsumer-agent opik eval run \ --dataset test-eval-v1 \ --experiment "prompt-v4-test"Available Metrics: See Custom Metrics section above for detailed metric descriptions.
Auto-Selection: When no metrics specified, automatically selects based on dataset metadata:
type: "landing"triggerssuggestion_validation+diversityrequires_tools: truetriggerstool_usage
Online Evaluation
Section titled “Online Evaluation”Automated evaluations run on production traffic in real-time using LLM-as-judge rules. Setup: python scripts/register_llm_judge_rules.py. View results in Opik Dashboard > Experiments.