CLI Reference -- PLT-246 New Commands
CLI Reference — PLT-246 New Commands
Section titled “CLI Reference — PLT-246 New Commands”New commands added for CI/CD evaluation pipeline. These extend the existing consumer-agent opik CLI.
opik eval health
Section titled “opik eval health”Compute composite health metrics from an evaluation experiment. Returns 4 composites (Safety, Compliance, Quality, Reliability) with pass/warning/fail verdicts. Exit code 0 = pass, 1 = fail — designed for CI/CD gating.
Experiments belong to a dataset in Opik. The --dataset flag specifies which dataset to search for the experiment (defaults to the golden dataset).
# Compute health for the V1 baselineconsumer-agent opik eval health -e main_muffin_3622
# With JSON output for CI consumptionconsumer-agent opik eval health -e main_muffin_3622 --json-output
# Compare against a different baselineconsumer-agent opik eval health -e new_experiment -b main_muffin_3622
# Different datasetconsumer-agent opik eval health -e my_experiment -d my-dataset| Option | Short | Required | Description |
|---|---|---|---|
--experiment | -e | Yes | Experiment name to assess |
--dataset | -d | No | Opik dataset name (default: consumer-agent-eval) |
--baseline | -b | No | Baseline experiment for comparison (default: from agent_config.yaml) |
--json-output | No | Output JSON only (for CI pipelines) |
Composites and thresholds are configured in agent_config.yaml under evaluation.health_metrics.
opik eval summary
Section titled “opik eval summary”Format health metrics as a markdown summary table. Can compute health directly from an experiment name, or accept pre-computed JSON for CI performance.
# Simple -- just experiment name (computes health internally)consumer-agent opik eval summary -e main_muffin_3622
# With smoke labelconsumer-agent opik eval summary -e main_muffin_3622 --smoke
# Custom datasetconsumer-agent opik eval summary -e main_muffin_3622 -d my-dataset
# CI mode -- pre-computed JSON (faster, no Opik API call)consumer-agent opik eval summary --health-json /tmp/health.json --experiment main_muffin_3622| Option | Short | Required | Description |
|---|---|---|---|
--experiment | -e | Yes | Experiment name |
--health-json | No | Pre-computed health JSON string or file path. If omitted, computes from experiment. | |
--dataset | -d | No | Dataset name, used when computing health without --health-json (default: consumer-agent-eval) |
--smoke | No | Label as smoke evaluation in the heading |
Outputs markdown to stdout. Pipe to $GITHUB_STEP_SUMMARY in CI.
opik eval report
Section titled “opik eval report”Generate a full markdown evaluation report with radar charts, score distributions, and LLM executive summary. See PLT-521 for full documentation.
# Generate report for an experimentconsumer-agent opik eval report -e main_muffin_3622 -o report.md
# Compare against baselineconsumer-agent opik eval report -e new_experiment --compare main_muffin_3622
# Skip LLM summary (faster, for CI)consumer-agent opik eval report -e main_muffin_3622 --no-summary| Option | Short | Required | Description |
|---|---|---|---|
--experiment | -e | Yes | Experiment name |
--compare | -c | No | Baseline experiment for comparison (default: from config) |
--output | -o | No | Output file path (default: reports/{experiment}.md) |
--dataset | -d | No | Dataset name (default: consumer-agent-eval) |
--no-summary | No | Skip LLM-generated executive summary |
opik dataset create-smoke
Section titled “opik dataset create-smoke”Create a stratified smoke subset from a full dataset. Includes all safety-category items (100%) and proportionally samples from other categories. Used for merge-evaluation CI workflow.
# Create 75-item smoke subsetconsumer-agent opik dataset create-smoke \ --source consumer-agent-eval \ --target consumer-agent-eval-smoke \ --size 75 \ --seed 42| Option | Short | Required | Description |
|---|---|---|---|
--source | -s | Yes | Source dataset name |
--target | -t | Yes | Target smoke dataset name (created if missing) |
--size | -n | No | Target subset size (default: 75) |
--safety-category | No | Category to include fully (default: safety) | |
--seed | No | Random seed for reproducibility (default: 42) |
CI Workflows
Section titled “CI Workflows”stage-evaluation.yaml (Deploy Gate)
Section titled “stage-evaluation.yaml (Deploy Gate)”Runs full 345-item evaluation before stage/prod deployment. Called by stage-deploy-fsd.yaml as a blocking prerequisite.
publish-to-ecr -> eval-gate (stage-evaluation) -> deploy-stage-eastmerge-evaluation.yaml (Merge Monitor)
Section titled “merge-evaluation.yaml (Merge Monitor)”Runs 75-item smoke evaluation on every push to main that touches agent code, prompts, or configs. Blocking.
Triggered by path changes: src/consumer_agent/**, prompts/**, configs/rules/**, agent_config.yaml, settings.yaml.