CLI Reference -- PLT-246 New Commands

CLI Reference — PLT-246 New Commands

New commands added for CI/CD evaluation pipeline. These extend the existing consumer-agent opik CLI.

opik eval health

Compute composite health metrics from an evaluation experiment. Returns 4 composites (Safety, Compliance, Quality, Reliability) with pass/warning/fail verdicts. Exit code 0 = pass, 1 = fail — designed for CI/CD gating.

Experiments belong to a dataset in Opik. The --dataset flag specifies which dataset to search for the experiment (defaults to the golden dataset).

# Compute health for the V1 baseline
consumer-agent opik eval health -e main_muffin_3622

# With JSON output for CI consumption
consumer-agent opik eval health -e main_muffin_3622 --json-output

# Compare against a different baseline
consumer-agent opik eval health -e new_experiment -b main_muffin_3622

# Different dataset
consumer-agent opik eval health -e my_experiment -d my-dataset

Option	Short	Required	Description
`--experiment`	`-e`	Yes	Experiment name to assess
`--dataset`	`-d`	No	Opik dataset name (default: `consumer-agent-eval`)
`--baseline`	`-b`	No	Baseline experiment for comparison (default: from `agent_config.yaml`)
`--json-output`		No	Output JSON only (for CI pipelines)

Composites and thresholds are configured in agent_config.yaml under evaluation.health_metrics.

opik eval summary

Format health metrics as a markdown summary table. Can compute health directly from an experiment name, or accept pre-computed JSON for CI performance.

# Simple -- just experiment name (computes health internally)
consumer-agent opik eval summary -e main_muffin_3622

# With smoke label
consumer-agent opik eval summary -e main_muffin_3622 --smoke

# Custom dataset
consumer-agent opik eval summary -e main_muffin_3622 -d my-dataset

# CI mode -- pre-computed JSON (faster, no Opik API call)
consumer-agent opik eval summary --health-json /tmp/health.json --experiment main_muffin_3622

Option	Short	Required	Description
`--experiment`	`-e`	Yes	Experiment name
`--health-json`		No	Pre-computed health JSON string or file path. If omitted, computes from experiment.
`--dataset`	`-d`	No	Dataset name, used when computing health without `--health-json` (default: `consumer-agent-eval`)
`--smoke`		No	Label as smoke evaluation in the heading

Outputs markdown to stdout. Pipe to $GITHUB_STEP_SUMMARY in CI.

opik eval report

Generate a full markdown evaluation report with radar charts, score distributions, and LLM executive summary. See PLT-521 for full documentation.

# Generate report for an experiment
consumer-agent opik eval report -e main_muffin_3622 -o report.md

# Compare against baseline
consumer-agent opik eval report -e new_experiment --compare main_muffin_3622

# Skip LLM summary (faster, for CI)
consumer-agent opik eval report -e main_muffin_3622 --no-summary

Option	Short	Required	Description
`--experiment`	`-e`	Yes	Experiment name
`--compare`	`-c`	No	Baseline experiment for comparison (default: from config)
`--output`	`-o`	No	Output file path (default: `reports/{experiment}.md`)
`--dataset`	`-d`	No	Dataset name (default: `consumer-agent-eval`)
`--no-summary`		No	Skip LLM-generated executive summary

opik dataset create-smoke

Create a stratified smoke subset from a full dataset. Includes all safety-category items (100%) and proportionally samples from other categories. Used for merge-evaluation CI workflow.

# Create 75-item smoke subset
consumer-agent opik dataset create-smoke \
  --source consumer-agent-eval \
  --target consumer-agent-eval-smoke \
  --size 75 \
  --seed 42

Option	Short	Required	Description
`--source`	`-s`	Yes	Source dataset name
`--target`	`-t`	Yes	Target smoke dataset name (created if missing)
`--size`	`-n`	No	Target subset size (default: 75)
`--safety-category`		No	Category to include fully (default: `safety`)
`--seed`		No	Random seed for reproducibility (default: 42)

CI Workflows

stage-evaluation.yaml (Deploy Gate)

Runs full 345-item evaluation before stage/prod deployment. Called by stage-deploy-fsd.yaml as a blocking prerequisite.

publish-to-ecr -> eval-gate (stage-evaluation) -> deploy-stage-east

merge-evaluation.yaml (Merge Monitor)

Runs 75-item smoke evaluation on every push to main that touches agent code, prompts, or configs. Blocking.

Triggered by path changes: src/consumer_agent/**, prompts/**, configs/rules/**, agent_config.yaml, settings.yaml.