Skip to content

CLI Reference -- PLT-246 New Commands

New commands added for CI/CD evaluation pipeline. These extend the existing consumer-agent opik CLI.


Compute composite health metrics from an evaluation experiment. Returns 4 composites (Safety, Compliance, Quality, Reliability) with pass/warning/fail verdicts. Exit code 0 = pass, 1 = fail — designed for CI/CD gating.

Experiments belong to a dataset in Opik. The --dataset flag specifies which dataset to search for the experiment (defaults to the golden dataset).

Terminal window
# Compute health for the V1 baseline
consumer-agent opik eval health -e main_muffin_3622
# With JSON output for CI consumption
consumer-agent opik eval health -e main_muffin_3622 --json-output
# Compare against a different baseline
consumer-agent opik eval health -e new_experiment -b main_muffin_3622
# Different dataset
consumer-agent opik eval health -e my_experiment -d my-dataset
OptionShortRequiredDescription
--experiment-eYesExperiment name to assess
--dataset-dNoOpik dataset name (default: consumer-agent-eval)
--baseline-bNoBaseline experiment for comparison (default: from agent_config.yaml)
--json-outputNoOutput JSON only (for CI pipelines)

Composites and thresholds are configured in agent_config.yaml under evaluation.health_metrics.


Format health metrics as a markdown summary table. Can compute health directly from an experiment name, or accept pre-computed JSON for CI performance.

Terminal window
# Simple -- just experiment name (computes health internally)
consumer-agent opik eval summary -e main_muffin_3622
# With smoke label
consumer-agent opik eval summary -e main_muffin_3622 --smoke
# Custom dataset
consumer-agent opik eval summary -e main_muffin_3622 -d my-dataset
# CI mode -- pre-computed JSON (faster, no Opik API call)
consumer-agent opik eval summary --health-json /tmp/health.json --experiment main_muffin_3622
OptionShortRequiredDescription
--experiment-eYesExperiment name
--health-jsonNoPre-computed health JSON string or file path. If omitted, computes from experiment.
--dataset-dNoDataset name, used when computing health without --health-json (default: consumer-agent-eval)
--smokeNoLabel as smoke evaluation in the heading

Outputs markdown to stdout. Pipe to $GITHUB_STEP_SUMMARY in CI.


Generate a full markdown evaluation report with radar charts, score distributions, and LLM executive summary. See PLT-521 for full documentation.

Terminal window
# Generate report for an experiment
consumer-agent opik eval report -e main_muffin_3622 -o report.md
# Compare against baseline
consumer-agent opik eval report -e new_experiment --compare main_muffin_3622
# Skip LLM summary (faster, for CI)
consumer-agent opik eval report -e main_muffin_3622 --no-summary
OptionShortRequiredDescription
--experiment-eYesExperiment name
--compare-cNoBaseline experiment for comparison (default: from config)
--output-oNoOutput file path (default: reports/{experiment}.md)
--dataset-dNoDataset name (default: consumer-agent-eval)
--no-summaryNoSkip LLM-generated executive summary

Create a stratified smoke subset from a full dataset. Includes all safety-category items (100%) and proportionally samples from other categories. Used for merge-evaluation CI workflow.

Terminal window
# Create 75-item smoke subset
consumer-agent opik dataset create-smoke \
--source consumer-agent-eval \
--target consumer-agent-eval-smoke \
--size 75 \
--seed 42
OptionShortRequiredDescription
--source-sYesSource dataset name
--target-tYesTarget smoke dataset name (created if missing)
--size-nNoTarget subset size (default: 75)
--safety-categoryNoCategory to include fully (default: safety)
--seedNoRandom seed for reproducibility (default: 42)

Runs full 345-item evaluation before stage/prod deployment. Called by stage-deploy-fsd.yaml as a blocking prerequisite.

publish-to-ecr -> eval-gate (stage-evaluation) -> deploy-stage-east

Runs 75-item smoke evaluation on every push to main that touches agent code, prompts, or configs. Blocking.

Triggered by path changes: src/consumer_agent/**, prompts/**, configs/rules/**, agent_config.yaml, settings.yaml.