Skip to content

OpenAI Responses API

The agent uses OpenAI’s Responses API to access model reasoning and structured streaming events.

Enable in agent_config.yaml:

openai:
model: gpt-5-mini
base_url: https://api.openai.com/v1
use_responses_api: true # Required
reasoning_effort: low # For GPT-5 models
text_verbosity: low # For GPT-5 models

The config system validates this on startup. Missing use_responses_api raises an error.

Requires langchain-openai>=0.3.9 (specified in pyproject.toml).

The agent emits structured events during response generation:

from consumer_agent.agent import Agent
from consumer_agent.factory import create_chat_model
model = create_chat_model()
agent = Agent(model)
messages = [{"role": "user", "content": "What is 2+2?"}]
system_prompt = "You are a helpful assistant."
async for event in agent.stream(messages, system_prompt):
if isinstance(event, ReasoningEvent):
print(f"Thinking: {event.content}")
elif isinstance(event, TextEvent):
print(f"Response: {event.content}")
elif isinstance(event, ToolCallStartEvent):
print(f"Calling tool: {event.tool_name}")
elif isinstance(event, ToolCallEndEvent):
print(f"Tool completed in {event.duration_ms}ms")
elif isinstance(event, UsageEvent):
print(f"Tokens: {event.total_tokens} ({event.reasoning_tokens} reasoning)")

Event types (src/consumer_agent/agent/streaming.py):

  • ReasoningEvent: Model reasoning before tool calls
  • TextEvent: Final response text after tools
  • ToolCallStartEvent: Tool execution begins
  • ToolCallEndEvent: Tool execution completes with duration
  • UsageEvent: Token usage metrics (input, output, reasoning, total)
  • CompletedEvent: Stream finished
  • ErrorEvent: Error occurred

The agent separates reasoning from final responses:

  • Before first tool call: Tokens are ReasoningEvent (internal thinking)
  • After tool calls: Tokens are TextEvent (final answer)

Example:

reasoning = []
response = []
async for event in agent.stream(messages, system_prompt):
if isinstance(event, ReasoningEvent):
reasoning.append(event.content)
elif isinstance(event, TextEvent):
response.append(event.content)
print("Reasoning:", "".join(reasoning))
print("Response:", "".join(response))

GPT-5 Series (gpt-5-mini, gpt-5, gpt-5-nano)

Section titled “GPT-5 Series (gpt-5-mini, gpt-5, gpt-5-nano)”

Reasoning tokens are generated and accessible.

Configuration:

openai:
model: gpt-5-mini
reasoning_effort: low # minimal, low, medium, high
text_verbosity: low # low, medium, high

Reasoning effort levels:

  • minimal: Fast, few/no reasoning tokens
  • low: Basic reasoning (default)
  • medium: Balanced reasoning and speed
  • high: Deep reasoning, more tokens

Reasoning appears in ReasoningEvent during streaming.

Reasoning tokens are generated but hidden (not accessible via API).

Configuration:

openai:
model: o1-mini
# No reasoning_effort or text_verbosity parameters

You pay for reasoning tokens but cannot see the reasoning text.

Reasoning tokens count as output tokens:

Example with GPT-5-mini:

  • Input: 100 tokens
  • Reasoning: 50 tokens
  • Output: 150 tokens
  • Billed output: 200 tokens (reasoning + output)

Token counts appear in response metadata:

# Available in message.response_metadata
usage = response_metadata.get('usage', {})
details = usage.get('completion_tokens_details', {})
reasoning_tokens = details.get('reasoning_tokens', 0)

The agent emits UsageEvent at the end of each stream with token consumption metrics.

from consumer_agent.agent.streaming import UsageEvent
# Event fields (OpenAI/LangChain standard names)
event.input_tokens # Prompt/input tokens
event.output_tokens # Completion tokens (includes reasoning)
event.total_tokens # Sum of input + output
event.reasoning_tokens # Reasoning tokens (subset of output)
event.cached_tokens # Cached input tokens
async for event in agent.stream(messages, system_prompt):
if isinstance(event, UsageEvent):
print(f"Input: {event.input_tokens}")
print(f"Output: {event.output_tokens}")
print(f"Reasoning: {event.reasoning_tokens}")
print(f"Total: {event.total_tokens}")

Track token usage across multiple turns:

session_tokens = {
"input": 0,
"output": 0,
"reasoning": 0,
"total": 0
}
async for event in agent.stream(messages, system_prompt):
if isinstance(event, UsageEvent):
session_tokens["input"] += event.input_tokens
session_tokens["output"] += event.output_tokens
session_tokens["reasoning"] += event.reasoning_tokens
session_tokens["total"] += event.total_tokens
print(f"Session total: {session_tokens['total']} tokens")
print(f"Reasoning: {session_tokens['reasoning']} tokens")

Calculate estimated costs:

# GPT-5-mini pricing (example)
INPUT_COST_PER_1M = 0.15 # $0.15 per 1M input tokens
OUTPUT_COST_PER_1M = 0.60 # $0.60 per 1M output tokens
async for event in agent.stream(messages, system_prompt):
if isinstance(event, UsageEvent):
input_cost = (event.input_tokens / 1_000_000) * INPUT_COST_PER_1M
output_cost = (event.output_tokens / 1_000_000) * OUTPUT_COST_PER_1M
total_cost = input_cost + output_cost
print(f"Estimated cost: ${total_cost:.4f}")
print(f" Input: ${input_cost:.4f}")
print(f" Output: ${output_cost:.4f} (includes {event.reasoning_tokens} reasoning tokens)")

The agent uses dual streaming modes for low latency:

  • “messages” mode: Direct LLM token streaming (lowest latency)
  • “updates” mode: Agent state updates (tool tracking)

Benefits:

  • 3-5x faster time-to-first-token
  • Separate reasoning/response: Automatic classification
  • Tool tracking: Start/end events with durations

Track tool execution timing:

tool_timings = {}
async for event in agent.stream(messages, system_prompt):
if isinstance(event, ToolCallStartEvent):
print(f"Starting {event.tool_name}...")
elif isinstance(event, ToolCallEndEvent):
tool_timings[event.tool_name] = event.duration_ms
print(f"{event.tool_name}: {event.duration_ms}ms")
print("Performance:", tool_timings)

Handle stream errors:

try:
async for event in agent.stream(messages, system_prompt):
if isinstance(event, ErrorEvent):
print(f"Error: {event.error} (code: {event.code})")
break
elif isinstance(event, TextEvent):
print(event.content)
except Exception as e:
print(f"Stream failed: {e}")

Error: use_responses_api missing in agent_config.yaml

Section titled “Error: use_responses_api missing in agent_config.yaml”

Add the required parameter:

openai:
use_responses_api: true

Check configuration:

  1. Verify reasoning_effort is set (GPT-5 only)
  2. Ensure use_responses_api: true in config
  3. Confirm model is GPT-5 series (O1/O3 hide reasoning)

Reduce reasoning effort:

openai:
reasoning_effort: minimal

Import error: langchain-openai version too old

Section titled “Import error: langchain-openai version too old”

Update dependencies:

Terminal window
uv sync