EvidenceRecorder: Capture Tool Calls for Richer Scoring

EvidenceRecorder is the SDK-level instrumentation primitive for agents that do not run on OpenClaw or a supported framework adapter. Wrapping an execution block inside a monitor() context upgrades the evidence quality from response_only to fully_instrumented, which raises evaluation_confidence and unlocks more accurate failure cause detection in the resulting CritiqorResult. Use EvidenceRecorder when you are calling a custom agent, a bare LLM client, or any tool-using pipeline where automatic framework detection does not apply.

Import

from critiqor import monitor, EvidenceRecorder

`monitor()`

monitor(prompt="") → EvidenceRecorder

Module-level factory function that creates an EvidenceRecorder context manager scoped to a single agent execution. The prompt parameter is optional at construction time and can be supplied later when calling finish().

Parameter	Type	Description
`prompt`	`str`	The user prompt for this execution. Optional here; required by `finish()`.

Returns: EvidenceRecorder — a context manager that begins capturing on __enter__ and closes on __exit__.

Usage

from critiqor import Critiqor, monitor

agent = Critiqor(your_agent)

with monitor("What is 2 + 2?") as recorder:
    recorder.record_tool_call("calculator", {"expression": "2 + 2"})
    result = your_agent.run("What is 2 + 2?")
    recorder.record_tool_output("calculator", "4")
    evidence = recorder.finish(result, "What is 2 + 2?")

critiqor_result = agent.evaluate(
    prompt="What is 2 + 2?",
    response=result,
    tool_calls=evidence.tool_calls,
    tool_outputs=evidence.tool_outputs,
    evidence_level="trace_available",
)

The with block automatically records agent_start and agent_finish trace events, captures any unhandled exceptions as error events, and resets the recorder context variable when the block exits. You can also call methods on the recorder directly inside any synchronous code without the with block — just call finish() manually when done.

Methods

`record_tool_call()`

recorder.record_tool_call(tool, args=None, call_id=None)

Records a tool invocation. Appends a ToolCall to the recorder’s internal list and emits a tool_start trace event.

Parameter	Type	Description
`tool`	`str`	Name of the tool being called.
`args`	`dict \| None`	Arguments passed to the tool. Defaults to an empty dict if not provided.
`call_id`	`str \| None`	Optional identifier used to correlate this call with its output.

`record_tool_output()`

recorder.record_tool_output(tool, output, call_id=None, error=None)

Records a tool result. Appends a ToolOutput and emits a tool_end trace event. If error is provided, it is also appended to the recorder’s error list.

Parameter	Type	Description
`tool`	`str`	Name of the tool that produced the output.
`output`	`Any`	The tool’s return value.
`call_id`	`str \| None`	Correlates with a prior `record_tool_call()` call.
`error`	`str \| None`	Error message string if the tool call failed.

`record_llm_call()`

recorder.record_llm_call(model=None, token_usage=None)

Records an LLM invocation for token counting and cost analysis. Token usage data is merged into the recorder’s token_usage dict and propagated to RuntimeMetrics when finish() is called.

Parameter	Type	Description
`model`	`str \| None`	Model identifier (e.g. `"gpt-4o"`, `"llama3.2"`).
`token_usage`	`dict \| None`	Token usage dict, e.g. `{"prompt_tokens": 120, "completion_tokens": 80, "total_tokens": 200}`.

`record_event()`

recorder.record_event(name, **payload)

Records a generic named event with an arbitrary keyword-argument payload. The event is timestamped automatically and appended to the trace. Use this for framework-specific events that don’t fit the tool call or LLM call shapes.

Parameter	Type	Description
`name`	`str`	Event name (e.g. `"state_transition"`, `"decision_made"`).
`**payload`	`Any`	Arbitrary keyword arguments included in the trace event dict.

`wrap_tool()`

recorder.wrap_tool(name, func) → callable

Returns an instrumented wrapper around a callable tool that automatically records record_tool_call() and record_tool_output() for every invocation. If the underlying function raises an exception, the error is recorded and the exception is re-raised.

Parameter	Type	Description
`name`	`str`	The tool name used in recorded evidence.
`func`	`callable`	The tool function to wrap.

Returns: A new callable with the same signature as func.

calculator = recorder.wrap_tool("calculator", raw_calculator_fn)
result = calculator("2 + 2")  # automatically recorded

`finish()`

recorder.finish(response="", prompt=None) → EvaluationEvidence

Closes the recorder and assembles the collected tool calls, outputs, trace events, and runtime metrics into an EvaluationEvidence object. Wall-clock latency is measured from the time the context manager was entered. The returned evidence always has evidence_level="fully_instrumented".

Parameter	Type	Description
`response`	`str`	The agent’s final response string.
`prompt`	`str \| None`	Overrides the prompt set at construction time.

Returns: EvaluationEvidence

`EvaluationEvidence` Fields

EvaluationEvidence is a frozen dataclass returned by finish() and also accessible as CritiqorResult.evidence. It holds the complete normalized evidence snapshot used during evaluation.

Field	Type	Description
`prompt`	`str`	The input prompt for the evaluated run.
`response`	`str`	The agent’s response.
`tool_calls`	`list[ToolCall]`	All captured tool calls in order.
`tool_outputs`	`list[ToolOutput]`	All captured tool outputs in order.
`trace`	`list[dict]`	Full event trace, including `agent_start`, `tool_start`, `tool_end`, LLM calls, and `agent_finish` events.
`metrics`	`RuntimeMetrics`	Wall-clock latency, token usage, retry count, and error strings.
`evidence_level`	`EvidenceLevel`	`"response_only"`, `"trace_available"`, or `"fully_instrumented"`. Inferred automatically if not overridden.

​Import

​monitor()

​Usage

​Methods

​record_tool_call()

​record_tool_output()

​record_llm_call()

​record_event()

​wrap_tool()

​finish()

​EvaluationEvidence Fields