Evidence Types: What Critiqor Collects During Agent Runs

Every Critiqor diagnosis is grounded in evidence — raw, structured observations captured passively from a running OpenClaw agent. Rather than relying on self-reported agent summaries or LLM judgment, Critiqor records what actually happened at runtime and reasons from that record. All evidence for a given run is written incrementally to a session.json file inside the run’s directory under runs/<run_id>/session.json. The seven evidence categories below describe exactly what is collected, where each item comes from in the plugin source, and how each type flows into the final diagnosis.

Tool Calls

What it is: A record of every tool invocation observed during the agent run. The plugin’s TOOL_EVENTS listener captures tool_call and tool_execution_start events from OpenClaw’s tool_hooks source layer. Each entry records the tool name, the arguments passed, a call ID that pairs it with its result, and the timestamp at which the call was issued. Example:

{
  "timestamp": "2025-06-01T12:00:01.234Z",
  "event_type": "tool_call",
  "source_layer": "tool_hooks",
  "tool_name": "read_file",
  "tool_call_id": "tc_abc123",
  "status": "ok",
  "duration_ms": null,
  "payload": {
    "toolName": "read_file",
    "toolCallId": "tc_abc123"
  }
}

duration_ms is null on the call record because the timer has not yet stopped — it is populated on the matching tool output event once the result arrives. Why it matters: The infinite_tool_loop detector in openclaw.py groups tool calls by a fingerprint of (tool_name, arguments). When the same fingerprint appears three or more times in a run, a loop failure is raised. The total count of tool calls also appears in the Executive Summary and in the tool_calls field of evidence_summary. In the dashboard: Every tool call is listed in the Evidence panel. The tool call count from evidence_summary.tool_calls appears in the Executive Summary header alongside the trust score.

Tool Outputs

What it is: The result returned by a tool after it executes. The plugin correlates a tool_result or tool_execution_end event with its originating tool_call using the shared tool_call_id. The isError flag on the payload becomes the status field ("ok" or "error"), and duration_ms is computed as the elapsed time between the call’s recorded start timestamp and the moment the result arrives. Example:

{
  "timestamp": "2025-06-01T12:00:01.890Z",
  "event_type": "tool_result",
  "source_layer": "tool_hooks",
  "tool_name": "read_file",
  "tool_call_id": "tc_abc123",
  "status": "ok",
  "duration_ms": 656,
  "payload": {
    "isError": false,
    "result": "File contents here..."
  }
}

Why it matters: The ignoring_tool_outputs detector looks for tool output events where used is false, referenced is false, or status is "ignored". When the agent receives a result but does not incorporate it into subsequent decisions, those outputs are flagged and penalise the tool_output_utilization dimension. The duration_ms value is also used in cost and efficiency analysis — slow tool calls with low utilization are a strong signal of redundant execution. In the dashboard: Tool outputs are displayed alongside their paired calls in the Evidence panel. A tool_result event with status: "error" increments the error_events metric counter in session.json.

Runtime Events

What it is: Agent lifecycle events that describe state transitions, retries, errors, memory operations, decisions, skill invocations, and context changes. These are emitted by the OpenClaw extension API rather than the tool hooks layer. Critiqor’s plugin subscribes to all events defined in TIMELINE_EVENTS — including agent_start, agent_end, turn_start, turn_end, session_start, session_end, message_received, and message_sent — and records them with source_layer: "extension_api". The full set of OpenClaw event types recognized by the Python diagnosis engine (OPENCLAW_EVENT_TYPES) includes:

Event type	Description
`tool_call`	Tool invocation
`tool_output`	Tool result
`memory_event`	Memory storage or recall
`retry_event`	Retry attempt
`error_event`	Runtime error
`state_transition`	Agent state change
`decision`	Agent decision point
`skill_event`	OpenClaw skill invocation
`token_usage`	Provider token consumption
`context_event`	Context window change
`process_output`	Raw stdout/stderr line
`process_start`	Process launch
`process_end`	Process exit

Example:

{
  "timestamp": "2025-06-01T12:00:00.100Z",
  "event_type": "retry_event",
  "source_layer": "extension_api",
  "payload": { "reason": "tool_timeout", "attempt": 2 }
}

Why it matters: Runtime events are the backbone of causal analysis. A retry_event following a repeated tool_call triggers the infinite_tool_loop detector. Memory events with action values of recall_failed, ignored, lost, or miss trigger memory_degradation detection. Context events with saturation ≥ 85 or action: compaction trigger context_pollution detection. Skill events with status: ignored, mismatch, or failed trigger skill_failure detection. In the dashboard: Runtime events appear in the Runtime Timeline section as a chronological event log. Retry and error events are highlighted and linked to the failure causes they contributed to.

Provider Requests

What it is: Events that bracket each LLM call — before_provider_request when the agent sends a prompt to the model, and after_provider_response when the model reply arrives. The after_provider_response payload contains a usage block with input_tokens, output_tokens, and total. The session normalizer in session.py maps after_provider_response to the token_usage event type so the Python diagnosis engine can accumulate totals across all turns. Example:

{
  "timestamp": "2025-06-01T12:00:02.000Z",
  "event_type": "after_provider_response",
  "source_layer": "extension_api",
  "payload": {
    "usage": { "input_tokens": 1204, "output_tokens": 387, "total": 1591 }
  }
}

Why it matters: The cost_explosion detector sums usage.total across all token_usage events in the session. If the cumulative total reaches 12,000 tokens and duplicate tool actions are also present, a cost_explosion failure is raised. At 30,000 tokens or more the severity escalates to critical, which directly triggers an unsafe_for_production readiness level regardless of the trust score. Token totals also appear in the cost_analysis block of the final diagnosis. In the dashboard: Token usage is shown in the Cost Analysis section. The total token count and the estimated token waste from redundant calls are both displayed.

Execution Metadata

What it is: Process-level information captured when an agent process is launched and when it exits. Critiqor records a process_start event at launch (including the command used) and a process_end event on exit with the exit code and total wall-clock latency. When the agent is monitored via monitor_openclaw_process(), stdout and stderr are also captured line-by-line; lines that parse as JSON are recorded as typed events, and plain-text lines become process_output events. Example:

{
  "event": "process_end",
  "timestamp": "2025-06-01T12:05:22.000Z",
  "pid": 12345,
  "exit_code": 0,
  "framework": "openclaw"
}

Why it matters: A non-zero exit code causes monitor_openclaw_process() to emit an error_event before the process_end record. That error_event is then available to every failure detector. Total latency from process_end is included in the run payload under runtime_metrics.latency and surfaced in the dashboard Agent Health view. In the dashboard: Exit code and latency appear in the run summary row shown by critiqor runs. Non-zero exit codes are flagged in the Evidence panel header.

Session Metadata

What it is: The top-level envelope that wraps all evidence in session.json. It is written by the plugin’s ensureSessionFile() function when the first event arrives and updated incrementally on every subsequent event. The metrics block is computed incrementally — total_events, by_event_type, by_source_layer, and error_events are all live counters maintained inside updateSessionSummary() in the plugin. Example:

{
  "session_id": "run_001",
  "run_id": "run_001",
  "schema_version": "critiqor.session.v1",
  "events_file": "session.json",
  "metrics": {
    "total_events": 47,
    "by_event_type": { "tool_call": 14, "tool_result": 14, "retry_event": 3 },
    "by_source_layer": { "tool_hooks": 28, "extension_api": 19 },
    "error_events": 1
  }
}

The schema_version field (critiqor.session.v1) is used by session.py to identify valid evidence files during finalization. The run_id ties the evidence file to the matching runs/run_001.json session record. In the dashboard: The metrics block is read during finalize_session() to populate the Evidence summary panel. by_event_type and by_source_layer breakdowns are shown as compact tables in the run detail view.

Runtime Timeline

What it is: The complete, ordered audit trail of everything Critiqor observed during the run. All seven evidence categories above appear as entries in the events[] array in session.json, sorted by the timestamp at which they were recorded. The timeline is the primary input to diagnose_openclaw_events() — all six failure detectors, all dimension scores, and the causal graph are derived from sequential analysis of this array. Why it matters: The causal graph built by build_openclaw_causal_graph() creates a precedes edge between every consecutive pair of events, so the temporal ordering of the timeline directly determines which events are identified as causes of later failures. An earlier retry_event that immediately follows a repeated tool_call is connected to a loop_flagged failure node via a causes edge. Repeated evidence items within a single failure cause are connected with reinforces edges. In the dashboard: The Runtime Timeline section renders the events[] array as a scrollable, chronological event log. Each entry shows the event type, source layer, timestamp, and a summary of its payload. Clicking an event that has an outgoing causes edge in the causal graph highlights the associated failure cause.

​Tool Calls

​Tool Outputs

​Runtime Events

​Provider Requests

​Execution Metadata

​Session Metadata

​Runtime Timeline