Evaluation: How Critiqor Scores and Diagnoses Agent Runs

Critiqor’s evaluation pipeline is entirely deterministic. No LLM judgment is used at any stage. Every score, every failure cause, every causal chain, and every deployment recommendation is computed by inspecting the raw events collected in session.json and applying rule-based detectors defined in openclaw.py. The same evidence always produces the same diagnosis — making results auditable, reproducible, and safe to use in CI/CD gates. The pipeline runs inside diagnose_openclaw_events(), which accepts the session’s event list and returns an OpenClawDiagnosis dataclass containing all of the fields documented below.

Trust Score

The trust score is a single integer from 0 to 100. It is the weighted sum of six per-dimension scores, each of which starts at 100 and is reduced by a penalty proportional to the severity of any detected failure in that dimension.

Dimension weights

Dimension	Weight	Failure type penalised
`loop_control`	20%	`infinite_tool_loop`
`tool_output_utilization`	20%	`ignoring_tool_outputs`
`memory_integrity`	15%	`memory_degradation`
`context_health`	15%	`context_pollution`
`cost_efficiency`	15%	`cost_explosion`
`skill_adherence`	15%	`skill_failure`

Score formula

Each dimension score is computed in _openclaw_scores():

dimension_score = max(0, 100 − |impact_score for that failure type|)

The weighted trust score is then computed in _weighted_score():

trust_score = round(Σ dimension_score × weight)

clamped to the range [0, 100].

Penalty caps

Each failure detector imposes an impact_score that grows with evidence severity but is capped to prevent any single failure from dominating the score:

Failure type	Maximum penalty
`infinite_tool_loop`	30 points
`ignoring_tool_outputs`	30 points
`cost_explosion`	30 points
`memory_degradation`	25 points
`skill_failure`	24 points
`context_pollution`	22 points

A run with no detected failures in any dimension receives a trust score of 100.

Executive Summary

The executive summary combines the trust score, readiness level, and primary failure type into a concise run-level verdict.

Readiness level

_readiness_level() maps the trust score and failure severities to one of three levels:

Level	Condition
`ready_for_runtime`	Trust ≥ 80 and no high-severity or critical failures
`review_recommended`	Trust 60–79 or any high-severity failure is present
`unsafe_for_production`	Trust less than 60 or any critical failure is present

Critical severity thresholds

A failure escalates to critical severity — triggering unsafe_for_production regardless of the trust score — under two conditions:

infinite_tool_loop: the same tool call is repeated 5 or more times, or 3 or more retry_event entries are present in the same session.
cost_explosion: cumulative token usage across all token_usage events reaches 30,000 tokens or more.

The executive summary is the first thing shown in the run detail view and is also included in the output of critiqor runs.

Primary Diagnosis

When multiple failure causes are detected in a single run, _primary_diagnosis() selects the one with the highest absolute impact_score and elevates it as the root cause explanation.

Primary diagnosis fields

{
  "root_cause_failure_type": "infinite_tool_loop",
  "causal_chain_explanation": "tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged",
  "severity": "critical",
  "description": "Tool call repeated 6 times with matching arguments."
}

Field	Description
`root_cause_failure_type`	The failure type with the largest penalty
`causal_chain_explanation`	The `" -> "`-separated causal chain string from the detector
`severity`	`"medium"`, `"high"`, or `"critical"`
`description`	Human-readable description of what was observed

If no failures are detected, root_cause_failure_type is null and causal_chain_explanation reads "No OpenClaw failure mode was detected from runtime evidence.".

Causal chain explanations by failure type

Each detector in openclaw.py hard-codes a causal chain that reflects the sequence of observations that led to the failure:

Failure type	Causal chain
`infinite_tool_loop`	`tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged`
`memory_degradation`	`memory_stored -> recall_failed_or_ignored -> state_reconstruction_failed`
`ignoring_tool_outputs`	`tool_call -> tool_output -> decision_skipped_output -> unsupported_agent_step`
`context_pollution`	`context_growth -> saturation_or_compaction -> key_state_risk`
`cost_explosion`	`repeated_reasoning_or_calls -> token_waste -> cost_spike`
`skill_failure`	`skill_available -> skill_not_selected_or_failed -> generic_execution`

Evidence Section

The evidence_summary block in the diagnosis is a compact snapshot of every event type observed in the session. It is computed by _evidence_summary() from the normalized event list.

Evidence summary fields

{
  "event_count": 47,
  "event_counts": {
    "tool_call": 14,
    "tool_output": 14,
    "retry_event": 3,
    "error_event": 1,
    "token_usage": 6,
    "state_transition": 5,
    "memory_event": 4
  },
  "tool_calls": 14,
  "tool_outputs": 14,
  "memory_events": 4,
  "retries": 3,
  "errors": 1,
  "state_transitions": 5
}

Field	Source
`event_count`	Total events in the session
`event_counts`	Per-type breakdown of every observed event type
`tool_calls`	Count of `tool_call` events specifically
`tool_outputs`	Count of `tool_output` events specifically
`memory_events`	Count of `memory_event` events
`retries`	Count of `retry_event` events
`errors`	Count of `error_event` events
`state_transitions`	Count of `state_transition` events

The raw events behind these counts are linked from the dashboard Evidence panel directly to the session.json file stored at runs/<run_id>/session.json.

Recommendations

Each failure cause has an associated description that explains what the detector observed. The dashboard surfaces the primary failure’s causal_chain_explanation as a step-by-step explanation of why the run scored the way it did, and pairs it with a remediation direction derived from the failure type.

Failure type	Remediation direction
`infinite_tool_loop`	Review loop termination conditions; add retry caps to prevent the agent from re-issuing identical tool calls without progress
`memory_degradation`	Check memory storage and recall logic; ensure that stored context is retrievable and actively used in downstream decisions
`ignoring_tool_outputs`	Ensure tool results are incorporated into agent decisions; outputs that are received but never referenced indicate a disconnect between the tool layer and the reasoning layer
`context_pollution`	Reduce context window usage; avoid unnecessary compaction events that discard key state silently
`cost_explosion`	Review tool call efficiency; add token budget guards to prevent runaway token accumulation across multi-turn sessions
`skill_failure`	Verify skill selection logic and skill invocation flow; a mismatch or ignored skill means the agent is falling back to generic execution when a specialized path is available

Agent Health / Run History

Every completed run produces a trust score that is persisted alongside the diagnosis in the runs/ directory. The dashboard Agent Health view plots these scores over time, making it possible to see whether an agent is improving, stable, or degrading across successive runs. Run files follow the naming convention runs/run_001.json, runs/run_002.json, and so on, with IDs assigned sequentially by next_run_id() in session.py. The CLI command critiqor runs reads this directory and prints a summary table showing run ID, trust score, readiness level, primary failure type, and tool call count for each completed run.

Root Cause Analysis

The causal graph is built by build_openclaw_causal_graph() and stored in the causal_graph field of the diagnosis. It is a node-edge graph with three edge types:

Edge type	Meaning
`precedes`	Temporal ordering — every consecutive pair of events in the timeline is connected with a `precedes` edge
`causes`	Evidence-to-failure — each piece of evidence that contributed to a detected failure has a `causes` edge pointing to the failure node
`reinforces`	Repeated evidence — when multiple evidence items within a single failure cause share the same failure node, consecutive items are connected with `reinforces` edges to show that the signal accumulated over time

Node types

Every event in the timeline becomes an event node. Each detected failure becomes a failure node with an ID of the form failure_<type> (e.g., failure_infinite_tool_loop). Failure nodes are the terminal points in the causal graph — all causes edges terminate at them.

Reading the causal chain

The primary_diagnosis.causal_chain_explanation field is the most human-readable entry point into the causal graph. It is the " -> "-joined sequence of the detector’s causal_chain list — for example:

tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged

Each step in this chain corresponds to a class of evidence observed in the runtime timeline. The full graph (accessible from the Evidence panel in the dashboard) shows the individual event nodes that instantiate each step, connected by the causes and reinforces edges that link them to the failure node.

​Trust Score

​Dimension weights

​Score formula

​Penalty caps

​Executive Summary

​Readiness level

​Critical severity thresholds

​Primary Diagnosis

​Primary diagnosis fields

​Causal chain explanations by failure type

​Evidence Section

​Evidence summary fields

​Recommendations

​Agent Health / Run History

​Root Cause Analysis

​Node types

​Reading the causal chain