Skip to main content
Critiqor’s evaluation pipeline is entirely deterministic. No LLM judgment is used at any stage. Every score, every failure cause, every causal chain, and every deployment recommendation is computed by inspecting the raw events collected in session.json and applying rule-based detectors defined in openclaw.py. The same evidence always produces the same diagnosis — making results auditable, reproducible, and safe to use in CI/CD gates. The pipeline runs inside diagnose_openclaw_events(), which accepts the session’s event list and returns an OpenClawDiagnosis dataclass containing all of the fields documented below.

Trust Score

The trust score is a single integer from 0 to 100. It is the weighted sum of six per-dimension scores, each of which starts at 100 and is reduced by a penalty proportional to the severity of any detected failure in that dimension.

Dimension weights

DimensionWeightFailure type penalised
loop_control20%infinite_tool_loop
tool_output_utilization20%ignoring_tool_outputs
memory_integrity15%memory_degradation
context_health15%context_pollution
cost_efficiency15%cost_explosion
skill_adherence15%skill_failure

Score formula

Each dimension score is computed in _openclaw_scores():
dimension_score = max(0, 100 − |impact_score for that failure type|)
The weighted trust score is then computed in _weighted_score():
trust_score = round(Σ dimension_score × weight)
clamped to the range [0, 100].

Penalty caps

Each failure detector imposes an impact_score that grows with evidence severity but is capped to prevent any single failure from dominating the score:
Failure typeMaximum penalty
infinite_tool_loop30 points
ignoring_tool_outputs30 points
cost_explosion30 points
memory_degradation25 points
skill_failure24 points
context_pollution22 points
A run with no detected failures in any dimension receives a trust score of 100.

Executive Summary

The executive summary combines the trust score, readiness level, and primary failure type into a concise run-level verdict.

Readiness level

_readiness_level() maps the trust score and failure severities to one of three levels:
LevelCondition
ready_for_runtimeTrust ≥ 80 and no high-severity or critical failures
review_recommendedTrust 60–79 or any high-severity failure is present
unsafe_for_productionTrust less than 60 or any critical failure is present

Critical severity thresholds

A failure escalates to critical severity — triggering unsafe_for_production regardless of the trust score — under two conditions:
  • infinite_tool_loop: the same tool call is repeated 5 or more times, or 3 or more retry_event entries are present in the same session.
  • cost_explosion: cumulative token usage across all token_usage events reaches 30,000 tokens or more.
The executive summary is the first thing shown in the run detail view and is also included in the output of critiqor runs.

Primary Diagnosis

When multiple failure causes are detected in a single run, _primary_diagnosis() selects the one with the highest absolute impact_score and elevates it as the root cause explanation.

Primary diagnosis fields

{
  "root_cause_failure_type": "infinite_tool_loop",
  "causal_chain_explanation": "tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged",
  "severity": "critical",
  "description": "Tool call repeated 6 times with matching arguments."
}
FieldDescription
root_cause_failure_typeThe failure type with the largest penalty
causal_chain_explanationThe " -> "-separated causal chain string from the detector
severity"medium", "high", or "critical"
descriptionHuman-readable description of what was observed
If no failures are detected, root_cause_failure_type is null and causal_chain_explanation reads "No OpenClaw failure mode was detected from runtime evidence.".

Causal chain explanations by failure type

Each detector in openclaw.py hard-codes a causal chain that reflects the sequence of observations that led to the failure:
Failure typeCausal chain
infinite_tool_looptool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged
memory_degradationmemory_stored -> recall_failed_or_ignored -> state_reconstruction_failed
ignoring_tool_outputstool_call -> tool_output -> decision_skipped_output -> unsupported_agent_step
context_pollutioncontext_growth -> saturation_or_compaction -> key_state_risk
cost_explosionrepeated_reasoning_or_calls -> token_waste -> cost_spike
skill_failureskill_available -> skill_not_selected_or_failed -> generic_execution

Evidence Section

The evidence_summary block in the diagnosis is a compact snapshot of every event type observed in the session. It is computed by _evidence_summary() from the normalized event list.

Evidence summary fields

{
  "event_count": 47,
  "event_counts": {
    "tool_call": 14,
    "tool_output": 14,
    "retry_event": 3,
    "error_event": 1,
    "token_usage": 6,
    "state_transition": 5,
    "memory_event": 4
  },
  "tool_calls": 14,
  "tool_outputs": 14,
  "memory_events": 4,
  "retries": 3,
  "errors": 1,
  "state_transitions": 5
}
FieldSource
event_countTotal events in the session
event_countsPer-type breakdown of every observed event type
tool_callsCount of tool_call events specifically
tool_outputsCount of tool_output events specifically
memory_eventsCount of memory_event events
retriesCount of retry_event events
errorsCount of error_event events
state_transitionsCount of state_transition events
The raw events behind these counts are linked from the dashboard Evidence panel directly to the session.json file stored at runs/<run_id>/session.json.

Recommendations

Each failure cause has an associated description that explains what the detector observed. The dashboard surfaces the primary failure’s causal_chain_explanation as a step-by-step explanation of why the run scored the way it did, and pairs it with a remediation direction derived from the failure type.
Failure typeRemediation direction
infinite_tool_loopReview loop termination conditions; add retry caps to prevent the agent from re-issuing identical tool calls without progress
memory_degradationCheck memory storage and recall logic; ensure that stored context is retrievable and actively used in downstream decisions
ignoring_tool_outputsEnsure tool results are incorporated into agent decisions; outputs that are received but never referenced indicate a disconnect between the tool layer and the reasoning layer
context_pollutionReduce context window usage; avoid unnecessary compaction events that discard key state silently
cost_explosionReview tool call efficiency; add token budget guards to prevent runaway token accumulation across multi-turn sessions
skill_failureVerify skill selection logic and skill invocation flow; a mismatch or ignored skill means the agent is falling back to generic execution when a specialized path is available

Agent Health / Run History

Every completed run produces a trust score that is persisted alongside the diagnosis in the runs/ directory. The dashboard Agent Health view plots these scores over time, making it possible to see whether an agent is improving, stable, or degrading across successive runs. Run files follow the naming convention runs/run_001.json, runs/run_002.json, and so on, with IDs assigned sequentially by next_run_id() in session.py. The CLI command critiqor runs reads this directory and prints a summary table showing run ID, trust score, readiness level, primary failure type, and tool call count for each completed run.

Root Cause Analysis

The causal graph is built by build_openclaw_causal_graph() and stored in the causal_graph field of the diagnosis. It is a node-edge graph with three edge types:
Edge typeMeaning
precedesTemporal ordering — every consecutive pair of events in the timeline is connected with a precedes edge
causesEvidence-to-failure — each piece of evidence that contributed to a detected failure has a causes edge pointing to the failure node
reinforcesRepeated evidence — when multiple evidence items within a single failure cause share the same failure node, consecutive items are connected with reinforces edges to show that the signal accumulated over time

Node types

Every event in the timeline becomes an event node. Each detected failure becomes a failure node with an ID of the form failure_<type> (e.g., failure_infinite_tool_loop). Failure nodes are the terminal points in the causal graph — all causes edges terminate at them.

Reading the causal chain

The primary_diagnosis.causal_chain_explanation field is the most human-readable entry point into the causal graph. It is the " -> "-joined sequence of the detector’s causal_chain list — for example:
tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged
Each step in this chain corresponds to a class of evidence observed in the runtime timeline. The full graph (accessible from the Evidence panel in the dashboard) shows the individual event nodes that instantiate each step, connected by the causes and reinforces edges that link them to the failure node.