session.json and applying rule-based detectors defined in openclaw.py. The same evidence always produces the same diagnosis — making results auditable, reproducible, and safe to use in CI/CD gates.
The pipeline runs inside diagnose_openclaw_events(), which accepts the session’s event list and returns an OpenClawDiagnosis dataclass containing all of the fields documented below.
Trust Score
The trust score is a single integer from 0 to 100. It is the weighted sum of six per-dimension scores, each of which starts at 100 and is reduced by a penalty proportional to the severity of any detected failure in that dimension.Dimension weights
| Dimension | Weight | Failure type penalised |
|---|---|---|
loop_control | 20% | infinite_tool_loop |
tool_output_utilization | 20% | ignoring_tool_outputs |
memory_integrity | 15% | memory_degradation |
context_health | 15% | context_pollution |
cost_efficiency | 15% | cost_explosion |
skill_adherence | 15% | skill_failure |
Score formula
Each dimension score is computed in_openclaw_scores():
_weighted_score():
[0, 100].
Penalty caps
Each failure detector imposes animpact_score that grows with evidence severity but is capped to prevent any single failure from dominating the score:
| Failure type | Maximum penalty |
|---|---|
infinite_tool_loop | 30 points |
ignoring_tool_outputs | 30 points |
cost_explosion | 30 points |
memory_degradation | 25 points |
skill_failure | 24 points |
context_pollution | 22 points |
Executive Summary
The executive summary combines the trust score, readiness level, and primary failure type into a concise run-level verdict.Readiness level
_readiness_level() maps the trust score and failure severities to one of three levels:
| Level | Condition |
|---|---|
ready_for_runtime | Trust ≥ 80 and no high-severity or critical failures |
review_recommended | Trust 60–79 or any high-severity failure is present |
unsafe_for_production | Trust less than 60 or any critical failure is present |
Critical severity thresholds
A failure escalates tocritical severity — triggering unsafe_for_production regardless of the trust score — under two conditions:
infinite_tool_loop: the same tool call is repeated 5 or more times, or 3 or moreretry_evententries are present in the same session.cost_explosion: cumulative token usage across alltoken_usageevents reaches 30,000 tokens or more.
critiqor runs.
Primary Diagnosis
When multiple failure causes are detected in a single run,_primary_diagnosis() selects the one with the highest absolute impact_score and elevates it as the root cause explanation.
Primary diagnosis fields
| Field | Description |
|---|---|
root_cause_failure_type | The failure type with the largest penalty |
causal_chain_explanation | The " -> "-separated causal chain string from the detector |
severity | "medium", "high", or "critical" |
description | Human-readable description of what was observed |
root_cause_failure_type is null and causal_chain_explanation reads "No OpenClaw failure mode was detected from runtime evidence.".
Causal chain explanations by failure type
Each detector inopenclaw.py hard-codes a causal chain that reflects the sequence of observations that led to the failure:
| Failure type | Causal chain |
|---|---|
infinite_tool_loop | tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged |
memory_degradation | memory_stored -> recall_failed_or_ignored -> state_reconstruction_failed |
ignoring_tool_outputs | tool_call -> tool_output -> decision_skipped_output -> unsupported_agent_step |
context_pollution | context_growth -> saturation_or_compaction -> key_state_risk |
cost_explosion | repeated_reasoning_or_calls -> token_waste -> cost_spike |
skill_failure | skill_available -> skill_not_selected_or_failed -> generic_execution |
Evidence Section
Theevidence_summary block in the diagnosis is a compact snapshot of every event type observed in the session. It is computed by _evidence_summary() from the normalized event list.
Evidence summary fields
| Field | Source |
|---|---|
event_count | Total events in the session |
event_counts | Per-type breakdown of every observed event type |
tool_calls | Count of tool_call events specifically |
tool_outputs | Count of tool_output events specifically |
memory_events | Count of memory_event events |
retries | Count of retry_event events |
errors | Count of error_event events |
state_transitions | Count of state_transition events |
session.json file stored at runs/<run_id>/session.json.
Recommendations
Each failure cause has an associated description that explains what the detector observed. The dashboard surfaces the primary failure’scausal_chain_explanation as a step-by-step explanation of why the run scored the way it did, and pairs it with a remediation direction derived from the failure type.
| Failure type | Remediation direction |
|---|---|
infinite_tool_loop | Review loop termination conditions; add retry caps to prevent the agent from re-issuing identical tool calls without progress |
memory_degradation | Check memory storage and recall logic; ensure that stored context is retrievable and actively used in downstream decisions |
ignoring_tool_outputs | Ensure tool results are incorporated into agent decisions; outputs that are received but never referenced indicate a disconnect between the tool layer and the reasoning layer |
context_pollution | Reduce context window usage; avoid unnecessary compaction events that discard key state silently |
cost_explosion | Review tool call efficiency; add token budget guards to prevent runaway token accumulation across multi-turn sessions |
skill_failure | Verify skill selection logic and skill invocation flow; a mismatch or ignored skill means the agent is falling back to generic execution when a specialized path is available |
Agent Health / Run History
Every completed run produces a trust score that is persisted alongside the diagnosis in theruns/ directory. The dashboard Agent Health view plots these scores over time, making it possible to see whether an agent is improving, stable, or degrading across successive runs.
Run files follow the naming convention runs/run_001.json, runs/run_002.json, and so on, with IDs assigned sequentially by next_run_id() in session.py. The CLI command critiqor runs reads this directory and prints a summary table showing run ID, trust score, readiness level, primary failure type, and tool call count for each completed run.
Root Cause Analysis
The causal graph is built bybuild_openclaw_causal_graph() and stored in the causal_graph field of the diagnosis. It is a node-edge graph with three edge types:
| Edge type | Meaning |
|---|---|
precedes | Temporal ordering — every consecutive pair of events in the timeline is connected with a precedes edge |
causes | Evidence-to-failure — each piece of evidence that contributed to a detected failure has a causes edge pointing to the failure node |
reinforces | Repeated evidence — when multiple evidence items within a single failure cause share the same failure node, consecutive items are connected with reinforces edges to show that the signal accumulated over time |
Node types
Every event in the timeline becomes an event node. Each detected failure becomes afailure node with an ID of the form failure_<type> (e.g., failure_infinite_tool_loop). Failure nodes are the terminal points in the causal graph — all causes edges terminate at them.
Reading the causal chain
Theprimary_diagnosis.causal_chain_explanation field is the most human-readable entry point into the causal graph. It is the " -> "-joined sequence of the detector’s causal_chain list — for example:
causes and reinforces edges that link them to the failure node.