Dashboard Overview: Understanding Your Diagnosis Results

The Critiqor dashboard is a local-only web interface that renders the contents of diagnosis.json for a given run. It starts automatically when you run critiqor finalize, and can be reopened at any time with critiqor dashboard <run_id>. Nothing is uploaded to a remote server — all data stays on your machine, and the dashboard reads the pre-computed artifacts Critiqor has already written to runs/<run_id>/.

Dashboard Sections

The dashboard is organized for progressive disclosure. New users see the simplest answer first, while engineers can drill into evidence and causal detail when needed.

Section	Purpose	What You’ll See
Overview	Executive-friendly summary	Trust score (0–100), readiness level, primary failure type, recommended action, and latest run status
Diagnosis	Root cause explanation	Primary failure mode, causal chain, severity, impact score, and recommended fix
Cost	Operational waste	Total tokens, token waste, duplicate tool calls, redundancy score, and cost efficiency
Evidence	Technical audit trail	Tool calls (name, arguments, call ID), tool outputs (result, error, duration), memory events, retries, errors, state transitions, and full raw trace
Why It Happened	Causal explanation	Pre-computed causal graph, step-by-step causal chain, and root cause walkthrough
Benchmarks	Reliability trend	Benchmark score, difficulty tier, percentile, and reliability trend across runs
Trust & Privacy	Transparency	Evidence collection model, data access boundaries, visibility controls, and FAQ

Readiness Levels

Every run receives one of three readiness levels. The level is shown prominently in the Overview section and drives the recommended action.

`ready_for_runtime`

Trust score >=80, no high-severity or critical failures. The agent completed the session without triggering any high-impact failure detectors. Critiqor did not observe tool loops, memory degradation, ignored outputs, context pollution, cost explosion, or skill failures at a severity that warrants review. The agent is cleared for use in production or next-stage evaluation.

`review_recommended`

Trust score 60–79, or any high-severity failure detected. The agent functioned but showed one or more patterns that reduce reliability under pressure. Common triggers include a small number of ignored tool outputs, memory recall failures, or context events approaching saturation. The agent should be reviewed and the detected failures addressed before deploying to production. The Diagnosis section shows specific guidance for each detected failure.

`unsafe_for_production`

Trust score less than 60, or any critical-severity failure detected. The agent showed serious reliability problems during execution. This includes scenarios like a tool call loop with >=5 repetitions, >=30,000 tokens consumed in a single session, or a combination of failures that together reduced the trust score below 60. Deployment is not recommended. Review the Diagnosis and Why It Happened sections for a detailed breakdown of what failed and why.

Data Source

The dashboard reads exclusively from pre-computed artifacts:

runs/<run_id>/diagnosis.json — trust score, readiness level, failure causes, causal graph, cost analysis, and failure analysis
runs/<run_id>/session.json — raw event stream used in the Evidence section

The dashboard does not compute trust scores, re-run detectors, or use an LLM. Everything you see was determined by the diagnosis engine at finalization time and written to disk. This means the dashboard is fast, fully offline, and reproducible — the same run will show the same results every time.

Navigating Multiple Runs

The Benchmarks section lists completed runs in order and shows how each run’s score compares to prior results. Each entry shows the run_id, trust score, benchmark percentile, and difficulty tier. Use critiqor runs to list all completed evaluations from the CLI, then open any run with critiqor dashboard <run_id> to view its full diagnosis. For a step-by-step guide to using the dashboard after a session, see the Dashboard Guide.

​Dashboard Sections

​Readiness Levels

​ready_for_runtime

​review_recommended

​unsafe_for_production

​Data Source

​Navigating Multiple Runs