Critiqor class is the primary entry point for adding runtime reliability evaluation to any AI agent. It wraps your existing agent without modifying it, intercepts each execution, and returns a CritiqorResult with a 0–100 trust score, dimension-level critique, detected failure causes, and a deployment gate recommendation. When execution traces or tool call evidence are available, Critiqor uses them first; it falls back to response-only scoring when no instrumentation is present.
Import
Constructor
| Parameter | Type | Description |
|---|---|---|
agent | Any | Any object with run(), invoke(), generate(), or __call__(). Critiqor auto-detects which method to call. |
Critiqor.run()
CritiqorResult. Extra positional and keyword arguments are forwarded directly to the underlying agent method.
When no instrumentation context is active, the evidence level is recorded as response_only and evaluation confidence is lower. Wrap execution in a monitor() context to upgrade to trace_available or fully_instrumented.
| Parameter | Type | Description |
|---|---|---|
prompt | str | The user prompt to pass to the wrapped agent. |
*args | Any | Forwarded to the agent method. |
**kwargs | Any | Forwarded to the agent method. |
CritiqorResult
Critiqor.evaluate()
prompt and response are optional. Critiqor infers the appropriate evidence_level automatically if one is not provided:
response_only— onlypromptandresponsesuppliedtrace_available— tool calls or outputs are presentfully_instrumented— trace events or latency metrics are present
| Parameter | Type | Description |
|---|---|---|
prompt | str | The original user prompt. |
response | str | The agent’s response string. |
tool_calls | list[ToolCall | dict] | None | Tool calls observed during the run. |
tool_outputs | list[ToolOutput | dict] | None | Tool outputs observed during the run. |
trace | list[dict] | None | Full event trace from a framework adapter or EvidenceRecorder. |
metrics | RuntimeMetrics | dict | None | Runtime metrics (latency, token usage, retries, errors). |
evidence_level | EvidenceLevel | None | Override the inferred evidence level. |
CritiqorResult
CritiqorResult Fields
CritiqorResult is a frozen dataclass returned by both run() and evaluate().
| Field | Type | Description |
|---|---|---|
answer | str | The agent’s response. |
confidence | int | Overall trust score from 0–100. Also accessible as trust_score in to_dict(). |
trust_level | "High" | "Moderate" | "Low" | Label derived from confidence: High (>=75), Moderate (50–74), Low (less than 50). |
critique | ReliabilityCritique | Per-dimension reliability scores and findings. |
evidence | EvaluationEvidence | The evidence object used during evaluation. |
failure_causes | list[FailureCause] | Detected failure causes with severity, impact, and recommendations. |
evaluation_confidence | int | Critiqor’s self-confidence in its own assessment (0–100). Higher when more evidence is available. |
deployment_recommendation | DeploymentRecommendation | "safe_to_deploy", "review_recommended", or "unsafe_for_production". |
benchmark_percentile | int | None | Percentile rank among historical runs, if benchmark data is available. |
CritiqorResult.to_dict()
Returns a JSON-serializable dict representation of the result. Includes all fields plus a trust_score alias for confidence and an evidence_level shortcut from evidence.evidence_level.
CritiqorResult.to_record()
EvaluationRecord suitable for persistence with save_evaluation(). A UUID is generated for run_id if none is provided.
Example
ReliabilityCritique Fields
ReliabilityCritique is a frozen dataclass nested inside every CritiqorResult under .critique. Each dimension is scored 0–100, where higher is better.
| Field | Type | Description |
|---|---|---|
hallucination | int | Reliability score for unsupported or fabricated claims. Higher means fewer hallucinations. |
reasoning | int | Score for coherent, well-structured task reasoning. |
tool_reliability | int | Score for observed tool selection and usage quality. Also accessible as tool_use for backward compatibility. |
consistency | int | Score for internal consistency — fewer contradictions is higher. |
task_completion | int | Score for how fully the agent satisfied the user’s request. |
confidence_calibration | int | Score for whether the agent’s expressed confidence matches evidence. |
execution_efficiency | int | Score for avoiding redundant calls, loops, and wasted steps. |
evidence_level | EvidenceLevel | The quality of evidence used: "response_only", "trace_available", or "fully_instrumented". |
summary | str | One-sentence reliability summary of the evaluated run. |
findings | list[str] | Bullet-point findings from the evaluation. |