Critiqor: Evidence-First Reliability Wrapper for AI Agents

The Critiqor class is the primary entry point for adding runtime reliability evaluation to any AI agent. It wraps your existing agent without modifying it, intercepts each execution, and returns a CritiqorResult with a 0–100 trust score, dimension-level critique, detected failure causes, and a deployment gate recommendation. When execution traces or tool call evidence are available, Critiqor uses them first; it falls back to response-only scoring when no instrumentation is present.

Import

from critiqor import Critiqor

Constructor

Critiqor(agent)

Parameter	Type	Description
`agent`	`Any`	Any object with `run()`, `invoke()`, `generate()`, or `__call__()`. Critiqor auto-detects which method to call.

`Critiqor.run()`

Critiqor.run(prompt, *args, **kwargs) → CritiqorResult

Runs the wrapped agent with the given prompt, collects any available evidence, and returns a scored CritiqorResult. Extra positional and keyword arguments are forwarded directly to the underlying agent method. When no instrumentation context is active, the evidence level is recorded as response_only and evaluation confidence is lower. Wrap execution in a monitor() context to upgrade to trace_available or fully_instrumented.

Parameter	Type	Description
`prompt`	`str`	The user prompt to pass to the wrapped agent.
`*args`	`Any`	Forwarded to the agent method.
`**kwargs`	`Any`	Forwarded to the agent method.

Returns: CritiqorResult

`Critiqor.evaluate()`

Critiqor.evaluate(
    prompt,
    response,
    tool_calls=None,
    tool_outputs=None,
    trace=None,
    metrics=None,
    evidence_level=None,
) → CritiqorResult

Evaluates a pre-collected agent run directly. Use this method when you already have a response and want to supply additional evidence — tool call logs, trace events, or runtime metrics — without re-executing the agent. All parameters except prompt and response are optional. Critiqor infers the appropriate evidence_level automatically if one is not provided:

response_only — only prompt and response supplied
trace_available — tool calls or outputs are present
fully_instrumented — trace events or latency metrics are present

Parameter	Type	Description
`prompt`	`str`	The original user prompt.
`response`	`str`	The agent’s response string.
`tool_calls`	`list[ToolCall \| dict] \| None`	Tool calls observed during the run.
`tool_outputs`	`list[ToolOutput \| dict] \| None`	Tool outputs observed during the run.
`trace`	`list[dict] \| None`	Full event trace from a framework adapter or `EvidenceRecorder`.
`metrics`	`RuntimeMetrics \| dict \| None`	Runtime metrics (latency, token usage, retries, errors).
`evidence_level`	`EvidenceLevel \| None`	Override the inferred evidence level.

Returns: CritiqorResult

`CritiqorResult` Fields

CritiqorResult is a frozen dataclass returned by both run() and evaluate().

Field	Type	Description
`answer`	`str`	The agent’s response.
`confidence`	`int`	Overall trust score from 0–100. Also accessible as `trust_score` in `to_dict()`.
`trust_level`	`"High" \| "Moderate" \| "Low"`	Label derived from `confidence`: High (>=75), Moderate (50–74), Low (less than 50).
`critique`	`ReliabilityCritique`	Per-dimension reliability scores and findings.
`evidence`	`EvaluationEvidence`	The evidence object used during evaluation.
`failure_causes`	`list[FailureCause]`	Detected failure causes with severity, impact, and recommendations.
`evaluation_confidence`	`int`	Critiqor’s self-confidence in its own assessment (0–100). Higher when more evidence is available.
`deployment_recommendation`	`DeploymentRecommendation`	`"safe_to_deploy"`, `"review_recommended"`, or `"unsafe_for_production"`.
`benchmark_percentile`	`int \| None`	Percentile rank among historical runs, if benchmark data is available.

`CritiqorResult.to_dict()`

Returns a JSON-serializable dict representation of the result. Includes all fields plus a trust_score alias for confidence and an evidence_level shortcut from evidence.evidence_level.

`CritiqorResult.to_record()`

result.to_record(agent_id="default", run_id=None) → EvaluationRecord

Converts the result into an EvaluationRecord suitable for persistence with save_evaluation(). A UUID is generated for run_id if none is provided.

Example

from critiqor import Critiqor

# Wrap any existing agent
base_agent = TheirExistingAgent(model="llama3.2")
verified_agent = Critiqor(base_agent)

result = verified_agent.run("What does Critiqor do?")

print(result.answer)                    # agent's response
print(result.confidence)                # 0-100 trust score
print(result.trust_level)               # "High", "Moderate", or "Low"
print(result.deployment_recommendation) # "safe_to_deploy" etc.
print(result.failure_causes)            # list of detected failures

`ReliabilityCritique` Fields

ReliabilityCritique is a frozen dataclass nested inside every CritiqorResult under .critique. Each dimension is scored 0–100, where higher is better.

Field	Type	Description
`hallucination`	`int`	Reliability score for unsupported or fabricated claims. Higher means fewer hallucinations.
`reasoning`	`int`	Score for coherent, well-structured task reasoning.
`tool_reliability`	`int`	Score for observed tool selection and usage quality. Also accessible as `tool_use` for backward compatibility.
`consistency`	`int`	Score for internal consistency — fewer contradictions is higher.
`task_completion`	`int`	Score for how fully the agent satisfied the user’s request.
`confidence_calibration`	`int`	Score for whether the agent’s expressed confidence matches evidence.
`execution_efficiency`	`int`	Score for avoiding redundant calls, loops, and wasted steps.
`evidence_level`	`EvidenceLevel`	The quality of evidence used: `"response_only"`, `"trace_available"`, or `"fully_instrumented"`.
`summary`	`str`	One-sentence reliability summary of the evaluated run.
`findings`	`list[str]`	Bullet-point findings from the evaluation.

​Import

​Constructor

​Critiqor.run()

​Critiqor.evaluate()

​CritiqorResult Fields

​CritiqorResult.to_dict()

​CritiqorResult.to_record()

​Example

​ReliabilityCritique Fields