Skip to main content
The Critiqor class is the primary entry point for adding runtime reliability evaluation to any AI agent. It wraps your existing agent without modifying it, intercepts each execution, and returns a CritiqorResult with a 0–100 trust score, dimension-level critique, detected failure causes, and a deployment gate recommendation. When execution traces or tool call evidence are available, Critiqor uses them first; it falls back to response-only scoring when no instrumentation is present.

Import

from critiqor import Critiqor

Constructor

Critiqor(agent)
ParameterTypeDescription
agentAnyAny object with run(), invoke(), generate(), or __call__(). Critiqor auto-detects which method to call.

Critiqor.run()

Critiqor.run(prompt, *args, **kwargs) → CritiqorResult
Runs the wrapped agent with the given prompt, collects any available evidence, and returns a scored CritiqorResult. Extra positional and keyword arguments are forwarded directly to the underlying agent method. When no instrumentation context is active, the evidence level is recorded as response_only and evaluation confidence is lower. Wrap execution in a monitor() context to upgrade to trace_available or fully_instrumented.
ParameterTypeDescription
promptstrThe user prompt to pass to the wrapped agent.
*argsAnyForwarded to the agent method.
**kwargsAnyForwarded to the agent method.
Returns: CritiqorResult

Critiqor.evaluate()

Critiqor.evaluate(
    prompt,
    response,
    tool_calls=None,
    tool_outputs=None,
    trace=None,
    metrics=None,
    evidence_level=None,
) → CritiqorResult
Evaluates a pre-collected agent run directly. Use this method when you already have a response and want to supply additional evidence — tool call logs, trace events, or runtime metrics — without re-executing the agent. All parameters except prompt and response are optional. Critiqor infers the appropriate evidence_level automatically if one is not provided:
  • response_only — only prompt and response supplied
  • trace_available — tool calls or outputs are present
  • fully_instrumented — trace events or latency metrics are present
ParameterTypeDescription
promptstrThe original user prompt.
responsestrThe agent’s response string.
tool_callslist[ToolCall | dict] | NoneTool calls observed during the run.
tool_outputslist[ToolOutput | dict] | NoneTool outputs observed during the run.
tracelist[dict] | NoneFull event trace from a framework adapter or EvidenceRecorder.
metricsRuntimeMetrics | dict | NoneRuntime metrics (latency, token usage, retries, errors).
evidence_levelEvidenceLevel | NoneOverride the inferred evidence level.
Returns: CritiqorResult

CritiqorResult Fields

CritiqorResult is a frozen dataclass returned by both run() and evaluate().
FieldTypeDescription
answerstrThe agent’s response.
confidenceintOverall trust score from 0–100. Also accessible as trust_score in to_dict().
trust_level"High" | "Moderate" | "Low"Label derived from confidence: High (>=75), Moderate (50–74), Low (less than 50).
critiqueReliabilityCritiquePer-dimension reliability scores and findings.
evidenceEvaluationEvidenceThe evidence object used during evaluation.
failure_causeslist[FailureCause]Detected failure causes with severity, impact, and recommendations.
evaluation_confidenceintCritiqor’s self-confidence in its own assessment (0–100). Higher when more evidence is available.
deployment_recommendationDeploymentRecommendation"safe_to_deploy", "review_recommended", or "unsafe_for_production".
benchmark_percentileint | NonePercentile rank among historical runs, if benchmark data is available.

CritiqorResult.to_dict()

Returns a JSON-serializable dict representation of the result. Includes all fields plus a trust_score alias for confidence and an evidence_level shortcut from evidence.evidence_level.

CritiqorResult.to_record()

result.to_record(agent_id="default", run_id=None) → EvaluationRecord
Converts the result into an EvaluationRecord suitable for persistence with save_evaluation(). A UUID is generated for run_id if none is provided.

Example

from critiqor import Critiqor

# Wrap any existing agent
base_agent = TheirExistingAgent(model="llama3.2")
verified_agent = Critiqor(base_agent)

result = verified_agent.run("What does Critiqor do?")

print(result.answer)                    # agent's response
print(result.confidence)                # 0-100 trust score
print(result.trust_level)               # "High", "Moderate", or "Low"
print(result.deployment_recommendation) # "safe_to_deploy" etc.
print(result.failure_causes)            # list of detected failures

ReliabilityCritique Fields

ReliabilityCritique is a frozen dataclass nested inside every CritiqorResult under .critique. Each dimension is scored 0–100, where higher is better.
FieldTypeDescription
hallucinationintReliability score for unsupported or fabricated claims. Higher means fewer hallucinations.
reasoningintScore for coherent, well-structured task reasoning.
tool_reliabilityintScore for observed tool selection and usage quality. Also accessible as tool_use for backward compatibility.
consistencyintScore for internal consistency — fewer contradictions is higher.
task_completionintScore for how fully the agent satisfied the user’s request.
confidence_calibrationintScore for whether the agent’s expressed confidence matches evidence.
execution_efficiencyintScore for avoiding redundant calls, loops, and wasted steps.
evidence_levelEvidenceLevelThe quality of evidence used: "response_only", "trace_available", or "fully_instrumented".
summarystrOne-sentence reliability summary of the evaluated run.
findingslist[str]Bullet-point findings from the evaluation.