> ## Documentation Index
> Fetch the complete documentation index at: https://critiqor.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Critiqor: Evidence-First Reliability Wrapper for AI Agents

> Wraps any AI agent with evidence-first reliability evaluation, returning a 0-100 trust score, dimension critique, and deployment gate on each run.

The `Critiqor` class is the primary entry point for adding runtime reliability evaluation to any AI agent. It wraps your existing agent without modifying it, intercepts each execution, and returns a `CritiqorResult` with a 0–100 trust score, dimension-level critique, detected failure causes, and a deployment gate recommendation. When execution traces or tool call evidence are available, Critiqor uses them first; it falls back to response-only scoring when no instrumentation is present.

## Import

```python theme={null}
from critiqor import Critiqor
```

## Constructor

```python theme={null}
Critiqor(agent)
```

| Parameter | Type  | Description                                                                                                     |
| --------- | ----- | --------------------------------------------------------------------------------------------------------------- |
| `agent`   | `Any` | Any object with `run()`, `invoke()`, `generate()`, or `__call__()`. Critiqor auto-detects which method to call. |

***

## `Critiqor.run()`

```python theme={null}
Critiqor.run(prompt, *args, **kwargs) → CritiqorResult
```

Runs the wrapped agent with the given prompt, collects any available evidence, and returns a scored `CritiqorResult`. Extra positional and keyword arguments are forwarded directly to the underlying agent method.

When no instrumentation context is active, the evidence level is recorded as `response_only` and evaluation confidence is lower. Wrap execution in a [`monitor()`](/reference/api/evidence-recorder) context to upgrade to `trace_available` or `fully_instrumented`.

| Parameter  | Type  | Description                                   |
| ---------- | ----- | --------------------------------------------- |
| `prompt`   | `str` | The user prompt to pass to the wrapped agent. |
| `*args`    | `Any` | Forwarded to the agent method.                |
| `**kwargs` | `Any` | Forwarded to the agent method.                |

**Returns:** [`CritiqorResult`](#critiqorresult-fields)

***

## `Critiqor.evaluate()`

```python theme={null}
Critiqor.evaluate(
    prompt,
    response,
    tool_calls=None,
    tool_outputs=None,
    trace=None,
    metrics=None,
    evidence_level=None,
) → CritiqorResult
```

Evaluates a pre-collected agent run directly. Use this method when you already have a response and want to supply additional evidence — tool call logs, trace events, or runtime metrics — without re-executing the agent.

All parameters except `prompt` and `response` are optional. Critiqor infers the appropriate `evidence_level` automatically if one is not provided:

* `response_only` — only `prompt` and `response` supplied
* `trace_available` — tool calls or outputs are present
* `fully_instrumented` — trace events or latency metrics are present

| Parameter        | Type                               | Description                                                      |
| ---------------- | ---------------------------------- | ---------------------------------------------------------------- |
| `prompt`         | `str`                              | The original user prompt.                                        |
| `response`       | `str`                              | The agent's response string.                                     |
| `tool_calls`     | `list[ToolCall \| dict] \| None`   | Tool calls observed during the run.                              |
| `tool_outputs`   | `list[ToolOutput \| dict] \| None` | Tool outputs observed during the run.                            |
| `trace`          | `list[dict] \| None`               | Full event trace from a framework adapter or `EvidenceRecorder`. |
| `metrics`        | `RuntimeMetrics \| dict \| None`   | Runtime metrics (latency, token usage, retries, errors).         |
| `evidence_level` | `EvidenceLevel \| None`            | Override the inferred evidence level.                            |

**Returns:** [`CritiqorResult`](#critiqorresult-fields)

***

## `CritiqorResult` Fields

`CritiqorResult` is a frozen dataclass returned by both `run()` and `evaluate()`.

| Field                       | Type                            | Description                                                                                       |
| --------------------------- | ------------------------------- | ------------------------------------------------------------------------------------------------- |
| `answer`                    | `str`                           | The agent's response.                                                                             |
| `confidence`                | `int`                           | Overall trust score from 0–100. Also accessible as `trust_score` in `to_dict()`.                  |
| `trust_level`               | `"High" \| "Moderate" \| "Low"` | Label derived from `confidence`: High (>=75), Moderate (50–74), Low (less than 50).               |
| `critique`                  | `ReliabilityCritique`           | Per-dimension reliability scores and findings.                                                    |
| `evidence`                  | `EvaluationEvidence`            | The evidence object used during evaluation.                                                       |
| `failure_causes`            | `list[FailureCause]`            | Detected failure causes with severity, impact, and recommendations.                               |
| `evaluation_confidence`     | `int`                           | Critiqor's self-confidence in its own assessment (0–100). Higher when more evidence is available. |
| `deployment_recommendation` | `DeploymentRecommendation`      | `"safe_to_deploy"`, `"review_recommended"`, or `"unsafe_for_production"`.                         |
| `benchmark_percentile`      | `int \| None`                   | Percentile rank among historical runs, if benchmark data is available.                            |

### `CritiqorResult.to_dict()`

Returns a JSON-serializable `dict` representation of the result. Includes all fields plus a `trust_score` alias for `confidence` and an `evidence_level` shortcut from `evidence.evidence_level`.

### `CritiqorResult.to_record()`

```python theme={null}
result.to_record(agent_id="default", run_id=None) → EvaluationRecord
```

Converts the result into an [`EvaluationRecord`](/reference/api/data-types#evaluationrecord) suitable for persistence with [`save_evaluation()`](/reference/api/data-types). A UUID is generated for `run_id` if none is provided.

***

## Example

```python theme={null}
from critiqor import Critiqor

# Wrap any existing agent
base_agent = TheirExistingAgent(model="llama3.2")
verified_agent = Critiqor(base_agent)

result = verified_agent.run("What does Critiqor do?")

print(result.answer)                    # agent's response
print(result.confidence)                # 0-100 trust score
print(result.trust_level)               # "High", "Moderate", or "Low"
print(result.deployment_recommendation) # "safe_to_deploy" etc.
print(result.failure_causes)            # list of detected failures
```

***

## `ReliabilityCritique` Fields

`ReliabilityCritique` is a frozen dataclass nested inside every `CritiqorResult` under `.critique`. Each dimension is scored 0–100, where higher is better.

| Field                    | Type            | Description                                                                                                    |
| ------------------------ | --------------- | -------------------------------------------------------------------------------------------------------------- |
| `hallucination`          | `int`           | Reliability score for unsupported or fabricated claims. Higher means fewer hallucinations.                     |
| `reasoning`              | `int`           | Score for coherent, well-structured task reasoning.                                                            |
| `tool_reliability`       | `int`           | Score for observed tool selection and usage quality. Also accessible as `tool_use` for backward compatibility. |
| `consistency`            | `int`           | Score for internal consistency — fewer contradictions is higher.                                               |
| `task_completion`        | `int`           | Score for how fully the agent satisfied the user's request.                                                    |
| `confidence_calibration` | `int`           | Score for whether the agent's expressed confidence matches evidence.                                           |
| `execution_efficiency`   | `int`           | Score for avoiding redundant calls, loops, and wasted steps.                                                   |
| `evidence_level`         | `EvidenceLevel` | The quality of evidence used: `"response_only"`, `"trace_available"`, or `"fully_instrumented"`.               |
| `summary`                | `str`           | One-sentence reliability summary of the evaluated run.                                                         |
| `findings`               | `list[str]`     | Bullet-point findings from the evaluation.                                                                     |