Skip to main content
All Critiqor data types are frozen dataclasses — immutable once constructed. Every type exposes a to_dict() method that returns a JSON-serializable dict. Import any type directly from the top-level critiqor package:
from critiqor import ToolCall, ToolOutput, FailureCause, EvaluationRecord, PolicyCheckResult, TrendAnalysis

Type Aliases

TrustLevel             = Literal["High", "Moderate", "Low"]
EvidenceLevel          = Literal["response_only", "trace_available", "fully_instrumented"]
FailureSeverity        = Literal["low", "medium", "high"]
DeploymentRecommendation = Literal["safe_to_deploy", "review_recommended", "unsafe_for_production"]
AgentType              = Literal["coding", "research", "customer_support", "general"]
CertificationLevel     = Literal["none", "bronze", "silver", "gold", "platinum"]
TrendDirection         = Literal["improving", "stable", "declining", "insufficient_data"]

ToolCall

Represents a single observed tool invocation. Produced by EvidenceRecorder.record_tool_call() and collected in EvaluationEvidence.tool_calls.
FieldTypeDescription
toolstrTool name.
argsdictArguments passed to the tool.
idstr | NoneOptional call ID used to correlate with a ToolOutput.
timestampfloat | NoneUnix timestamp of the call.

ToolOutput

Represents the result of a single tool invocation. Produced by EvidenceRecorder.record_tool_output() and collected in EvaluationEvidence.tool_outputs.
FieldTypeDescription
toolstrTool name.
outputAnyThe tool’s result.
call_idstr | NoneCorrelates with the id of a prior ToolCall.
errorstr | NoneError message if the tool call failed; None on success.
timestampfloat | NoneUnix timestamp of the output.

RuntimeMetrics

Aggregated runtime statistics for a single agent execution. Populated automatically by EvidenceRecorder.finish() or supplied directly to Critiqor.evaluate().
FieldTypeDescription
latencyfloat | NoneWall-clock duration of the execution in seconds.
token_usagedictToken usage breakdown, e.g. {"prompt_tokens": 120, "completion_tokens": 80, "total_tokens": 200}.
retriesintNumber of retry events observed during the run.
errorslist[str]List of error message strings captured during the run.

FailureCause

A structured explanation for a trust-score penalty. Failure causes are detected deterministically by detect_failure_causes() and returned in CritiqorResult.failure_causes.
FieldTypeDescription
typestrFailure type identifier, e.g. "infinite_tool_loop", "ignored_tool_output", "unsupported_claims", "redundant_tool_calls", "runtime_failures", "confidence_mismatch".
severityFailureSeverity"low", "medium", or "high".
impactintTrust-score penalty applied by this cause (negative integer).
descriptionstrHuman-readable description of what was observed.
root_causeRootCause | NoneOptional deeper root cause enrichment.
recommendationstrSuggested remediation. Empty string if none is available.

RootCause

Optional enrichment nested inside a FailureCause. Provides a deeper causal explanation and a concrete fix recommendation.
FieldTypeDescription
descriptionstrExplanation of the underlying cause.
impactstrHuman-readable description of the downstream impact.
trust_penaltyintTrust-score deduction contributed by this root cause.
recommended_fixstrConcrete remediation suggestion.

EvaluationRecord

A persisted representation of one Critiqor evaluation. Returned by save_evaluation() and loaded back by load_evaluations(). Also produced by CritiqorResult.to_record().
FieldTypeDescription
run_idstrUnique run identifier (UUID).
agent_idstrAgent identifier.
timestampstrISO 8601 UTC timestamp of the evaluation.
scoresdictPer-dimension reliability scores keyed by dimension name.
failure_causeslist[FailureCause]All failure causes detected for this run.
trust_scoreintOverall trust score (0–100).
evidence_levelEvidenceLevelEvidence quality used for this evaluation.
evaluation_confidenceintCritiqor’s self-confidence in the evaluation (0–100).
deployment_recommendationDeploymentRecommendationThe deployment gate result for this run.

EvaluationRecord.to_dict()

Returns a JSON-serializable dict. Failure causes are serialized via their own to_dict() methods.

EvaluationRecord.from_dict()

EvaluationRecord.from_dict(payload: dict) → EvaluationRecord
Reconstructs an EvaluationRecord from a previously serialized dict. Unknown or invalid field values are coerced to safe defaults.

PolicyCheckResult

Returned by check_policy(). Represents a CI/CD deployment gate decision for a given agent run.
FieldTypeDescription
passedboolTrue if the run met all configured policy thresholds.
deployment_recommendationDeploymentRecommendationThe deployment decision: "safe_to_deploy", "review_recommended", or "unsafe_for_production".
messageslist[str]Human-readable messages explaining the gate result — which thresholds passed or failed.

TrendAnalysis

Returned by analyze_trends(). Summarizes the direction and magnitude of reliability change across multiple historical runs for a single agent.
FieldTypeDescription
trust_trendTrendDirectionOverall trend direction: "improving", "stable", "declining", or "insufficient_data".
trust_changeintAverage change in trust score per run (positive = improving).
hallucination_changeintAverage change in the hallucination score per run.
tool_reliability_changeintAverage change in the tool reliability score per run.
reasoning_changeintAverage change in the reasoning score per run.
summarystrHuman-readable narrative of the trend.

ReliabilityCertification

Returned by certify_run(). Encodes a standardized certification level for a run or benchmark suite result.
FieldTypeDescription
certification_levelCertificationLevel"none", "bronze", "silver", "gold", or "platinum".
trust_scoreintTrust score used to determine the certification level.
percentileintPercentile rank among historical runs.
markdown_badgestrReady-to-embed Markdown badge string for README files.
criteriadictThe threshold criteria that were evaluated to arrive at the certification level.

AgentProfile

Registered identity for an agent, used for cross-agent ranking and leaderboard participation.
FieldTypeDescription
agent_idstrUnique agent identifier.
namestrDisplay name. Defaults to agent_id if not set.
categorystrAgent category: "coding", "research", "customer_support", or "general".
metadatadictArbitrary additional metadata.

BenchmarkResult

Returned by benchmark_run(). Aggregates scores across all prompts in a benchmark suite.
FieldTypeDescription
namestrBenchmark suite name.
agent_typeAgentTypeAgent category used for percentile ranking.
trust_scoreintAverage trust score across all benchmark runs.
percentileintPercentile rank among agents in the same category.
run_countintNumber of prompts evaluated.
scoresdict[str, int]Average per-dimension scores across all runs.
resultslist[CritiqorResult]Individual results for each benchmark prompt.

CausalGraph

A structured causal graph for a single failure event. Returned by build_causal_graph().
FieldTypeDescription
failure_eventstrThe root failure type (e.g. "infinite_tool_loop").
causal_graphlist[CausalGraphEdge]Ordered list of directed causal edges.
run_idstr | NoneRun identifier this graph was built from, if available.

CausalGraph.explain()

Returns the causal chain as a human-readable string, e.g.: "Prompt was ambiguous -> Agent selected incorrect tool -> Evidence was missing -> Final answer hallucinated"

ReliabilityInsight

An executive summary generated by generate_insights() from historical reliability data.
FieldTypeDescription
summarystrHigh-level narrative of agent reliability trends.
primary_driverslist[str]The top contributing factors to recent reliability changes.