The Problem with Self-Reported Evaluation
When you evaluate an agent by reading its final response, you are reading a document the agent produced after the work was done. That document often looks confident, coherent, and complete — even when the underlying execution was broken. Self-reported evaluation cannot reveal:- Tool loops — the agent called the same tool six times with identical arguments and still returned a clean-sounding answer
- Memory failures — the agent attempted to recall earlier context, failed silently, and reconstructed a plausible-but-wrong response from scratch
- Context saturation — the agent’s context window filled past 85%, compaction occurred, and key state was lost before the final turn
- Ignored outputs — a tool returned the correct data, but the agent never incorporated it into its reasoning
What Runtime Traces Reveal
Runtime traces expose a layered hierarchy of evidence. The deeper you go, the harder it is for failures to hide:- Final Response — the agent’s own summary of what it did; reflects intent, not necessarily execution
- Tool Calls + Outputs — what the agent actually invoked and what it received back; reveals ignored results, repeated calls, and incorrect arguments
- Memory Events — whether memory recall succeeded or silently failed, and whether retrieved context was used
- Provider Requests — how many LLM calls were made, how much context was sent, and whether token usage grew unexpectedly
- State Transitions — the turn-level lifecycle: where execution started, what changed, and where it ended
- Error Events — explicit runtime failures, timeouts, non-zero exit codes, and tool exceptions — the strongest possible evidence that something went wrong
How Critiqor Uses Runtime Evidence
Critiqor classifies every run by its evidence level — a measure of how much observed execution data is available:| Evidence Level | What’s Available | Diagnostic Confidence |
|---|---|---|
response_only | Only the final response text | Low — cannot validate tool behavior, loops, or memory |
trace_available | Tool calls and outputs included | Moderate — can detect ignored outputs and redundant calls |
fully_instrumented | Full runtime trace with all event types | High — all six failure detectors are active |
fully_instrumented is the highest-confidence level. When a run includes the complete event stream — tool calls, tool outputs, memory events, context events, token usage, state transitions, and error events — Critiqor’s detectors have full visibility into every layer of the hierarchy above. Failures that are invisible to response-level evaluation become detectable, and the causal chain from evidence to failure cause to trust score becomes fully traceable.
When evidence is limited to the final response, Critiqor still produces a diagnosis, but the evaluation confidence is lower and certain failure modes — particularly tool loops, memory degradation, and context pollution — cannot be confirmed.
The practical implication: the more runtime evidence you give Critiqor, the more accurate and specific its diagnosis will be. fully_instrumented runs produce the most reliable signal for deployment decisions.