Skip to main content
Runtime intelligence is the practice of evaluating an AI agent by observing what it actually does during execution — not by asking it to explain itself afterward. Instead of grading a final answer, Critiqor captures every tool call, memory access, provider request, and state transition that occurred during the run and uses that evidence as the basis for diagnosis. The result is a reliability signal rooted in what happened, not in what the agent reports.

The Problem with Self-Reported Evaluation

When you evaluate an agent by reading its final response, you are reading a document the agent produced after the work was done. That document often looks confident, coherent, and complete — even when the underlying execution was broken. Self-reported evaluation cannot reveal:
  • Tool loops — the agent called the same tool six times with identical arguments and still returned a clean-sounding answer
  • Memory failures — the agent attempted to recall earlier context, failed silently, and reconstructed a plausible-but-wrong response from scratch
  • Context saturation — the agent’s context window filled past 85%, compaction occurred, and key state was lost before the final turn
  • Ignored outputs — a tool returned the correct data, but the agent never incorporated it into its reasoning
An agent that confidently answers “Task complete” may have spent the entire run stuck in a retry loop. Self-report cannot tell you which is true. Observed execution can.

What Runtime Traces Reveal

Runtime traces expose a layered hierarchy of evidence. The deeper you go, the harder it is for failures to hide:
Self-Report (weakest)
  └── Final Response
        └── Trace Events
              ├── Tool Calls + Outputs
              ├── Memory Events
              ├── Provider Requests
              ├── State Transitions
              └── Error Events (strongest)
Each layer closer to raw execution is harder to fake and harder to misread:
  • Final Response — the agent’s own summary of what it did; reflects intent, not necessarily execution
  • Tool Calls + Outputs — what the agent actually invoked and what it received back; reveals ignored results, repeated calls, and incorrect arguments
  • Memory Events — whether memory recall succeeded or silently failed, and whether retrieved context was used
  • Provider Requests — how many LLM calls were made, how much context was sent, and whether token usage grew unexpectedly
  • State Transitions — the turn-level lifecycle: where execution started, what changed, and where it ended
  • Error Events — explicit runtime failures, timeouts, non-zero exit codes, and tool exceptions — the strongest possible evidence that something went wrong
When all of these layers are present, Critiqor can diagnose failures that would be completely invisible to any evaluator reading only the final output.

How Critiqor Uses Runtime Evidence

Critiqor classifies every run by its evidence level — a measure of how much observed execution data is available:
Evidence LevelWhat’s AvailableDiagnostic Confidence
response_onlyOnly the final response textLow — cannot validate tool behavior, loops, or memory
trace_availableTool calls and outputs includedModerate — can detect ignored outputs and redundant calls
fully_instrumentedFull runtime trace with all event typesHigh — all six failure detectors are active
fully_instrumented is the highest-confidence level. When a run includes the complete event stream — tool calls, tool outputs, memory events, context events, token usage, state transitions, and error events — Critiqor’s detectors have full visibility into every layer of the hierarchy above. Failures that are invisible to response-level evaluation become detectable, and the causal chain from evidence to failure cause to trust score becomes fully traceable. When evidence is limited to the final response, Critiqor still produces a diagnosis, but the evaluation confidence is lower and certain failure modes — particularly tool loops, memory degradation, and context pollution — cannot be confirmed. The practical implication: the more runtime evidence you give Critiqor, the more accurate and specific its diagnosis will be. fully_instrumented runs produce the most reliable signal for deployment decisions.