Runtime Intelligence: Observe What Agents Actually Do

Runtime intelligence is the practice of evaluating an AI agent by observing what it actually does during execution — not by asking it to explain itself afterward. Instead of grading a final answer, Critiqor captures every tool call, memory access, provider request, and state transition that occurred during the run and uses that evidence as the basis for diagnosis. The result is a reliability signal rooted in what happened, not in what the agent reports.

The Problem with Self-Reported Evaluation

When you evaluate an agent by reading its final response, you are reading a document the agent produced after the work was done. That document often looks confident, coherent, and complete — even when the underlying execution was broken. Self-reported evaluation cannot reveal:

Tool loops — the agent called the same tool six times with identical arguments and still returned a clean-sounding answer
Memory failures — the agent attempted to recall earlier context, failed silently, and reconstructed a plausible-but-wrong response from scratch
Context saturation — the agent’s context window filled past 85%, compaction occurred, and key state was lost before the final turn
Ignored outputs — a tool returned the correct data, but the agent never incorporated it into its reasoning

An agent that confidently answers “Task complete” may have spent the entire run stuck in a retry loop. Self-report cannot tell you which is true. Observed execution can.

What Runtime Traces Reveal

Runtime traces expose a layered hierarchy of evidence. The deeper you go, the harder it is for failures to hide:

Self-Report (weakest)
  └── Final Response
        └── Trace Events
              ├── Tool Calls + Outputs
              ├── Memory Events
              ├── Provider Requests
              ├── State Transitions
              └── Error Events (strongest)

Each layer closer to raw execution is harder to fake and harder to misread:

Final Response — the agent’s own summary of what it did; reflects intent, not necessarily execution
Tool Calls + Outputs — what the agent actually invoked and what it received back; reveals ignored results, repeated calls, and incorrect arguments
Memory Events — whether memory recall succeeded or silently failed, and whether retrieved context was used
Provider Requests — how many LLM calls were made, how much context was sent, and whether token usage grew unexpectedly
State Transitions — the turn-level lifecycle: where execution started, what changed, and where it ended
Error Events — explicit runtime failures, timeouts, non-zero exit codes, and tool exceptions — the strongest possible evidence that something went wrong

When all of these layers are present, Critiqor can diagnose failures that would be completely invisible to any evaluator reading only the final output.

How Critiqor Uses Runtime Evidence

Critiqor classifies every run by its evidence level — a measure of how much observed execution data is available:

Evidence Level	What’s Available	Diagnostic Confidence
`response_only`	Only the final response text	Low — cannot validate tool behavior, loops, or memory
`trace_available`	Tool calls and outputs included	Moderate — can detect ignored outputs and redundant calls
`fully_instrumented`	Full runtime trace with all event types	High — all six failure detectors are active

fully_instrumented is the highest-confidence level. When a run includes the complete event stream — tool calls, tool outputs, memory events, context events, token usage, state transitions, and error events — Critiqor’s detectors have full visibility into every layer of the hierarchy above. Failures that are invisible to response-level evaluation become detectable, and the causal chain from evidence to failure cause to trust score becomes fully traceable. When evidence is limited to the final response, Critiqor still produces a diagnosis, but the evaluation confidence is lower and certain failure modes — particularly tool loops, memory degradation, and context pollution — cannot be confirmed. The practical implication: the more runtime evidence you give Critiqor, the more accurate and specific its diagnosis will be. fully_instrumented runs produce the most reliable signal for deployment decisions.

​The Problem with Self-Reported Evaluation

​What Runtime Traces Reveal

​How Critiqor Uses Runtime Evidence

The Problem with Self-Reported Evaluation

What Runtime Traces Reveal

How Critiqor Uses Runtime Evidence