Documentation

agentlens is local-first observability + evals for AI agents. Trace runs in development, CI, or production, catch stuck loops, score behavior, and keep reports portable — with no vendor backend and zero runtime dependencies.

trace every turn inspect one HTML report assert behavior in CI

Installation

Requires Python 3.10+. The core has no dependencies.

$pip install agentlens-eval

The LLMJudge metric additionally needs the Anthropic SDK, only when you use it:

$pip install anthropic

Quickstart

Wrap each turn of your agent loop in tracer.step(), then render a report:

from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html

tracer = Tracer("my-agent")

with tracer.step("look up the weather") as step:
    step.tool_call("get_weather", {"city": "NYC"}, result="72F", tokens=120)

generate_html(tracer.trace, "report.html",
              warnings=detect_loops(tracer.trace),
              open_browser=True)

That's the whole idea. agentlens only records — it never calls your model or runs your tools. Remove the agentlens lines and your agent behaves identically.

AI-readable docs

For coding agents, tools, and LLM crawlers, use the plain-text docs map:

$https://agentlens-eval.com/llms.txt

It includes the install command, import path, public API summary, integration snippet, eval workflow, and generated report outputs in a compact format.

Add it to your agent

This is the shape of a real integration. Keep your loop, model call, and tool execution where they already are; add the tracer around each turn and record what happened.

import anthropic
from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html

def run_agent(question):
    tracer = Tracer("my-agent")
    messages = [{"role": "user", "content": question}]

    while True:
        with tracer.step() as step:
            resp = client.messages.create(
                model="claude-opus-4-8",
                tools=TOOLS,
                messages=messages,
            )
            step.tokens = resp.usage.output_tokens

            if resp.stop_reason != "tool_use":
                step.result = final_text(resp)
                break

            for call in tool_calls(resp):
                out = run_tool(call)  # you run the tool
                step.tool_call(call.name, call.input, result=out)
                # feed `out` back to the model, as you already do

    generate_html(tracer.trace, warnings=detect_loops(tracer.trace))

The important bits are Tracer(...), with tracer.step(), step.tool_call(...), step.result, and generate_html(...). Everything else stays your agent code.

Core concepts

Tracer — the recorder. Create one per run.
Step — one iteration of your loop. Holds a thought, an optional tool call, result, tokens, latency, error.
Trace — the full run: a list of steps plus totals (total_steps, total_tokens, final_output).
Lenses — anything that reads a Trace: detect_loops, metrics, Eval, the HTML report.

One trace, many lenses. New features read the same object — your integration never changes.

Tracing your loop

Use the step() context manager so latency is timed automatically. Record a tool call with step.tool_call(...), or set step.result for a final answer:

tracer = Tracer("agent")

while not done:
    with tracer.step("deciding what to do") as step:
        decision = llm(messages, tools)            # your model call
        if decision.wants_tool:
            out = run_tool(decision.name, decision.args)   # you run the tool
            step.tool_call(decision.name, decision.args,
                           result=out, tokens=decision.tokens)
        else:
            step.result = decision.text
            done = True

Any exception inside the with block is recorded on the step (as step.error) and re-raised, so failures show up in the report.

Loop detection

detect_loops(trace) returns warnings when the agent repeats a tool call. It flags two shapes:

consecutive — the same (tool, args) back-to-back (severity error)
frequent — the same (tool, args) 3+ times anywhere (severity warning)

from agentlens_eval import detect_loops

for w in detect_loops(tracer.trace):
    print(w.severity, w.message)
# error  `get_weather` called 3x in a row with identical args ... likely stuck in a loop

Tune the thresholds: detect_loops(trace, consecutive_threshold=2, frequent_threshold=3).

HTML reports

Render a single trace, or a whole eval run, to a self-contained HTML file (inline CSS, no assets):

from agentlens_eval.report import generate_html, generate_eval_html

generate_html(trace, "report.html", open_browser=True)        # one run
generate_eval_html(report, "eval_report.html", open_browser=True)  # a dataset

The trace report shows a step timeline, per-step tool calls, totals (steps / tokens / latency / errors), and highlighted loop warnings. The eval report adds a pass-rate header and per-case pass/fail with an expandable trace.

Metrics

Metrics read a finished Trace and return pass/fail. All but LLMJudge are deterministic and free.

Metric	Checks
`Contains(text)`	final output contains `text`
`Regex(pattern)`	final output matches a regex
`Equals(value)`	final output equals `value`
`ToolWasCalled(name)`	the agent called that tool
`MaxSteps(n)`	run used ≤ n steps
`MaxTokens(n)`	run used ≤ n tokens
`NoErrors()`	no step raised an error
`NoLoops()`	no repeated-tool loop
`LLMJudge(criteria)`	Claude grades the output (opt-in)

from agentlens_eval import metrics

checks = [
    metrics.Contains("72"),
    metrics.ToolWasCalled("get_weather"),
    metrics.MaxSteps(5),
    metrics.NoLoops(),
]

Evals & datasets

An agent for eval is any callable that takes a case input and returns a Trace (run a Tracer inside, return tracer.trace). A Case pairs an input with the metrics it must pass.

from agentlens_eval import Tracer, Eval, Case, metrics

def my_agent(question: str):
    tracer = Tracer("support-agent")
    with tracer.step(question) as step:
        if "weather" in question:
            step.tool_call("get_weather", {"city": "NYC"}, result="72F")
        else:
            step.result = "Sorry, I can't help with that."
    return tracer.trace

report = Eval("support-agent", my_agent).run([
    Case("weather in NYC",
         [metrics.Contains("72"), metrics.ToolWasCalled("get_weather")],
         name="weather"),
    Case("tell me a joke", [metrics.MaxSteps(3)], name="out-of-scope"),
])

report.summary()                       # prints pass/fail per case
print(report.pass_rate)                # 1.0

If your agent raises, the case is marked failed (with the error) — the run never crashes.

pytest & CI

Turn a dataset into one pytest row per case with parametrize, and assert with check:

from agentlens_eval import Case, metrics
from agentlens_eval.testing import parametrize, check

DATASET = [
    Case("weather in NYC", [metrics.Contains("72")], name="weather"),
    Case("cancel my plan", [metrics.ToolWasCalled("cancel_plan")], name="cancel"),
]

@parametrize(DATASET)
def test_agent(case):
    check(my_agent, case).assert_passed()

$pytest -v

Prefer a single gate? Assert on the whole run:

Eval("support-agent", my_agent).run(DATASET).assert_passed(min_pass_rate=0.9)

LLM-as-judge

For open-ended output where assertions fall short, LLMJudge grades against criteria with Claude. It's the only metric that costs tokens, so keep it opt-in alongside cheap checks.

from agentlens_eval import metrics

metrics.LLMJudge("politely and correctly states the 30-day refund policy")

# options:
metrics.LLMJudge(
    "the answer is correct and cites a source",
    model="claude-opus-4-8",     # default
    include_transcript=True,     # let the judge see the steps, not just the answer
    client=my_anthropic_client,  # inject a client (e.g. for tests)
)

The judge uses structured outputs to return a strict pass/fail plus a short reason. If the API call fails, the metric fails with a judge error: ... detail instead of crashing your run.

Examples

The repo ships four runnable examples — three need no API key:

example.py — a trace with a loop → report.html
example_my_agent.py — integration into a tool-exposing loop (fake model)
example_eval.py — a dataset eval → eval_report.html
example_claude.py — a real Anthropic SDK tool-use loop

$python example_my_agent.py

← Back to home