Documentation
agentlens is local-first observability + evals for AI agents. Trace runs in development, CI, or production, catch stuck loops, score behavior, and keep reports portable — with no vendor backend and zero runtime dependencies.
Installation
Requires Python 3.10+. The core has no dependencies.
pip install agentlens-eval
The LLMJudge metric additionally needs the Anthropic SDK, only when you use it:
pip install anthropic
Quickstart
Wrap each turn of your agent loop in tracer.step(), then render a report:
from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html
tracer = Tracer("my-agent")
with tracer.step("look up the weather") as step:
step.tool_call("get_weather", {"city": "NYC"}, result="72F", tokens=120)
generate_html(tracer.trace, "report.html",
warnings=detect_loops(tracer.trace),
open_browser=True)
AI-readable docs
For coding agents, tools, and LLM crawlers, use the plain-text docs map:
https://agentlens-eval.com/llms.txt
It includes the install command, import path, public API summary, integration snippet, eval workflow, and generated report outputs in a compact format.
Add it to your agent
This is the shape of a real integration. Keep your loop, model call, and tool execution where they already are; add the tracer around each turn and record what happened.
import anthropic
from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html
def run_agent(question):
tracer = Tracer("my-agent")
messages = [{"role": "user", "content": question}]
while True:
with tracer.step() as step:
resp = client.messages.create(
model="claude-opus-4-8",
tools=TOOLS,
messages=messages,
)
step.tokens = resp.usage.output_tokens
if resp.stop_reason != "tool_use":
step.result = final_text(resp)
break
for call in tool_calls(resp):
out = run_tool(call) # you run the tool
step.tool_call(call.name, call.input, result=out)
# feed `out` back to the model, as you already do
generate_html(tracer.trace, warnings=detect_loops(tracer.trace))
Tracer(...), with tracer.step(), step.tool_call(...), step.result, and generate_html(...). Everything else stays your agent code.Core concepts
Tracer— the recorder. Create one per run.Step— one iteration of your loop. Holds a thought, an optional tool call, result, tokens, latency, error.Trace— the full run: a list of steps plus totals (total_steps,total_tokens,final_output).- Lenses — anything that reads a
Trace:detect_loops, metrics,Eval, the HTML report.
One trace, many lenses. New features read the same object — your integration never changes.
Tracing your loop
Use the step() context manager so latency is timed automatically. Record a tool call with step.tool_call(...), or set step.result for a final answer:
tracer = Tracer("agent")
while not done:
with tracer.step("deciding what to do") as step:
decision = llm(messages, tools) # your model call
if decision.wants_tool:
out = run_tool(decision.name, decision.args) # you run the tool
step.tool_call(decision.name, decision.args,
result=out, tokens=decision.tokens)
else:
step.result = decision.text
done = True
Any exception inside the with block is recorded on the step (as step.error) and re-raised, so failures show up in the report.
Loop detection
detect_loops(trace) returns warnings when the agent repeats a tool call. It flags two shapes:
- consecutive — the same
(tool, args)back-to-back (severityerror) - frequent — the same
(tool, args)3+ times anywhere (severitywarning)
from agentlens_eval import detect_loops
for w in detect_loops(tracer.trace):
print(w.severity, w.message)
# error `get_weather` called 3x in a row with identical args ... likely stuck in a loop
Tune the thresholds: detect_loops(trace, consecutive_threshold=2, frequent_threshold=3).
HTML reports
Render a single trace, or a whole eval run, to a self-contained HTML file (inline CSS, no assets):
from agentlens_eval.report import generate_html, generate_eval_html
generate_html(trace, "report.html", open_browser=True) # one run
generate_eval_html(report, "eval_report.html", open_browser=True) # a dataset
The trace report shows a step timeline, per-step tool calls, totals (steps / tokens / latency / errors), and highlighted loop warnings. The eval report adds a pass-rate header and per-case pass/fail with an expandable trace.
Metrics
Metrics read a finished Trace and return pass/fail. All but LLMJudge are deterministic and free.
| Metric | Checks |
|---|---|
Contains(text) | final output contains text |
Regex(pattern) | final output matches a regex |
Equals(value) | final output equals value |
ToolWasCalled(name) | the agent called that tool |
MaxSteps(n) | run used ≤ n steps |
MaxTokens(n) | run used ≤ n tokens |
NoErrors() | no step raised an error |
NoLoops() | no repeated-tool loop |
LLMJudge(criteria) | Claude grades the output (opt-in) |
from agentlens_eval import metrics
checks = [
metrics.Contains("72"),
metrics.ToolWasCalled("get_weather"),
metrics.MaxSteps(5),
metrics.NoLoops(),
]
Evals & datasets
An agent for eval is any callable that takes a case input and returns a Trace (run a Tracer inside, return tracer.trace). A Case pairs an input with the metrics it must pass.
from agentlens_eval import Tracer, Eval, Case, metrics
def my_agent(question: str):
tracer = Tracer("support-agent")
with tracer.step(question) as step:
if "weather" in question:
step.tool_call("get_weather", {"city": "NYC"}, result="72F")
else:
step.result = "Sorry, I can't help with that."
return tracer.trace
report = Eval("support-agent", my_agent).run([
Case("weather in NYC",
[metrics.Contains("72"), metrics.ToolWasCalled("get_weather")],
name="weather"),
Case("tell me a joke", [metrics.MaxSteps(3)], name="out-of-scope"),
])
report.summary() # prints pass/fail per case
print(report.pass_rate) # 1.0
pytest & CI
Turn a dataset into one pytest row per case with parametrize, and assert with check:
from agentlens_eval import Case, metrics
from agentlens_eval.testing import parametrize, check
DATASET = [
Case("weather in NYC", [metrics.Contains("72")], name="weather"),
Case("cancel my plan", [metrics.ToolWasCalled("cancel_plan")], name="cancel"),
]
@parametrize(DATASET)
def test_agent(case):
check(my_agent, case).assert_passed()
pytest -vPrefer a single gate? Assert on the whole run:
Eval("support-agent", my_agent).run(DATASET).assert_passed(min_pass_rate=0.9)
LLM-as-judge
For open-ended output where assertions fall short, LLMJudge grades against criteria with Claude. It's the only metric that costs tokens, so keep it opt-in alongside cheap checks.
from agentlens_eval import metrics
metrics.LLMJudge("politely and correctly states the 30-day refund policy")
# options:
metrics.LLMJudge(
"the answer is correct and cites a source",
model="claude-opus-4-8", # default
include_transcript=True, # let the judge see the steps, not just the answer
client=my_anthropic_client, # inject a client (e.g. for tests)
)
The judge uses structured outputs to return a strict pass/fail plus a short reason. If the API call fails, the metric fails with a judge error: ... detail instead of crashing your run.
Examples
The repo ships four runnable examples — three need no API key:
example.py— a trace with a loop →report.htmlexample_my_agent.py— integration into a tool-exposing loop (fake model)example_eval.py— a dataset eval →eval_report.htmlexample_claude.py— a real Anthropic SDK tool-use loop
python example_my_agent.py