$ inspect agent runs locally

agentlens

Trace every step your AI agent takes in development, CI, or production. Catch stuck loops, score behavior, and keep a local record of what happened when a run goes sideways.

stdlib-only self-contained reports pytest-ready evals no vendor backend
$pip install agentlens-eval
get_started() read_docs()
agentlens.py README.md
""" agentlens — see what your agent actually did. local-first observability + evals for AI agents. trace every step · catch stuck loops · score behavior. no backend · no account · stdlib-only. """ from agentlens_eval import Tracer, Eval, metrics tracer = Tracer("my-agent") # one Trace, many lenses report = Eval("my-agent", agent).run(dataset) # → pass_rate
from your_stack import Anthropic, OpenAI, LangChain, LlamaIndex # or "any python callable"
# add it to your agent

# agentlens only observes — wrap the loop you already have. the green + lines are all you add

agent.py
import anthropicfrom agentlens_eval import Tracer, detect_loopsfrom agentlens_eval.report import generate_html def run_agent(question):    tracer = Tracer("my-agent")    messages = [{"role": "user", "content": question}]    while True:        with tracer.step() as step:            resp = client.messages.create(                model="claude-opus-4-8", tools=TOOLS, messages=messages)            step.tokens = resp.usage.output_tokens            if resp.stop_reason != "tool_use":                step.result = final_text(resp)                break            for call in tool_calls(resp):                out = run_tool(call)            # you run the tool                step.tool_call(call.name, call.input, result=out)                # ...feed `out` back to the model, as you already do     generate_html(tracer.trace, warnings=detect_loops(tracer.trace))
  1. Import the tracer + report helper.
  2. Make one Tracer at the start of a run.
  3. Wrap each turn in with tracer.step() — your existing code goes inside, unchanged.
  4. Record as it happens: step.tokens, step.tool_call(...), step.result.
  5. Call generate_html() at the end → your report + loop warnings.
# output

# the thing agentlens writes for you: one local HTML file, no server required

powershell - local run
$ python example_my_agent.py
trace: smoke-agent
steps: 2 tokens: 22 errors: 0
warning: search repeated at steps 1, 2
wrote: report.html
$ start report.html
devopen the report while debugging a run
cifail a build when evals miss the bar
prodsave the report when a live run misbehaves

The core output is still just a local Trace plus HTML. No hosted dashboard is required.

smoke-agent

agentlens trace report

report.html
2steps
22tokens
18ms total
0errors
Loop detection
ERROR

search called 2x in a row with identical args {"q": "weather"} (steps 1, 2) — likely stuck in a loop

Timeline
Step 110 tok · 8 ms
first lookup
tool: search
{"q": "weather"}
Step 212 tok · 10 ms
retry same lookup
result: answer for weather
# features

# one Trace, many lenses — every feature reads the same run object

01 / trace

see the run

Record each turn with tool calls, tokens, latency, final output, and errors in one local trace.

  • tracer.step()
  • generate_html(trace)
02 / detect

catch loops

Flag repeated tool calls and token sinks before they become invisible cost or flaky behavior.

  • detect_loops(trace)
  • metrics.NoLoops()
03 / evaluate

gate behavior

Run deterministic metrics, optional LLM judging, and pytest rows against the same trace object.

  • Eval(...).run(dataset)
  • report.assert_passed()
# showcase

# a few lines woven into the loop you already have — agentlens only observes, never drives

from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html

tracer = Tracer("my-agent")

while not done:                          # your loop, unchanged
    with tracer.step() as step:          # one "page" per turn
        decision = llm(messages, tools)
        if decision.wants_tool:
            result = run_tool(decision.name, decision.args)
            step.tool_call(decision.name, decision.args, result=result)
        else:
            step.result = decision.text
            done = True

generate_html(tracer.trace, warnings=detect_loops(tracer.trace),
              open_browser=True)
from agentlens_eval import Eval, Case, metrics

report = Eval("support-agent", my_agent).run([
    Case("weather in NYC",
         [metrics.Contains("72"), metrics.ToolWasCalled("get_weather")]),
    Case("cancel my plan",
         [metrics.ToolWasCalled("cancel_plan"), metrics.NoLoops()]),
])

report.summary()                 # -> 2/2 cases passed (100%)
report.assert_passed(min_pass_rate=0.9)   # gate CI
from agentlens_eval import Case, metrics
from agentlens_eval.testing import parametrize, check

DATASET = [
    Case("weather in NYC", [metrics.Contains("72")], name="weather"),
    Case("cancel my plan", [metrics.ToolWasCalled("cancel_plan")], name="cancel"),
]

@parametrize(DATASET)            # one pytest row per case
def test_agent(case):
    check(my_agent, case).assert_passed()

# $ pytest -v
# test_agent[weather] PASSED
# test_agent[cancel]  PASSED
from agentlens_eval import Eval, Case, metrics

# deterministic checks stay cheap; the judge is opt-in
report = Eval("support-agent", my_agent).run([
    Case("explain the refund policy", [
        metrics.ToolWasCalled("lookup_policy"),
        metrics.LLMJudge(
            "politely and correctly states the 30-day refund policy"
        ),
    ]),
])
report.summary()
import anthropic
from agentlens_eval import Tracer
from agentlens_eval.report import generate_html

client = anthropic.Anthropic()
tracer = Tracer("weather-agent")
messages = [{"role": "user", "content": "Weather in NYC?"}]

while True:
    with tracer.step() as step:
        resp = client.messages.create(
            model="claude-opus-4-8", max_tokens=1024,
            thinking={"type": "adaptive"}, tools=tools, messages=messages)
        step.tokens = resp.usage.output_tokens      # real token usage
        if resp.stop_reason != "tool_use":
            step.result = next(b.text for b in resp.content if b.type == "text")
            break
        # ... record step.tool_call(...) and feed results back

generate_html(tracer.trace, open_browser=True)
# why

# built for the dev loop, not a dashboard

# comparisonagentlenstypical platform
runs locally, no backendTrue # stdlib-onlyneeds Postgres / cloud
understands the whole runTrue # step + tooloften final answer only
cheap & deterministic by defaultTrue # judge opt-inLLM-judge everywhere
catches stuck loopsTrue # built inmanual
adoption cost~5 linesaccount + SDK + config
# changelog

# built in the open · git log --oneline

0.3LLM-as-judge. Opt-in LLMJudge("criteria") grades open-ended output via Claude structured outputs.
0.2Eval layer + CI. Deterministic metrics, an Eval/Case runner with pass_rate, pytest rows, HTML eval report.
0.1Observability core. Step tracing, loop detection, self-contained HTML trace report. Stdlib-only.
# install
bash — your machine
$ pip install agentlens-eval
Successfully installed agentlens-eval-0.3.0 # 0 dependencies
$ python -c "import agentlens_eval"
MIT licensed 0 dependencies no backend py 3.10+