# agentlens-eval > Local-first observability and evals for Python AI agents. Trace runs, detect repeated-tool loops, score behavior, and write portable HTML reports. No hosted backend. Core runtime has zero dependencies. Canonical site: https://agentlens-eval.com/ Human docs: https://agentlens-eval.com/docs PyPI: https://pypi.org/project/agentlens-eval/ Repository: https://github.com/Harshk10-star/agentlens ## Install ```bash pip install agentlens-eval ``` Import package: ```python import agentlens_eval from agentlens_eval import Tracer, detect_loops, Eval, Case, metrics from agentlens_eval.report import generate_html, generate_eval_html ``` ## Core concepts - `Tracer`: create one per agent run. - `Step`: one loop iteration. Can hold thought, tool call, args, result, tokens, latency, and error. - `Trace`: full run record. Includes steps and aggregate properties. - `detect_loops(trace)`: returns warnings for repeated tool calls. - `generate_html(trace, ...)`: writes a self-contained trace report. - `Eval`: runs an agent over a dataset of `Case` objects. - `metrics`: deterministic checks plus optional `LLMJudge`. ## Minimal trace ```python from agentlens_eval import Tracer, detect_loops from agentlens_eval.report import generate_html tracer = Tracer("my-agent") with tracer.step("look up the weather") as step: step.tool_call("get_weather", {"city": "NYC"}, result="72F", tokens=120) generate_html( tracer.trace, "report.html", warnings=detect_loops(tracer.trace), ) ``` ## Add to an existing agent loop ```python from agentlens_eval import Tracer, detect_loops from agentlens_eval.report import generate_html def run_agent(question): tracer = Tracer("my-agent") messages = [{"role": "user", "content": question}] while True: with tracer.step() as step: resp = client.messages.create( model="claude-opus-4-8", tools=TOOLS, messages=messages, ) step.tokens = resp.usage.output_tokens if resp.stop_reason != "tool_use": step.result = final_text(resp) break for call in tool_calls(resp): out = run_tool(call) step.tool_call(call.name, call.input, result=out) # Feed `out` back to the model, as your agent already does. generate_html(tracer.trace, warnings=detect_loops(tracer.trace)) return tracer.trace ``` ## Eval workflow ```python from agentlens_eval import Eval, Case, metrics report = Eval("support-agent", my_agent).run([ Case("weather in NYC", [ metrics.Contains("72"), metrics.ToolWasCalled("get_weather"), metrics.NoLoops(), ], name="weather"), ]) report.assert_passed(min_pass_rate=1.0) generate_eval_html(report, "eval_report.html") ``` ## Public API summary - `Tracer(name="agent-run")` - `tracer.step(thought="")` - `step.tool_call(tool, args=None, result=None, tokens=0, latency_ms=0.0, error=None)` - `trace.total_steps` - `trace.total_tokens` - `trace.total_latency_ms` - `trace.errors` - `trace.final_output` - `detect_loops(trace, consecutive_threshold=2, frequent_threshold=3)` - `generate_html(trace, output_path="report.html", warnings=None, open_browser=False)` - `generate_eval_html(report, output_path="eval_report.html", open_browser=False)` - `Case(input, expect, name="")` - `Eval(name, agent).run(dataset)` - `report.pass_rate` - `report.assert_passed(min_pass_rate=1.0)` ## Metrics - `metrics.Contains(text)` - `metrics.Regex(pattern)` - `metrics.Equals(value)` - `metrics.ToolWasCalled(name)` - `metrics.MaxSteps(n)` - `metrics.MaxTokens(n)` - `metrics.NoErrors()` - `metrics.NoLoops()` - `metrics.LLMJudge(criteria, model=..., include_transcript=False, client=None)` `LLMJudge` is opt-in and uses an Anthropic-compatible client. Other metrics are deterministic and free. ## Generated files - `report.html`: trace timeline, stats, tool calls, loop warnings. - `eval_report.html`: pass rate, per-case results, expandable traces. ## Testing ```bash python -m pytest -q ``` Current suite includes public API smoke tests for tracing, loop detection, HTML reports, eval reports, deterministic metrics, and fake-client LLMJudge behavior.