# agentlens-eval

> Local-first observability and evals for Python AI agents. Trace runs, detect repeated-tool loops, score behavior, and write portable HTML reports. No hosted backend. Core runtime has zero dependencies.

Canonical site: https://agentlens-eval.com/
Human docs: https://agentlens-eval.com/docs
PyPI: https://pypi.org/project/agentlens-eval/
Repository: https://github.com/Harshk10-star/agentlens

## Install

```bash
pip install agentlens-eval
```

Import package:

```python
import agentlens_eval
from agentlens_eval import Tracer, detect_loops, Eval, Case, metrics
from agentlens_eval.report import generate_html, generate_eval_html
```

## Core concepts

- `Tracer`: create one per agent run.
- `Step`: one loop iteration. Can hold thought, tool call, args, result, tokens, latency, and error.
- `Trace`: full run record. Includes steps and aggregate properties.
- `detect_loops(trace)`: returns warnings for repeated tool calls.
- `generate_html(trace, ...)`: writes a self-contained trace report.
- `Eval`: runs an agent over a dataset of `Case` objects.
- `metrics`: deterministic checks plus optional `LLMJudge`.

## Minimal trace

```python
from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html

tracer = Tracer("my-agent")

with tracer.step("look up the weather") as step:
    step.tool_call("get_weather", {"city": "NYC"}, result="72F", tokens=120)

generate_html(
    tracer.trace,
    "report.html",
    warnings=detect_loops(tracer.trace),
)
```

## Add to an existing agent loop

```python
from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html

def run_agent(question):
    tracer = Tracer("my-agent")
    messages = [{"role": "user", "content": question}]

    while True:
        with tracer.step() as step:
            resp = client.messages.create(
                model="claude-opus-4-8",
                tools=TOOLS,
                messages=messages,
            )
            step.tokens = resp.usage.output_tokens

            if resp.stop_reason != "tool_use":
                step.result = final_text(resp)
                break

            for call in tool_calls(resp):
                out = run_tool(call)
                step.tool_call(call.name, call.input, result=out)
                # Feed `out` back to the model, as your agent already does.

    generate_html(tracer.trace, warnings=detect_loops(tracer.trace))
    return tracer.trace
```

## Eval workflow

```python
from agentlens_eval import Eval, Case, metrics

report = Eval("support-agent", my_agent).run([
    Case("weather in NYC", [
        metrics.Contains("72"),
        metrics.ToolWasCalled("get_weather"),
        metrics.NoLoops(),
    ], name="weather"),
])

report.assert_passed(min_pass_rate=1.0)
generate_eval_html(report, "eval_report.html")
```

## Public API summary

- `Tracer(name="agent-run")`
- `tracer.step(thought="")`
- `step.tool_call(tool, args=None, result=None, tokens=0, latency_ms=0.0, error=None)`
- `trace.total_steps`
- `trace.total_tokens`
- `trace.total_latency_ms`
- `trace.errors`
- `trace.final_output`
- `detect_loops(trace, consecutive_threshold=2, frequent_threshold=3)`
- `generate_html(trace, output_path="report.html", warnings=None, open_browser=False)`
- `generate_eval_html(report, output_path="eval_report.html", open_browser=False)`
- `Case(input, expect, name="")`
- `Eval(name, agent).run(dataset)`
- `report.pass_rate`
- `report.assert_passed(min_pass_rate=1.0)`

## Metrics

- `metrics.Contains(text)`
- `metrics.Regex(pattern)`
- `metrics.Equals(value)`
- `metrics.ToolWasCalled(name)`
- `metrics.MaxSteps(n)`
- `metrics.MaxTokens(n)`
- `metrics.NoErrors()`
- `metrics.NoLoops()`
- `metrics.LLMJudge(criteria, model=..., include_transcript=False, client=None)`

`LLMJudge` is opt-in and uses an Anthropic-compatible client. Other metrics are deterministic and free.

## Generated files

- `report.html`: trace timeline, stats, tool calls, loop warnings.
- `eval_report.html`: pass rate, per-case results, expandable traces.

## Testing

```bash
python -m pytest -q
```

Current suite includes public API smoke tests for tracing, loop detection, HTML reports, eval reports, deterministic metrics, and fake-client LLMJudge behavior.