$ inspect agent runs locally

agentlens

Trace every step your AI agent takes in development, CI, or production. Catch stuck loops, score behavior, and keep a local record of what happened when a run goes sideways.

stdlib-only self-contained reports pytest-ready evals no vendor backend

$pip install agentlens-eval

get_started() read_docs()

agentlens.py README.md

""" agentlens — see what your agent actually did. local-first observability + evals for AI agents. trace every step · catch stuck loops · score behavior. no backend · no account · stdlib-only. """ from agentlens_eval import Tracer, Eval, metrics tracer = Tracer("my-agent") # one Trace, many lenses report = Eval("my-agent", agent).run(dataset) # → pass_rate

from your_stack import Anthropic, OpenAI, LangChain, LlamaIndex # or "any python callable"

# add it to your agent

# agentlens only observes — wrap the loop you already have. the green + lines are all you add

agent.py

import anthropicfrom agentlens_eval import Tracer, detect_loopsfrom agentlens_eval.report import generate_html def run_agent(question):    tracer = Tracer("my-agent")    messages = [{"role": "user", "content": question}]    while True:        with tracer.step() as step:            resp = client.messages.create(                model="claude-opus-4-8", tools=TOOLS, messages=messages)            step.tokens = resp.usage.output_tokens            if resp.stop_reason != "tool_use":                step.result = final_text(resp)                break            for call in tool_calls(resp):                out = run_tool(call)            # you run the tool                step.tool_call(call.name, call.input, result=out)                # ...feed `out` back to the model, as you already do     generate_html(tracer.trace, warnings=detect_loops(tracer.trace))

Import the tracer + report helper.
Make one Tracer at the start of a run.
Wrap each turn in with tracer.step() — your existing code goes inside, unchanged.
Record as it happens: step.tokens, step.tool_call(...), step.result.
Call generate_html() at the end → your report + loop warnings.

# output

# the thing agentlens writes for you: one local HTML file, no server required

powershell - local run

$ python example_my_agent.py

trace: smoke-agent

steps: 2 tokens: 22 errors: 0

warning: search repeated at steps 1, 2

wrote: report.html

$ start report.html

devopen the report while debugging a run

cifail a build when evals miss the bar

prodsave the report when a live run misbehaves

The core output is still just a local Trace plus HTML. No hosted dashboard is required.

smoke-agent

agentlens trace report

report.html

2steps

22tokens

18ms total

0errors

Loop detection

ERROR

search called 2x in a row with identical args {"q": "weather"} (steps 1, 2) — likely stuck in a loop

Timeline

Step 110 tok · 8 ms

first lookup

tool: search

{"q": "weather"}

Step 212 tok · 10 ms

retry same lookup

result: answer for weather

# features

# one Trace, many lenses — every feature reads the same run object

01 / trace

see the run

Record each turn with tool calls, tokens, latency, final output, and errors in one local trace.

tracer.step()
generate_html(trace)

02 / detect

catch loops

Flag repeated tool calls and token sinks before they become invisible cost or flaky behavior.

detect_loops(trace)
metrics.NoLoops()

03 / evaluate

gate behavior

Run deterministic metrics, optional LLM judging, and pytest rows against the same trace object.

Eval(...).run(dataset)
report.assert_passed()

# showcase

# a few lines woven into the loop you already have — agentlens only observes, never drives

from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html

tracer = Tracer("my-agent")

while not done:                          # your loop, unchanged
    with tracer.step() as step:          # one "page" per turn
        decision = llm(messages, tools)
        if decision.wants_tool:
            result = run_tool(decision.name, decision.args)
            step.tool_call(decision.name, decision.args, result=result)
        else:
            step.result = decision.text
            done = True

generate_html(tracer.trace, warnings=detect_loops(tracer.trace),
              open_browser=True)

from agentlens_eval import Eval, Case, metrics

report = Eval("support-agent", my_agent).run([
    Case("weather in NYC",
         [metrics.Contains("72"), metrics.ToolWasCalled("get_weather")]),
    Case("cancel my plan",
         [metrics.ToolWasCalled("cancel_plan"), metrics.NoLoops()]),
])

report.summary()                 # -> 2/2 cases passed (100%)
report.assert_passed(min_pass_rate=0.9)   # gate CI

from agentlens_eval import Case, metrics
from agentlens_eval.testing import parametrize, check

DATASET = [
    Case("weather in NYC", [metrics.Contains("72")], name="weather"),
    Case("cancel my plan", [metrics.ToolWasCalled("cancel_plan")], name="cancel"),
]

@parametrize(DATASET)            # one pytest row per case
def test_agent(case):
    check(my_agent, case).assert_passed()

# $ pytest -v
# test_agent[weather] PASSED
# test_agent[cancel]  PASSED

from agentlens_eval import Eval, Case, metrics

# deterministic checks stay cheap; the judge is opt-in
report = Eval("support-agent", my_agent).run([
    Case("explain the refund policy", [
        metrics.ToolWasCalled("lookup_policy"),
        metrics.LLMJudge(
            "politely and correctly states the 30-day refund policy"
        ),
    ]),
])
report.summary()

import anthropic
from agentlens_eval import Tracer
from agentlens_eval.report import generate_html

client = anthropic.Anthropic()
tracer = Tracer("weather-agent")
messages = [{"role": "user", "content": "Weather in NYC?"}]

while True:
    with tracer.step() as step:
        resp = client.messages.create(
            model="claude-opus-4-8", max_tokens=1024,
            thinking={"type": "adaptive"}, tools=tools, messages=messages)
        step.tokens = resp.usage.output_tokens      # real token usage
        if resp.stop_reason != "tool_use":
            step.result = next(b.text for b in resp.content if b.type == "text")
            break
        # ... record step.tool_call(...) and feed results back

generate_html(tracer.trace, open_browser=True)

# why

# built for the dev loop, not a dashboard

# comparison	agentlens	typical platform
runs locally, no backend	True # stdlib-only	needs Postgres / cloud
understands the whole run	True # step + tool	often final answer only
cheap & deterministic by default	True # judge opt-in	LLM-judge everywhere
catches stuck loops	True # built in	manual
adoption cost	~5 lines	account + SDK + config

# changelog

# built in the open · git log --oneline

0.3LLM-as-judge. Opt-in LLMJudge("criteria") grades open-ended output via Claude structured outputs.

0.2Eval layer + CI. Deterministic metrics, an Eval/Case runner with pass_rate, pytest rows, HTML eval report.

0.1Observability core. Step tracing, loop detection, self-contained HTML trace report. Stdlib-only.

# install

bash — your machine

$ pip install agentlens-eval

Successfully installed agentlens-eval-0.3.0 # 0 dependencies

$ python -c "import agentlens_eval"

MIT licensed 0 dependencies no backend py 3.10+

get_started() star_on_github()