see the run
Record each turn with tool calls, tokens, latency, final output, and errors in one local trace.
tracer.step()generate_html(trace)
$ inspect agent runs locally
Trace every step your AI agent takes in development, CI, or production. Catch stuck loops, score behavior, and keep a local record of what happened when a run goes sideways.
# agentlens only observes — wrap the loop you already have. the green + lines are all you add
import anthropicfrom agentlens_eval import Tracer, detect_loopsfrom agentlens_eval.report import generate_html def run_agent(question): tracer = Tracer("my-agent") messages = [{"role": "user", "content": question}] while True: with tracer.step() as step: resp = client.messages.create( model="claude-opus-4-8", tools=TOOLS, messages=messages) step.tokens = resp.usage.output_tokens if resp.stop_reason != "tool_use": step.result = final_text(resp) break for call in tool_calls(resp): out = run_tool(call) # you run the tool step.tool_call(call.name, call.input, result=out) # ...feed `out` back to the model, as you already do generate_html(tracer.trace, warnings=detect_loops(tracer.trace))
Tracer at the start of a run.with tracer.step() — your existing code goes inside, unchanged.step.tokens, step.tool_call(...), step.result.generate_html() at the end → your report + loop warnings.# the thing agentlens writes for you: one local HTML file, no server required
The core output is still just a local Trace plus HTML. No hosted dashboard is required.
agentlens trace report
search called 2x in a row with identical args {"q": "weather"} (steps 1, 2) — likely stuck in a loop
{"q": "weather"}
# one Trace, many lenses — every feature reads the same run object
Record each turn with tool calls, tokens, latency, final output, and errors in one local trace.
tracer.step()generate_html(trace)Flag repeated tool calls and token sinks before they become invisible cost or flaky behavior.
detect_loops(trace)metrics.NoLoops()Run deterministic metrics, optional LLM judging, and pytest rows against the same trace object.
Eval(...).run(dataset)report.assert_passed()# a few lines woven into the loop you already have — agentlens only observes, never drives
from agentlens_eval import Tracer, detect_loops
from agentlens_eval.report import generate_html
tracer = Tracer("my-agent")
while not done: # your loop, unchanged
with tracer.step() as step: # one "page" per turn
decision = llm(messages, tools)
if decision.wants_tool:
result = run_tool(decision.name, decision.args)
step.tool_call(decision.name, decision.args, result=result)
else:
step.result = decision.text
done = True
generate_html(tracer.trace, warnings=detect_loops(tracer.trace),
open_browser=True)from agentlens_eval import Eval, Case, metrics
report = Eval("support-agent", my_agent).run([
Case("weather in NYC",
[metrics.Contains("72"), metrics.ToolWasCalled("get_weather")]),
Case("cancel my plan",
[metrics.ToolWasCalled("cancel_plan"), metrics.NoLoops()]),
])
report.summary() # -> 2/2 cases passed (100%)
report.assert_passed(min_pass_rate=0.9) # gate CIfrom agentlens_eval import Case, metrics
from agentlens_eval.testing import parametrize, check
DATASET = [
Case("weather in NYC", [metrics.Contains("72")], name="weather"),
Case("cancel my plan", [metrics.ToolWasCalled("cancel_plan")], name="cancel"),
]
@parametrize(DATASET) # one pytest row per case
def test_agent(case):
check(my_agent, case).assert_passed()
# $ pytest -v
# test_agent[weather] PASSED
# test_agent[cancel] PASSEDfrom agentlens_eval import Eval, Case, metrics
# deterministic checks stay cheap; the judge is opt-in
report = Eval("support-agent", my_agent).run([
Case("explain the refund policy", [
metrics.ToolWasCalled("lookup_policy"),
metrics.LLMJudge(
"politely and correctly states the 30-day refund policy"
),
]),
])
report.summary()import anthropic
from agentlens_eval import Tracer
from agentlens_eval.report import generate_html
client = anthropic.Anthropic()
tracer = Tracer("weather-agent")
messages = [{"role": "user", "content": "Weather in NYC?"}]
while True:
with tracer.step() as step:
resp = client.messages.create(
model="claude-opus-4-8", max_tokens=1024,
thinking={"type": "adaptive"}, tools=tools, messages=messages)
step.tokens = resp.usage.output_tokens # real token usage
if resp.stop_reason != "tool_use":
step.result = next(b.text for b in resp.content if b.type == "text")
break
# ... record step.tool_call(...) and feed results back
generate_html(tracer.trace, open_browser=True)# built for the dev loop, not a dashboard
| # comparison | agentlens | typical platform |
|---|---|---|
| runs locally, no backend | True # stdlib-only | needs Postgres / cloud |
| understands the whole run | True # step + tool | often final answer only |
| cheap & deterministic by default | True # judge opt-in | LLM-judge everywhere |
| catches stuck loops | True # built in | manual |
| adoption cost | ~5 lines | account + SDK + config |
# built in the open · git log --oneline
LLMJudge("criteria") grades open-ended output via Claude structured outputs.Eval/Case runner with pass_rate, pytest rows, HTML eval report.