JetBrains Tracy: Pragmatic AI Observability for Kotlin
JetBrains Tracy is a Kotlin library that wires LLM-aware tracing into your app on top of OpenTelemetry. This post walks through how I integrated it in a Spring Boot service, the design decisions that matter, and the failure modes teams hit once LLM calls become the hottest path in their system.
At some point in the last eighteen months, most of the experiments I’ve run stopped treating LLM calls as a curiosity and started treating them as load-bearing infrastructure. That shift doesn’t announce itself — one experiment gets a “summarize” step, another layers in retrieval, then something evolves into an agent — and before long, a meaningful share of your p99 latency, your monthly cloud bill, and your on-call pager is tied up in opaque HTTP calls to providers like OpenAI or Anthropic.
Traditional observability tooling doesn't cover that surface well. Datadog will happily show you that an HTTP call took 4.2 seconds, but it can't tell you which model you hit, whether the prompt was truncated, how many tokens came back, or how much that single request cost. That gap is what Tracy — JetBrains' new open-source Kotlin library — is built to fill.
This post is the write-up of integrating Tracy into a small Spring Boot agent service and a view on where it fits in the Kotlin observability stack. If you run Kotlin in production and you're starting to care about LLM cost, latency, and debuggability, there's a concrete takeaway here: Tracy is not a new observability backend, it is a well-designed instrumentation layer on top of OpenTelemetry — and that design choice is what makes it worth adopting.
The Problem
LLM applications break the assumptions most of our observability tooling grew up with.
A traditional microservice produces structured, deterministic output: a status code, a body, maybe a handful of custom attributes. An LLM-powered service produces a branching execution graph that looks more like this:
- Your endpoint receives a request.
- You enrich it from a vector store or database.
- You send a prompt to a model. It responds with a tool call.
- You execute the tool (which may itself hit another service).
- You send the tool result back to the model.
- Repeat until the model produces a final answer, or you hit a guardrail.
Every one of those steps has its own cost, latency profile, and failure mode. Some failures are silent — the model "succeeds" with a hallucinated tool call that your code happily executes. Some failures are cost failures — a retry loop quietly burns 40k tokens on a single user request. Some are drift failures — a prompt change three weeks ago degraded accuracy by 12%, but no one notices because there is no regression signal.
Stand-alone APMs don't know about any of this. SDK-specific tracing (the OpenAI SDK's built-in telemetry, for example) only covers the call to the model, not your surrounding application logic. Frameworks like LangChain or Koog give you tracing, but only if every LLM call flows through their abstractions — a constraint that doesn't survive contact with real codebases.
What you actually want is something that captures the whole graph with a single, composable API. That is the niche Tracy fills.
The Approach
Tracy's design can be summarised in one sentence: every interesting LLM interaction becomes an OpenTelemetry span, and you get three ergonomic ways to create those spans.
The three APIs are:
1. Scoped spans via withSpan. A block-scoped construct that opens a span on entry and closes it on exit. Nesting is automatic, so you get a hierarchical trace without juggling span references.
2. Client instrumentation. You hand Tracy your HTTP or SDK client, it wraps it, and every call through that client produces an LLM-aware span with model, provider, token counts, and latency attached. Crucially, by default it captures metadata only — prompts and completions are off unless you explicitly opt in.
3. @Trace annotations. Slap @Trace on an interface method and every implementation is automatically wrapped in a span. This matters more than it sounds — instrumenting tool calls by hand is the kind of tedious work that gets skipped under deadline pressure, which means the exact moments you need visibility are the ones where it's missing.
Because everything is emitted as OTel spans, you point Tracy at whatever backend you already run: Jaeger, Zipkin, Grafana Tempo, or an LLM-specific product like Langfuse or W&B Weave. You don't buy into a new UI; you enrich your existing one.

Technical Deep Dive
Here's what integration looks like in a Spring Boot service that uses the OpenAI Kotlin SDK, fronted by a simple agent loop.
Wrapping the client
@Configuration
class OpenAiConfig {
@Bean
fun openAiClient(): OpenAIClient {
val raw = OpenAIOkHttpClient.fromEnv()
// Every call through `client` now emits an OTel span
// with provider, model, prompt_tokens, completion_tokens,
// total_tokens and latency attached automatically.
return instrument(raw)
}
}The important detail here: instrument is a boundary. Anything that goes through the wrapped client is traced. Anything that bypasses it — a direct HttpClient call someone added for a one-off integration — is invisible. A good rule is to expose the instrumented client as the only injectable bean and fail the build if someone constructs a raw one.
Scoping agent work
@Service
class SupportAgent(
private val client: OpenAIClient,
private val tools: List<Tool<*>>,
) {
fun handle(ticket: Ticket): AgentReply = withSpan("support-agent") {
withSpan("context.load") {
loadContext(ticket)
}.let { context ->
runAgentLoop(ticket, context)
}
}
}withSpan is more than a logger. It anchors every nested LLM call, tool invocation, and downstream service hop to one root span, which means you can ask your backend questions like: "show me every ticket where the agent looped more than three times and cost more than $0.50."
Tracing tools without boilerplate
interface Tool<T> {
@Trace(name = "tool.call")
fun execute(args: Map<String, Any>): T
}
class LookupOrderTool(
private val orders: OrderRepository,
) : Tool<Order?> {
override fun execute(args: Map<String, Any>): Order? =
orders.findById(args["orderId"] as String)
}Every implementation inherits the annotation behaviour. Add a new tool, get a span for free. This is the piece I've come to trust most in production — most incidents I've seen with agent services trace back to a specific tool behaving oddly, and the span data is what gets you to root cause without a repro.
Capturing prompts on demand
if (env.isDev || featureFlag("llm.trace.content", userId)) {
TracingManager.traceSensitiveContent()
}The fact that this is opt-in is a real design choice, not a limitation. Prompts and completions carry two kinds of risk: they often contain PII, and they balloon span size (a single RAG-augmented prompt can be 50KB of text). Making the opt-in explicit — runtime-toggleable, targeted — is the right default.
The trade-off that matters
Tracy sits on OpenTelemetry, so the ceiling of what you can do is whatever OTel and your backend support. That means you inherit OTel's strengths (vendor-neutral, standardised semantic conventions, rich ecosystem) and its weaknesses (high-cardinality attributes punish some backends, span payload limits can truncate large prompts, sampling strategies need thought).
The alternative would have been a bespoke wire format optimised for LLM payloads. That would give you tighter control over prompt storage and richer evals but would force you onto a proprietary viewer. JetBrains picked the pragmatic path, and I think it's the right one for a library that needs to meet teams where they already live.
Pitfalls & Edge Cases
A few things will bite you that aren't obvious from the announcement post:
Token attribution is per span, not per request. If your agent loops, you'll see five LLM spans under one root — each with its own token counts. Rolling those up to "cost per user request" is work you do in your backend's query layer, not something Tracy hands you. Build that dashboard early; it's the number your PM will ask for first.
Sensitive content capture plays badly with span size limits. Most OTel backends cap span attributes somewhere between 4KB and 64KB. Enable prompt capture on a RAG app and you'll silently truncate. If you need full-fidelity prompt logging, route prompts to a separate store (S3, a Postgres llm_requests table) and put only the reference in the span.
Instrumented clients don't automatically propagate context across coroutines. If you launch work on a different dispatcher inside a withSpan block without propagating the OTel context, the child spans will orphan. Kotlin Coroutines' MDCContext and OTel's Context.current().asContextElement() are the right primitives. Test this with a deliberate withContext(Dispatchers.IO) inside an agent loop before you ship.
Retries are not free observability. If your OpenAI client retries on a 429, Tracy will show you the successful span but the retry storm will inflate your latency percentiles and token counts. Either instrument at a layer above retries, or add a retry.count attribute so you can filter your dashboards.
@Trace only works through the interface. Call the concrete class directly and the annotation is bypassed. If you use Spring, make sure your tools are resolved through their interface beans and that AOP is configured to weave them; this caught me once on a service that used Kotlin's object singletons for a couple of tools.
Practical Takeaways
- Adopt the instrumented client as a single boundary. Make it the only way into the LLM provider from your codebase, and fail the build on direct construction.
- Put
withSpanat the top of every agent or feature entry point. A root span per user interaction is the unit of analysis you will use forever. - Turn on
@Tracefor every tool interface from day one. The marginal cost is zero and the marginal value is enormous once something misbehaves. - Keep sensitive content off by default; gate it behind a feature flag for dev and targeted production debugging.
- Store prompts and completions out of band when you do need them, and reference them from the span rather than inlining them.
- Build a cost-per-request dashboard early. Aggregating token counts across nested spans is the observability question that returns the most ROI.
- Don't abandon your existing APM. Tracy gives you the LLM-aware view; your existing tools still own host metrics, DB performance, and everything else. The win is the correlation.
Conclusion
Tracy is the least flashy of the 2026 wave of AI-tooling releases — and that's its strength. It doesn't try to be a new observability backend, a new agent framework, or a new evaluation platform. It quietly fills the instrumentation gap between your Kotlin application code and the OpenTelemetry-compatible backends you already run, and it does so with an API surface that respects how Kotlin developers actually write code.
Reach for it when you have Kotlin services making LLM calls in production, when you want to see your agent loops as first-class traces alongside your existing spans, and when you'd rather augment your current observability stack than replace it. Reach past it if you need a vertically integrated eval + tracing + prompt-management product — that's a different category, and tools like Langfuse (which, pleasantly, Tracy can export to) are a better fit there.
The bigger message is the one JetBrains makes in passing in their announcement: no matter how good the models get, the applications wrapping them still need to be debugged, measured, and evaluated. Observability is the foundation that makes everything downstream — evals, cost optimisation, prompt engineering — possible. Tracy is a quiet, well-built piece of that foundation for the Kotlin ecosystem, and if you live there, it's worth thirty minutes of your afternoon.
References:
JetBrains Tracy - https://github.com/JetBrains/tracy
Written by Tiarê Balbi