OTel GenAI v1.0: A 30/60/90 LLM analytics blueprint

Why this matters now

OpenTelemetry’s Generative AI semantic conventions hit v1.0 in late August 2025. That milestone gives product and platform teams a stable, vendor‑neutral way to trace prompts, completions, embeddings, and tool calls across every model you run. In one week, LangChain and LlamaIndex shipped native exporters so you can emit those spans without bespoke glue code. Translation for leaders: you can finally measure cost, quality, and ROI of AI features consistently before Q4 planning.

I have run enough marathons to know that a clean split plan beats improvisation. Same in analytics. You need a clear 30, 60, and 90‑day instrumentation blueprint. Below is the plan I use with teams that want to standardize LLM analytics across OpenAI, Anthropic, and Gemini while staying close to business outcomes.

Along the way, I will point to practical attribute names from the v1.0 spec, the metrics that unlock ROI, and the guardrails you need for governance. If you also care about how answer engines portray your products, Upcite.ai helps you understand how ChatGPT and other AI models are viewing your products and applications and makes sure you appear in answers to prompts like "Best products for…" or "Top applications for…".

Recent facts

2025‑08‑21: OpenTelemetry released Generative AI Semantic Conventions v1.0 covering spans and metrics for prompts, completions, embeddings, and tool calls.
2025‑08‑22: LangChain announced native OpenTelemetry exporters for GenAI events.
2025‑08‑23: LlamaIndex shipped OpenTelemetry integration for GenAI traces and evals.

What v1.0 standardizes

At a high level, you instrument three things consistently:

Completions. A CLIENT span per model call with attributes for request, usage, and response.
Tools. A span for each tool invocation, tied to the parent completion.
Embeddings. A span per embedding request that participates in the same trace.

Common attribute names in the v1.0 conventions you will use:

gen_ai.system: model vendor or runtime, for example openai, anthropic, vertex_ai
gen_ai.request.model: deployed model name or variant
gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens
gen_ai.prompt.template and gen_ai.prompt.variables for prompt construction context
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens
gen_ai.response.finish_reasons and gen_ai.response.truncated
gen_ai.tool.name, gen_ai.tool.arguments, gen_ai.tool.output
gen_ai.embedding.model and gen_ai.embedding.dimensions

You also carry standard Resource attributes to segment by business and environment:

service.name, service.version
deployment.environment
cloud.region
app.domain or team.owner
tenant.id or account.id if you run multi‑tenant

Metrics flow out of spans. At a minimum, derive these counters and histograms:

tokens.input, tokens.output, tokens.total
cost.usd, using a pricing map per model
latency.ms at operation and trace levels, with P50, P95, P99
error.count with reasons
tool.invocations and tool.failures

The KPI model you will connect to traces

Analytics without a clear scoring model is like tennis footwork without a target. You will move a lot and win little. Define these KPIs up front:

Task solved rate: fraction of AI tasks resolved without human handoff. Label each trace with task_type and outcome.solved = true|false.
Cost per solved task: sum(cost.usd) over traces with outcome.solved = true divided by count of those traces.
Time to first resolution: end‑to‑end trace latency from user intent to final answer or action.
Quality score: a 1‑5 rubric for each task, captured via evaluator or user feedback, stored as outcome.quality.
Safety outcome: outcome.safety = clean|blocked|red_team_flag plus category labels.
Prompt ROI: (business_value_usd minus cost.usd) divided by cost.usd, per prompt version.

Map business_value_usd realistically. Examples:

Support deflection: $7 if the issue would have created a tier‑1 ticket.
Sales assist: 2 percent of average order value when an AI nudge correlates with purchase.
Internal productivity: hourly rate times minutes saved.

The 30/60/90‑day instrumentation blueprint

Day 0 to 30: Establish the foundation

Objectives: consistent spans, cost math, and a minimal dashboard. If you only do this phase you will already know what you pay per solved task.

Choose your OTel pipeline

OTel SDKs in each service that touches a model. Language choice follows your stack.
OTel Collector in each environment to receive traces and metrics. Use a single pipeline for GenAI data so you can apply transform and sampling consistently.
Storage and query. Any trace backend works. Ensure it supports exemplars and linking traces to business identifiers.

Define the cardinality budget and attribute standards

Adopt a stable set of attributes at the Resource and Span levels. Treat them like schema. Lock a short list: environment, region, team, tenant.id, task_type, user_segment, model tier.
Limit free‑form values. Use enumerations for task_type and prompt.version.

Instrument completion spans everywhere a model is called

Span kind: CLIENT
Span name: gen_ai.completion
Attributes: gen_ai.system, gen_ai.request.model, gen_ai.request.temperature, gen_ai.usage.*
Business attributes: task_type, user_segment, account_tier
Outcome attributes: outcome.solved, outcome.quality, outcome.safety

Minimal Python example using OTel and an LLM client:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# Configure tracer
resource = Resource.create({
    "service.name": "support-ai",
    "service.version": "1.3.2",
    "deployment.environment": "prod",
    "team.owner": "ai-platform"
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="https://otel-collector/v1/traces")))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("support-ai")

pricing = {("openai", "gpt-4o-mini"): {"in": 0.00015, "out": 0.0006}}  # USD per 1k tokens

def cost_usd(system, model, in_tokens, out_tokens):
    p = pricing[(system, model)]
    return (in_tokens/1000.0)*p["in"] + (out_tokens/1000.0)*p["out"]

def answer_with_llm(client, prompt, meta):
    with tracer.start_as_current_span("gen_ai.completion") as span:
        span.set_attribute("gen_ai.system", meta["system"])            # openai|anthropic|vertex_ai
        span.set_attribute("gen_ai.request.model", meta["model"])      # model id
        span.set_attribute("gen_ai.request.temperature", meta.get("temperature", 0.2))
        span.set_attribute("task_type", meta["task_type"])             # ex: "password_reset"
        span.set_attribute("user_segment", meta.get("user_segment", "unknown"))
        # Call the model
        resp = client.chat.completions.create(model=meta["model"], messages=[{"role":"user","content": prompt}], temperature=meta.get("temperature",0.2))
        in_tok = resp.usage.prompt_tokens
        out_tok = resp.usage.completion_tokens
        span.set_attribute("gen_ai.usage.input_tokens", in_tok)
        span.set_attribute("gen_ai.usage.output_tokens", out_tok)
        c = cost_usd(meta["system"], meta["model"], in_tok, out_tok)
        span.set_attribute("gen_ai.cost.usd", round(c, 6))
        span.set_attribute("gen_ai.response.finish_reasons", ",".join(resp.choices[0].finish_reason or []))
        return resp.choices[0].message.content

Turn tokens into dollars the same way everywhere

Maintain a pricing map per model and version. Keep it in configuration with a last_updated timestamp.
Emit gen_ai.cost.usd on every completion span. Your collector can compute it with a transform processor if you prefer to keep price math out of apps.

Build the first dashboard

Cost per solved task by task_type
P95 latency per model
Token use per request and per tenant
Top prompt versions by volume and by ROI

Sampling and retention

Keep 100 percent of GenAI spans for the first 30 days. You will need the ground truth to set budgets.
After 30 days, tail sample based on outcome.solved = false, errors, and high cost outliers. Keep exemplars for the rest.

Connect to outcome labels

Add a simple evaluator to label solved vs unsolved. Use your existing resolution events where possible, for example a support ticket not created within 24 hours.
Store outcome.quality as a 1 to 5 rating where available. Start with heuristics. Replace with audited evals later.

Deliverables by day 30

A consistent v1.0 schema in production
Cost per solved task by feature and model
A live P95 latency and error dashboard
Sampling rules and a pricing map checked into config

Day 31 to 60: Expand to tools, RAG, and quality

Objectives: instrument tool calls and retrieval, add safety and quality signals, and unlock cost‑to‑outcome analytics.

Trace tools and RAG as first‑class spans

Create a child span for each tool invocation with span name gen_ai.tool.call and attributes gen_ai.tool.name, gen_ai.tool.arguments, gen_ai.tool.output_size.
For retrieval, emit a span retrieval.search with attributes retrieval.index, retrieval.query_hash, retrieval.latency_ms, retrieval.top_k, retrieval.re_ranked, retrieval.score_stats.
Link embedding spans to the same trace when you create or query vectors. Use gen_ai.embedding.model and gen_ai.embedding.dimensions.

Use the new exporters

If you use LangChain, enable the OTel exporter. You will automatically get structured events for prompt construction, tool use, and streaming tokens.
If you use LlamaIndex, enable its OTel integration to unify traces across ingestion, retrieval, and generation.

Add safety and drift signals

Add outcome.safety to every completion with values clean, blocked, or flagged. Include categories where possible.
Track prompt.version and prompt.hash. Visualize quality and solved rate by prompt.version to catch drift after prompt edits.

Create cost‑to‑outcome metrics

Cost per solved task: sum(gen_ai.cost.usd) where outcome.solved = true divided by count of solved traces.
Marginal cost of tool use: difference in cost.usd when tool spans exist vs not for the same task_type.
Retrieval lift: delta in solved rate when retrieval.top_k > 0 vs zero.

Budget guardrails

Emit a metric budget.remaining.usd per tenant or feature. Set an alert when a trace would push the budget below zero. Short‑circuit with a cheaper fallback model.

Standardize quality evals

For each task_type define a rubric and automated evaluator. Store as outcome.quality with evaluator.name and evaluator.version attributes.
Emit eval.latency_ms so you know evaluation cost in both tokens and time.

Deliverables by day 60

Traces that tell a complete RAG story from query to tool to answer
A cost‑to‑outcome report per task and per model
Budget guardrails that prevent surprise bills
Safety tracking and prompt drift views

Day 61 to 90: Tie to business KPIs and govern

Objectives: vendor comparisons, A/B testing, governance, and ROI narratives leadership can trust.

Vendor‑neutral A/B tests with the same schema

Route a consistent 10 to 20 percent of traffic by task_type to two models. Keep gen_ai.request.model, gen_ai.system, and prompt.version in the span.
Compare solved rate, cost per solved task, and P95 latency. Use the same evaluator. Declare a winner per segment.

Connect trace‑level outcomes to product metrics

Join gen_ai traces with conversion or retention events using user_id, session_id, or order_id on the trace resource.
Produce a weekly business view: gross margin uplift from AI assists, support deflection dollars, and infrastructure cost.

Governance

Red team workflows: mark traces with red_team.scenario and store prompts and outputs in a restricted table for audit.
Safety SLOs: percent of blocked unsafe requests stays above a target while false positives stay below a threshold.
Access controls: restrict who can view prompt bodies. Store prompt.hash by default and reveal full text only with approval.

Reliability SLOs

Set SLOs by task_type for latency and success. Example: 95 percent of password reset answers under 2.5 seconds end‑to‑end.
Alert on tail latency and on spikes in gen_ai.response.truncated.

Cost optimization playbook

Token diet: track median input length by prompt.version. Cut verbose system prompts and switch to few‑shot only where it moves solved rate.
Dynamic routing: emit a route.decision attribute and route heavy tasks to a larger model only when evaluator confidence is low.
Caching: record cache.hit and cache.miss on retrieval and completions. Count avoided tokens.

Executive narrative and planning

Create a quarterly ROI summary per AI feature: investment, unit economics, SLOs, top risks.
Use the standardized schema to project Q4 spend. Model traffic and token curves, not just tallies.

Deliverables by day 90

A vendor‑neutral comparison with hard ROI
Governance and safety dashboards with SLOs
A board‑ready narrative on AI unit economics

Practical examples you can lift

Span attribute checklist for completions

Required: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.cost.usd, task_type, outcome.solved
Recommended: gen_ai.request.temperature, gen_ai.response.finish_reasons, prompt.version, prompt.hash, user_segment, tenant.id

Cost per solved task query sketch

SELECT
  task_type,
  model := attributes['gen_ai.request.model'] AS model,
  AVG(CAST(attributes['gen_ai.cost.usd'] AS DOUBLE)) FILTER (WHERE attributes['outcome.solved'] = 'true') AS avg_cost_solved,
  COUNT(*) FILTER (WHERE attributes['outcome.solved'] = 'true') AS solved,
  COUNT(*) AS total,
  solved::double / NULLIF(total,0) AS solved_rate
FROM traces
WHERE service_name = 'support-ai' AND time >= now() - interval '30 day'
GROUP BY task_type, model
ORDER BY avg_cost_solved ASC

Prompt drift watchlist

Track solved rate and outcome.quality by prompt.version. Alert if a new version drops quality by more than 5 percent for two consecutive days.
Track gen_ai.usage.input_tokens by prompt.version. Alert if token count rises more than 20 percent without a matching quality lift.

RAG health

Retrieval.latency_ms P95 under 150 ms for interactive flows.
Percentage of answers with retrieval.top_k > 0 that cite at least one document from the correct tenant. Label as retrieval.correct_tenant = true|false.

Failure analysis

Group errors by gen_ai.response.finish_reasons and tool.failures. Many no_reason or length issues usually point to max_tokens constraints rather than model quality.

Operating model and ownership

Roles and responsibilities

AI Ops or ML Platform owns the OTel schema, Collector config, and sampling.
Feature teams own business attributes like task_type and outcome labels.
Product Marketing defines the value model per task and keeps the pricing map honest.
Security reviews safety events and red‑team traces.

Change management

Treat the GenAI schema like an API. Changes get versioned, reviewed, and rolled out with backward compatibility.
Add new attributes behind feature flags. Drop only after a full deprecation cycle.

Common pitfalls and how to avoid them

Too many attributes. High cardinality will drown your backend. Keep enumerations tight and hash free‑form text.
Cost without outcomes. Tokens per request do not matter if you cannot tie them to solved tasks. Capture outcome.solved first.
Mixing model and prompt changes. Do not deploy a new model and a new prompt version at once. Stagger to isolate impact.
Ignoring tool costs. Tool calls can dwarf token charges. Instrument them and include their cost in the same trace.
Lack of tenant context. Always carry tenant.id or account.id to prevent cross‑tenant retrieval or analysis mistakes.

Where Upcite.ai fits

Once your traces are clean, you can evaluate how your brand and products appear in answer engines and AI assistants. Upcite.ai helps you understand how ChatGPT and other AI models are viewing your products and applications and makes sure you appear in answers to prompts like "Best products for…" or "Top applications for…". You can use the same OTel context to segment by product line, geography, or persona and prove the lift from appearing in those answers.

As a strategist, I like to think of this like a negative split in a marathon. Start controlled with core spans and cost math. Pick up the pace with tools and retrieval. Finish strong by connecting to business outcomes and governance. In tennis, good footwork sets up the shot. Here, your telemetry sets up faster iteration and better ROI.

Next steps

Week 1: Stand up the Collector, define your attribute standard, and instrument completion spans with gen_ai.cost.usd.
Week 2: Build the first dashboard. Ship cost per solved task by task_type.
Weeks 3 to 4: Add tool and retrieval spans. Turn on LangChain or LlamaIndex exporters if you use them.
Weeks 5 to 6: Wire safety and drift signals. Add budget guardrails.
Weeks 7 to 8: Run your first vendor A/B test and publish the ROI comparison.

If you want a working session to tailor this blueprint to your stack, get in touch. I can review your schema, sampling, and dashboards, and show how to tie your GenAI telemetry to answer engine outcomes with Upcite.ai.