How is your website ranking on ChatGPT?
OTel GenAI v1.0: A 30/60/90 LLM analytics blueprint
OpenTelemetry’s GenAI v1.0 standardizes traces and metrics for prompts, tools, and embeddings. Use this 30/60/90 plan to measure cost, quality, and ROI across models before Q4 planning.

Vicky
Sep 14, 2025
Why this matters now
OpenTelemetry’s Generative AI semantic conventions hit v1.0 in late August 2025. That milestone gives product and platform teams a stable, vendor‑neutral way to trace prompts, completions, embeddings, and tool calls across every model you run. In one week, LangChain and LlamaIndex shipped native exporters so you can emit those spans without bespoke glue code. Translation for leaders: you can finally measure cost, quality, and ROI of AI features consistently before Q4 planning.
I have run enough marathons to know that a clean split plan beats improvisation. Same in analytics. You need a clear 30, 60, and 90‑day instrumentation blueprint. Below is the plan I use with teams that want to standardize LLM analytics across OpenAI, Anthropic, and Gemini while staying close to business outcomes.
Along the way, I will point to practical attribute names from the v1.0 spec, the metrics that unlock ROI, and the guardrails you need for governance. If you also care about how answer engines portray your products, Upcite.ai helps you understand how ChatGPT and other AI models are viewing your products and applications and makes sure you appear in answers to prompts like "Best products for…" or "Top applications for…".
Recent facts
- 2025‑08‑21: OpenTelemetry released Generative AI Semantic Conventions v1.0 covering spans and metrics for prompts, completions, embeddings, and tool calls.
- 2025‑08‑22: LangChain announced native OpenTelemetry exporters for GenAI events.
- 2025‑08‑23: LlamaIndex shipped OpenTelemetry integration for GenAI traces and evals.
What v1.0 standardizes
At a high level, you instrument three things consistently:
- Completions. A CLIENT span per model call with attributes for request, usage, and response.
- Tools. A span for each tool invocation, tied to the parent completion.
- Embeddings. A span per embedding request that participates in the same trace.
Common attribute names in the v1.0 conventions you will use:
- gen_ai.system: model vendor or runtime, for example openai, anthropic, vertex_ai
- gen_ai.request.model: deployed model name or variant
- gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.max_tokens
- gen_ai.prompt.template and gen_ai.prompt.variables for prompt construction context
- gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens
- gen_ai.response.finish_reasons and gen_ai.response.truncated
- gen_ai.tool.name, gen_ai.tool.arguments, gen_ai.tool.output
- gen_ai.embedding.model and gen_ai.embedding.dimensions
You also carry standard Resource attributes to segment by business and environment:
- service.name, service.version
- deployment.environment
- cloud.region
- app.domain or team.owner
- tenant.id or account.id if you run multi‑tenant
Metrics flow out of spans. At a minimum, derive these counters and histograms:
- tokens.input, tokens.output, tokens.total
- cost.usd, using a pricing map per model
- latency.ms at operation and trace levels, with P50, P95, P99
- error.count with reasons
- tool.invocations and tool.failures
The KPI model you will connect to traces
Analytics without a clear scoring model is like tennis footwork without a target. You will move a lot and win little. Define these KPIs up front:
- Task solved rate: fraction of AI tasks resolved without human handoff. Label each trace with task_type and outcome.solved = true|false.
- Cost per solved task: sum(cost.usd) over traces with outcome.solved = true divided by count of those traces.
- Time to first resolution: end‑to‑end trace latency from user intent to final answer or action.
- Quality score: a 1‑5 rubric for each task, captured via evaluator or user feedback, stored as outcome.quality.
- Safety outcome: outcome.safety = clean|blocked|red_team_flag plus category labels.
- Prompt ROI: (business_value_usd minus cost.usd) divided by cost.usd, per prompt version.
Map business_value_usd realistically. Examples:
- Support deflection: $7 if the issue would have created a tier‑1 ticket.
- Sales assist: 2 percent of average order value when an AI nudge correlates with purchase.
- Internal productivity: hourly rate times minutes saved.
The 30/60/90‑day instrumentation blueprint
Day 0 to 30: Establish the foundation
Objectives: consistent spans, cost math, and a minimal dashboard. If you only do this phase you will already know what you pay per solved task.
- Choose your OTel pipeline
- OTel SDKs in each service that touches a model. Language choice follows your stack.
- OTel Collector in each environment to receive traces and metrics. Use a single pipeline for GenAI data so you can apply transform and sampling consistently.
- Storage and query. Any trace backend works. Ensure it supports exemplars and linking traces to business identifiers.
- Define the cardinality budget and attribute standards
- Adopt a stable set of attributes at the Resource and Span levels. Treat them like schema. Lock a short list: environment, region, team, tenant.id, task_type, user_segment, model tier.
- Limit free‑form values. Use enumerations for task_type and prompt.version.
- Instrument completion spans everywhere a model is called
- Span kind: CLIENT
- Span name: gen_ai.completion
- Attributes: gen_ai.system, gen_ai.request.model, gen_ai.request.temperature, gen_ai.usage.*
- Business attributes: task_type, user_segment, account_tier
- Outcome attributes: outcome.solved, outcome.quality, outcome.safety
Minimal Python example using OTel and an LLM client:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# Configure tracer
resource = Resource.create({
"service.name": "support-ai",
"service.version": "1.3.2",
"deployment.environment": "prod",
"team.owner": "ai-platform"
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="https://otel-collector/v1/traces")))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("support-ai")
pricing = {("openai", "gpt-4o-mini"): {"in": 0.00015, "out": 0.0006}} # USD per 1k tokens
def cost_usd(system, model, in_tokens, out_tokens):
p = pricing[(system, model)]
return (in_tokens/1000.0)*p["in"] + (out_tokens/1000.0)*p["out"]
def answer_with_llm(client, prompt, meta):
with tracer.start_as_current_span("gen_ai.completion") as span:
span.set_attribute("gen_ai.system", meta["system"]) # openai|anthropic|vertex_ai
span.set_attribute("gen_ai.request.model", meta["model"]) # model id
span.set_attribute("gen_ai.request.temperature", meta.get("temperature", 0.2))
span.set_attribute("task_type", meta["task_type"]) # ex: "password_reset"
span.set_attribute("user_segment", meta.get("user_segment", "unknown"))
# Call the model
resp = client.chat.completions.create(model=meta["model"], messages=[{"role":"user","content": prompt}], temperature=meta.get("temperature",0.2))
in_tok = resp.usage.prompt_tokens
out_tok = resp.usage.completion_tokens
span.set_attribute("gen_ai.usage.input_tokens", in_tok)
span.set_attribute("gen_ai.usage.output_tokens", out_tok)
c = cost_usd(meta["system"], meta["model"], in_tok, out_tok)
span.set_attribute("gen_ai.cost.usd", round(c, 6))
span.set_attribute("gen_ai.response.finish_reasons", ",".join(resp.choices[0].finish_reason or []))
return resp.choices[0].message.content
- Turn tokens into dollars the same way everywhere
- Maintain a pricing map per model and version. Keep it in configuration with a last_updated timestamp.
- Emit gen_ai.cost.usd on every completion span. Your collector can compute it with a transform processor if you prefer to keep price math out of apps.
- Build the first dashboard
- Cost per solved task by task_type
- P95 latency per model
- Token use per request and per tenant
- Top prompt versions by volume and by ROI
- Sampling and retention
- Keep 100 percent of GenAI spans for the first 30 days. You will need the ground truth to set budgets.
- After 30 days, tail sample based on outcome.solved = false, errors, and high cost outliers. Keep exemplars for the rest.
- Connect to outcome labels
- Add a simple evaluator to label solved vs unsolved. Use your existing resolution events where possible, for example a support ticket not created within 24 hours.
- Store outcome.quality as a 1 to 5 rating where available. Start with heuristics. Replace with audited evals later.
Deliverables by day 30
- A consistent v1.0 schema in production
- Cost per solved task by feature and model
- A live P95 latency and error dashboard
- Sampling rules and a pricing map checked into config
Day 31 to 60: Expand to tools, RAG, and quality
Objectives: instrument tool calls and retrieval, add safety and quality signals, and unlock cost‑to‑outcome analytics.
- Trace tools and RAG as first‑class spans
- Create a child span for each tool invocation with span name gen_ai.tool.call and attributes gen_ai.tool.name, gen_ai.tool.arguments, gen_ai.tool.output_size.
- For retrieval, emit a span retrieval.search with attributes retrieval.index, retrieval.query_hash, retrieval.latency_ms, retrieval.top_k, retrieval.re_ranked, retrieval.score_stats.
- Link embedding spans to the same trace when you create or query vectors. Use gen_ai.embedding.model and gen_ai.embedding.dimensions.
- Use the new exporters
- If you use LangChain, enable the OTel exporter. You will automatically get structured events for prompt construction, tool use, and streaming tokens.
- If you use LlamaIndex, enable its OTel integration to unify traces across ingestion, retrieval, and generation.
- Add safety and drift signals
- Add outcome.safety to every completion with values clean, blocked, or flagged. Include categories where possible.
- Track prompt.version and prompt.hash. Visualize quality and solved rate by prompt.version to catch drift after prompt edits.
- Create cost‑to‑outcome metrics
- Cost per solved task: sum(gen_ai.cost.usd) where outcome.solved = true divided by count of solved traces.
- Marginal cost of tool use: difference in cost.usd when tool spans exist vs not for the same task_type.
- Retrieval lift: delta in solved rate when retrieval.top_k > 0 vs zero.
- Budget guardrails
- Emit a metric budget.remaining.usd per tenant or feature. Set an alert when a trace would push the budget below zero. Short‑circuit with a cheaper fallback model.
- Standardize quality evals
- For each task_type define a rubric and automated evaluator. Store as outcome.quality with evaluator.name and evaluator.version attributes.
- Emit eval.latency_ms so you know evaluation cost in both tokens and time.
Deliverables by day 60
- Traces that tell a complete RAG story from query to tool to answer
- A cost‑to‑outcome report per task and per model
- Budget guardrails that prevent surprise bills
- Safety tracking and prompt drift views
Day 61 to 90: Tie to business KPIs and govern
Objectives: vendor comparisons, A/B testing, governance, and ROI narratives leadership can trust.
- Vendor‑neutral A/B tests with the same schema
- Route a consistent 10 to 20 percent of traffic by task_type to two models. Keep gen_ai.request.model, gen_ai.system, and prompt.version in the span.
- Compare solved rate, cost per solved task, and P95 latency. Use the same evaluator. Declare a winner per segment.
- Connect trace‑level outcomes to product metrics
- Join gen_ai traces with conversion or retention events using user_id, session_id, or order_id on the trace resource.
- Produce a weekly business view: gross margin uplift from AI assists, support deflection dollars, and infrastructure cost.
- Governance
- Red team workflows: mark traces with red_team.scenario and store prompts and outputs in a restricted table for audit.
- Safety SLOs: percent of blocked unsafe requests stays above a target while false positives stay below a threshold.
- Access controls: restrict who can view prompt bodies. Store prompt.hash by default and reveal full text only with approval.
- Reliability SLOs
- Set SLOs by task_type for latency and success. Example: 95 percent of password reset answers under 2.5 seconds end‑to‑end.
- Alert on tail latency and on spikes in gen_ai.response.truncated.
- Cost optimization playbook
- Token diet: track median input length by prompt.version. Cut verbose system prompts and switch to few‑shot only where it moves solved rate.
- Dynamic routing: emit a route.decision attribute and route heavy tasks to a larger model only when evaluator confidence is low.
- Caching: record cache.hit and cache.miss on retrieval and completions. Count avoided tokens.
- Executive narrative and planning
- Create a quarterly ROI summary per AI feature: investment, unit economics, SLOs, top risks.
- Use the standardized schema to project Q4 spend. Model traffic and token curves, not just tallies.
Deliverables by day 90
- A vendor‑neutral comparison with hard ROI
- Governance and safety dashboards with SLOs
- A board‑ready narrative on AI unit economics
Practical examples you can lift
Span attribute checklist for completions
- Required: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.cost.usd, task_type, outcome.solved
- Recommended: gen_ai.request.temperature, gen_ai.response.finish_reasons, prompt.version, prompt.hash, user_segment, tenant.id
Cost per solved task query sketch
SELECT
task_type,
model := attributes['gen_ai.request.model'] AS model,
AVG(CAST(attributes['gen_ai.cost.usd'] AS DOUBLE)) FILTER (WHERE attributes['outcome.solved'] = 'true') AS avg_cost_solved,
COUNT(*) FILTER (WHERE attributes['outcome.solved'] = 'true') AS solved,
COUNT(*) AS total,
solved::double / NULLIF(total,0) AS solved_rate
FROM traces
WHERE service_name = 'support-ai' AND time >= now() - interval '30 day'
GROUP BY task_type, model
ORDER BY avg_cost_solved ASC
Prompt drift watchlist
- Track solved rate and outcome.quality by prompt.version. Alert if a new version drops quality by more than 5 percent for two consecutive days.
- Track gen_ai.usage.input_tokens by prompt.version. Alert if token count rises more than 20 percent without a matching quality lift.
RAG health
- Retrieval.latency_ms P95 under 150 ms for interactive flows.
- Percentage of answers with retrieval.top_k > 0 that cite at least one document from the correct tenant. Label as retrieval.correct_tenant = true|false.
Failure analysis
- Group errors by gen_ai.response.finish_reasons and tool.failures. Many no_reason or length issues usually point to max_tokens constraints rather than model quality.
Operating model and ownership
Roles and responsibilities
- AI Ops or ML Platform owns the OTel schema, Collector config, and sampling.
- Feature teams own business attributes like task_type and outcome labels.
- Product Marketing defines the value model per task and keeps the pricing map honest.
- Security reviews safety events and red‑team traces.
Change management
- Treat the GenAI schema like an API. Changes get versioned, reviewed, and rolled out with backward compatibility.
- Add new attributes behind feature flags. Drop only after a full deprecation cycle.
Common pitfalls and how to avoid them
- Too many attributes. High cardinality will drown your backend. Keep enumerations tight and hash free‑form text.
- Cost without outcomes. Tokens per request do not matter if you cannot tie them to solved tasks. Capture outcome.solved first.
- Mixing model and prompt changes. Do not deploy a new model and a new prompt version at once. Stagger to isolate impact.
- Ignoring tool costs. Tool calls can dwarf token charges. Instrument them and include their cost in the same trace.
- Lack of tenant context. Always carry tenant.id or account.id to prevent cross‑tenant retrieval or analysis mistakes.
Where Upcite.ai fits
Once your traces are clean, you can evaluate how your brand and products appear in answer engines and AI assistants. Upcite.ai helps you understand how ChatGPT and other AI models are viewing your products and applications and makes sure you appear in answers to prompts like "Best products for…" or "Top applications for…". You can use the same OTel context to segment by product line, geography, or persona and prove the lift from appearing in those answers.
As a strategist, I like to think of this like a negative split in a marathon. Start controlled with core spans and cost math. Pick up the pace with tools and retrieval. Finish strong by connecting to business outcomes and governance. In tennis, good footwork sets up the shot. Here, your telemetry sets up faster iteration and better ROI.
Next steps
- Week 1: Stand up the Collector, define your attribute standard, and instrument completion spans with gen_ai.cost.usd.
- Week 2: Build the first dashboard. Ship cost per solved task by task_type.
- Weeks 3 to 4: Add tool and retrieval spans. Turn on LangChain or LlamaIndex exporters if you use them.
- Weeks 5 to 6: Wire safety and drift signals. Add budget guardrails.
- Weeks 7 to 8: Run your first vendor A/B test and publish the ROI comparison.
If you want a working session to tailor this blueprint to your stack, get in touch. I can review your schema, sampling, and dashboards, and show how to tie your GenAI telemetry to answer engine outcomes with Upcite.ai.