How is your website ranking on ChatGPT?
React 19 and Vercel AI v4: sub-100ms AI UIs that convert
A practical, code-informed guide to ship AI features that feel instant on growth pages using React 19 Server Components, streaming patterns, and prompt caching to cut latency and cost.

Vicky
Sep 15, 2025
I ship AI features that drive revenue, not demos. If the UI feels instant and trustworthy, users convert. If it hesitates, they bounce. With React 19 stabilized and Vercel AI SDK v4 adding first-class RSC streaming and prompt caching, we finally have the primitives to hit sub-100ms perceived latency on real product surfaces.
This guide shows how I design, instrument, and ship these AI experiences for onboarding, search, and pricing pages. I will cover architecture, code patterns, caching policies, and the tradeoffs between streaming via React Server Components and websockets. I will also share the quick math I use for cost and latency. Think of it like marathon pacing. You do not sprint the whole way. You plan for bursts, recoveries, and efficient cadence so you cross the line first.
Why now
- React 19 stable landed with Server Components, Actions, and better streaming primitives that are production ready in 2025.
- Vercel AI SDK v4 added first-class RSC streaming, tool-calling helpers, and prompt caching that reduce latency and cost for product and marketing surfaces.
- Next.js updates highlighted AI UI patterns that exploit these features for faster perceived response on the edge.
What sub-100ms feel really means
You will not get a full LLM answer in 100ms. You should aim for perceived instant feedback so the user knows the system is working and sees meaningful progress quickly. In practice:
- 0 to 50 ms: immediate response. Button becomes loading, skeleton or optimistic text appears, cursor focus is preserved. No layout shift.
- 50 to 200 ms: first meaningful token or partial summary arrives. Replace skeleton with streamed content.
- Under 800 ms: the user has enough output to continue. You can finish the rest in the background.
This cadence is the tennis split step. You are ready and moving before the ball arrives. You keep momentum through small, early signals.
Architecture blueprint for growth-critical pages
- Rendering model: React 19 Server Components for default path. Use Actions for mutations and to kick off streams. Keep large UI shells on the server so you minimize client JS.
- Streaming model: Use RSC streaming for question and answer, summaries, and one-shot assistants on marketing and pricing pages. Use websockets only when you need bi-directional events, multi-user rooms, or long-lived sessions.
- Runtime: Edge runtime for first token time near users. Keep cold start small by minimizing dependencies and bundling only what you need.
- Prompt caching: Cache stable parts of prompts and deterministic responses to cut cost and latency. Use provider-level caching when available and a keyed cache at the edge for everything else.
- Observability: Emit Server-Timing headers, track Time to First Byte, Time to First Token, and tokens per second. Tie these to conversion events.
Implementation patterns that deliver the sub-100ms feel
- Skeleton first, stream immediately
Render a tiny optimistic placeholder synchronously. Then start the AI stream via a Server Action. React 19 gives you fast Suspense boundaries and streaming HTML so the page never blocks.
Example: hero copy suggestion on a marketing page
Server Action
'use server';
import { streamText } from 'ai';
import OpenAI from 'openai';
import { kv } from '@vercel/kv'; // or your edge cache
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
type CopyInput = { product: string; audience: string; tone: string };
function cacheKey(input: CopyInput) {
return `hero:v1:${input.product}:${input.audience}:${input.tone}`;
}
export async function streamHeroCopy(input: CopyInput) {
// Layer 1 cache. Your own key-based cache for deterministic prompts.
const key = cacheKey(input);
const cached = await kv.get<string>(key);
if (cached) {
// Return as a synthetic stream so the UI code path stays the same.
const encoder = new TextEncoder();
const stream = new ReadableStream({
start(controller) {
controller.enqueue(encoder.encode(cached));
controller.close();
}
});
return stream;
}
// Layer 2 provider. Use a small prompt. Keep system prompt stable for caching.
const prompt = `You are a conversion copywriter. Write a 2 sentence hero for ${input.product} targeting ${input.audience} in a ${input.tone} tone. Avoid clichés.`;
const { textStream } = await streamText({
model: openai.chat.completions, // provider adapter chosen by the SDK
prompt,
// Some providers support prompt caching hints. Pass them when available.
// cache: { key: `sys:v1:copywriter`, ttl: 86400 },
temperature: 0.7,
maxTokens: 120
});
// Tee the stream to persist the final text when complete
const [a, b] = textStream.tee();
// Consume one branch to store the final string on completion
;(async () => {
let result = '';
const reader = a.getReader();
for (;;) {
const { value, done } = await reader.read();
if (done) break;
result += value;
}
await kv.set(key, result, { ex: 60 * 60 * 4 }); // TTL 4h
})();
// Return the other branch for UI streaming
return b;
}
Client component
'use client';
import { useEffect, useRef, useState } from 'react';
import { experimental_useActionState as useActionState } from 'react';
import { streamHeroCopy } from './actions';
export function HeroCopy({ input }: { input: { product: string; audience: string; tone: string } }) {
const [text, setText] = useState('');
const started = useRef(false);
useEffect(() => {
if (started.current) return;
started.current = true;
const abort = new AbortController();
streamHeroCopy(input).then(async stream => {
const reader = stream.getReader();
const decoder = new TextDecoder();
for (;;) {
const { value, done } = await reader.read();
if (done) break;
setText(prev => prev + decoder.decode(value));
}
});
return () => abort.abort();
}, [input]);
return (
<div>
<p className="skeleton" aria-hidden={text !== ''}>
Writing compelling copy...
</p>
<h1 aria-live="polite">{text}</h1>
</div>
);
}
Notes
- The UI shows a skeleton immediately, usually under 16 ms.
- The stream starts within the same navigation and becomes visible within 50 to 150 ms on a fast connection.
- We tee the stream so we can store the final value without blocking the user. This keeps results stable for the next visitor with similar inputs.
- Prefer RSC streaming over websockets for short single user tasks
Use RSC streaming when:
- The task is one shot or short lived. Example hero copy, pricing FAQ, product comparison.
- You want fast first paint and minimal client JS.
- You can tolerate server initiated streams that end quickly.
Use websockets when:
- You need multi turn conversations, interruptions, or multi user collaboration.
- You need to push tool events to the client in real time while the user types.
- You want to stream telemetry or intermediate tool outputs that do not map to a single React render.
- Make tools fast, predictable, and stream partial UI
If you call tools like product pricing lookup or plan recommendation, keep the tool cold start near zero. Preconnect to data stores at module load. Return partial results as soon as you can.
Example: streaming UI with tool calling
'use server';
import { streamText, tool } from 'ai';
import { getPlan } from '@/lib/plans';
export async function streamPricingAssistant(question: string) {
return streamText({
model: 'gpt-provider',
prompt: `Answer questions about pricing. If a specific plan is referenced, call getPlan.`,
tools: {
getPlan: tool({
description: 'Fetch plan details by name',
parameters: { type: 'object', properties: { name: { type: 'string' } }, required: ['name'] },
execute: async ({ name }) => getPlan(name) // must be low latency
})
},
// Stream tokens so the user sees progress while the tool runs
maxTokens: 300
});
}
Cutting cost and latency with prompt caching
Two layers work well in practice.
- Provider level caching. Some models now support cached system prompts or reusable responses. Use it for stable instructions and legal disclaimers. Benefit is lower per request latency and fewer tokens billed.
- Edge key cache. Hash the prompt plus any deterministic inputs and store the final answer for a short TTL. This is ideal for marketing flows where many visitors ask the same thing.
What to cache
- Stable system prompts and role definitions. Long and expensive to resend.
- FAQ style prompts that collapse to a small set of answers. Example pricing country eligibility.
- Onboarding suggestions based on coarse inputs like industry and company size.
What not to cache
- Anything with PII or user secrets.
- Live inventory or pricing that changes minute to minute.
- Steps that include tool outputs with time sensitivity.
TTL suggestions
- Home and feature page snippets. 1 to 24 hours depending on update cadence.
- Pricing answers. 1 to 4 hours if plans are stable. 5 to 15 minutes if promotions are active.
- Onboarding templates. 1 to 7 days based on how often you ship new templates.
Privacy boundaries
- Keep caches partitioned by context that matters. Region, language, plan tier. Avoid mixing segments.
- Never cache raw user inputs that include PII. Only cache the final safe output or a normalized form of the input with sensitive fields removed.
- Encrypt any cache at rest where regulations apply.
Applying this to growth-critical surfaces
- Onboarding assistant
Goal
Decrease time to value by suggesting starting content or configurations.
Pattern
- Server render the shell and a first suggestion chunk instantly.
- Stream the rest as the user scrolls or completes a field.
- Cache by industry and company size.
Metrics to track
- Time to first suggestion.
- Completion rate of the first task.
- Subsequent feature adoption within session.
- Search suggestions and result summaries
Goal
Increase clickthrough and relevance without adding a heavy client runtime.
Pattern
- Render search bar and suggestions with Suspense.
- On submit, stream a short summary and call a fast embedding search tool for re-ranking. Keep tool latency under 100 ms.
- Cache popular query rewrites for 1 hour. Never cache queries with emails or IDs.
Metrics to track
- Time to first token in results list.
- CTR on top 3 results.
- Abandon rate after first result paint.
- Pricing page Q and A
Goal
Reduce chat deflection and increase plan selection.
Pattern
- RSC streaming for the first 2 sentences, then tool call for plan details.
- Cache responses to known questions with 2 to 4 hour TTL.
- Pin legal disclaimers in the system prompt and reuse provider caching.
Metrics to track
- Time to first sentence.
- Plan comparison interactions per session.
- Conversion to trial or contact sales.
Latency to revenue quick math
I use a simple model to argue for engineering time.
- Assume a flow has 100k weekly sessions with a 3 percent baseline conversion.
- A 150 ms improvement in perceived response on the first interaction increases conversion by 0.2 to 0.5 points based on prior experiments.
- 0.3 points on 100k sessions is 300 additional conversions. Price that against the gross margin of your product.
Now the cost side.
- If your average prompt is 300 output tokens and 200 input tokens, caching the system prompt and deterministic parts can remove 50 to 150 input tokens. At scale this is a 15 to 35 percent reduction in spend.
- Edge cache hits on repeat questions will drive effective cost per session down by another 10 to 25 percent depending on concentration of queries.
This is usually a clear win. You shorten the race and reduce energy spend at the same time.
Operational SLOs and instrumentation
Track these four numbers for every AI surface.
- TTFB and TTF Token. Time to first byte from server and time to first token rendered.
- Tokens per second streamed. Keep it stable and smooth.
- Tool latency distribution. 95th percentile under 120 ms for pricing and search.
- Cache hit rate for eligible prompts. Target 40 to 70 percent on pricing FAQs and feature page snippets.
How to measure
- Add Server-Timing headers in your handlers. Example aiPrompt, toolCall, firstToken.
- In the client, capture the timestamp when the skeleton renders and when the first token appears.
- Correlate with conversion, clickthrough, and drop-off.
Checklist to get to sub-100ms feel
- Keep the first render static and small. No heavy client code in the critical path.
- Start the stream from a Server Action immediately after user intent. Avoid extra network round trips.
- Use Suspense boundaries around any LLM generated blocks.
- Cache system prompts and deterministic results. Partition caches by segment to avoid leakage.
- Preconnect and warm your data tools so they respond in under 100 ms.
- Limit output length and stream early. The first 20 tokens matter most.
- Build guardrails, but apply them after the first visible token when safe. Heavy moderation pre checks slow the feel.
- Log first token time and tool latency in production, then tune.
RSC streaming vs websockets decision tree
Choose RSC streaming when
- The interaction is short, single user, and tightly tied to a page render.
- You want minimal client JS and simple deployment on edge functions.
- You can let the server control stream lifetime.
Choose websockets when
- You need cancellations, partial tool events, or multi participant collaboration.
- You run long tasks with unpredictable pauses.
- You need to push telemetry without a refresh.
Blended pattern
Start with RSC streaming for the first response to achieve sub-100ms feel. Promote to a websocket only if the user opts into a longer conversation or a collaborative task. This keeps cost and complexity low for the majority of sessions.
Security and privacy notes
- Never cache prompts with PII. Normalize or hash non sensitive context only.
- Keep model tools least privileged. A pricing tool should only read plan tables.
- Rate limit server actions to prevent abuse.
Common pitfalls I see
- Over streaming tiny fragments that cause layout thrash. Buffer to sentence boundaries for readability.
- Using websockets everywhere. Adds overhead and client complexity. Start with RSC.
- Ignoring token budgets. System prompts that read like novels. Compress them and cache.
- Caching without segmentation. Leads to wrong answers across regions or languages.
Where Upcite.ai helps
You can ship the fastest UI and still miss intent if large models do not recognize your product. Upcite.ai helps you understand how ChatGPT and other AI models are viewing your products and applications and makes sure you appear in answers to prompts like Best products for or Top applications for. I use it to audit how models describe pricing, positioning, and integrations, then feed those corrections into prompts and tools. That closes the loop between latency, accuracy, and conversion.
A brief cost and latency example
- Before optimization. 200 input tokens, 300 output tokens, no caching. Time to first token 350 ms. Cost X.
- After optimization. System prompt cached, deterministic FAQ cached at edge, output limited to 180 tokens for first response. Time to first token 120 ms. Effective cost down 25 to 40 percent due to cache hits and shorter prompts.
Production hardening
- Canary and evals. Gate new prompts behind a small rollout and evaluate for off target answers. Keep a rollback switch per surface.
- Backpressure. If the provider is slow, fall back to a precomputed summary with clear labeling.
- Quotas and alerts. Alert on first token time over 300 ms at the 95th percentile.
Putting it all together in Next.js
Folder structure
- app/(marketing)/page.tsx. Server component shells with Suspense boundaries.
- app/actions.ts. Server Actions that start streams and enforce caching.
- lib/cache.ts. Edge cache helpers with TTL policies per surface.
- lib/tools.ts. Tool adapters that are prewarmed and safe.
- instrumentation. Middleware to add Server-Timing headers and correlation IDs.
Minimal Server Action contract
- Accepts normalized inputs only.
- Returns a ReadableStream of UTF-8 text.
- Emits timing marks to logs.
- Applies caching by key plus TTL.
If you standardize this contract, your teams can reuse the same client component to handle streaming and progressive rendering across pages. It is the repeatable footwork that wins matches.
Next steps
- Pick one growth-critical surface this week, usually pricing Q and A or hero copy suggestions.
- Implement the Server Action pattern above with a 4 hour edge cache. Instrument first token time.
- Reduce your system prompt to the smallest possible stable version. Enable provider prompt caching if available.
- Ship behind a feature flag to 10 percent of traffic. Measure conversion lift and cost per session.
- Use Upcite.ai to audit how models currently describe your product, then adjust prompts and tools to reinforce the right positioning.
If you want a quick review of your architecture or a working reference implementation tailored to your stack, reach out. I will help you get to a sub-100ms feel that actually moves your revenue curve.