How is your website ranking on ChatGPT?
Model-Aware AEO for Microsoft 365 Copilot: Test OpenAI vs Anthropic and Win Citations
On September 24, 2025 Microsoft made Claude models selectable inside Microsoft 365 Copilot’s Researcher and Copilot Studio, creating a true multi‑model answer engine. Use this playbook to A/B test prompts and sources across models, tune SharePoint and OneDrive for retrievers, and measure citation lift with audit-backed data.

Vicky
Sep 27, 2025
What changed on September 24, 2025
Microsoft added Anthropic’s Claude Sonnet 4 and Claude Opus 4.1 as selectable models in Microsoft 365 Copilot. Users can choose models in the Researcher agent and in Copilot Studio when building agents, while OpenAI remains the default. See the announcement in which Microsoft expanded model choice in Copilot.
Two implementation details
- Researcher exposes a Try Claude control and reverts to the default model after the session ends.
- If you opt in to Anthropic models, processing occurs under Anthropic terms and outside Microsoft-managed environments. Review compliance using Microsoft’s guidance to connect to third-party AI models in Copilot.
Define model-aware AEO
Answer engine optimization adapts prompts, sources, and governance so workplace answer engines return higher quality, cited, and actionable responses. With model choice in Copilot, AEO becomes model-aware by testing how OpenAI and Anthropic differ on reasoning depth, style, grounding, and citation behavior for your corpus and workflows.
Playbook overview
- Objective: maximize business-ready answers with reliable citations while controlling risk.
- Levers: prompt design, grounding sources, model selection, and user interaction patterns.
- KPIs: citation lift, acceptance rate, edit distance to final, time to first useful answer, source diversity, and governance compliance.
A. A/B testing prompts and sources across OpenAI vs Anthropic
1) Design the experiment
- Choose 3 to 5 priority tasks that matter for your org such as sales briefs, policy rollups, QBR prep, or product FAQs.
- Create a fixed prompt set per task. Include crisp role, task, constraints, and required outputs with a mandatory citations requirement.
- Hold the same grounding for both arms to isolate model effects. If you also want to test source curation, run a second experiment that varies sources with a fixed model.
- Randomize at the session or user level to reduce contamination across back-to-back turns.
- Tag each run with an experiment ID at the start of the prompt, for example [exp RSRCH 25W40 A], and follow our prompt ID tagging playbook.
2) Execute in Researcher
- Arm A uses the default OpenAI model. Arm B uses Try Claude. Run matched prompts in identical sessions and time windows. Record satisfaction and any manual edits to outputs.
3) Execute in Copilot Studio
- Clone a baseline agent twice with identical knowledge, tools, and instructions. Set primary model to OpenAI in one and Anthropic in the other. Route traffic 50–50 for two weeks, or use sequential A then B if traffic is small.
4) Evaluate
- Acceptance rate: percent of answers pasted or shared without rework.
- Edit distance: character delta between model output and published artifact.
- Time to answer: first useful response latency.
- Citation lift: see section C for definition and measurement.
- Tie-break rule: when models are close, prefer the model that yields more internal citations from authoritative libraries.
B. Structure SharePoint and OneDrive so retrievers find and cite the right sources
1) Create a source-of-truth library
- Build a SharePoint library per domain such as Product, Policy, Sales. Apply content types and metadata for version, owner, status, and effective date.
- Establish a canonical folder for briefs and playbooks. Treat this as the only library whose content should be cited in executive outputs.
2) Chunk and format for retrievers
- Keep files concise, ideally under roughly 36,000 characters, and split long docs into logical sections per topic so ingestion is reliable.
- When declaring knowledge in agents, reference specific files rather than broad folders. Keep the total page count of referenced files near or below 300 and cap the list around 20 high-signal files.
- Avoid heavy tables or exotic formatting. Prefer plain text with clear headings, short paragraphs, and bullet lists.
- For embedded files, keep each under about 750 to 1,000 pages so important sections are inside the indexed window.
3) Make documents citation-friendly
- Begin each file with a one-paragraph abstract, owner, last review date, and a short glossary of internal terms.
- Use consistent H1/H2 headings that mirror how users ask. Example headings include “Pricing exceptions policy” or “Q3 competitive traps.”
- Add a Sources section pointing to sibling canon docs to help retrievers surface corroborating citations.
4) Wire external systems
- Use Microsoft Graph connectors to ground prompts on approved third-party systems like CRM and ITSM, then include these sources explicitly during grounding.
5) Governance guardrails in the corpus
- Apply sensitivity and retention labels to authoritative libraries. Keep confidential drafts in separate work-in-progress libraries that are excluded from grounding.
- Add a disclaimer footer to draft files instructing Copilot to avoid citing drafts in executive outputs.
C. Track citation lift when answers switch models
Definition
Citation lift is the change in citation quality and quantity when the responding model changes, normalized by task.
Core measures
- Percent of answers with at least one citation
- Average citations per answer
- Share of citations pointing to internal authoritative libraries
- Unique sources per 10 answers as a diversity proxy
- External-to-internal citation ratio
Instrumentation
- Researcher and Copilot agents produce audit records that include which app handled the interaction and references to files, sites, or resources accessed to generate a response. Export AIApp or AIAppInteraction audit events from Microsoft Purview and parse resource references to attribute citations to internal libraries.
- For Copilot Studio agents, enable auditing of user interactions and join those logs to your experiment ID.
- Use the Microsoft 365 Copilot usage report or dashboard for adoption and engagement, not audit data, which is designed for compliance.
- If you require prompt and response text for qualitative review, collect it via eDiscovery or sanctioned analytics features.
D. AEO prompt patterns that travel well across models
- Task-first pattern: You are a research analyst. Task is to produce a one-page brief with citations. Use only the provided sources unless instructed. Ask one clarifying question if needed.
- Source discipline: Ground on these files only. If insufficient, return “Insufficient evidence” and list missing items.
- Output contract: Return title, 3 insights with one-sentence proof per insight, and a sources block listing the exact file names.
- Red team nudge: Before final, list two ways this answer could be wrong and fix them using only the sources.
E. AEO corpus checklist for SharePoint and OneDrive
- Canon libraries defined, owners assigned, and review cadence documented.
- Files normalized to under roughly 36,000 characters and split by topic.
- Top 20 gold files per task curated in each agent manifest or grounding set. Keep combined page count near 300.
- Tables converted to plain lists where possible.
- Glossaries, abstracts, and effective dates present in every file.
- Draft and archive libraries excluded from grounding with labels.
F. Governance before you flip the switch
- Admins must enable Anthropic models in the Microsoft 365 admin center. Confirm users understand that Anthropic processing occurs outside Microsoft-managed environments and under Anthropic terms. Update data maps, vendor risk, and privacy notices accordingly.
- Roll out through a targeted pilot or frontier program first. Validate that legal hold, audit, and retention processes capture Researcher and agent interactions as expected. For broader policy strategy, see our guidance on license-led AEO with Cloudflare.
G. 30–60–90 action plan
- Days 0 to 30: Create gold source libraries and split oversized files. Define 5 prompts per task and stand up two cloned agents with different primary models. Turn on Purview auditing and confirm export.
- Days 31 to 60: Run A/B tests in Researcher and Copilot Studio. Track citation lift and acceptance rate. Review a 50-answer sample for factuality and style, borrowing techniques from our AI Overview citation strategies.
- Days 61 to 90: Set default model by task. Roll out prompt galleries and agent shortcuts. Expand to third-party data via Graph connectors and refresh the gold lists monthly.
Bottom line
Model choice in Microsoft 365 Copilot is here. Treat Copilot like a multi-model answer engine and make AEO a first-class practice. If you curate sources, enforce prompt discipline, and measure citation lift, you can pick the right model for the right task and prove it with data.