Fundamentals Updated June 2026

How AI engines choose sources.

The signals that determine whether ChatGPT, Perplexity, Gemini or Copilot retrieves your content — and how to optimise for each layer of the decision.

Foundations · F—03 13 min read Every claim sourced

Schema markup lift

+67%

LLM discoverability · Yext, 2026

Citations under 2 years old

85%

Seer Interactive, 2026

Lift from expert quotes

Single biggest signal · Princeton, KDD 2024

Recency citation multiplier

4.3×

Updated vs stale content · Seer Interactive

01 — The mechanism

Two-stage retrieval

Most AI answer engines use a retrieve-then-generate architecture. First, a retrieval layer fetches candidate documents. Then, the language model synthesises an answer from those documents.

Your content must pass two filters: it must be retrievable (technically accessible and indexed by the engine's crawler), and it must be preferred over competing sources by the ranking layer.

1Retrieval layer

Crawler indexes your content. Clean HTML, open robots.txt, fast load, HTTPS, structured data — all improve indexing quality and completeness.

2Ranking / preference layer

From candidate documents, the model scores which sources to cite based on authority, freshness, specificity and structural cue signals. This is where GEO optimisation happens.

RAG PIPELINE — RETRIEVAL-AUGMENTED GENERATION

02 — The signals

Five factors that drive citation

Cited sources in your content

Adding cited, linked sources to your content lifted AI visibility by +41% in controlled tests. Models prefer content that itself demonstrates epistemic rigour.

Source: Princeton / KDD 2024

Statistics & specific numbers

Content with specific, sourced statistics earned +32% more AI citations than equivalent prose. "20–30% higher ROI" beats "better results" every time.

Source: Princeton / KDD 2024

Expert quotations

Adding named expert quotes produced the single largest lift in the foundational GEO study — larger than statistics, citations or structural changes alone.

Source: Princeton / KDD 2024

Structured markup & schema

Schema markup improves LLM discoverability by 67%. FAQ, HowTo, Article and Organization schema are highest priority. Clean heading hierarchy and short paragraphs improve chunk extraction.

Source: Yext, 2026

Recency & freshness

85% of AI citations are from content less than 2 years old. Updated content appears 4.3× more often in AI answers than stale equivalents. Date-stamping and regular refresh signals matter.

Source: Seer Interactive, 2026

Full playbook

Turn these signals into a step-by-step implementation guide for your content team.

Open the playbook →

03 — Engine differences

How each engine differs

Engine Retrieval model Key citation signals Freshness weight

ChatGPT (GPT-4o) RAG + web search (Bing) Authority, entity clarity, structured data High

Perplexity Real-time web search Freshness, specificity, source diversity Very high

Gemini Google index + RAG E-E-A-T, schema, Knowledge Graph Medium-high

Copilot Bing index + web search Bing ranking signals + citation quality High

AI Overviews Google core index E-E-A-T, structured content, featured snippet eligibility Medium

Engine architectures evolve rapidly. Methodology v2.3 · updated June 2026.

04 — Technical access

Crawlability checklist

robots.txt — open to AI bots

GPTBot, ClaudeBot, PerplexityBot, Google-Extended must all be explicitly allowed. Default deny-all blocks AI citation entirely.

llms.txt file

An emerging standard (modelled on robots.txt) that gives AI systems a structured overview of your site's most important pages and content categories.

Clean HTML with semantic structure

Proper heading hierarchy (H1→H2→H3), short paragraphs, minimal JavaScript dependency for main content. Retrieval layers extract text, not render JS.

Schema markup

Organization, Article, FAQ, HowTo, BreadcrumbList. JSON-LD preferred. Makes entity recognition and content classification trivial for retrieval systems.

Fast load & HTTPS

AI crawlers have tighter timeout thresholds than Googlebot. Core Web Vitals compliance and HTTPS are table-stakes for being fully indexed.

Avoid: paywall without free tier

Hard paywalls that block crawlers prevent indexing entirely. A free abstract or preview section is minimum viable for AI citation eligibility.

Next: implementation

Turn these signals into a step-by-step plan.

The practitioner playbook for appearing in ChatGPT, Perplexity and every major answer engine.

Read the full playbook → How to measure AI visibility