
TL;DR
- AI search does not evaluate your whole page at once. It evaluates individual passages pulled from your page. Every section you publish is a standalone competitor for citation.
- Passages get matched to queries by meaning, not keywords. A section that says one thing clearly will outperform a section stuffed with the right terms but vague in what it actually says.
- Your page gets automatically split into chunks before any query is even asked. If a section break lands in the middle of a complete thought, neither half gets retrieved.
- The passages that consistently earn citations lead with the answer, carry high entity density, and make sense when read in complete isolation from the rest of the page.
- Google ranking does not predict AI citation. Only 12% of AI-cited URLs also rank in Google's top 10. Each platform retrieves from different sources using different logic.
- Content teams that want AI visibility need to shift from thinking about pages to thinking about passages. Every brief, every section scope, every edit should be evaluated at the passage level.
When someone asks a question in ChatGPT, Claude, or Perplexity, the answer is assembled from individual passages pulled from across the web in real time.
The system that does this is called Retrieval-Augmented Generation (RAG). RAG is how AI systems pull passages from the web in real time and use them to build answers.
Three concepts drive this process: RAG itself, embeddings, and chunking. In this article, I break down each one and explain what the research shows about how they affect your content.
What Happens When Someone Asks AI a Question
The mechanics behind AI answers are simpler than most content leads expect. Understanding the pipeline helps you understand the ‘why’ behind every section you write.
The Retrieval Pipeline Behind Every AI Answer
Every time a user asks a question in Claude, ChatGPT, or Perplexity, RAG runs a two-step process:
- Retrieval: the system searches indexed content and pulls the passages most relevant to the query.
- Generation: it feeds those passages to a language model, which reads them and builds an answer, often citing where each piece came from.
AI models also draw from training data, everything the model learned before deployment. That knowledge is static and does not update between releases. So, when a model cites a source in its answer, that citation comes from the retrieval layer, not training data. For content teams, the retrieval layer is where you can influence outcomes.

Kevin Indig, Growth Advisor and former Head of SEO at Shopify, puts the distinction plainly:
“Content used for training gets barely cited, so the question we have to ask ourselves is how much we want to be part of the training data as opposed to beachfront property for live web retrieval.”
The Unit of Retrieval Is the Passage, Not the Page
I explain this distinction to every content lead I work with because it changes how they scope their briefs. RAG does not retrieve full pages. It retrieves individual passages.
A 3,000-word article does not compete as one unit in AI retrieval. It gets split into passages first, and each passage is evaluated independently. An audit of 15 domains covering approximately 2 million organic sessions found that ‘answer capsule’ presence was the single strongest commonality among ChatGPT-cited posts (Search Engine Land, November 2025).
How Embeddings Decide Which Passages Match a Query
This is the concept that explains why keyword-stuffed content fails in AI search, and why a well-scoped section on a smaller site can outperform a vague section on a high-authority domain.
What an Embedding Is in Simple Terms
RAG retrieves through meaning-based matching. The technology behind this is called embeddings. An embedding is a way of converting text into a format that captures what the text means, not just what words it uses.
When your content gets indexed, every passage gets converted into an embedding. When a user asks a question, that question gets converted into an embedding, too. The system then compares the question's embedding against every passage's embedding and returns the ones that are closest in meaning. This is called semantic similarity.
Keywords Still Contribute, but Meaning Decides
Because embeddings capture meaning, a passage can get retrieved without containing the exact words from the query. If someone asks "how do I get my content cited by AI" and your passage discusses "structuring sections for retrieval in language model systems," the meanings are close enough to match.
Keywords still matter. They contribute to the meaning the embedding captures. But a passage loaded with the right keywords and vague in what it actually says will produce a weak embedding. It will not match strongly with any specific query.

Kevin Indig's analysis of 1.2 million ChatGPT responses and 18,012 verified citations found five traits of highly cited content (Growth Memo, February 2026):
- Definitive language: cited passages were nearly 2x more likely to use clear definitions.
- Conversational Q&A structure: 78.4% of question-linked citations came from headings.
- Entity density: cited text had 3-4x higher entity density than average web content.
- Balanced sentiment: subjectivity score around 0.47.
- Business-grade clarity: Flesch-Kincaid grade level of 16 versus 19.1 for lower-cited content.
I keep coming back to this list because it matches what I see in practice. The pages that get cited are not the longest or the most keyword-rich. They are the clearest.
What Makes an Embedding Strong vs Weak
A strong embedding comes from a passage that says one thing clearly. One topic. One direct statement. Enough context to stand on its own. A weak embedding comes from a passage that covers multiple topics, drifts between ideas, or needs the paragraph before it to make sense.
When a passage could mean several things, its vector sits between multiple clusters on the meaning map. It does not match strongly with anything. The practical rule: if a passage is about one thing and says it plainly, it will produce a strong embedding.
How Chunking Splits Your Page Into Pieces That Compete Alone
Chunking is the part of the RAG pipeline that most content teams have never considered. Your page gets broken apart before it even enters the retrieval competition.
What Chunking Is and When It Happens
Before any matching happens, your content gets split into chunks. This happens during indexing, before any user asks a question. Common chunk sizes range between 128 and 512 tokens, though some systems use up to 1,024 tokens for tasks that need broader context (arxiv, May 2025; Weaviate). Each chunk gets its own embedding. Each chunk enters the retrieval competition independently.
You do not control where the splits happen. The system decides based on its own chunking strategy. Some systems use fixed-size chunks. Others use semantic boundaries. The approach varies by platform.
Research presented at Tech SEO Connect 2025 found two important retrieval behaviors:
"RAG crawler activity is governed by appetite and session throttling, meaning bots are lazy and will only visit the best content once; they don't want to go more than about three pages deep. If the bot retrieves unstructured content first, it will use that information and ignore subsequently structured content in the same session."
What Goes Wrong When Chunking Breaks Your Content Badly
Three problems show up consistently:
- Split answer: a complete thought gets divided across two chunks. Neither contains the full answer. Neither gets retrieved.
- Merged topics: two unrelated ideas land in one chunk. A paragraph about pricing and a paragraph about implementation end up together. The embedding becomes unfocused and matches neither topic well.
- Missing context: a chunk contains a useful statement but lacks the setup to make it understandable. The system retrieves it, but the passage feels incomplete.
All three come from the same root cause. I have reviewed hundreds of pages where the content was strong, but the structure broke it for retrieval. That’s because the content was written for reading in sequence, not for extraction in isolation.

Why Section Boundaries Matter More Than Section Length
There are two ways a system can split your page:
- At natural topic breaks, where one section ends and another begins. A peer-reviewed study found this approach retrieved the right passage 87% of the time (MDPI Bioengineering, November 2025).
- At a fixed word count, cutting wherever the count hits, regardless of whether a thought is finished. The same study found this approach retrieved the right passage only 50% of the time.
The length of each passage matters too, but not in the way most teams assume:
- A factual question like "what is RAG" needs a short, direct answer.
- A conceptual question like "how should content teams think about AI search" needs a longer passage with more surrounding context.
There is no single perfect chunk size. But content that places clean topic breaks between 200 and 400 words gives chunking systems the best chance to split in the right places.
What This Means for Content Teams
The research points in one direction. Google ranking does not predict AI citation.
Only 12% of URLs cited by AI assistants also rank in Google's top 10. Perplexity shows the highest overlap with Google at 28.6% (Ahrefs, August 2025). ChatGPT on short-tail queries overlaps only about 10% (Ahrefs, September 2025). Each platform indexes different sources and retrieves from different databases.

Rand Fishkin, CEO of SparkToro, tested 2,961 prompts across ChatGPT, Claude, and Google AI:
"There is less than a 1 in 100 chance that any of the AI tools will give the same list of brands/products in two responses. Less than 1 in 1,000 will give the same list in the same order."
Kevin Indig's analysis of 1.2 million ChatGPT responses found a consistent citation pattern he calls the "ski ramp." 44.2% of citations come from the first 30% of content (Growth Memo, February 2026).
The URLs differ. The citation timing varies. But the structural traits of cited passages stay the same.
The Passage Is the Unit of Competition
Every section you write is a standalone competitor for citation. The brief should specify what question each section answers. The writer should lead each section with the answer. The editor should test whether each section makes sense when read in isolation.
Passage Independence Checklist
I use this checklist on every brief before it goes to a writer:
- Each 200-400 word section is standalone
- Zero backward references ("as mentioned above")
- Zero forward references ("as we'll see below")
- Every "it", "this", "they" has a clear referent
- Main point in first 1-2 sentences
- No section exceeds 400 words
Lily Ray summed it up at Affiliate Summit 2026:
"SEO still powers AI. Ranking influences RAG citations. Third-party reputation is critical. AI trusts corroboration."
Start by Testing Your Top Pages
Step 1: Pick your top 5-10 pages by traffic or strategic value.
Step 2: Identify the target query each page should answer.
Step 3: Run each query across ChatGPT, Claude, and Perplexity.
Step 4: Document which passages get surfaced and which get skipped on each platform.
Step 5: Review the skipped pages. Check whether each section leads with the answer, covers one idea, and makes sense when read alone.
Step 6: Restructure where needed. One idea per section. No multi-topic paragraphs. Sections between 150 and 400 words.
Step 7: Run the isolation test. Copy any section out, read it without context. If meaning is lost, rewrite until it stands alone.
RAG, embeddings, and chunking are not three separate problems to solve. They are three stages of one pipeline that your content passes through every time someone asks an AI a question. Understanding how they work together is what turns passage-level thinking from a concept into an editing discipline.
Quick Reference
| Concept | What It Does | What You Control |
|---|---|---|
| RAG | Pulls passages in real time to build answers | Whether your content contains clear, citable passages |
| Embeddings | Matches query meaning to passage meaning | How clearly each passage communicates one idea |
| Chunking | Splits your page into competing fragments | How you size and scope each section |
Reviewed By
Ameet Mehta
Frequently Asked Questions
Can I control how AI systems chunk my content?+
Not directly. Each platform uses its own chunking strategy and you cannot dictate where the splits happen. But you can influence it. Clean HTML heading structure, one topic per section, and consistent section lengths between 200 and 400 words give the system natural break points to work with. The closer your section breaks align with topic boundaries, the more likely the system is to split where you intended.
Does HTML formatting like headers, bold, and bullet points affect retrieval?+
Headers act as strong chunking signals across most systems. They tell the system where one topic ends and another begins. Bold and bullet points contribute to the meaning an embedding captures, but they are not retrieval signals on their own. The content inside the formatting is what matters, not the formatting itself.
How do I know if AI platforms are citing my content?+
The simplest way is to run your target queries across ChatGPT, Claude, and Perplexity and check whether your URLs show up in the citations. The challenge is that AI citation results are inconsistent between runs, so manual testing gives you a snapshot, not a trend. VisibilityStack's Demand Capture Score automates this by tracking your passage-level citation performance across AI platforms and traditional search together, so you can monitor visibility over time instead of relying on spot checks.
If AI retrieves passages, does page length still matter?+
Page length matters for Google ranking but not for AI retrieval. A 500-word page with three well-structured sections can outperform a 5,000-word page where the answer is buried in paragraph twelve. What matters is whether each individual section is clear, self-contained, and leads with the answer.
Does schema markup help with AI retrieval?+
Schema helps search engines understand what your page is about, but RAG retrieval is driven by embeddings, not structured data markup. A well-written passage without schema will outperform a poorly written passage with perfect schema. That said, schema contributes to how your page gets indexed and categorized, which can indirectly affect whether it enters the retrieval pool in the first place.
How often do AI systems re-crawl and re-index content+
It varies by platform and there is no published crawl schedule. Research from Tech SEO Connect 2025 found that RAG crawlers are governed by appetite and session throttling, meaning they visit the best content once and don't go more than about three pages deep. Updating existing content to be passage-ready is likely more effective than publishing new pages and waiting for a crawl.


