Grounding Forces: What is RAG and why does it matter for SEO?
A question we keep being asked is a variation of “alongside SEO, can we optimise for AI search too?” and the answer is yes, we are confident that we can. Why? Because the fundamentals are the same.
We’re not in the business of overpromising things, and the unignorable caveat here is that this is an area moving faster than you can say Retrieval Augmented Generation, but understanding how AI systems surface information inevitably gives us a better chance of understanding how we can optimise for these processes. Knowledge is power, as they say.
So you want to be found in AI Search?
If the world were organised such that to increase an LLM’s knowledge repository all anyone had to do was email OpenAI or Google or whoever else and say “hey, I’m brand and I do this thing please make sure I show up when people search for this prompt” then I wouldn’t need to write this blog post because you, and every other business, would have done that already.
The primary way that an LLM increases its knowledge base is through training. At its most simple, this training takes the form of the system finding, ingesting and storing huge amounts of information from a vast number of different sources to equip the model with more information, and in turn train it to provide answers using this information, updated on roughly a yearly to six-monthly basis. Glaring fact number one: this training data has a limit; it is static. Glaring fact number two: whoever is training the model will impart their own biases as to what information makes it into the dataset. Errare humanum est, after all.
This forms a solid foundation from which the LLM can return an answer to a given prompt, but this training data can get quite stale quite quickly, particularly for queries that are time-critical in nature. If you asked your chatbot friend about this morning’s breaking news and it could only give you information from six months ago, in your disappointment you might be inclined to go outside, converse with a real person, maybe touch some grass while you’re there. So if all they had to use to provide an answer was training data, LLMs would be pretty limited and certainly not worth billions.
Large language models are also trained to provide any answer over not returning an answer at all, even if this means providing the wrong answer. This is the mega pitfall through which funny, perplexing, and sometimes downright defamatory hallucinations weave their way into responses. So if you have a system that will preferentially return an answer, and a selection of sources that are out of date by the time you’re asking the question, this is all beginning to look like a recipe for inaccuracy. Surely the tech gurus didn’t just leave it there? Of course not. Enter: grounding.
What on Earth is grounding?
Grounding is the process by which a model reaches out to specific trusted sources to ground its final output in trusted, factual data. It is applied when the LLM doesn’t have the minimum level of confidence in the information within its training data required to answer the query accurately. What this means is that the model knows that if it returns an answer just using its training data, it is unlikely to be accurate, so, yay, we’re less likely to receive a hallucinatory response when grounding is applied. Retrieval Augmented Generation (RAG) is one form of grounding: put ever so eloquently by Harry Clarkson-Bennett, RAG “retrieves data, augments the prompt, and generates an improved response”. RAG achieves grounding by retrieving relevant information from external sources, which is almost always going to be via a live web search. Think of it like Gemini running a Google search after you’ve input your prompt, then using the results of that search within your conversation to provide you with a factually accurate answer. Job done.
Not so glaring fact here: grounding generally happens on an individual basis. The grounding done within your conversation is not likely to influence the results returned within a totally separate conversation.
It would be massively expensive and resource-intensive to expand a model’s training data on a regular basis, so real-time grounding through mechanisms like RAG mean the LLM can provide accurate, up-to-date information to the user for a fraction of the cost of re-training and, crucially, in a much more timely manner. It is for this reason that most SEOs aren’t putting too much emphasis on trying to increase your brand’s presence in training data: if you’ve been putting out content for a few years, there’s a high likelihood that you’re already in this training data, and there isn’t all that much to be done to build upon this in the short-term (we can look into training data in a separate blog post, I have a word limit to stick to here).
So what does this all mean for SEO?
I just likened RAG to an LLM running a Google search and, as an SEO consultancy, we do actually know a bit about optimising for Google search. As with many aspects of emerging AI tech the finer details are shrouded in mystery and murkiness, but we have good reason to believe that ChatGPT uses Google (as well as Bing) to form its responses when running a web search. We do know that Gemini, being a Google product, uses Google’s index, and Copilot, from Microsoft, uses Bing.
Given what we now know about the sources used in RAG, how it works, and when it is applied, it stands to reason that optimising your website to rank well on Google will have a material impact on your content being surfaced by an LLM. AI systems need to retrieve information from somewhere: if your website makes it easy to understand what you do, exudes credibility, authority, trustworthiness, basically all the usual good stuff, then you’re giving search engines and AI systems all the right signals to surface your content in their results.
So basically you need to keep doing good SEO. Sorry, there’s no secret trick that I was keeping under wraps until the very end of this blog to reveal. The better your content ranks for a given topic, alongside how successfully your content meets search intent (there is ample evidence to show that many citations in AI results are not from top ranking results but rather from pages that do a great job of meeting query intent) are essential factors at play in both traditional search and in RAG.
So that’s why whenever we get asked we say yes, alongside SEO, we optimise for AI search too.
Unlike journalists a good blog writer does reveal their sources, so here are mine:
38% of AI Overview Citations Pull From The Top 10 - Louise Linehan
From RAG to Riches - Mark Williams-Cook
Information Retrieval Part 2: How to Get Into Model Training Data - Harry Clarkson-Bennett
Information Retrieval Part 4 (sigh): Grounding & RAG - Harry Clarkson-Bennett
Search Grounding is Transient - Dan Petrovic
SEO 2.0: How Content Marketing Drives Visibility in AI Search - Ryan Law
The science of how AI picks its sources - Kevin Indig