Seek And Thou Shalt Find

So far we have explored how AI, language models, in particular, mimic the activities of the human brain. They can do meaning - by way of semantic space, they can generate phrases and sentences through a statistical process. Let’s see how it all fits together for practical use in the real world. 

A large language model (LLM), as hinted, is like the human brain. In particular, its ability to generate text comes from having trained it with vast amounts of textual data, in recent cases, practically all that’s good on the Internet (who decides what’s good is another matter!)

So when such a model is released, it reflects the training data fed into it up to that point. But let’s take a step back. Being a neural net, an LLM is, just like the human brain, a combination of knowledge (a type of memory) and reasoning capabilities (using such memory.) 

LLMs, like the human brain, 
are a combination of knowledge and reasoning

When you ask a chatbot such as ChatGPT a question, the user interface – remember what you see is just the interface - passes the text on to the underlying model, say GPT-4. 

The language model has this text passed to it in its arsenal, its underlying repertoire of knowledge, its ability to ‘think,’ and potentially a few other instructions, such as telling it to be kind, polite, etc. At this point, the statistical process I’ve mentioned kicks in, producing the response we touched upon in the first episode of this series. 

This process is called Inference, which is an important concept we will come back to to explore further later. The key thing to appreciate here is this: 

LLM inference is a stateless process, 

i.e., each time the user interface feeds the LLM a piece of text, it is a one-off, it has no recollection of the previous run of inference. It’s always one-for-one. That’s simply how the architecture is. 

If you think this through, you’ll come to the question: but wait a minute, how the heck does ChatGPT or the Claude interface remember my previous questions during a conversation so I don’t have to repeat the previous bits every time? The answer is quite simple: the interface deliberately feeds parts of the previous conversation back into the underlying LLM, so it has as much context as possible. 

But how much of the previous conversation? Well, there is a limit to how much text can be passed into a language model, which varies but essentially depends on the given LLM. This limit is called the context window

This still has the can of worms open. Thank you very much Mr UI for including the previous bits, but that’s still not enough. One particular related problem is giving the LLM context or knowledge it doesn’t have—stuff from after its cut-off date. Or say, I want to be able to answer from all my corporate emails so I don’t have to pore over them for meeting summaries and decisions taken 3 months or 3 years ago?

This is technically possible, but getting right is still quite a challenge. Let’s explore how this has evolved. The older approach is called fine-tuning: subject the LLM to an additional piece of training, but using your domain-specific data. So it ‘learns’ that bit as well. 

But as it’s turned out, this can be problematic, not to mention costly to do on an ongoing basis. So then, the other question arose: why not give the language model the right bits of relevant information from our specific data pool during that stateless inference as part of its context window?

And that’s been one of the hottest topics in the generative AI space in recent months. It’s called RAG - Retrieval Augmented Generation. Let’s see how it works.

But before we proceed with this, let’s pause to recognize that the chatbot approach we mentioned is just one way of eliciting the services of an LLM. In an enterprise situation, your existing software codebase may be extended to be wired directly to access a language model, perhaps by using an Application Program Interface (API) exposed by the LLM provider. 

So, let’s say, going back to the corporate correspondence example, we need a way of allowing this LLM-enhanced system to have access to emails, minutes, etc (with due security safeguards, of course), which can then be queried by users (with the right permissions) to get answers from this hoard of information accumulated in your company over the years that lies buried beyond the practical reach of the human search effort. 

So here’s where the strengths of the LLM semantic space come in very handy. Taking any piece of text, structured or unstructured, it is possible to create, as it were, a ‘mental image’ of that text in the LLM’s brain, which can then be stored in special types of AI-native databases. Think of this as slicing that part of Einstein’s brain that had full knowledge and context of the Theory of Relativity and storing it in a cryogenic facility for later retrieval! (Fun fact: Einstein’s doctor did secretly take out and preserve the great scientist’s brain for some time!)

Back in the less creepy digital world, this ‘mental image’ representing the knowledge/understanding of a piece of text in the language model’s semantic space is called embeddings or vectors (because of the underlying mathematics used to achieve this.) And the specialized database is called a vector database or vectorstore

So, in summary, retrieval-augmented generation consists of firstly vectorizing domain-specific data into a vectorstore, and during inference, pulling out relevant bits to add to the context window so the language model knows exactly what to think and say. 

Now is that the holy grail? Not quite, RAG is still an on-going challenge - how do you ensure the right relevance is achieved and so on. We will come back to this, stay tuned. 

About the author

Ash Stuart

Engineer | Technologist | Hacker | Linguist | Polyglot | Wordsmith | Futuristic Historian | Nostalgic Futurist | Time-traveler

Reply

or to participate.