What Did You Say?

In the 7 articles that make up this series so far, we have explored the notion of how machines do human language, digging deeper into some of the conceptual foundations underpinning this phenomenon. Let’s take a step back. 

First, let’s acknowledge that this is a truly revolutionary development. 

The creation of machines that can talk convincingly like humans is one of the greatest technological leaps in all history. 

This of course opens up a whole new set of doors - essentially across the entire gamut of human activity that involves the use of language, where we can harness this capability to achieve a whole bunch of improved outcomes. 

Someone new to Generative AI (AI that generates output such as language) is perhaps exposed to this on ChatGPT as released in Nov 2022, the first major and most popular such product in the market. Those more savvy would have upgraded to using Perplexity, a more powerful application that blends chat with search, by implementing the capabilities mentioned in particular in the episode about retrieval-augmented generation (05 Seek and Thou Shalt Find). 

These are examples of B2C - ordinary folk like you and me accessing these capabilities via a business-to-consumer web (or mobile) user interface. It’s not hard to imagine, however, that there is a whole host of B2B applications as well - where business, in whatever industry, can harness the power of these tools in an enterprise setup. 

In such a context, there is an additional set of considerations that become important, or more important, than in the consumer usage scenario. 

Using Generative AI for enterprise use brings in additional considerations compared to consumer use

A key concept before we proceed: when LLM inference happens, which we discussed in the previous article mentioned above, there is a parameter that has an important bearing on the type of output. This is called temperature. Put simply, temperature defines a spectrum between how consistent/conservative versus how creative/variable the inference output shall be. Without getting into the statistical details of this at this point, suffice it to understand that giving the LLM a lower temperature means it’ll try to produce a more consistent, conservative response whereas a higher temperature would lead to a more creative, variable output. 

When you’re using a web-based chatbot, say Claude, for generating a fusion, Thai-style pasta recipe, the ‘correctness’ or ‘accuracy’ of the output is not that vital. In fact for your pasta recipe, you might want the temperature to be high, so that the model really has the freedom to try out some wacky combinations in producing the fusion recipe. 

However, in enterprise situations, we want the temperature to be as low as possible, so that the output is as consistent as possible every time. This is because every business has a particular need, and given the inputs in that business, it is imperative that the outputs conform to the requirements and quality standards as defined in the business. 

Let’s also remember that in an enterprise setup, the simple chatbot interface is typically not adequate. At the very least there is a bunch of domain-specific data which needs to be incorporated to enable the LLM to have the grounding to perform in a way amenable to the specific business needs. Thus there is typically a bespoke pipeline and workflow (ie a software extension) built and deployed exclusively for the business. Think of this as recruiting and training a new employee with basic, or preferably advanced, expertise in the domain. 

Zooming in on the technical detail of testing the output provided by such an LLM-based application, when compared to a conventional (ie, non-AI) software application, we run into a problem. When testing conventional software – remember we noted that this was deterministic, it is possible, and fairly routine, to pre-identify the types of input and correspondingly correct output values that form part of the workflow. 

And this is simple enough to test. The simplest example would be if you were building a calculator, you can write a test in code to check if, given the inputs are 2 and 2, and the operation is ‘add’, the output should be 4. You can run the same test with a bunch of combinations of the two input values, and check that the outputs conform to how we know arithmetic addition to behave. 

The key thing here is that once you’ve written such tests with a bunch of sample input and output values, in the course of the evolution of the (deterministic) conventional application, you can run the tests again and again at any time, to ensure, for example, that once you’ve implemented code for subtraction, it has not somehow introduced a bug and broken the addition functionality. This is standard software engineering practice, and has been in place for decades. 

But when it comes to output from LLM-based applications, which typically consists of human language, and which we’ve noted previously is non-deterministic, how the heck do we ensure we have a 2+2=4 type guaranteed testing in place? The simple answer is we fully can’t. It’s not that straightforward.

To give another very silly (but again useful) example, if the question to an LLM is ‘What is the capital of Japan?’, the answer could be ‘The capital of Japan is Tokyo.’, ‘Tokyo.’, ‘Tokyo’. Conventional testing would treat these as distinct answers. 

There has thus been the development of a somewhat whole new industry to tackle this new problem. So much so that the technical term used to test LLM inputs is not test but evaluations, or for short, evals. Because enterprise application of LLMs is by and large such a new phenomenon, evals is a very nascent area of activity, still somewhat cutting edge. In the coming articles, we’ll explore this further.

About the author

Ash Stuart

Engineer | Technologist | Hacker | Linguist | Polyglot | Wordsmith | Futuristic Historian | Nostalgic Futurist | Time-traveler

Reply

or to participate.