How Good Is It?

In the previous article, we introduced the notion of how evaluating LLM outputs is distinct from testing standard software output, because unlike the latter, the former is not deterministic. Let’s now take an end-to-end example to illustrate this, integrating other concepts we’ve covered in this series. 

Let’s take the enterprise situation of a clothes store implementing a product-enquiry chatbot for use by its (potential) customers. The chatbot pops up on the store website and can answer all sorts of questions in relation to the product catalog. 

Let’s see how the generative AI concepts we touched upon come into action in such a scenario.

Firstly, there is the question of product classification (or taxonomy) within the catalog - this is something we were used to, even before AI came along. You might have waded through a taxonomy tree such as mens > summer > casual, or womens > winter > travel. 

This can be seen as part of the deterministic setup, and is typically stored in a conventional database (in other words structured data) as part of the conventional software setup. Nothing new there per se. Where it gets interesting, of course, is when this can be opened up for users to query in natural language, which is where our LLM ideas come in with a bang. 

With human-language queries, there could be nuance, there could be ambiguity. For example, while searching for a particular type of clothing, a user might say “I’m going on vacation in December to…” Note that the taxonomy mentioned captures neither the word ‘December’ nor ‘vacation’, however it’s not hard for you and me to instinctively connect this query with winter and travel, in the above example. 

This interpretation of nuance is precisely where the LLM adds value in a way that would be too tedious and unreliable using rules and mappings in conventional software. Here we see at play the concepts of semantic space and in particular semantic nearness. Assuming we’re in the Northern hemisphere, ‘December’ is semantically closer to ‘winter’ rather than to ‘summer’, ‘vacation’ has a bearing with ‘travel’, and so on. 

This way we see how the LLM helps us easily connect unstructured data (the human-language query) to structured data (potentially an exact set of clothing items in the database conforming to the desired criteria). 

We can also see how RAG is at play here. Remember inference - how the LLM computes the input query and produces the output - is a stateless process? To enable the above correct interpretation, there would be a RAG pipeline set up whereby, as part of the prompt engineering, the language model would be provided the right pieces of information, after for example getting an embedding model to carry out the semantic nearness search to identify items under ‘winter’ to be returned against the ‘December’ criterion. 

Imagine such prompt engineering didn’t happen, or happened badly, for such a human-language query as above. You can expect the language model to confidently come up with an answer anyway, telling you about its favorite set of parkas and ugg boots, all fictitious! This is the hallucination we’ve touched upon in Article 07. 

Also, since this is a chatbot, and inference is stateless, the system has to ensure that the conversation flows smoothly by feeding the language model, as part of the prompt engineering, previous bits of conversation. Thus,

  • “How much is that blue cotton sweater?”

  • “Do you have size M in stock?”

The prompt the language model receives as part of the second question should give enough context around the fact that the item under consideration is the given blue cotton sweater (potentially with item id etc). 

Let’s finally look at evals - how the outputs to such questions are evaluated. Here are some potential categories of response. 

Simple yes/no or price information - these could be in a terse or verbose form.

  • $59.99

  • The price of the item is $59.99

Typically, the evals will have to be built to handle both types, and report on whether the key piece of information in this case, the price, for the exact given item was output correctly by the LLM-based system. 

Let’s take a closer look at where evaluations can get challenging. Consider the following user queries and potential responses.

Q: "I love the minimalist Scandinavian aesthetic but need to dress professionally. Any suggestions?"

A: "Consider our clean-line blazers in neutral tones like oatmeal or sage. They pair well with our high-waisted straight-leg trousers….”

Q: "I'm attending an outdoor wedding in October in New England. What should I wear?"

A: "You'll want something elegant but practical for variable weather. Consider our midi-length dresses in jewel tones, particularly the…”

Q: "How can I make my work wardrobe more sustainable but still professional?"

A: "Focus on versatile, high-quality pieces that can be mixed and matched. Our wool-blend blazer in charcoal pairs with multiple items and…”

In such cases, apart from ensuring that the items alluded to do exist in the catalog (remember hallucination?), the evals might also have to check for one or more of the below criteria:

  • Style knowledge accuracy

  • Appropriateness of suggestions

  • Practical usefulness

  • Sensitivity to customer needs

  • Brand alignment

  • Cultural awareness

  • Seasonal appropriateness

This should convey the point made at the end of the previous article about the complexity and challenge of LLM evals. We will next look at how all the things we’ve discussed in this series so far come together.

About the author

Ash Stuart

Engineer | Technologist | Hacker | Linguist | Polyglot | Wordsmith | Futuristic Historian | Nostalgic Futurist | Time-traveler

Reply

or to participate.