Testing is crucial for producing reliable software. As an industry, we know a lot about testing, or at least about testing conventional software. The recent democratization of AI and LLMs enabled developers and software companies to easily integrate text or image generation features into their products. LLMs possess unique properties that make testing LLM-powered software quite challenging. Why is this the case, and how can we test such software properly?
Testing LLM-powered software is challenging for at least three reasons:
- LLMs usually produce unstructured and non-deterministic output. A small change in the input prompt can have a large impact on what you get back. And when testing, we cannot easily formalize the requirements on the output.
- LLMs are slow. They need seconds or even tens of seconds to respond. This slowness also affects tests. Running a test suite for an LLM-powered application can take even an order of magnitude longer than testing conventional apps.
- LLMs are expensive. Most companies rely on APIs provided by OpenAI, Anthropic, or other LLM providers. These providers use a pay-per-use model: customers pay for each input and output token. Depending on the use-case, one test suite run can cost even tens of dollars.
Strategies for testing LLM-powered apps
We can choose from many different strategies for testing LLM-powered apps or components. No strategy is a silver bullet, though. Each has its strengths, weaknesses, and suitable uses.
Here are several strategies we use at profiq. I ordered them from the easiest, cheapest and least powerful to most expensive and most powerful. Here is a quick overview:
Method | Cost and time | When to use |
---|---|---|
Conventional tests | Very Low | Deterministic output like number in a JSON object. |
Embedding similarity | Low | You can make comparisons with an ideal output. |
Established metrics | Low | You are performing a common task like summarization or translation. |
Yes/No questions | Medium | You can formulate validity criteria in terms of Yes/No questions. Also useful for multi-step agentic workflows. |
LLM-as-a-judge (scoring) | Medium | You want to evaluate the LLM output in terms of vaguely defined criteria like “feasibility” or “style”. |
Panel of judges | Medium to High | Can combine with other strategies. More robust in general. Especially useful when comparing multiple LLMs. |
The list is by no means extensive, and I invite you to share your own strategies in the comments.
Testing deterministic behavior
Sometimes we want LLMs to produce deterministic structured output like a JSON object. Say we are developing a component extracting information from Craigslist laptop listings. We know the exact list of attributes to extract and, given some sample listing at the input, we can easily determine the expected value of each attribute.
Testing this component is straightforward and after OpenAI introduced structured outputs, it became even easier. We simply define a Pydantic model, pass it to OpenAI API, and then write a few asserts to check that all attributes have desired values. Here is a gist with an example.
You can use this approach with other LLM providers too, but it requires some additional work:
- Ensure that the JSON schema is passed in a prompt and that you asked the LLM to generate JSON.
- Properly extract the JSON string from the response or use prompt engineering to prevent the LLM from returning additional comments with the JSON object.
- Manually validate the JSON string using the Pydantic model.
Embedding similarity
Thing become complicated quickly, if we want an LLM to produce unstructured, hard-to-predict text. But if we can provide some sort of ideal output for a given test input, we can implement a test checking that the actual output is similar-enough to this ideal.
You are probably familiar with text embeddings. In short, text-embeddings represent a piece of text as a list of numbers. Open AI and many other LLM providers offer an API endpoint for transforming text into embeddings.
One thing we can do with embeddings is calculate their mutual similarity. Most people use cosine similarity, which can be easily implemented with libraries like NumPy. Its value ranges from -1 to 1. Values closer to 1 signal higher similarity.
We can use text embeddings and cosine similarity to implement the following test procedure:
- Define ideal output for some sample input.
- Let an LLM produce actual output for the sample input.
- Get the embeddings of both the ideal output and the actual output.
- Calculate their cosine similarity.
- Check that this cosine similarity is above a certain threshold.
In this example, I use an LLM to produce a title for a website that describes what the user actually sees; we don’t want the LLM to simply extract the contents of the <title>
tag. I am using the Hacker News login page as my test subject. I follow the procedure above and verify that the similarity between the embedding of the real output and the embedding of the desired output is at least 0.85.
Be careful about a few things when using this method:
- You need to provide some sort of ideal output for the input we are using in our test. This is not always possible.
- Quality of the embedding model matters a lot. Sometimes you have to look for or train a domain-specific embedding model to get good results.
- An LLM could potentially produce incomprehensible gibberish that is still similar to the ideal output when turned into an embedding.
Established metrics
Some natural language tasks are common enough to have their own standardized metrics. We have ROUGE for summarization or BLEU for translation. Use them if your use case allows it.
You can work with these metrics, as with embedding similarity. You simply determine its value for a given sample input and actual output provided by the LLMs, and then compare the result against a threshold value.
Asking yes/no questions
Sometimes you can define the properties of the desired output as a set of yes/no questions. If we use the previous example of generating a title for a webpage, we can ask questions like:
- Does the title mention a login form?
- Does the title mention a registration form?
- Does the title mention that we are on Hacker News?
- Does the title mention a search bar?
If we are on a Hacker News authentication page, we expect a positive answer to the first three questions and a negative answer to the last question.
We can implement a test doing the following:
- Perform a task for a sample input. For example, suggest a title for a webpage.
- Give the output of the task to an LLM and ask it to answer questions related to the output.
- Compare the answers provided by the LLM with your expectations.
My example implementation uses Pydantic to simplify the answer parsing. I recommend asking the LLM to think about the answer first — the Pydantic model has a reasoning
attribute for that. In practice, I also observe that small models like GPT-4o-mini are good-enough for question-asking, so use them to make your tests cheaper.
One of the advantages of this method is that you can use it to test multistep, agentic workflows. We do this a lot at profiq. You just have to save the message history of your agent and then provide it to the question-asking LLM. Then you can ask questions like:
- Did you ask about the weather in Ostrava, Czechia?
- Did you see a system settings screen in some of the screenshots sent to you?
LLM-as-a-judge
Depending on your definition, the LLM-as-a-judge already covers two previous testing strategies: embedding similarity and asking questions. We can broadly define LLM-as-a-judge as a set of techniques that use LLMs to evaluate text produced by LLMs or humans.
One way to use LLMs to evaluate text, is to ask the LLM to score the provided text according to certain criteria. For example, I am currently developing a LLM-powered system for brainstorming, and I need to compare the quality of ideas produced in different experiments. So I ask an LLM to evaluate all ideas on a scale from 1 to 10 according to three criteria:
- Relevance to the problem
- Originality
- Feasibility
I am mostly interested in comparison between different configurations, but you can easily adapt this scoring approach for standard software testing. Just define a threshold for each criterion or their aggregate. Here is an example.
If you are interested in LLM-as-a-judge, I recommend checking out this arXiv paper.
Panel of LLM judges
Panel of LLM judges builds on top of the LLM-as-a-judge paradigm. Its core idea is that we use multiple LLM judges and then aggregate their conclusions. In the example above, we could calculate the average scores for a given idea across judges. Individual judges can be different LLMs, or we can use a single LLM and make it role play different personalities.
Panel of judges brings two important improvements:
- Researchers discovered that LLMs often favor outputs produced by themselves. For example, if we use GPT-4o to evaluate texts produced by GPT-4o and Claude 3.5 Sonnet, it is likely that it will prefer the first text. Using multiple LLMs for evaluation and aggregating the result mitigates this issue. This makes a panel of LLM judges an excellent option if you need to compare outputs from different LLMs.
- Panel of judges allows you to replace a single powerful judge like GPT-4o with a larger number of weaker judges, like GPT-4o-mini, Claude-3.5-Haiku or open-weight 8B models. A group of weaker judges often outperforms one strong judge, while making the evaluation cheaper and faster.
You can learn more about panels of LLM judges in this arXiv paper.
General recommendations
No matter how you approach testing LLM-powered components, I recommend considering a few general principles:
- Think like a statistician. Consider testing LLM-powered components with multiple sample inputs and calculate statistics like means, minimums, or standard deviations. If you use only a single input, you don’t know how well the LLM generalizes across different situations.
- Ask the LLM to explain their thinking process. Researchers and practitioners have repeatedly shown that asking the LLM to think about the answer first significantly improves output quality. Apply this when you are using LLMs to test other LLMs with methods like question asking or scoring.
- Don’t use training data during testing. If you have ever studied machine learning, you know that you should split you dataset into at least two parts: one for training and one for evaluation. You should do the same when testing LLM-powered components, especially if that component uses a fine-tuned model or few-shot learning.
- Separate LLM tests from other tests. As I wrote at the start of the article, tests for LLM-powered components have different properties than conventional tests. We need to work with them differently.
- Run only a subset of all LLM tests. Since using LLMs is expensive and generating a response takes time, consider running only a subset of tests in your CI/CD pipeline. You can take a random sample, you can have a test suite covering the most important functionality, or some combination of both.
- Evaluate time and cost besides output quality. I like to calculate simple ratios like score per dollar or score per second. Slight improvement in quality is not always worth the wait and money.
- Consider using MLOps tools like MLFlow. Many MLOps tools work well with LLMs. They help you track the development of various metrics over time, compare different configurations, and generally translate all data into nice visualizations. I recommend checking out our videos on MLFlow here and here.
Conclusion
Developing LLM-powered apps is hard. Unstructured and non-deterministic output makes them difficult to test. I hope this article makes testing LLM-powered components easier and enables you to ship more robust products.
What approaches are you using to test LLMs at your company that I didn’t mention? Do you use any testing frameworks worth knowing about? Tell us in the comments.
Leave a Reply