AI in software testing – Analysing a user story

I think it’s safe to say that AI is here to stay. We can choose to avoid it like the plague or embrace it.

If you are working in the software industry, there’s a chance that your feed on LinkedIn is flooded with posts on AI and the use of AI in software testing. The tone has shifted from critical to sometimes rather harsh and it seems that people are either embracing AI in their work in testing or the complete opposite -avoiding it at all costs.

It reminds me of the dialogue between Obi Wan Kenobi and Anakin Skywalker, just before their duel on Mustafar.

Anakin: If you’re not with me, then you’re my enemy.

Obi Wan: Only a Sith deals in absolutes

Admittedly, people debating on LinkedIn may not be enemies, but it seems that feelings weigh heavier than facts these days and to some people, the hallucinations of LLM’s renders the use of AI worthless and particularly in software testing. In this post i will try to clarify if the “Nay sayers” are right or that maybe, I can bring a little optimism to the table.

I will demonstrate the instant road to failure, but also provide you with some advice on how you can reduce the hallucinations and bias and perhaps even get quite a good result. 

Hallucinations and bias

If you have been using an LLM,  you will most likely have experienced hallucinations and bias, which if used without a critical and professional approach, can do more harm than good to whatever work you need done.

Hallucinations come in various forms. I have experienced some pretty weird results with data not being read or suddenly disappears from a calculation that worked fine just before. I have also encountered miscalculations when asking the LLM to analyse data from 2 or 3 different sources.

Bias comes from the data on which the LLM has been trained, and the output can and will be affected by this. We have seen it lately with new AI models that “refuse” to answer questions against the country of origin. What most users might not be aware of is, that bias can, however unwillingly, be infused by the user. When you go beyond the absolute beginner state and start experimenting with customs instructions, these instructions will influence how the LLM treats the input and produces an output. There is a way to avoid the user infused bias, which i will cover later in this post.

 

Measure twice, cut once.

Given what we know about AI, especially that it is biased and often hallucinates, I fully understand that so many are giving it the thumbs down in regards to use AI in software testing. As a reader you might think that I am on a hopeless quest, as it has been “proven” by many highly skilled people within our craft, that we loose logical thinking and risk producing low quality testware.

They are absolutely right. A blind trust in AI is the perfect recipe for disaster, but this is the case with any tool we use, AI or not.

I am an amateur carpenter—not a very good one—but I get great satisfaction in seeing the raw material turn into something useful in my home or garden. It has not been cheap though, and throughout the years I have discarded several pieces of wood because of a blind trust in my calculations and measurements.

The advice I am following now and will pass on to you is: “Measure twice, cut once”.

 

AI is a tool, not a silver bullet.

The quality of the test basis and how we enhance it has a great impact on the quality of our test cases. If requirements are sparse and we write test cases based on this and our assumptions, there is a risk that our test coverage is insufficient and deliver false positives in our assessments.

Using AI is no different. Hallucinations are often caused by a low quality in our instructions. If the information is insufficient, the LLM assumes what we ask for. If we try to fit too much information into one prompt, the LLM may—due to a token threshold—discard parts of the instructions and give os a wrong result. I will not go into details about tokens, but when using an LLM such as ChatGPT 4o (which I am using in this case), the safe zone is below 1.000 tokens or about 750 words. Staying within this will yield the best result and as such an iterative approach to prompting is a good choice.

In the following examples, I will steer you through the pitfalls of using AI (there are many) and arm you with a basic approach for better results.

The Challenge

As a tester you have been given the task to design test cases for a feature and its user stories. Neither the feature or the user stories are well written and if you were to design test cases manually, your chances of success would rely on your ability to seek out those with knowledge to enrich your test basis. Using AI requires the same footwork and in fact, I believe that it requires even more.

In this case i asked ChatGPT to provide me with 3 user stories, covering a subscription payment for a SaaS application. I deliberately asked for: 1. A very well written user story with highly testable acceptance criteria. 2. A user story where details were missing and finally 3. A user story of poor quality, incomplete and highly ambiguous. My point was to get as close to the real world as possible, where the level of needed information will vary.

The first user story is as close to worthless on purpose to visualise the risk of  misinterpretation by both developers and testers.

Title: Payment with Stripe.
As a user,
I want to pay through Stripe,
So that I can pay.
Acceptance Criteria:
Stripe integration should work.
Payments should be successful.

 

The user story lacks information about payment types, error handling, logging etc and the acceptance criteria are vague and really not testable.

Throwing caution into the wind, you decide to rush into using your new found magic wand, import the user story and start prompting:

Analyse this user story and create test cases

I am not saying you won’t get a result and you might even get one that you can use to some extent. The danger here is that you are completely out of control and the LLM will use assumptions to deliver a result.

We got a result but…

As I expected, the LLM (ChatGPT) delivered as asked and replaced the missing information with assumptions.

 

It is obvious to anyone that this cannot be used and if this is the result people are presented with, i fully understand  why so many believe that AI is completely worthless in regards to software testing.

I can almost see and hear the angry mob with torches and pitchforks on their way to slay the monster.

 

 

 

 

The result i got from my simple prompt as well as the included user story shows clearly that :

    • We need to be just as descriptive and accurate when prompting as when describing our user stories

    • The more detailed our input is, the better output we get

    • Using your professional skills and experience is vital to narrow down the best result

 

Improving the test basis makes a difference

I managed to improve the above test case by prompting back and forth, but just like we have been used to for years, requirements modeling is an iterative process and the first take is never the last. I wondered if a much better 

 

User Story

Title: User can subscribe to a monthly plan using Stripe payment.

As a customer,
I want to subscribe to a monthly plan using Stripe,
So that I can access premium features through a secure and convenient payment method.

Acceptance Criteria:
1. The subscription plan displays the price, billing frequency (e.g., $14.99 per month), and any add-ons (e.g., $2.99/month Member option).
2. Users can select additional services (e.g., “Add to your order”) before proceeding to payment.
3. The checkout page presents a “Pay and subscribe” button integrated with Stripe.
4. Users are prompted to enter their **email address** and **payment method** (e.g., credit card, debit card, Apple Pay) via the Stripe interface.
5. If the user has a saved payment method with Stripe’s **Link** feature, it is pre-filled for faster checkout.
6. Users can choose to Pay without Link if they don’t wish to use saved payment information.
7. On clicking “Pay and subscribe”:
– If payment is successful, the user is redirected to a subscription confirmation page.
– If payment fails, an error message is shown, allowing the user to retry or select another payment method.
8. Recurring billing details are clearly stated (e.g., “Billed monthly”) with the ability to cancel the subscription anytime.
9. The total amount due is correctly calculated and displayed, reflecting selected add-ons and the subscription plan.
10. Payment information is processed securely via Stripe; no sensitive data is stored in the webshop’s systems.
11. Transaction details, including subscription start date and amount, are logged and accessible to both the user and admin.

With a much more detailed user story it was time for another prompt:

Please examine the enclosed user story thoroughly and do the following:
1. Confirm the number of acceptance criteria
2. ensure that each acceptance criteria is covered by at least one test case
3. The test cases should be in the same format as in industry standard tools
such as ALM, XRAY, Azure DevOps and similar

4. preconditions and test data must be considered
5. I want to be able to trace the test cases back to the acceptance criteria
6. present the test cases in a table for better visualisation

This time the result was much better and even though the test cases are not perfect (will they ever be ?), I believe that they form a good foundation to build on.

A few of the test cases i got from the analysis of the user story. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Conclusion so far

LLMs are not magic wands or fire-and-forget missiles. They require a lot of attention, just like any other test basis you will come across in your work in software testing. Hallucinations, bias and the occasional loss of data are all part of working with Large Language Models. If you do not pay attention and use a critical and professional approach, you will end up spending a lot of time on rework.

Having said that, I think that they have improved quite a bit over the last year and if you put a quality effort into the preparation, input and prompt, you will get a good result that can be great when given a solid review and polishing. 

Stay tuned for more posts about AI and Software Testing.

Happy Testing

X
LinkedIn
Facebook