AI in Software Testing – Modern Methods for Test Case Generation¶
Why do we use artificial intelligence in testing?
Modern systems are complex, with many states, many inputs, and rapid changes. Manual testing and manually written test cases are often:
- time-consuming,
- covering typical cases but missing rare ones,
- difficult to keep aligned with fast development cycles.
Advantages of AI:
- discovering new edge cases (mapping the input space),
- automatic generation of UI and API tests,
- code- or specification-based testing,
- fast regression detection,
- UI navigation performed “like a human”.
Test Case Generation Techniques with AI¶
LLM-based Test Case Generation¶
LLMs (e.g., ChatGPT-like models) are capable of:
- writing unit test skeletons or complete tests,
- generating API tests from specifications,
- suggesting boundary cases,
- proposing ideas for invalid inputs.
Excercise: Using an LLM to generate test cases
Let’s explore what different LLMs can do for test case generation using the ATM machine example:
Consider an ATM system function where, if the user enters an invalid PIN three times, the account will be locked. If the PIN is correct on any attempt, the system grants access at that attempt. On the first and second attempts, if the PIN is incorrect, the user receives a warning.
Search-Based Software Testing (SBST)¶
It works with evolutionary algorithms:
- it mutates and selects test cases,
- goal: maximize code coverage.
Tools:
RL-based Exploratory Testing¶
What is reinforcement learning?
Reinforcement learning (RL) is a fundamental machine learning paradigm in which an agent learns to make decisions in an environment based on trial-and-error interactions and rewards. The goal is to maximize long-term cumulative reward.
In simple terms: The agent tries something → receives a reward or penalty → next time it acts more intelligently.

A Reinforcement Learning agent searches for faulty states:
- clicking through UI elements,
- mobile interfaces,
- games and 3D environments.
Excercise: Exploring a Web Application UI
Let’s carry out the following exercise!
Create a Python virtual environment:
1 | |
Activate this environment:
1 | |
Install Playwright:
1 | |
Install the browsers:
1 | |
This installs the Chromium / WebKit / Firefox drivers into the venv.
Install pytest:
1 | |
Install Flask:
1 | |
Flask works well as a lightweight application server in development environments.
Create the following demo_app.py file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | |
This application provides a small menu, a form, and an intentionally 500-error page — ideal for RL-style exploration.
Create the following test_rl_explorer.py script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
Run the application:
1 | |
Run the test:
1 | |
Browser use com¶
What is this?
Browser-use.com is a tool capable of controlling a browser using an AI agent.
The agent navigates like a human:
- clicks,
- fills fields,
- follows page logic,
- attempts to accomplish the given task.
Why is it relevant in testing?
- automatic generation of UI tests
- behavior close to real human interaction
- discovery of rarely visited paths
- quick regression testing
Usage of the tool
1 2 3 4 5 6 7 | |
Runtime environment
To run the above code, you need an API key (available by registering at the link), and you must install the browser_use_sdk package: pip install browser_use_sdk
Excercise: FakeBrowser
If you prefer not to use an API key, try the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
What Problems Can Arise During AI-based Testing?¶
AI is powerful, but not “magic”. There are many risks that developers and testers must be aware of — especially when AI generates test cases, UI actions, or input combinations automatically.
Explainability¶
What is the issue?
AI-generated test cases are often unclear: * why it chose that specific edge case, * why it navigated in that order, * why it marked a state as critical, * why it considers a case a success or failure.
This is especially true for: * large language models (LLMs), * RL agents (UI testing), * evolutionary algorithms (SBST).
Why is explainability a problem in testing?
- Test cases become non-reproducible or unintelligible.
- It is difficult to judge whether a discovered issue is relevant.
- Hard to argue for/against in code reviews.
- Hard to document in test management systems.
Engineering advice:
- Always accompany tests with:
- a short description (can be generated by an LLM),
- an explanation of the test goal,
- a reconstruction of test logic (e.g., coverage data).
Concept Drift¶
What is the concept?
The system changes, but the AI model behaves according to outdated logic.
Typical causes:
- UI changes (buttons, fields, menus),
- API updates, parameter rearrangement,
- new business rules appear,
- input data shifts over time.
What happens then?
- AI-generated tests produce incorrect results.
- Automated UI agent (e.g., Browser Use) clicks the wrong element.
- Evolutionary test generators produce irrelevant edge cases.
- LLM suggests tests based on outdated patterns.
Testing impact:
- false negatives: tests miss real bugs
- false positives: test reports a bug, but the system simply changed
- drastically decreasing code coverage
- AI-generated test suites become unsustainable
Engineering advice:
- periodic retraining/regeneration
- drift detectors in the pipeline (UI change monitoring)
- periodic “test case audits” with LLMs
Hallucination and False Assumptions (LLM Behavior)¶
What does this mean?
LLMs may write tests that:
- assume non-existent functions,
- assume nonexistent input formats,
- apply requirements that are not true locally,
- introduce version mismatches.
Why is hallucination dangerous?
- The test becomes green (passes), but checks nothing.
- Creates a false sense of safety.
- Overrides real specifications with “common patterns”.
Example: “Login API throws an error if the password is too short.” Many systems behave this way — but your system may not.
Stability Issues in UI Testing¶
Particularly with Browser Use or Playwright agents:
- dynamic DOM
- animations
- random element IDs
- slowly loading elements
- hidden elements / overlays
Typical effect: AI tries to click an element that no longer exists → test freezes or crashes.
Test Suite Degradation¶
Long-term, AI-generated tests may:
- fragment,
- duplicate,
- become redundant,
- become hard to maintain,
- lack documentation.
This is the modern form of classic “script aging”.
Data Issues and Over-generated Edge Cases¶
AI may be “too creative”:
- generates unrealistic test data
- extremely long strings, invalid characters, Unicode oddities
- excessively many combinations
Good for bug-hunting, but may:
- slow test execution significantly,
- test irrelevant scenarios,
- overload the CI pipeline.
Security Risks¶
If AI agents run with broad permissions:
- may access sensitive data,
- may modify databases unexpectedly,
- may delete or create entities.
In short: AI agent = powerful automation → must restrict scope.
Responsibility Issues – Who Owns the Decision?¶
If an AI-generated test approves a system but a bug appears later:
- who made the mistake?
- the model?
- the developer who trusted it?
- the test manager?
Critical in automotive, safety-critical, and banking domains.
Supplementary Material: A/B Testing and AI¶
What is A/B testing?
A/B testing is a comparative experimental method used to determine which of two (or more) versions performs better for a given goal.
In short: You show two versions to users → you observe which one performs better.

At least two variants are required:
- A = control (original version)
- B = modified version (new design, new text, new algorithm, new feature, etc.)
The system randomly splits users into groups:
- one group sees version A
- the other sees version B
Then we compare which version results in higher conversion, fewer errors, better user behavior, or better performance.
AI as Decision Support in A/B Testing¶
A) Automatic generation of test variants
AI-LLMs can:
- propose UI variants,
- generate alternative wording, CTA, layout,
- simulate API performance variants,
- create microcopy / UX modifications..
Testing significance: You can quickly generate many variants — expanding the test space, but requiring careful control (see: test case explosion).
B) Automatic experiment design (AI-driven experimental design)
AI can analyze past data to determine:
- what problems are worth testing now,
- where performance improvement is likely,
- which changes were effective in the past.
This is essentially meta-level test support.
AI in Traffic Allocation: Multi-Armed Bandit (MAB)¶
Very important — this is where AI can directly intervene in A/B test execution.
What is the multi-armed bandit?
An algorithm that dynamically distributes traffic between A and B (or more versions), not statically.
Whereas classical A/B says:
- 50% A, 50% B, and observe,
- …the bandit says: allocate more traffic to the version that seems better — with continuous learning.
Why AI?
Bandit algorithms (Thompson sampling, UCB1, Bayesian approaches) often include ML/AI components, such as:
- predictive models,
- conversion probability estimation,
- context-aware decision making (“contextual bandit”).
Advantages:
- faster optimization
- fewer users “lost” on a weaker version
- continuous adaptation (no fixed time window)
Problems:
- explainability — difficult to justify why it preferred that version
- concept drift — if user behavior changes, the algorithm may mis-learn
- not classical A/B statistics — p-values and confidence intervals do not apply as usual
- bias — seasonality can distort decisions
AI in Evaluation¶
AI can:
- automatically detect outliers,
- signal non-significant results,
- identify temporal trends,
- find user clusters,
- reduce noise.
Points of caution:
- avoid model bias (“p-hacking by AI”),
- avoid learning from noisy randomness,
- avoid recommending changes based on spurious patterns.
AI in Variant Fine-Tuning – Autonomous Optimization¶
Modern systems (e.g., growth-hacking platforms, Netflix experimentation tools) can:
- generate new variants (A’, B’, C…),
- run them in mini-bandit environments,
- decide which ones show promise,
- and only promote good ones into the main test.
Risk:
- this becomes black-box optimization
- difficult to track why a variant won
- AI bias may override business logic