Kihagyás

AI in Software Testing – Modern Methods for Test Case Generation

Why do we use artificial intelligence in testing?

Modern systems are complex, with many states, many inputs, and rapid changes. Manual testing and manually written test cases are often:

  • time-consuming,
  • covering typical cases but missing rare ones,
  • difficult to keep aligned with fast development cycles.

Advantages of AI:

  • discovering new edge cases (mapping the input space),
  • automatic generation of UI and API tests,
  • code- or specification-based testing,
  • fast regression detection,
  • UI navigation performed “like a human”.

Test Case Generation Techniques with AI

LLM-based Test Case Generation

LLMs (e.g., ChatGPT-like models) are capable of:

  • writing unit test skeletons or complete tests,
  • generating API tests from specifications,
  • suggesting boundary cases,
  • proposing ideas for invalid inputs.

Excercise: Using an LLM to generate test cases

Let’s explore what different LLMs can do for test case generation using the ATM machine example:

Consider an ATM system function where, if the user enters an invalid PIN three times, the account will be locked. If the PIN is correct on any attempt, the system grants access at that attempt. On the first and second attempts, if the PIN is incorrect, the user receives a warning.

Search-Based Software Testing (SBST)

It works with evolutionary algorithms:

  • it mutates and selects test cases,
  • goal: maximize code coverage.

Tools:

RL-based Exploratory Testing

What is reinforcement learning?

Reinforcement learning (RL) is a fundamental machine learning paradigm in which an agent learns to make decisions in an environment based on trial-and-error interactions and rewards. The goal is to maximize long-term cumulative reward.

In simple terms: The agent tries something → receives a reward or penalty → next time it acts more intelligently.

Reinforcement learning

A Reinforcement Learning agent searches for faulty states:

  • clicking through UI elements,
  • mobile interfaces,
  • games and 3D environments.

Excercise: Exploring a Web Application UI

Let’s carry out the following exercise!

Create a Python virtual environment:

1
python3 -m venv venv

Activate this environment:

1
source venv/bin/activate

Install Playwright:

1
pip install playwright

Install the browsers:

1
playwright install

This installs the Chromium / WebKit / Firefox drivers into the venv.

Install pytest:

1
pip install pytest

Install Flask:

1
pip install flask

Flask works well as a lightweight application server in development environments.

Create the following demo_app.py file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from flask import Flask, render_template_string, request

app = Flask(__name__)

PAGE = """
<!doctype html>
<title>RL Demo App</title>
<h1>RL Demo – Simple UI</h1>

<nav>
    <a href="/">Home</a> |
    <a href="/form">Form</a> |
    <a href="/error">Error</a>
</nav>

{% if page == "home" %}
<p>Welcome on the home page.</p>
{% elif page == "form" %}
<form method="post">
    <label>Username: <input name="username"></label><br>
    <label>Age: <input name="age"></label><br>
    <button type="submit">Submit</button>
</form>
{% if submitted %}
    <p>Submitted: {{ username }} ({{ age }})</p>
{% endif %}
{% elif page == "error" %}
    {% if trigger_error %}
        {% set x = 1 / 0 %}
    {% else %}
        <p>Click the button to trigger server error.</p>
        <form method="post">
            <button type="submit">Trigger error</button>
        </form>
    {% endif %}
{% endif %}
"""

@app.route("/", methods=["GET"])
def index():
    return render_template_string(PAGE, page="home")

@app.route("/form", methods=["GET", "POST"])
def form():
    submitted = False
    username = ""
    age = ""
    if request.method == "POST":
        submitted = True
        username = request.form.get("username", "")
        age = request.form.get("age", "")
    return render_template_string(
        PAGE,
        page="form",
        submitted=submitted,
        username=username,
        age=age,
    )

@app.route("/error", methods=["GET", "POST"])
def error_page():
    trigger_error = (request.method == "POST")
    # Ha POST érkezik, szándékosan kivételt dobunk -> 500-as hiba
    return render_template_string(PAGE, page="error", trigger_error=trigger_error)


if __name__ == "__main__":
    app.run(port=5000, debug=True)

This application provides a small menu, a form, and an intentionally 500-error page — ideal for RL-style exploration.

Create the following test_rl_explorer.py script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import random
from playwright.sync_api import sync_playwright, Error as PlaywrightError

# Only “meaningful” actions
ACTIONS = ["click_random", "type_random", "scroll"]


def step(page):
    """An RL-like step. Returns (reward, done)."""
    # If the page closed meanwhile, signal termination
    if page.is_closed():
        return 0.0, True

    action = random.choice(ACTIONS)
    reward = 0.0

    try:
        if action == "click_random":
            # only clickable elements
            elements = page.query_selector_all("a, button, input[type=submit], [role='button']")
            if elements:
                el = random.choice(elements)
                el.click(timeout=1000)
                reward += 1.0

        elif action == "type_random":
            inputs = page.query_selector_all("input[type=text], input:not([type])")
            if inputs:
                el = random.choice(inputs)
                el.fill("TEST" + str(random.randint(0, 9999)))
                reward += 2.0

        elif action == "scroll":
            page.mouse.wheel(0, 300)
            reward += 0.5

        # Error detection – if 500 or error text appears
        try:
            content = page.content()
            if "Internal Server Error" in content or "500" in content or "Exception" in content:
                reward += 10.0
        except PlaywrightError:
            # If this also fails, treat as terminal state
            return reward, True

        return reward, False

    except PlaywrightError as e:
        # This will catch TargetClosedError as well
        print("Playwright-hiba a lépés közben:", repr(e))
        # In RL terms: this is a terminal state
        return -5.0, True


def test_rl_like_explorer():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()

        # Your own webapp — important that demo_app.py is running!
        page.goto("http://localhost:5000")

        total_reward = 0.0
        steps = 0

        for _ in range(50):  # max. 50 lépés
            reward, done = step(page)
            total_reward += reward
            steps += 1
            print(f"Lépés {steps}, reward={reward}, total={total_reward}")
            if done:
                print("Terminal state — stopping.")
                break

        print("Final cumulative reward:", total_reward)
        browser.close()

        # The test is green even if an error was found / page closed;
        # only requirement: at least 1 step executed.
        assert steps >= 1

Run the application:

1
python demo_app.py

Run the test:

1
pytest -s test_rl_explorer.py

Browser use com

What is this?

Browser-use.com is a tool capable of controlling a browser using an AI agent.
The agent navigates like a human:

  • clicks,
  • fills fields,
  • follows page logic,
  • attempts to accomplish the given task.

Why is it relevant in testing?

  • automatic generation of UI tests
  • behavior close to real human interaction
  • discovery of rarely visited paths
  • quick regression testing

Usage of the tool

1
2
3
4
5
6
7
from browser_use_sdk import BrowserUse

client = BrowserUse(api_key="YOUR_KEY")

task = client.tasks.create_task(
    task="Open the site, log in with the test credentials, navigate to dashboard, verify page loads"
)

Runtime environment

To run the above code, you need an API key (available by registering at the link), and you must install the browser_use_sdk package: pip install browser_use_sdk

Excercise: FakeBrowser

If you prefer not to use an API key, try the following example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class FakeBrowserUse:
    class FakeTasks:
        def create_task(self, task: str):
            print(f"[FAKE] create_task called with: {task}")
            return {"id": "fake-task-id", "status": "created"}
    def __init__(self, api_key: str):
        self.tasks = self.FakeTasks()

# Usage:
client = FakeBrowserUse(api_key="dummy")
task = client.tasks.create_task(
    task="Open the site, log in with the test credentials, navigate to dashboard, verify page loads"
)
print(task)

What Problems Can Arise During AI-based Testing?

AI is powerful, but not “magic”. There are many risks that developers and testers must be aware of — especially when AI generates test cases, UI actions, or input combinations automatically.

Explainability

What is the issue?

AI-generated test cases are often unclear: * why it chose that specific edge case, * why it navigated in that order, * why it marked a state as critical, * why it considers a case a success or failure.

This is especially true for: * large language models (LLMs), * RL agents (UI testing), * evolutionary algorithms (SBST).

Why is explainability a problem in testing?

  • Test cases become non-reproducible or unintelligible.
  • It is difficult to judge whether a discovered issue is relevant.
  • Hard to argue for/against in code reviews.
  • Hard to document in test management systems.

Engineering advice:

  • Always accompany tests with:
    • a short description (can be generated by an LLM),
    • an explanation of the test goal,
    • a reconstruction of test logic (e.g., coverage data).

Concept Drift

What is the concept?

The system changes, but the AI model behaves according to outdated logic.

Typical causes:

  • UI changes (buttons, fields, menus),
  • API updates, parameter rearrangement,
  • new business rules appear,
  • input data shifts over time.

What happens then?

  • AI-generated tests produce incorrect results.
  • Automated UI agent (e.g., Browser Use) clicks the wrong element.
  • Evolutionary test generators produce irrelevant edge cases.
  • LLM suggests tests based on outdated patterns.

Testing impact:

  • false negatives: tests miss real bugs
  • false positives: test reports a bug, but the system simply changed
  • drastically decreasing code coverage
  • AI-generated test suites become unsustainable

Engineering advice:

  • periodic retraining/regeneration
  • drift detectors in the pipeline (UI change monitoring)
  • periodic “test case audits” with LLMs

Hallucination and False Assumptions (LLM Behavior)

What does this mean?

LLMs may write tests that:

  • assume non-existent functions,
  • assume nonexistent input formats,
  • apply requirements that are not true locally,
  • introduce version mismatches.

Why is hallucination dangerous?

  • The test becomes green (passes), but checks nothing.
  • Creates a false sense of safety.
  • Overrides real specifications with “common patterns”.

Example: “Login API throws an error if the password is too short.” Many systems behave this way — but your system may not.

Stability Issues in UI Testing

Particularly with Browser Use or Playwright agents:

  • dynamic DOM
  • animations
  • random element IDs
  • slowly loading elements
  • hidden elements / overlays

Typical effect: AI tries to click an element that no longer exists → test freezes or crashes.

Test Suite Degradation

Long-term, AI-generated tests may:

  • fragment,
  • duplicate,
  • become redundant,
  • become hard to maintain,
  • lack documentation.

This is the modern form of classic “script aging”.

Data Issues and Over-generated Edge Cases

AI may be “too creative”:

  • generates unrealistic test data
  • extremely long strings, invalid characters, Unicode oddities
  • excessively many combinations

Good for bug-hunting, but may:

  • slow test execution significantly,
  • test irrelevant scenarios,
  • overload the CI pipeline.

Security Risks

If AI agents run with broad permissions:

  • may access sensitive data,
  • may modify databases unexpectedly,
  • may delete or create entities.

In short: AI agent = powerful automation → must restrict scope.

Responsibility Issues – Who Owns the Decision?

If an AI-generated test approves a system but a bug appears later:

  • who made the mistake?
    • the model?
    • the developer who trusted it?
    • the test manager?

Critical in automotive, safety-critical, and banking domains.

Supplementary Material: A/B Testing and AI

What is A/B testing?

A/B testing is a comparative experimental method used to determine which of two (or more) versions performs better for a given goal.

In short: You show two versions to users → you observe which one performs better.

AB testing

At least two variants are required:

  • A = control (original version)
  • B = modified version (new design, new text, new algorithm, new feature, etc.)

The system randomly splits users into groups:

  • one group sees version A
  • the other sees version B

Then we compare which version results in higher conversion, fewer errors, better user behavior, or better performance.

AI as Decision Support in A/B Testing

A) Automatic generation of test variants

AI-LLMs can:

  • propose UI variants,
  • generate alternative wording, CTA, layout,
  • simulate API performance variants,
  • create microcopy / UX modifications..

Testing significance: You can quickly generate many variants — expanding the test space, but requiring careful control (see: test case explosion).

B) Automatic experiment design (AI-driven experimental design)

AI can analyze past data to determine:

  • what problems are worth testing now,
  • where performance improvement is likely,
  • which changes were effective in the past.

This is essentially meta-level test support.

AI in Traffic Allocation: Multi-Armed Bandit (MAB)

Very important — this is where AI can directly intervene in A/B test execution.

What is the multi-armed bandit?

An algorithm that dynamically distributes traffic between A and B (or more versions), not statically.

Whereas classical A/B says:

  • 50% A, 50% B, and observe,
  • …the bandit says: allocate more traffic to the version that seems better — with continuous learning.

Why AI?

Bandit algorithms (Thompson sampling, UCB1, Bayesian approaches) often include ML/AI components, such as:

  • predictive models,
  • conversion probability estimation,
  • context-aware decision making (“contextual bandit”).

Advantages:

  • faster optimization
  • fewer users “lost” on a weaker version
  • continuous adaptation (no fixed time window)

Problems:

  • explainability — difficult to justify why it preferred that version
  • concept drift — if user behavior changes, the algorithm may mis-learn
  • not classical A/B statistics — p-values and confidence intervals do not apply as usual
  • bias — seasonality can distort decisions

AI in Evaluation

AI can:

  • automatically detect outliers,
  • signal non-significant results,
  • identify temporal trends,
  • find user clusters,
  • reduce noise.

Points of caution:

  • avoid model bias (“p-hacking by AI”),
  • avoid learning from noisy randomness,
  • avoid recommending changes based on spurious patterns.

AI in Variant Fine-Tuning – Autonomous Optimization

Modern systems (e.g., growth-hacking platforms, Netflix experimentation tools) can:

  • generate new variants (A’, B’, C…),
  • run them in mini-bandit environments,
  • decide which ones show promise,
  • and only promote good ones into the main test.

Risk:

  • this becomes black-box optimization
  • difficult to track why a variant won
  • AI bias may override business logic

Utolsó frissítés: 2025-11-23 11:26:13