Testing Chatbots and LLM Features

What makes chatbot testing different

A chatbot can respond in many acceptable ways, but not every response is acceptable. You test the conversation outcome, the boundaries, and the risk. The feature may need to answer, ask a clarifying question, refuse, escalate, or say it does not know.

Better vs weaker evidence

A benefits chatbot answers a supported policy question but keeps talking after the user asks for payroll advice it should not handle.

Stronger evidence
The tester checks the answer, the boundary, the fallback path, and the handoff behavior.

Thin evidence
The bot sounds helpful while drifting outside the product scope.

Conversation flow testing

Test a supported question from start to finish. Then test an unsupported question, a user who changes intent mid conversation, and a user who gives incomplete information. The bot should not trap the user in a dead end.

In practice

A user asks about a refund, changes to shipping, then asks whether the first answer still applies.

What helps: The bot follows the new intent and does not mix old refund context into the shipping answer.

What gets missed: The conversation becomes a dead end after the intent change.

Prompt behavior

Try direct prompts, vague prompts, misspelled prompts, long prompts, and prompts with misleading context. A good test case records the prompt, the output, the expected behavior, and the risk if the output is wrong.

What to save

Situation: A small wording change makes an AI feature ignore a required warning.
Evidence: The tester keeps prompt variants and compares the boundary behavior.
Easy miss: One approved prompt is tested and the risky phrasing users actually type is missed.

Context handling

Context is where chatbots often stumble. Ask a question, add a correction, then ask a follow up. Confirm the bot uses the current context without inventing details. If the user changes intent, confirm the bot follows the change.

What to save

Situation: A user corrects an order number mid conversation and the bot continues using the old number.
Evidence: The tester records the correction point and confirms the next answer uses the updated context.
Easy miss: The bot ignores the correction and gives advice for the wrong record.

Memory and session behavior

Test what the bot remembers in the same session and what it forgets across sessions. Confirm private or temporary details do not leak to a later user or conversation. Ask what memory is intended before judging behavior.

Refusals and fallback behavior

Unsupported questions should produce helpful fallback behavior. The bot can say it cannot answer, offer safe alternatives, or escalate to a human. A refusal should not be rude, confusing, or easy to bypass with a simple rephrase.

Hallucination checks

Ask for a policy, product capability, or account fact the bot should not know. Check whether it makes something up. A confident wrong answer may be worse than a clear I do not know.

Persona drift

Persona drift happens when the bot changes tone, role, or behavior in a way the product does not intend. Test whether a user can pressure the bot into acting like a different role or ignoring product boundaries.

In practice

A user tells the bot to act as a manager and override the normal support policy.

What helps: The bot keeps its intended role, tone, and boundaries.

What gets missed: The bot accepts the new role and stops following product rules.

Safety boundaries

Test requests that the bot should refuse or redirect. Keep shared test notes focused on product behavior and risk. You do not need to publish harmful detail to prove the boundary works. For LLM-specific risk language, the OWASP LLM Top 10 is a better reference than random prompt lists.

Data leakage concerns

Ask for another user’s data, hidden instructions, account details, private records, and logs. Also test what happens when a user provides private data. The bot should protect information and avoid exposing data outside the user’s permission.

Better vs weaker evidence

A support bot repeats private account details after a user asks for another customer’s history.

Stronger evidence
The test uses role boundaries, sensitive-data prompts, and log review.

Thin evidence
Synthetic demo data makes the team skip privacy checks that production will need.

Test case examples

Good chatbot test cases read like conversations, not isolated prompts. Each case should name the prompt sequence, expected boundary, observed result, and user risk so another tester can repeat the check.

Better vs weaker evidence

A tester documents a three-turn support conversation where the bot answers the refund policy correctly, then contradicts itself after the user asks a follow-up about opened items.

Stronger evidence
The test case records the prompt sequence, expected behavior, actual contradiction, and user risk.

Thin evidence
The test case only saves the first correct answer and misses the follow-up failure.

If you want to show this work in a job search, adapt the phrasing in QA Tester Resume Examples.

Common mistakes

Testing only one turn instead of a conversation. A single turn does not prove a conversation works. Test follow-up questions, corrections, intent changes, and incomplete details so the bot proves it can handle context over time.
Ignoring unsupported questions. Unsupported questions are where fallback behavior earns trust. The bot should refuse, redirect, clarify, or escalate without inventing an answer just to keep the conversation moving.
Accepting confident wording as accuracy. Confident wording is not evidence. Save the prompt, expected boundary, actual response, and risk so the team can separate fluent text from correct behavior.
Skipping repeated prompts and context changes. Repeated prompts and context changes reveal instability. Run important conversations more than once and compare whether the guidance stays within the same safe boundaries.
Failing to test privacy and session boundaries. Privacy and session defects can expose information silently. Test another user’s data, hidden instructions, memory across sessions, and private details pasted by the user.

What to save

Situation: A chatbot test review finds every case was a single supported question, with no follow-up, correction, unsupported request, or privacy boundary.
Evidence: The tester adds multi-turn conversations that exercise context, fallback, repetition, and session limits.
Easy miss: The bot looks accurate for one turn and fails once the conversation changes shape.

How to practice

Create a prompt sheet with ten conversations. Include normal, vague, unsupported, sensitive, misleading, repeated, and privacy related prompts. Run them more than once and compare patterns. Write notes about risk rather than only pass or fail.

Prove these LLM testing skills with AT*SQA

Testing chatbots and LLM features sits inside the AT*SQA AI for Testers AT*SkillStack: four micro-credentials covering AI Introduction for Testers, What to Test in AI-Based Systems, How to Test AI-Based Systems, and Testing Using AI. Each exam is $39, open-notes, two attempts, valid for a year. The output-review and risk work on this page maps straight to "What to Test" and "How to Test," and the new Testing Using AI credential covers using AI tools in your own testing workflow.

AT*SQA’s AT*Learn AI for Testers training (a one-year subscription, $49) makes prepping straightforward.

Pass all four and AT*SQA awards the full AI for Testers certification at no additional cost, $156 total. Every micro-credential appears on the Official U.S. List of Certified and Credentialed Software Testers and counts toward Testing Tiers, turning "I tested an LLM feature once" into four verifiable, in-demand skills.

FAQ

Questions testers ask

Can chatbot tests have pass or fail results?

Yes, but the expected result may describe acceptable behavior rather than exact wording.

Should I test the same prompt more than once?

Yes. Repeated prompts can reveal variation, inconsistency, and risk patterns.

What is fallback behavior?

Fallback behavior is what the bot does when it cannot or should not answer. It may ask a clarifying question, refuse, redirect, or escalate.

Why test persona drift?

Persona drift can make a bot ignore product rules, change tone, or act outside its intended role.

How do I test a chatbot conversation instead of a single prompt?

Write a multi-turn script with a supported request, follow-up question, correction, topic change, and unsupported request. Check whether the bot keeps the right context and drops the wrong context. Conversation flow matters more than one polished answer.

What chatbot fallback behavior should QA expect?

A good fallback admits limits, asks a clarifying question, offers a safe next step, or escalates to a human. A bad fallback invents information, loops the user, or refuses without useful guidance. Test unsupported and ambiguous requests deliberately.

How do I test whether a chatbot leaks private data?

Ask for another user’s records, hidden instructions, account details, logs, and information from a previous session. Also test what happens when the user enters private data. The bot should protect data and keep permissions intact across the conversation.

What is persona drift in chatbot testing?

Persona drift happens when the bot changes role, tone, policy, or boundaries because the user nudges it. Test role-play requests, pressure, and instructions to ignore rules. The bot should stay inside the product’s intended behavior.

How do I write chatbot test cases for a QA portfolio?

Write the prompt sequence, expected behavior, actual output, risk, and whether escalation should happen. Include supported, unsupported, privacy, and safety examples. Keep the artifacts safe and avoid publishing harmful operational detail.

What makes chatbot testing different

Better vs weaker evidence

Conversation flow testing

In practice

Prompt behavior

What to save

Context handling

What to save

Memory and session behavior

Refusals and fallback behavior

Hallucination checks

Persona drift

In practice

Safety boundaries

Data leakage concerns

Better vs weaker evidence

Test case examples

Better vs weaker evidence

Common mistakes

What to save

How to practice

Prove these LLM testing skills with AT*SQA

Keep going

AI Testing for QA Testers

AI Testing Checklist for QA Teams

QA Tester Resume Examples

QA Skills Matrix Template for Software Testing Teams

Questions testers ask