Test Cases
GTWY Testcases: Measure, Evaluate, and Improve Your AI Prompts with Confidence
Generating responses from an AI model is easy.
Ensuring those responses are accurate, reliable, and consistent is where the real challenge begins.
When you update prompts, switch models, or modify agent configurations, even small changes can affect how your AI behaves. Without a structured testing system, it becomes difficult to know whether your improvements actually make things better — or introduce unexpected issues.
GTWY addresses this challenge with Testcases.
Testcases provide a built-in framework for testing, scoring, and improving AI prompts with precision. Instead of relying on trial and error, you can evaluate your AI’s performance using measurable results.
It transforms prompt engineering from guesswork into a data-driven process.

What Are Testcases in GTWY?
Testcases allow you to evaluate how well your system prompt, agent configuration, and model settings perform against a defined expected result.
For every test, you provide a user input and define what the ideal response should look like. GTWY then runs the agent with those settings and compares the generated output to the expected result.
This helps you determine whether your AI agent is behaving as intended.
Testcases are particularly useful when you are building:
AI assistants
customer support chatbots
automation agents
decision-making workflows
research or summarization systems
You can also create testcases directly from Bridge history, allowing you to reuse real conversations as evaluation examples. This makes it easier to test whether your agent remains consistent when handling real-world scenarios.
How Testcases Work
Every testcase in GTWY includes three key components.
User Message
This is the input prompt or question that you want your AI agent to respond to.
It simulates a real user request and triggers the model to generate a response during the test.
Expected Response or Tool Call
This represents the ideal output you expect from your AI agent.
It can be:
a specific answer
a structured response
a tool call or action
Note: Tools are not actually executed during testcase runs. The system only verifies whether the agent intended to call the correct tool.
Bridge Version
Each testcase is executed against a selected Bridge Version.
This allows you to test different versions of your agent configuration and compare how changes in prompts, parameters, or tools affect the output.
Matching Methods
GTWY offers multiple evaluation methods so you can choose how strictly responses should be judged.
1. Exact Matching
Exact Matching checks whether the generated response perfectly matches the expected response.
Both the content and structure must be identical.
This method works best for:
rule-based outputs
code generation
data extraction
factual responses
structured API outputs
When precision matters, Exact Matching ensures there is zero deviation.
2. AI Matching
AI Matching uses another language model to evaluate the generated response.
Instead of comparing responses word-for-word, this method evaluates whether the meaning and intent of the answer are correct.
The model then assigns a score based on semantic similarity.
This method works well for tasks such as:
summarization
paraphrasing
open-ended explanations
creative responses
AI Matching provides a more flexible and intelligent evaluation for complex prompts.
3. Similarity Matching
Similarity Matching evaluates how close the generated response is to the expected output using cosine similarity.
This technique compares the semantic similarity between two pieces of text and returns a numerical score between 0 and 1.
0 indicates no similarity
1 indicates nearly identical meaning
This approach is helpful when you want a quantitative measure of similarity without requiring an exact match.
Scoring and Evaluation
Instead of providing a simple pass or fail result, GTWY generates detailed scores for every testcase.
This allows you to understand how well the generated output aligns with the expected result.
With these insights, you can:
compare results across different agent versions
identify areas where your AI is improving
detect cases where prompt changes reduce accuracy
continuously refine your prompts and configurations
Over time, this creates a feedback loop for improving AI performance.
Why Testcases Matter
AI systems are powerful, but they can also behave unpredictably when prompts or configurations change.
Testcases help bring stability and reliability to AI development.
With Testcases in GTWY, you can:
validate your AI responses before deploying them
benchmark new prompt versions
detect regressions when updating agents
measure improvements with clear scoring metrics
Instead of relying on manual testing, you gain a structured way to verify AI performance at scale.
Final Thoughts
Testcases transform prompt engineering into a measurable discipline.
Rather than guessing whether a prompt works, GTWY allows you to test it, score it, and improve it systematically.
Whenever you modify a system prompt, adjust model parameters, or release a new bridge version, running testcases ensures that your AI continues to perform exactly as expected.
If you change your AI agent, don’t assume it works better.
Test it — and prove it with GTWY Testcases.