LLM Evaluation Tool

<aside> 🧭

Navigation:

Getting Started

</aside>

The Deploy.AI evaluation tool allows you to compare AI models in the exact context of your own agents and real-world business use cases, instead of generic benchmarking approach.

Whether you're optimizing for cost, accuracy, latency, or other dimensions, the tool helps you determine which AI model performs best for your specific workflow.

Model Evaluation Tool is a perfect solution for the following cases:

You're currently using certain AI model (e.g. GPT_4_TURBO) and want to test if another model performs better for your use case.
A new model has launched and you want to compare its results against your current agent setup.
You want to validate model behavior before switching — using real prompts and configurations from your deployed agent.

Here’s now you can apply Model Evaluation to your existing agents.

Step 0: Make sure all tested models are added to the agent

Before starting the evaluation, make sure all AI models you want to test are added to the agent's configuration.

Go to your agent’s Edit screen → select the models you want from the Models list → click Save → and create a new version. This ensures the models will be available in the evaluation dropdown.

Frame 5.png

Step 1: Open Evaluation Tool

In your Admin Panel, find the agent you want to evaluate. In the row view, you’ll see an “Eval” button next to each agent.

Frame 1-2.png

Clicking this will open the Model Evaluation screen.

Frame 3-2.png

Step 2: Select Models to Compare

From the model selector at the top, choose one or more models to compare. You can:

Test a single model to validate how it handles your prompts.
Or compare multiple models side-by-side.

Frame 4.png

<aside> 💡

Important: Make sure the models you want to test are already added to your agent’s config.

Go to your agent’s Edit screen → Models → Add the models → Save → Create a new version

</aside>

Deploy.AI supports versioned agent configs, allowing you to customize prompts or logic for each evaluation.

For example:

Model A → evaluated on Version 1
Model B → evaluated on Version 2 (with updated prompt or tools)

This is helpful when models require slightly different prompting styles or tool access.

Frame 4-2.png

Step 3: Choose Evaluation Method

You’ll be prompted to choose the Evaluation Type:

LLM as a Judge (available now) — Use an AI model to analyze and score responses.
Role-Based Evaluation (coming soon) — Tailor evaluations based on expected role behavior.
Human-in-the-Loop (coming soon) — Use manual review for high-stakes evaluations.

Frame 6.png

Step 4: Choose Judging Model

If using LLM-as-a-Judge, you can pick from a list of models that are optimized for evaluation. These judge models will review the outputs and score them based on dimensions like:

Correctness
Formatting
Consistency
Sentiment alignment
Clarity

Step 5: Provide Test Input

You’ll now enter the input the user would provide to your agent. This input must follow the structure of the agent’s form or chat-based UI.

Frame 6-2.png

Deploy.AI will automatically pull the relevant user and system prompts from the agent’s current configuration.

If you want to test multiple scenarios, you can add several input examples in this step to broaden the comparison.

Frame 5-3.png