<aside> 🧭

Navigation:


Set Up an Organization

LLM Models

Model Evaluation Tool

Logs

Contact us

</aside>

The Deploy.AI evaluation tool allows you to compare AI models in the exact context of your own agents and real-world business use cases, instead of generic benchmarking approach.

Whether you're optimizing for cost, accuracy, latency, or other dimensions, the tool helps you determine which AI model performs best for your specific workflow.

Model Evaluation Tool is a perfect solution for the following cases:

Here’s now you can apply Model Evaluation to your existing agents.

Step 0: Make sure all tested models are added to the agent

Before starting the evaluation, make sure all AI models you want to test are added to the agent's configuration.

Go to your agent’s Edit screen → select the models you want from the Models list → click Save → and create a new version. This ensures the models will be available in the evaluation dropdown.

Frame 5.png

Step 1: Open Evaluation Tool

In your Admin Panel, find the agent you want to evaluate. In the row view, you’ll see an “Eval” button next to each agent.

Frame 1-2.png

Clicking this will open the Model Evaluation screen.

Frame 3-2.png

Step 2: Select Models to Compare

From the model selector at the top, choose one or more models to compare. You can:

Frame 4.png

<aside> 💡

Important: Make sure the models you want to test are already added to your agent’s config.

Go to your agent’s Edit screen → Models → Add the models → Save → Create a new version

</aside>

Deploy.AI supports versioned agent configs, allowing you to customize prompts or logic for each evaluation.

For example:

This is helpful when models require slightly different prompting styles or tool access.

Frame 4-2.png

Step 3: Choose Evaluation Method

You’ll be prompted to choose the Evaluation Type:

Frame 6.png

Step 4: Choose Judging Model

If using LLM-as-a-Judge, you can pick from a list of models that are optimized for evaluation. These judge models will review the outputs and score them based on dimensions like:

Step 5: Provide Test Input

You’ll now enter the input the user would provide to your agent. This input must follow the structure of the agent’s form or chat-based UI.

Frame 6-2.png

Deploy.AI will automatically pull the relevant user and system prompts from the agent’s current configuration.

If you want to test multiple scenarios, you can add several input examples in this step to broaden the comparison.

Frame 5-3.png

Step 6: Run the evaluation