<aside> 🧭
Navigation:
</aside>
The Deploy.AI evaluation tool allows you to compare AI models in the exact context of your own agents and real-world business use cases, instead of generic benchmarking approach.
Whether you're optimizing for cost, accuracy, latency, or other dimensions, the tool helps you determine which AI model performs best for your specific workflow.
Model Evaluation Tool is a perfect solution for the following cases:
GPT_4_TURBO) and want to test if another model performs better for your use case.Here’s now you can apply Model Evaluation to your existing agents.
Before starting the evaluation, make sure all AI models you want to test are added to the agent's configuration.
Go to your agent’s Edit screen → select the models you want from the Models list → click Save → and create a new version. This ensures the models will be available in the evaluation dropdown.

In your Admin Panel, find the agent you want to evaluate. In the row view, you’ll see an “Eval” button next to each agent.

Clicking this will open the Model Evaluation screen.

From the model selector at the top, choose one or more models to compare. You can:

<aside> 💡
Important: Make sure the models you want to test are already added to your agent’s config.
Go to your agent’s Edit screen → Models → Add the models → Save → Create a new version
</aside>
Deploy.AI supports versioned agent configs, allowing you to customize prompts or logic for each evaluation.
For example:
This is helpful when models require slightly different prompting styles or tool access.

You’ll be prompted to choose the Evaluation Type:

If using LLM-as-a-Judge, you can pick from a list of models that are optimized for evaluation. These judge models will review the outputs and score them based on dimensions like:
You’ll now enter the input the user would provide to your agent. This input must follow the structure of the agent’s form or chat-based UI.

Deploy.AI will automatically pull the relevant user and system prompts from the agent’s current configuration.
If you want to test multiple scenarios, you can add several input examples in this step to broaden the comparison.
