Evaluate models offline
In this guide, we will walk through creating a testset and using it to run an offline evaluation.
Create an offline evaluator
Prerequisites
- You need to have access to the beta preview of evaluations.
- You also need to have a project created - if not, please first follow our project creation guides.
- Finally, you need at least a few datapoints in your project. Use the Editor to generate some datapoints if you don't have any yet.
You need datapoints in your project because we will use these a source of testcases for the testset we will create. If you want to create arbitrary testcases from scratch, see our guide to doing this from the API. We will soon be updating the app to enable arbitrary testcase creation from your browser.
For this example, we're going to evaluate a model who's responsibility is to extract key information from a customer service request and return this information in JSON. In the image below, you can see the model config we've drafted on the left, and an example of it running against a customer query on the right.
Set up a testset
We will create a testset based on existing datapoint that have already been logged in the project.
- Navigate to the data tab
- Select the datapoints you would like to convert into testcases
- From the dropdown menu in the top right (see below), choose Add to testset

Creating testcases from a selection of existing project datapoints.
- In the dialog box, give the new testset a name and provide an optional description. Click Create testset.
You can add more testcases to the same testset later by clicking the 'add to existing testset' button at the top.
- Go to the evaluations tab and click Manage testsets
- Click into the newly created testset. One testcase will be present for each datapoint you selected in step 3

The newly created testset, containing testcases that were generated from existing datapoints in the project.
- Click on a testcase to inspect its parameters.
A testcase contains inputs (the variables passed into your model config template), an optional sequence of messages (if used for a chat model) and a target representing the desired output.
As we converted existing datapoints into testcases, the target defaults to the output of the source datapoint.
In our example, we created testcases from existing datapoints. The default behaviour is that the original datapoint's output becomes an output field in the target JSON.
In order to access the feature
field more easily in our evaluator, we'll modify the testcase targets to be a raw JSON with a feature key.

The original datapoint was an LLM generation which outputted a JSON value. The conversion process has placed this into the output
field of the testcase target.
- Modify the testcase if you need to make refinements. You can provide an arbitrary JSON object as the target.

After editing, we have a clean JSON object recording the salient characteristics of the testcase's expected output.
Create an offline evaluator
Having set up a testset, we'll now create the evaluator. As with online evaluators, it's a Python function but for offline mode, it also takes a testcase
parameter alongside the generated datapoint.
- Navigate to the evaluations tab
- Select + New Evaluator and choose Offline
- Choose Start from scratch
For this example, we'll use the code below to compare the LLM generated output with what we expected for that testcase.
import json
from json import JSONDecodeError
def it_extracts_correct_feature(datapoint, testcase):
expected_feature = testcase["target"]["feature"]
try:
# The model is expected to produce valid JSON output
# but it could fail to do so.
output = json.loads(datapoint["output"])
actual_feature = output.get("feature", None)
return expected_feature == actual_feature
except JSONDecodeError:
# If the model didn't even produce valid JSON, then
# we evaluate the output as bad.
return False
- In the debug console at the bottom of the dialog, click Load from testset and select the testset you created in the previous section. The console will be populated with its testcases.

The debug console. Use this to load up testcases from a testset and perform debug runs with any model config in your project.
- Choose a model config from the dropdown menu.
- Click the run button at the far right of one of the testcases.
A new debug run will be triggered, which causes an LLM generation using that testcase's inputs and messages parameters. The generated datapoint and the testcase will be passed into the evaluator and the resulting evaluation displays in the Result column.
- Click Create when you are happy with the evaluator.
Trigger an offline evaluation
Now that you have an offline evaluator and a testset, you can use them to evaluate the performance of any model config in your project.
- Go to the Evaluations tab.
- To the right of the Evaluation Runs section, click Run Evaluation
- In the dialog box, choose a model config to evaluate, and select your newly created testset and evaluator.

Triggering an evaluation of the model config 'customer-service-extractor' with a testset and evaluator function.
- Click Launch
- A new evaluation is launched. Click on the card to inspect the results.
It may take some time for the evaluation to complete, depending on how many testcases are in your testset and what model config you are using.
- Inspect the results of the evaluation.

The model config tested performed well - all testcases produced valid JSON with a feature
key containing the correct extracted data.
Updated about 1 month ago