Evaluate models offline

In this guide, we will walk through creating a testset and using it to run an offline evaluation.

Create an offline evaluator

Prerequisites

  1. You need to have access to evaluations.
  2. You also need to have a project created - if not, please first follow our project creation guides.
  3. Finally, you need at least a few logs in your project. Use the Editor to generate some logs if you don't have any yet.

πŸ“˜

You need logs in your project because we will use these a source of test datapoints for the dataset we will create. If you want to create arbitrary test datapoints from scratch, see our guide to doing this from the API. We will soon be updating the app to enable arbitrary test datapoint creation from your browser.

For this example, we're going to evaluate a model who's responsibility is to extract key information from a customer service request and return this information in JSON. In the image below, you can see the model config we've drafted on the left, and an example of it running against a customer query on the right.

Set up a dataset

We will create a dataset based on existing logs that are already in the project.

  1. Navigate to the Logs tab
  2. Select the logs you would like to convert into test datapoints
  3. From the dropdown menu in the top right (see below), choose Add to Dataset
Creating test datapoints from a selection of existing project datapoints.

Creating test datapoints from a selection of existing project datapoints.

  1. In the dialog box, give the new dataset a name and provide an optional description. Click Create dataset.

πŸ“˜

You can add more datapoints to the same dataset later by clicking the 'add to existing dataset' button at the top.

  1. Go to the Datasets tab.
  2. Click into the newly created dataset. One datapoint will be present for each log you selected in step 3
The newly created dataset, containing datapoints that were converted from existing logs in the project.

The newly created dataset, containing datapoints that were converted from existing logs in the project.

  1. Click on a datapoint to inspect its parameters.

πŸ“˜

A test datapoint contains inputs (the variables passed into your model config template), an optional sequence of messages (if used for a chat model) and a target representing the desired output.

As we converted existing logs into datapoints, the target defaults to the output of the source log.

In our example, we created datapoints from existing logs. The default behaviour is that the original log's output becomes an output field in the target JSON.

In order to access the feature field more easily in our evaluator, we'll modify the datapoint targets to be a raw JSON with a feature key.

The original log was an LLM generation which outputted a JSON value. The conversion process has placed this into the `output` field of the testcase target.

The original log was an LLM generation which outputted a JSON value. The conversion process has placed this into the output field of the testcase target.

  1. Modify the datapoint if you need to make refinements. You can provide an arbitrary JSON object as the target.
After editing, we have a clean JSON object recording the salient characteristics of the datapoint's expected output.

After editing, we have a clean JSON object recording the salient characteristics of the datapoint's expected output.

Create an offline evaluator

Having set up a dataset, we'll now create the evaluator. As with online evaluators, it's a Python function but for offline mode, it also takes a testcase parameter alongside the generated log.

  1. Navigate to the evaluations section, and then the Evaluators tab
  2. Select + New Evaluator and choose Offline Evaluation
  3. Choose Start from scratch

For this example, we'll use the code below to compare the LLM generated output with what we expected for that testcase.

import json
from json import JSONDecodeError

def it_extracts_correct_feature(log, testcase):
    expected_feature = testcase["target"]["feature"]

    try:
        # The model is expected to produce valid JSON output
        # but it could fail to do so.
        output = json.loads(log["output"])
        actual_feature = output.get("feature", None)
        return expected_feature == actual_feature

    except JSONDecodeError:
        # If the model didn't even produce valid JSON, then
        # we evaluate the output as bad.
        return False
  1. In the debug console at the bottom of the dialog, click Load data and then Datapoints from dataset. Select the dataset you created in the previous section. The console will be populated with its datapoints.
The debug console. Use this to load up test datapoints from a dataset and perform debug runs with any model config in your project.

The debug console. Use this to load up test datapoints from a dataset and perform debug runs with any model config in your project.

  1. Choose a model config from the dropdown menu.
  2. Click the run button at the far right of one of the test datapoints.

A new debug run will be triggered, which causes an LLM generation using that datapoint's inputs and messages parameters. The generated log and the test datapoint will be passed into the evaluator and the resulting evaluation displays in the Result column.

  1. Click Create when you are happy with the evaluator.

Trigger an offline evaluation

Now that you have an offline evaluator and a dataset, you can use them to evaluate the performance of any model config in your project.

  1. Go to the Evaluations section.
  2. In the Runs tab, click Run Evaluation
  3. In the dialog box, choose a model config to evaluate, and select your newly created dataset and evaluator.

  1. Click Batch Generate
  2. A new evaluation is launched. Click on the card to inspect the results.

A batch generation has now been triggered. This means that the model config you selected will be used to generate a log for each datapoint in the dataset. It may take some time for the evaluation to complete, depending on how many test datapoints are in your dataset and what model config you are using. Once all the logs have been generated, the evaluator will execute for each in turn.

  1. Inspect the results of the evaluation.


What’s Next

Try altering the prompt from this example to measurably improve its performance.