Evaluate your model

Humanloop's evaluation framework allows you to test and track the performance of models in a rigorous way.

A key part of successful prompt engineering and deployment for LLMs is a robust evaluation framework. In this section we provide guides for how to set up Humanloop's evaluation framework in your projects.

The core entity in the Humanloop evaluation framework is an evaluator - a function you define which takes an LLM-generated datapoint as an argument and returns an evaluation. The evaluation is typically either a boolean or a number, indicating how well the model performed according to criteria you determine based on your use case.

There are two flavours of evaluator - online and offline.


Online evaluators are for use on datapoints generated or logged in your project, including live in production. Typically, they are used to monitor deployed model performance over time.

Online evaluators can be set to run automatically whenever new datapoints are logged to the project. The evaluator takes the generated datapoint as an argument.


Offline evaluators are for use with predefined testsets in order to evaluate models as you iterate in with prompt engineering, or to test for regressions in a CI workflow.

A testset is a collection of testcases, which are roughly analogous to unit tests in traditional programming. Each testcase specifies inputs to your model and (optionally) some target data.

When you run an offline evaluation, Humanloop iterates through each testcase in the testset and triggers a fresh LLM generation using the inputs of the testcase and the model config being evaluated. For each test case, your evaluator function will be called, taking as arguments the freshly generated datapoint and the testcase that gave rise to it. Typically, you would write your evaluator to perform some domain-specific logic to determine whether the model-generated datapoint meets your desired criteria (as specified in the testcase 'target').