Self hosted evaluations

We've added support for managing evaluations outside of Humanloop in your own code.

There are certain use cases where you may wish to run your evaluation process outside of Humanloop, where the evaluator itself is defined in your code as opposed to being defined using our Humanloop runtime.

For example, you may have implemented an evaluator that uses your own custom model, or has to interact with multiple systems. In which case, it can be difficult to define these as a simple code or LLM evaluator within your Humanloop project.

With this kind of setup, our users have found it very beneficial to leverage the datasets they have curated on Humanloop, as well as consolidate all of the results alongside the prompts stored on Humanloop.

To better support this setting, we're releasing additional API endpoints and SDK utilities. We've added endpoints that allow you to:

Retrieve your curated datasets
Trigger evaluation runs
Send evaluation results for your datasets generated using your custom evaluators

Below is a code snippet showing how you can use the latest version of the Python SDK to log an evaluation run to a Humanloop project. For a full explanation, see our guide on self-hosted evaluations.

from humanloop import Humanloop

API_KEY = ...
humanloop = Humanloop(api_key=API_KEY)

# 1. Retrieve a dataset
DATASET_ID = ...
datapoints = humanloop.datasets.list_datapoints(DATASET_ID).records

# 2. Create an external evaluator
evaluator = humanloop.evaluators.create(
    name="My External Evaluator",
    description="An evaluator that runs outside of Humanloop runtime.",
    type="external",
    arguments_type="target_required",
    return_type="boolean",
)
# Or, retrieve an existing one:
# evaluator = humanloop.evaluators.get(EVALUATOR_ID)

# 3. Retrieve a model config
CONFIG_ID = ...
model_config = humanloop.model_configs.get(CONFIG_ID)

# 4. Create the evaluation run
PROJECT_ID = ...
evaluation_run = humanloop.evaluations.create(
    project_id=PROJECT_ID,
    config_id=CONFIG_ID,
    evaluator_ids=[EVALUATOR_ID],
    dataset_id=DATASET_ID,
)

# 5. Iterate the datapoints and trigger generations
logs = []
for datapoint in datapoints:
    log = humanloop.chat_model_config( 
        project_id=PROJECT_ID,
        model_config_id=model_config.id,
        inputs=datapoint.inputs,
        messages=[
            {key: value for key, value in dict(message).items() if value is not None}
            for message in datapoint.messages
        ],
        source_datapoint_id=datapoint.id,
    ).data[0]	
    logs.append((log, datapoint))

# 6. Evaluate the results.
#    In this example, we use an extremely simple evaluation, checking for an exact
#    match between the target and the model's actual output.
for (log, datapoint) in logs:
    # The datapoint target tells us the correct answer.
    target = str(datapoint.target["answer"])

    # The log output is what the model said.
    model_output = log.output

    # The evaluation is a boolean, indicating whether the model was correct.
    result = target == model_output

    # Post the result back to Humanloop.
    evaluation_result_log = humanloop.evaluations.log_result(
        log_id=log.id,
        evaluator_id=evaluator.id,
        evaluation_run_external_id=evaluation_run.id,
        result=result,
    )

# 7. Complete the evaluation run.
humanloop.evaluations.update_status(id=evaluation_run.id, status="completed")