Create a Dataset

Datasets can be created from existing logs or uploaded from CSV and via the API.

You can currently create datasets in Humanloop in three ways: from existing logs, by uploading a CSV or via the API.

Convert from existing logs

Prerequisites:

To create a dataset from existing logs:

  1. Go to the Logs tab in your project
  2. Select a subset of the logs in that project and choose Add to dataset from the menu in the top right of the page.
Select some logs and then click **Add to Dataset**

Select some logs and then click Add to Dataset

  1. In the dialog box, provide a Name and Description for the new dataset. Alternatively, you can click add to existing dataset to append the selected to a dataset you already have.

Upload from CSV

Prerequisites:

To create a dataset from a CSV file, we'll first create a CSV in Google Sheets and then upload it to a dataset in Humanloop.

  1. Create a CSV file.
    • In our Google Sheets example below, we have a column called user_query which is an input to a prompt variable of that name. So in our model config, we'll need to include {{ user_query }} somewhere, and that placeholder will be populated with the value from the user_query input in the datapoint at generation-time.
    • You can include as many columns of prompt variables as you need for your model configs.
    • There is additionally a column called target which will populate the target of the datapoint. In this case, we use simple strings to define the target.
    • Note: messages are harder to incorporate into a CSV file as they tend to be verbose and hard-to-read JSON. If you want a dataset with messages, consider using the API to upload, or convert from existing logs.
A CSV file in Google Sheets defining a collection of 9 datapoints.

A CSV file in Google Sheets defining a collection of 9 datapoints.

  1. Export the Google Sheet to CSV by choosing FileDownloadComma-separated values (.csv)
  2. In your Humanloop project, go to the Datasets tab and choose New dataset
  3. In the dialog window, provide a name and optional description for the dataset. Then upload the CSV file from step 2 by drag-and-drop or using the file explorer.
Uploading a CSV file to create a dataset.

Uploading a CSV file to create a dataset.

  1. Click Upload Dataset from CSV and you should see a new dataset appear in the datasets tab. You can explore it by clicking in.
  2. Follow the link in the pop-up to inspect the dataset that was created in the upload. You'll see a column with the input key-value pairs for each datapoint, a messages column (in our case we didn't use messages, so they're all empty) and a target column with the expected model output.

Upload via API

  1. First define some sample data as a basis for your test datapoints, consisting of user messages and target extraction pairs. This is where you could load up any existing data you wish to use for your evaluation:
# Example test case data
data = [
    {
        "messages": [
            {
                "role": "user",
                "content": "Hi Humanloop support team, I'm having trouble understanding how to use the evaluations feature in your software. Can you provide a step-by-step guide or any resources to help me get started?",
            }
        ],
        "target": {"feature": "evaluations", "issue": "needs step-by-step guide"},
    },
    {
        "messages": [
            {
                "role": "user",
                "content": "Hi there, I'm interested in fine-tuning a language model using your software. Can you explain the process and provide any best practices or guidelines?",
            }
        ],
        "target": {
            "feature": "fine-tuning",
            "issue": "process explanation and best practices",
        },
    },
]
  1. Then define a dataset and upload the datapoints
# Create a dataset
dataset = humanloop.datasets.create(
    project_id=project_id,
    name="Sample dataset",
    description="Examples of featue requests extracted from user messages",
)
dataset_id = dataset.body["id"]

# Create datapoints for the dataset
datapoints = humanloop.datasets.create_datapoint(
    dataset_id=dataset_id,
    body=data,
)

On the datasets tab in your Humanloop project you will now see the dataset you just uploaded via the API.


What’s Next

Trigger a batch generation across a dataset. See the next guide in this section.