Create a Dataset
Datasets can be created from existing logs or uploaded from CSV and via the API.
You can currently create datasets in Humanloop in three ways: from existing logs, by uploading a CSV or via the API.
Convert from existing logs
Prerequisites:
- A project in Humanloop
- Some logged generations available in that project
To create a dataset from existing logs:
- Go to the Logs tab in your project
- Select a subset of the logs in that project and choose Add to dataset from the menu in the top right of the page.
- In the dialog box, provide a Name and Description for the new dataset. Alternatively, you can click add to existing dataset to append the selected to a dataset you already have.
Upload from CSV
Prerequisites:
- A project in Humanloop
To create a dataset from a CSV file, we'll first create a CSV in Google Sheets and then upload it to a dataset in Humanloop.
- Create a CSV file.
- In our Google Sheets example below, we have a column called
user_query
which is an input to a prompt variable of that name. So in our model config, we'll need to include{{ user_query }}
somewhere, and that placeholder will be populated with the value from theuser_query
input in the datapoint at generation-time. - You can include as many columns of prompt variables as you need for your model configs.
- There is additionally a column called
target
which will populate the target of the datapoint. In this case, we use simple strings to define the target. - Note:
messages
are harder to incorporate into a CSV file as they tend to be verbose and hard-to-read JSON. If you want a dataset with messages, consider using the API to upload, or convert from existing logs.
- In our Google Sheets example below, we have a column called
- Export the Google Sheet to CSV by choosing File → Download → Comma-separated values (.csv)
- In your Humanloop project, go to the Datasets tab and choose New dataset
- In the dialog window, provide a name and optional description for the dataset. Then upload the CSV file from step 2 by drag-and-drop or using the file explorer.
- Click Upload Dataset from CSV and you should see a new dataset appear in the datasets tab. You can explore it by clicking in.
- Follow the link in the pop-up to inspect the dataset that was created in the upload. You'll see a column with the input key-value pairs for each datapoint, a messages column (in our case we didn't use messages, so they're all empty) and a target column with the expected model output.
Upload via API
- First define some sample data as a basis for your test datapoints, consisting of user messages and target extraction pairs. This is where you could load up any existing data you wish to use for your evaluation:
# Example test case data
data = [
{
"messages": [
{
"role": "user",
"content": "Hi Humanloop support team, I'm having trouble understanding how to use the evaluations feature in your software. Can you provide a step-by-step guide or any resources to help me get started?",
}
],
"target": {"feature": "evaluations", "issue": "needs step-by-step guide"},
},
{
"messages": [
{
"role": "user",
"content": "Hi there, I'm interested in fine-tuning a language model using your software. Can you explain the process and provide any best practices or guidelines?",
}
],
"target": {
"feature": "fine-tuning",
"issue": "process explanation and best practices",
},
},
]
- Then define a dataset and upload the datapoints
# Create a dataset
dataset = humanloop.datasets.create(
project_id=project_id,
name="Sample dataset",
description="Examples of featue requests extracted from user messages",
)
dataset_id = dataset.body["id"]
# Create datapoints for the dataset
datapoints = humanloop.datasets.create_datapoint(
dataset_id=dataset_id,
body=data,
)
On the datasets tab in your Humanloop project you will now see the dataset you just uploaded via the API.
Updated 25 days ago
What’s Next
Trigger a batch generation across a dataset. See the next guide in this section.