Experiments can be used to compare different prompt templates, different hyperparameter combinations (such as temperature and presence penalties) and even different base models.
- You already have a project created - if not, please pause and first follow our project creation guides.
- You have integrated
humanloop.chat_deployed()endpoints, along with the
humanloop.feedback()with the API or Python SDK.
Using other model providers
This guide assumes you're are using an OpenAI model. If you want to use other providers or your own model please also look at the guide for running an experiment with your own model provider.
- Navigate to the Experiments tab of your project.
- Click the Create new experiment button:
- Give your experiment a descriptive name.
- Select a list of feedback labels to be considered as positive actions - this will be used to calculate the performance of each of your model configs during the experiment.
- Select which of your project’s model configs you wish to compare.
- Then click the Create button.
Now that you have an experiment, you need to set it as the project’s active experiment:
- Navigate to the Models tab.
- For the default environment click the dropdown menu. From the menu options select Change deployment.
- Under Experiments, select your you newly created experiment and confirm that you want to change the active deployment when prompted:
Now that your experiment is active, any SDK or API calls to generate will sample model configs from the list you provided when creating the experiment and any subsequent feedback captured using feedback will contribute to the experiment performance.
Now that an experiment is live, the data flowing through your generate and feedback calls will update the experiment progress in real time:
- Navigate back to the Experiments tab.
- Select the Experiment card.
Here you will see the performance of each model config with a measure of confidence based on how much feedback data has been collected so far:
🎉 Your experiment can now give you insight into which of the model configs your users prefer.
How quickly you can draw conclusions depends on how much traffic you have flowing through your project.
Generally, you should be able to draw some initial conclusions after on the order of hundreds of examples.
Updated 3 months ago