- A dataset with test inputs and optionally expected outputs.
- A target function that defines what you’re evaluating. For example, this may be one LLM call that includes the new prompt you are testing, a part of your application or your end to end application.
- Evaluators that score your target function’s outputs.
This guide uses prebuilt LLM-as-judge evaluators from the open-source
openevals
package. OpenEvals includes a set of commonly used evaluators and is a great starting point if you’re new to evaluations. If you want greater flexibility in how you evaluate your apps, you can also define completely custom evaluators using your own code.LangSmith SDK
1. Install dependencies
If you are using
yarn
as your package manager, you will also need to manually install @langchain/core
as a peer dependency of openevals
. This is not required for LangSmith evals in general - you may define evaluators using arbitrary custom code.2. Create a LangSmith API key
To create an API key, head to the Settings page. Then click + API Key.3. Set up environment variables
This guide uses OpenAI, but you can adapt it to use any LLM provider. If you’re using Anthropic, use the Anthropic wrapper to trace your calls. For other providers, use the traceable wrapper.4. Create a dataset
Next, define example input and reference output pairs that you’ll use to evaluate your app:5. Define what you’re evaluating
Now, define a target function that contains what you’re evaluating. In this guide, we’ll define a target function that contains a single LLM call to answer a question.6. Define evaluator
Import a prebuilt prompt fromopenevals
and create an evaluator. outputs
are the result of your target function. reference_outputs
/ referenceOutputs
are from the example pairs you defined in step 4 above.
CORRECTNESS_PROMPT
is just an f-string with variables for "inputs"
, "outputs"
, and "reference_outputs"
. See here for more information on customizing OpenEvals prompts.7. Run and view results
Finally, run the experiment!
Next steps
To learn more about running experiments in LangSmith, read the evaluation conceptual guide.
- For more details on evaluations, refer to the Evaluation documentation.
- Check out the OpenEvals README to see all available prebuilt evaluators and how to customize them.
- Learn how to define custom evaluators that contain arbitrary code.
- For comprehensive descriptions of every class and function see the Python or Typescript SDK references.
LangSmith UI
1. Navigate to the playground
LangSmith’s prompt playground makes it possible to run evaluations over different prompts, new models or test different model configurations. Go to LangSmith’s Playground in the UI.2. Create a prompt
Modify the system prompt to:3. Create a dataset
Click Set up Evaluation, then use the + New button in the dropdown to create a new dataset. Add the following examples to the dataset:Inputs | Reference Outputs |
---|---|
question: Which country is Mount Kilimanjaro located in? | output: Mount Kilimanjaro is located in Tanzania. |
question: What is Earth’s lowest point? | output: Earth’s lowest point is The Dead Sea. |
4. Add an evaluator
Click +Evaluator. Select Correctness from the pre-built evaluator options. Press Save.5. Run your evaluation
Press Start on the top right to run your evaluation. Running this evaluation will create an experiment that you can view in full by clicking the experiment name.
Next steps
To learn more about running experiments in LangSmith, read the evaluation conceptual guide.
- For more details on evaluations, refer to the Evaluation documentation.
- Learn how to create and manage datasets in the UI
- Learn how to run an evaluation from the prompt playground