Evaluating language models for accuracy and bias
Some tools for evaluating large language models include:
- OpenAI evals: OpenAI’s LLM evaluation tool, with a benchmark repository
- Evidently: an ML and LLM observability tool which can be used on general ML tasks, including LLMs