MT-bench
MT-bench LLM 모델을 평가하기 위한 multi-turn open-ended 질문 모음
To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models’ responses.
B) 논문 주장
- strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.
C) Methods
- MT-bench 질문들에 대해 타겟 모델의 응답을 생성한다.
- GPT-4 의 평가 (judgement) 들을 생성한다.
- 여러 옵션이 존재한다: (1) pairwise winrate, (2) single-answer grading (default)
- MT-bench 점수를 계산한다.