Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

1Tohoku University,
2Research and Development Center for Large Language Models
National Institute of Informatics,

3National Institute of Informatics

TL;DR

We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations.

🔍Our Research Questions

Our Research Questions
  • 🔍RQ1: Can we determine whether a checklist is necessary for LLM evaluators?

    ✅ We observe that the effectiveness of selective checklist application varies by task. So, this suggests that checklist use is not universally beneficial.

  • 🔍RQ2: How can we create useful checklists?

    ✅ We find the most useful checklist creation method varies across different evaluation models and tasks, suggesting that no single approach works best in all settings.

  • 🔍RQ3: Which checklist items contribute to alignment with human evaluation?

    ✅ Our analysis shows that many checklist items, although useless for improving human correlation, still overlap substantially with human-written items.


💡Our Method

💡 Checklist Generation

We create three methods (control checklist length, more specified generation, and self-refine) in addition to the baseline and existing method.

💡 Evaluation Responses

We evaluate a total of eight models, including gpt-4o-2024-08-06 and Qwen2.5-7B-Instruct, to investigate which checklist items contribute to alignment with human evaluation.


We chose two tasks (pairwise comparison and direct scoring) and use LLMBar and InFoBench.


📊 Analysis

To evaluate the effectiveness of checklists generated by LLMs, we conducted a comprehensive analysis from multiple perspectives, such as Ablation Study, Analysis of Questions,and Comparison with Human-Written Checklists.

📊 Ablation Study on Correlation with Human Evaluation

To identify which checklist items contribute to alignment with human evaluation, we perform an ablation analysis, revealing both impactful (positive, negative) and non-impactful (neither) items.


✅ In the LLMBar, 53.5% of the checklists used in positive checklists are classified as positive checklist items (1,079 out of 2,018), while 42.0% of the items in negative checklists are classified as negative checklist items (3,847 out of 9,157).
✅ In the InFoBench, all items in positive checklists are classified as positive checklist items (56 out of 56), while 40.7% of the items in negative checklists are classified as negative (599 out of 1,472).

Ablation Study Figure

📊 Analysis of Questions in the Dataset

We investigate how checklist effectiveness depends on question type (closed vs. open).

✅ We find that LLMBar questions are sampled as 10% from each of the eight subsets (85 in total), while all 50 questions from the InFoBench are classified.

Ablation Study Figure

📊 Comparison with Human-Written Checklists

We compare LLM-generated checklists with those created by human annotators, analyzing overlaps and divergences in checklists.

We qualitatively analyze 293 items, including 89 positive and 102 negative items.
✅ For positive items, 60% are explicitly reflect key question elements, aligning with essential response components.
✅ For negative items, 85% are consistent and deemed usable upon manual review.

BibTeX

              
@article{furuhashi2025are,
  author = {Furuhashi Momoka, Nakayama Kouta, Kodama Takashi,and Sugawara Saku},
  title = {Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?},
  booktitle = {The 2025 Conference on Empirical Methods in Natural Language Processing},
  year = {2025},
}