Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyondbb
| Jan 23, 2024
0  |  Read Time 0 min
link
Publish Date
Number
reflection
abstract
evaluate fifteen logical reasoning datasets from fine-level metric (answer correctness, explain correctness, explain completeness and explain redundancy). Meanwhile, they propose a neutral content
Status
Done
Type
evaluation
Author

Some interesting conclusions:

Overall

  1. BARD shows consistent superiority among deductive, inductive and abductive settings, while text-davinci-003 also does relatively well. It seems that ChatGPT struggles in the three settings, but is better at mixed-form reasoning.
  1. LLMs do best in deductive setting, while they mostly struggle in inductive setting.
  1. few-shot in-context learning (ICL) does not necessarily bring improvements in logical reasoning tasks. difficult to learn with few samples and the ICL samples may cause noises.
 

Fine-level

  1. LLMs are best at keeping rigorous reasoning in the abductive setting, while they are weak in the deductive and inductive settings. abductive reasoning requires the LLMs to achieve the reasoning reversely, which can activate LLMs to provide sufficient reasoning process. While in a deductive reasoning setting, the reasoning chain is sequential, which may cause LLMs to be in lazy mode and harm rigorous reasoning.
  1. consider LLMs with less redundant content as more self-aware : Results indicate that text-davinci-003 exhibits notable advantages, particularly in the inductive, abductive, and mixed-form reasoning settings. Additionally, it ranks second in the deductive setting. Conversely, BARD performs poorly in deductive, abductive, and mixed-form reasoning settings.
  1. the most obvious obstacle to logical reasoning tasks in LLMs is whether they can find the correct evidence and perspective.
  • Valine
Catalog