Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond,b,b

link

Publish Date

Number

reflection

abstract

evaluate fifteen logical reasoning datasets from fine-level metric (answer correctness, explain correctness, explain completeness and explain redundancy). Meanwhile, they propose a neutral content

Status

Done

Type

evaluation

Author

Some interesting conclusions:

Overall

BARD shows consistent superiority among deductive, inductive and abductive settings, while text-davinci-003 also does relatively well. It seems that ChatGPT struggles in the three settings, but is better at mixed-form reasoning.

LLMs do best in deductive setting, while they mostly struggle in inductive setting.

few-shot in-context learning (ICL) does not necessarily bring improvements in logical reasoning tasks. difficult to learn with few samples and the ICL samples may cause noises.

Fine-level

LLMs are best at keeping rigorous reasoning in the abductive setting, while they are weak in the deductive and inductive settings. abductive reasoning requires the LLMs to achieve the reasoning reversely, which can activate LLMs to provide sufficient reasoning process. While in a deductive reasoning setting, the reasoning chain is sequential, which may cause LLMs to be in lazy mode and harm rigorous reasoning.

consider LLMs with less redundant content as more self-aware : Results indicate that text-davinci-003 exhibits notable advantages, particularly in the inductive, abductive, and mixed-form reasoning settings. Additionally, it ranks second in the deductive setting. Conversely, BARD performs poorly in deductive, abductive, and mixed-form reasoning settings.

the most obvious obstacle to logical reasoning tasks in LLMs is whether they can find the correct evidence and perspective.