link
Publish Date
Number
reflection
abstract
This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. GPT-4和ChatGPT在trandition benchmark上表现还可以,但是在OOD上表现很差。
Status
Not started
Type
evaluation
Author