truthfulqa
Adversarial dataset for answering questions, specifically relating to whether models generate good answers to questions
- interesting finding of this paper is that they found larger models tended to generate more false answers
- possibly because larger models are better at imitating falsehoods, or they make more human like (but false) generalizations