truthfulqa

Adversarial dataset for answering questions, specifically relating to whether models generate good answers to questions

interesting finding of this paper is that they found larger models tended to generate more false answers
- possibly because larger models are better at imitating falsehoods, or they make more human like (but false) generalizations