This website presents the results in the following paper.
+
+ Tuan Dung Nguyen, Duncan J. Watts, and Mark E. Whiting. "A Large-Scale Evaluation of Commonsense Knowledge in Humans and Large Language Models."PNAS Nexus 5(3): pgag029 (2026). DOI: 10.1093/pnasnexus/pgag029.
+
+
+
@article{nguyenLargescaleEvaluationCommonsense2026,
+ title = {A Large-Scale Evaluation of Commonsense Knowledge in Humans and Large Language Models},
+ author = {Nguyen, Tuan Dung and Watts, Duncan J and Whiting, Mark E},
+ year = 2026,
+ journal = {PNAS Nexus},
+ volume = {5},
+ number = {3},
+ pages = {pgag029},
+ publisher = {Oxford University Press},
+ issn = {2752-6542},
+ doi = {10.1093/pnasnexus/pgag029},
+ url = {https://academic.oup.com/pnasnexus/article/doi/10.1093/pnasnexus/pgag029/8487345},
+ copyright = {https://creativecommons.org/licenses/by/4.0/},
+}
+
Model ranking
+
The following models have been evaluated. For more details, see the next section.
+
+
Evaluation settings
+
+
+
+
For every statement, humans and LLMs are asked to indicate (a) whether they agree with it and (b) whether they think most other people would agree with it. In panel A, a total of N = 2,046 human participants were recruited to perform this task. The "Avg." column denotes the percentage of people who answered "yes" to the corresponding question. In panel B, we treat each LLM (in a total of N = 35 models) as an independent survey respondent, just like every human in panel A. This gives rise to the individual-level view of common sense, in which this model is measured based on its agreement with the majority of other people on every statement. In panel C, we treat every LLM's probability in its output answer as the average response of a hypothetical population of "silicon samples" (depicted as robots). For instance, if the LLM agrees with the statement "Eighty percent of success is showing up" with 90% probability, we interpret this as 90% of the silicon samples would agree with this statement. This gives rise to the statement-level metric of common sense, which is used to measure the correlation between the human (panel A) and silicon sample (panel C) populations
+
+
Human common sense
+
From panel A in the figure above, we use the statement ratings of each individual to calculate their commonsensicality score. To see each individual's ratings and their commonsensicality score, check out the Human Individual panel at the top.
+
+
LLMs as Individuals
+
From panel B in the figure above, we use the statement ratings of each LLM to calcualte its commonsensicality score. To see the ratings and commonsensicality score, check out the LLM as Individual panel at the top.
+
+
LLM as a Generator of a Population
+
From panel C in the figure above, we use each LLM as a generator of a hypothetical population. Each individual in this population rates statements like every human does. Then, for this population, we calculate a commonsensicality score for each statement. Thus, every statement has a commonsensicality score for each population, including humans and every population generated by every LLM.
+
In the Statement Commonsensicality panel at the top, you can browse all statements and their commonsensicality in every population.
+
In the Statement Score Correlation panel, you can compare every LLM-generated population with the human population. This comparison is based on the Pearson correlation in statement commonsensicality score in the two populations.