As in #72
human baseline. Monitor metrics other than the loss that are human-interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it. Alternatively, annotate the test data twice and for each example treat one annotation as prediction and the second as ground truth.
For this we can only warn user to use metrics different than just loss.