The evaluation scripts for the three splits are currently forks of each other. This must be refactored into a unified evaluation library.