Skip to content

[BUG] LogProbTokenNorm crashes with IndexError in CFFormulation for belebele #1170

@Rijgersberg

Description

@Rijgersberg

Describe the bug

I'm trying to benchmark models on the CFFormulation for belebele.

When doing so, lighteval crashes during the lobprob token-based normalization phase with an IndexError (see traceback below).

Putting some prints in the offending function, I notice that the length of choices_tokens (3) does not correspond to the expected length of 4, which is both the number of answer choices and the length of choices_tokens and choices_logprob.

choices_text=[' 4', ' 27', ' 6', ' 34']                   
choices_logprob=[-2.90625, -5.65625, -3.03125, -5.4375]     
choices_tokens=[[236743, 236812, -1], [236743, 236778, 236832], [236743, 236825, -1]] 

I tried to trace it back to its origin, but it seems to be a problem that occurs during generation itself.

Traceback:

───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/main_ac │
│ celerate.py:147 in accelerate                                                │
│                                                                              │
│   144 │   │   model_config=model_config,                                     │
│   145 │   )                                                                  │
│   146 │                                                                      │
│ ❱ 147 │   pipeline.evaluate()                                                │
│   148 │                                                                      │
│   149 │   pipeline.show_results()                                            │
│   150                                                                        │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/pipelin │
│ e.py:291 in evaluate                                                         │
│                                                                              │
│   288 │   │                                                                  │
│   289 │   │   if self.is_main_process():                                     │
│   290 │   │   │   self._post_process_outputs(outputs)                        │
│ ❱ 291 │   │   │   self._compute_metrics(outputs)                             │
│   292 │   │   │                                                              │
│   293 │   │   │   self.evaluation_tracker.general_config_logger.log_end_time │
│   294 │   │   │   self.evaluation_tracker.metrics_logger.aggregate(          │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/pipelin │
│ e.py:391 in _compute_metrics                                                 │
│                                                                              │
│   388 │   │   │   │   docs = [doc for doc, _ in samples]                     │
│   389 │   │   │   │   responses = [response for _, response in samples]      │
│   390 │   │   │   │                                                          │
│ ❱ 391 │   │   │   │   outputs = apply_metric(                                │
│   392 │   │   │   │   │   docs=docs,                                         │
│   393 │   │   │   │   │   responses=responses,                               │
│   394 │   │   │   │   │   metrics=metric_category_metrics,                   │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /__init__.py:54 in apply_metric                                              │
│                                                                              │
│   51 │   │   # Add non-batched metric results for this sample                │
│   52 │   │   for metric in non_batched_metrics:                              │
│   53 │   │   │   output.update(                                              │
│ ❱ 54 │   │   │   │   metric.compute_sample(                                  │
│   55 │   │   │   │   │   model_response=responses[i],                        │
│   56 │   │   │   │   │   doc=docs[i],                                        │
│   57 │   │   │   │   )                                                       │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /utils/metric_utils.py:59 in compute_sample                                  │
│                                                                              │
│    56 │   │                                                                  │
│    57 │   │   if isinstance(self, MetricGrouping):                           │
│    58 │   │   │   return sample_level_fn(**kwargs)                           │
│ ❱  59 │   │   return {self.metric_name: sample_level_fn(**kwargs)}           │
│    60 │                                                                      │
│    61 │   def get_corpus_aggregations(self) -> dict:                         │
│    62 │   │   if isinstance(self, MetricGrouping):                           │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /metrics_sample.py:282 in compute                                            │
│                                                                              │
│    279 │   │   choices_tokens = model_response.output_tokens[:n_choices]     │
│    280 │   │                                                                 │
│    281 │   │   normalized_log_probs = (                                      │
│ ❱  282 │   │   │   normalize_log_probs(                                      │
│    283 │   │   │   │   self.logprob_normalization,                           │
│    284 │   │   │   │   choices_logprobs,                                     │
│    285 │   │   │   │   unconditioned_logprobs,                               │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /normalizations.py:527 in normalize_log_probs                                │
│                                                                              │
│   524 │   │   case LogProbTokenNorm():                                       │
│   525 │   │   │   assert choices_tokens is not None, "choices_tokens must be │
│   526 │   │   │   normalized_log_probs = [                                   │
│ ❱ 527 │   │   │   │   choices_logprob[ix] / len(choices_tokens[ix]) for ix i │
│   528 │   │   │   ]                                                          │
│   529 │   │   case LogProbPMINorm():                                         │
│   530 │   │   │   assert unconditioned_logprob is not None, "unconditioned_l │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range

To Reproduce

$ lighteval accelerate "model_name=HPLT/hplt2c_nld_checkpoints" "belebele_nld_Latn_cf|5" --load-tasks-multilingual

Expected behavior

Lighteval completes the benchmark successfully, as it does for "belebele_nld_Latn_mcq" and for LogProbCharNorm in "belebele_nld_Latn_cf".

Version info

Linux, lighteval version 0.13.0 and main, python 3.13.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions