Skip to content

[WhoScored] expose eventId/event_id for related_event_id linking #941

@alexrguezzz

Description

@alexrguezzz

Describe the bug
In soccerdata==1.8.8, WhoScored.read_events(output_fmt="events") returns a dataframe that includes related_event_id, but it does not expose the event's own identifier in the formatted output.

This makes related_event_id difficult to use, because users cannot easily link a related event back to the original event row within the same dataframe.

The expected behavior is that the formatted events output should preserve the event identifier, either as an index level or as a regular column.

Python version: 3.12.13

Affected scrapers
This affects the following scrapers:

  • ClubElo
  • ESPN
  • FBref
  • FiveThirtyEight
  • Match History
  • SoFIFA
  • Understat
  • WhoScored

Code example
A minimal code example that reproduces the problem in soccerdata==1.8.8:

import soccerdata as sd

ws = sd.WhoScored(leagues="ENG-Premier League", seasons="2021", no_cache=True)
events = ws.read_events(match_id=1485184)

print(events.index.names)
print("id" in events.columns)
print("game_id" in events.columns)
print("related_event_id" in events.columns)

Output

['league', 'season', 'game']
False
True
True

No exception is raised. The issue is that the formatted dataframe exposes related_event_id, but does not expose the event's own identifier, which makes related_event_id difficult or impossible to use for linking related events within the same formatted output.

Additional context
I want to be explicit about the version behavior:

  • In soccerdata==1.8.8, WhoScored.read_events() works, but the formatted events output is missing the event identifier.
  • In soccerdata==1.9.0, I can no longer validate this in the same workflow because WhoScored currently fails earlier due to a separate regression reported in [WhoScored] read_schedule() fails with JSONDecodeError in 1.9.0 #940, where read_schedule() fails with JSONDecodeError.

So this issue is specifically about the formatted read_events(output_fmt="events") output schema.

I am attaching the Jupyter notebook that I used to reproduce the 1.8.8 behavior. In that notebook, read_events(match_id=1485184) returns a dataframe indexed by league, season, and game, and the rendered dataframe shows game_id and related_event_id, but not the event's own identifier.

Guía SoccerData (1.8.8).ipynb

This looks like either:

  • a bug in the formatted output schema; or
  • a mismatch between the implementation and the documentation.

Local workaround

I found a local workaround and am sharing it here for reference.

The raw WhoScored event data appears to contain both eventId and id. After standardize_colnames, these become event_id and id.

There is a design question here: should the formatted dataframe expose event_id, id, or both? There is also a second question: should the chosen identifier be part of the index, or should it remain a regular column?

In the local patch I tested, I used event_id as an additional index level and kept id as a regular column.

The reasoning was:

  • event_id comes from WhoScored's eventId;
  • related_event_id appears to refer to this event-level identifier;
  • using event_id in the index makes it possible to link related_event_id back to an event row within the same match;
  • keeping id as a column avoids dropping the other identifier from the formatted output.

I understand that the maintainers may prefer a different schema, for example keeping the current index and exposing event_id as a regular column instead. The main point is that the formatted read_events(output_fmt="events") output should preserve an event identifier so that related_event_id can be used reliably.

I am attaching the modified whoscored.py file for reference.

whoscored_issue_941_local_patch.py

Contributor Action Plan

  • I can fix this issue and will submit a pull request.
  • I’m unsure how to fix this, but I'm willing to work on it with guidance.
  • I’m not able to fix this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions