Skip to content

Conversation

@PeterNerlich
Copy link

Linkcheck limits URL length to MAX_URL_LENGTH, longer URLs are skipped and a warning is logged mentioning the URL. This can create disadvantageous situations:

  • Someone decided it was a good idea to put a multi megabyte image directly into the content as a base64 encoded data URL, and the project using django-linkcheck did not think to prevent that situation.
    Now there are multi megabyte data urls in the log every time linkcheck scans this. Inspecting logs is near impossible, since one has to scroll past huge blocks of garbage data, and maybe there's another data url logged just after it, so the important log line between the two is easily missed.

  • Maybe it is from a data URL or maybe just a conventional URL that happens to exceed MAX_URL_LENGTH – one decides to investigate where in the content it was used and whether it should be changed somehow.
    Unfortunately, the usual solution of looking at Link objects to find the content object the URL is in does not work, since the URL was rejected for being too long.

I propose a solution to each of these:

  • If the URL exceeds MAX_URL_LENGTH, if it also starts with data:, truncate it to only 64 characters.
    (Expectation: the data is not useful for identifying the URL)

  • In the log message when the URL exceeds MAX_URL_LENGTH, also log the instance where it came from to aid doing something about it.

@claudep
Copy link
Contributor

claudep commented Nov 24, 2025

In my opinion, data: URLs should not be collected at all by linkcheck at the Linklist level, as it doesn't point to any checkable content. Thoughts?

@PeterNerlich
Copy link
Author

I agree, as long as the intention of linkcheck is to track clickable links. In integreat-cms we kind of abuse it to track other stuff as well – though not through data urls, so your suggestion would not impact us, but now that I think about it that might have been a cleaner way to do things in our case.

In any case, since linkcheck needs to bring the infrastructure to track links in content in order to do its main purpose, it is convenient to use that as an index for where arbitrary are used and thus might impact people.
Maybe it could be a setting though whether to handle data: urls that defaults to False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants