Truncate data urls #219

PeterNerlich · 2025-11-24T09:00:31Z

Linkcheck limits URL length to MAX_URL_LENGTH, longer URLs are skipped and a warning is logged mentioning the URL. This can create disadvantageous situations:

Someone decided it was a good idea to put a multi megabyte image directly into the content as a base64 encoded data URL, and the project using django-linkcheck did not think to prevent that situation.
Now there are multi megabyte data urls in the log every time linkcheck scans this. Inspecting logs is near impossible, since one has to scroll past huge blocks of garbage data, and maybe there's another data url logged just after it, so the important log line between the two is easily missed.
Maybe it is from a data URL or maybe just a conventional URL that happens to exceed MAX_URL_LENGTH – one decides to investigate where in the content it was used and whether it should be changed somehow.
Unfortunately, the usual solution of looking at Link objects to find the content object the URL is in does not work, since the URL was rejected for being too long.

I propose a solution to each of these:

If the URL exceeds MAX_URL_LENGTH, if it also starts with data:, truncate it to only 64 characters.
(Expectation: the data is not useful for identifying the URL)
In the log message when the URL exceeds MAX_URL_LENGTH, also log the instance where it came from to aid doing something about it.

claudep · 2025-11-24T09:12:41Z

In my opinion, data: URLs should not be collected at all by linkcheck at the Linklist level, as it doesn't point to any checkable content. Thoughts?

PeterNerlich · 2025-11-24T09:25:33Z

I agree, as long as the intention of linkcheck is to track clickable links. In integreat-cms we kind of abuse it to track other stuff as well – though not through data urls, so your suggestion would not impact us, but now that I think about it that might have been a cleaner way to do things in our case.

In any case, since linkcheck needs to bring the infrastructure to track links in content in order to do its main purpose, it is convenient to use that as an index for where arbitrary are used and thus might impact people.
Maybe it could be a setting though whether to handle data: urls that defaults to False

PeterNerlich added 2 commits November 24, 2025 09:42

truncate long data urls in log

5cf74e7

log where a too long URL came from

dc8f4d5

PeterNerlich mentioned this pull request Nov 24, 2025

Limit length of log entries digitalfabrik/integreat-cms#4028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Truncate data urls #219

Truncate data urls #219

Uh oh!

PeterNerlich commented Nov 24, 2025

Uh oh!

claudep commented Nov 24, 2025

Uh oh!

PeterNerlich commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Truncate data urls #219

Are you sure you want to change the base?

Truncate data urls #219

Uh oh!

Conversation

PeterNerlich commented Nov 24, 2025

I propose a solution to each of these:

Uh oh!

claudep commented Nov 24, 2025

Uh oh!

PeterNerlich commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants