Provide text with UTF-8 MIME Type by default by ArneBab · Pull Request #970 · hyphanet/fred

ArneBab · 2024-07-25T10:16:31Z

This avoids very common text encoding problems.

Bombe · 2024-07-25T10:24:02Z

You have not avoided the very common test writing problem! 😄

ArneBab · 2024-09-22T15:15:28Z

You have not avoided the very common test writing problem! 😄

I had also not avoided the very common "my change does not have any effect and a test would have shown that" problem 😓

Now it’s fixed: our plain text filter actually detects the charset from the BOM and uses UTF-8 by default.

Bombe · 2024-09-22T20:44:37Z

src/freenet/client/filter/ContentFilter.java

+				if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) {
+					byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE];
+					int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer);
+					BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE);


I’m pretty sure this is 100% wrong. That method detects an encoding from the representation of the string @charset. It is also gloriously misnamed as it has nothing to do with a BOM. 😀

Yep, I agree with @Bombe here. See my other comment for something that does appear to work.

This avoids very common text encoding problems.

bertm · 2025-05-11T01:12:20Z

Both new tests fail on my machine, not sure why they work on CI.

bertm · 2025-05-11T01:33:18Z

src/freenet/client/filter/ContentFilter.java

 				if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) {
 					int bufferSize = handler.charsetExtractor.getCharsetBufferSize();
-					input.mark(bufferSize);
 					byte[] charsetBuffer = new byte[bufferSize];
-					int bytesRead = 0, offset = 0, toread=0;
-					while(true) {
-						toread = bufferSize - offset;
-						bytesRead = input.read(charsetBuffer, offset, toread);
-						if(bytesRead == -1 || toread == 0) break;
-						offset += bytesRead;
-					}
-					input.reset();
+					int offset = readIntoBuffer(input, bufferSize, charsetBuffer);
 					charset = detectCharset(charsetBuffer, offset, handler, maybeCharset);


I believe the correct solution to this problem is moving this block of code right before the if(handler.readFilter != null) check: text/plain does not have a readFilter, but does takesACharset so this would run the detectCharset appropriately.

Few things to consider:

handler.charsetExtractor.getCharsetBufferSize() will NPE, so we need to choose the bufferSize to the max BOM length (5?) when handler.charsetExtractor is absent.

Alternatively a dummy CharsetExtractor can be used that always fails to detect a charset, so the BOM one is used automagically.

this will return UTF-8 rather than utf-8 so the related test would need some adjustment.

bertm · 2025-05-11T01:33:52Z

src/freenet/client/filter/ContentFilter.java

+				if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) {
+					byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE];
+					int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer);
+					BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE);


Yep, I agree with @Bombe here. See my other comment for something that does appear to work.

bertm · 2025-05-11T01:35:06Z

test/freenet/client/filter/ContentFilterTest.java

+        byte[] buf = { (byte) 0xef, (byte) 0xbb, (byte) 0xbf, 0x40 };
+        ArrayBucket out = new ArrayBucket();
+        FilterStatus fo = ContentFilter.filter(new ArrayBucket(buf).getInputStream(), out.getOutputStream(), "text/plain", null, null, null);
+        assertTrue("utf-8".equals(fo.charset));


Please use assertThat(actual, equalTo(expected)) or assertEquals(expected, actual) for checking equality - this just yields a non-descriptive AssertionError when it fails instead of showing what the actual value was.

ArneBab · 2025-11-08T22:51:23Z

#1109 does part of the work of this PR with just a single line change.

I now think the correct way to deal with this here would be to set the charset when detecting text/plain. That’s then needed both in pyFreenet and other utils and in fred. Basically always set utf-8, if that can correctly decode the text.

Bombe reviewed Sep 22, 2024

View reviewed changes

ArneBab added 3 commits November 8, 2024 10:10

Provide text with UTF-8 MIME Type by default

e7436a3

This avoids very common text encoding problems.

Actually detect charset with plain text, and add test

85013fb

Add failing test to detect BOM for Utf16Le

748fdae

ArneBab force-pushed the content-filter--text-utf8 branch from 2827f9e to 748fdae Compare November 11, 2024 07:24

bertm suggested changes May 11, 2025

View reviewed changes

ArneBab mentioned this pull request Nov 8, 2025

Fix: open as text link on downloads now uses utf-8 encoding #1109

Open

ArneBab closed this Nov 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide text with UTF-8 MIME Type by default#970

Provide text with UTF-8 MIME Type by default#970
ArneBab wants to merge 3 commits intohyphanet:nextfrom
ArneBab:content-filter--text-utf8

ArneBab commented Jul 25, 2024

Uh oh!

Bombe commented Jul 25, 2024

Uh oh!

ArneBab commented Sep 22, 2024

Uh oh!

Bombe Sep 22, 2024

Uh oh!

bertm May 11, 2025

Uh oh!

bertm commented May 11, 2025

Uh oh!

bertm May 11, 2025

Uh oh!

bertm May 11, 2025

Uh oh!

bertm May 11, 2025

Uh oh!

ArneBab commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArneBab commented Jul 25, 2024

Uh oh!

Bombe commented Jul 25, 2024

Uh oh!

ArneBab commented Sep 22, 2024

Uh oh!

Bombe Sep 22, 2024

Choose a reason for hiding this comment

Uh oh!

bertm May 11, 2025

Choose a reason for hiding this comment

Uh oh!

bertm commented May 11, 2025

Uh oh!

bertm May 11, 2025

Choose a reason for hiding this comment

Uh oh!

bertm May 11, 2025

Choose a reason for hiding this comment

Uh oh!

bertm May 11, 2025

Choose a reason for hiding this comment

Uh oh!

ArneBab commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants