Provide text with UTF-8 MIME Type by default#970
Provide text with UTF-8 MIME Type by default#970ArneBab wants to merge 3 commits intohyphanet:nextfrom
Conversation
|
You have not avoided the very common test writing problem! 😄 |
I had also not avoided the very common "my change does not have any effect and a test would have shown that" problem 😓 Now it’s fixed: our plain text filter actually detects the charset from the BOM and uses UTF-8 by default. |
| if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) { | ||
| byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE]; | ||
| int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer); | ||
| BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE); |
There was a problem hiding this comment.
I’m pretty sure this is 100% wrong. That method detects an encoding from the representation of the string @charset. It is also gloriously misnamed as it has nothing to do with a BOM. 😀
There was a problem hiding this comment.
Yep, I agree with @Bombe here. See my other comment for something that does appear to work.
This avoids very common text encoding problems.
2827f9e to
748fdae
Compare
|
Both new tests fail on my machine, not sure why they work on CI. |
| if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) { | ||
| int bufferSize = handler.charsetExtractor.getCharsetBufferSize(); | ||
| input.mark(bufferSize); | ||
| byte[] charsetBuffer = new byte[bufferSize]; | ||
| int bytesRead = 0, offset = 0, toread=0; | ||
| while(true) { | ||
| toread = bufferSize - offset; | ||
| bytesRead = input.read(charsetBuffer, offset, toread); | ||
| if(bytesRead == -1 || toread == 0) break; | ||
| offset += bytesRead; | ||
| } | ||
| input.reset(); | ||
| int offset = readIntoBuffer(input, bufferSize, charsetBuffer); | ||
| charset = detectCharset(charsetBuffer, offset, handler, maybeCharset); |
There was a problem hiding this comment.
I believe the correct solution to this problem is moving this block of code right before the if(handler.readFilter != null) check: text/plain does not have a readFilter, but does takesACharset so this would run the detectCharset appropriately.
Few things to consider:
handler.charsetExtractor.getCharsetBufferSize()will NPE, so we need to choose the bufferSize to the max BOM length (5?) whenhandler.charsetExtractoris absent.- Alternatively a dummy CharsetExtractor can be used that always fails to detect a charset, so the BOM one is used automagically.
- this will return
UTF-8rather thanutf-8so the related test would need some adjustment.
| if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) { | ||
| byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE]; | ||
| int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer); | ||
| BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE); |
There was a problem hiding this comment.
Yep, I agree with @Bombe here. See my other comment for something that does appear to work.
| byte[] buf = { (byte) 0xef, (byte) 0xbb, (byte) 0xbf, 0x40 }; | ||
| ArrayBucket out = new ArrayBucket(); | ||
| FilterStatus fo = ContentFilter.filter(new ArrayBucket(buf).getInputStream(), out.getOutputStream(), "text/plain", null, null, null); | ||
| assertTrue("utf-8".equals(fo.charset)); |
There was a problem hiding this comment.
Please use assertThat(actual, equalTo(expected)) or assertEquals(expected, actual) for checking equality - this just yields a non-descriptive AssertionError when it fails instead of showing what the actual value was.
|
#1109 does part of the work of this PR with just a single line change. I now think the correct way to deal with this here would be to set the charset when detecting text/plain. That’s then needed both in pyFreenet and other utils and in fred. Basically always set utf-8, if that can correctly decode the text. |
This avoids very common text encoding problems.