Skip to content

Fix revisit content-type#42

Merged
lfoppiano merged 8 commits intoccfrom
bugfix/revisit-content-type
Mar 28, 2026
Merged

Fix revisit content-type#42
lfoppiano merged 8 commits intoccfrom
bugfix/revisit-content-type

Conversation

@lfoppiano
Copy link
Copy Markdown

This PR fixed issue #40. I replaced the content type message/http with application/http; msgtype=response.

Copy link
Copy Markdown

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano, thanks! The fix looks good to me.

However, the unit test failed when I tried to run it. Creating a "HTTP 304" response object, resp. the included Content object, might be challenging. See the attached data. It's on you whether to read the response from the segment, or to study it to figure out how it should look like. Use nutch readseg for inspection. There are other Nutch unit tests which use segments.

Otherwise, it's very appreciated that there will be now the first "deeper" unit test for the WARC writer.

Comment thread src/test/org/commoncrawl/util/TestWarcWriter.java Outdated
@lfoppiano
Copy link
Copy Markdown
Author

lfoppiano commented Feb 24, 2026

@sebastian-nagel the test was left failing on purpose, because I wanted to first fix the test running.

I managed (or, better, Claude Opus managed - other models weren't able) to find the cause.
In short, all test were ran with <fork> and there were two tests in particular TestCommonCrawlDataDumper whcih called at some point System.exit() dumping the whole forked process. However also TestMimeUtil had a simliar problem.

For now I've separated the org.commoncrawl.* from the org.apache.* tests. But the issue may still occur preventing other tests from the same group to be executed.

@lfoppiano lfoppiano marked this pull request as draft February 25, 2026 07:37
@sebastian-nagel
Copy link
Copy Markdown

called at some point System.exit()

Thanks! That's a left-over of NUTCH-2852. Unclear why it hits our fork but not upstream Nutch. It should be fixed upstream anyway. I'll also comment on PR #44.

@lfoppiano lfoppiano marked this pull request as ready for review February 26, 2026 13:40
Copy link
Copy Markdown

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano, thanks! The fix to change the Content-Type in the WARC header looks good.

See the inline comments about the unit test code.

Comment thread src/test/org/commoncrawl/util/TestWarcWriter.java Outdated
Comment thread build.xml Outdated
@lfoppiano lfoppiano force-pushed the bugfix/revisit-content-type branch from 0f0c4b3 to b0110c9 Compare March 23, 2026 14:24
@lfoppiano
Copy link
Copy Markdown
Author

@sebastian-nagel I believe now the test works fine with the data from a real case. 🌮 🎉
I think there are only two item that I did not find:

String payloadDigest = "sha1:abc123";
String blockDigest = "sha1:def456";

@sebastian-nagel
Copy link
Copy Markdown

The digests are calculated one level higher in the WarcRecordWriter and then passed to the WarcWriter. No need to test it as part of this PR.

Copy link
Copy Markdown

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lfoppiano! Looks good to me.

Please merge!

@lfoppiano lfoppiano merged commit d0c4d97 into cc Mar 28, 2026
1 check passed
@lfoppiano lfoppiano deleted the bugfix/revisit-content-type branch March 28, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants