Skip to content

5.4.26#113

Merged
mscasso-scanoss merged 6 commits into
mainfrom
5.4.26
Jun 25, 2026
Merged

5.4.26#113
mscasso-scanoss merged 6 commits into
mainfrom
5.4.26

Conversation

@mscasso-scanoss

@mscasso-scanoss mscasso-scanoss commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • Chores
    • Bumped version to 5.4.26.
  • New Features
    • Added scan modes: -P/--purl, -C/--url-hash, and -S/--snippet-scan (with JSON output).
    • Added match-tolerance getter for reporting.
    • Added reporting support for printing purl arrays.
  • Bug Fixes
    • Fixed extension detection to only use dots from the filename basename.
    • Improved component ranking to only prioritize exact extension matches.
  • Documentation
    • Updated help text for the new options.
  • Chores
    • Added configurable maximum file content size (default 50MB) and a cap for URL-hash inputs.

@mscasso-scanoss mscasso-scanoss self-assigned this May 27, 2026
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@mscasso-scanoss, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 49 minutes and 49 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 189a7045-1935-450e-b09f-2c6e2012ca5d

📥 Commits

Reviewing files that changed from the base of the PR and between ffdad17 and bc8fb8f.

📒 Files selected for processing (2)
  • inc/purl_scan.h
  • src/purl_scan.c
📝 Walkthrough

Walkthrough

This PR updates the release version, changes extension handling, adds new purl/component/snippet scan modes, and enforces a maximum decompressed file-content size.

Changes

Extension, scan entry points, and limits

Layer / File(s) Summary
Version bump and extension logic
inc/scanoss.h, src/ignorelist.c, src/match.c
The version constant is updated, extension() returns a basename extension only when a dot exists after the last slash, and compare_file_extension() only distinguishes candidates when exactly one matches the scanned file extension.
Public scan contracts and CLI wiring
inc/limits.h, inc/match_list.h, inc/purl_scan.h, inc/report.h, src/help.c, src/main.c, src/copyright.c
New scan declarations, limit constants, help text, reporting helpers, and main option parsing are added for purl, component, snippet, and max-file-content-size handling.
PURL and component scan implementation
src/purl_scan.c, src/url.c
src/purl_scan.c adds purl aggregation by file MD5, component lookup by URL hash list, and snippet scan entry points; src/url.c adds debug logging around oldest-URL selection.
Snippet scanning and file-size limit enforcement
src/purl_scan.c, src/mz.c
snippet_scan_stream() parses WFP input and emits snippet JSON, the stdin and string wrappers expose that flow, and mz_get_key() rejects decompressed content above the configured file-content limit.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • scanoss-qg

Possibly related PRs

  • scanoss/engine#103: Related to the extension-based component ordering logic changed in src/match.c.

Poem

A rabbit hops with version new,
Through hashes, purls, and snippets too.
One dot in basename, now that’s the key,
CLI paths run clean and free.
🐰✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is just a version number and doesn't describe the main code changes. Use a short descriptive title that names the primary change, such as adding new scan modes and the max file content size limit.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 5.4.26

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

* add purl scan mode

* add componentes and snippets scanning mode

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/purl_scan.c (2)

118-123: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Guard against realloc failure / size overflow.

If realloc returns NULL the original url_hashes pointer is leaked and the next line dereferences NULL. Capturing into a temp (and optionally using a checked growth) makes this robust. Low priority since cohorts are small, but trivial to harden.

♻️ Suggested hardening
 	if (e->n_url_hashes >= e->url_hashes_cap)
 	{
 		e->url_hashes_cap = e->url_hashes_cap ? e->url_hashes_cap * 2 : 8;
-		e->url_hashes = realloc(e->url_hashes, e->url_hashes_cap * sizeof(char *));
+		char **tmp = realloc(e->url_hashes, e->url_hashes_cap * sizeof(char *));
+		if (!tmp)
+			return;
+		e->url_hashes = tmp;
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/purl_scan.c` around lines 118 - 123, The URL hashes growth logic in
purl_scan.c is not safe against allocation failure because the result of realloc
is written directly back to e->url_hashes and then used immediately. Update the
append path around the url_hashes resizing in the relevant function to store
realloc in a temporary pointer, verify it is non-NULL before assigning back, and
only increment n_url_hashes after a successful growth. Also add a checked
capacity calculation for the e->url_hashes_cap * sizeof(char *) allocation to
avoid size overflow when growing the buffer.

537-545: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Check strdup result before use.

rec = strdup(...) is passed to field_n without a NULL check; on allocation failure field_n would operate on NULL. Minor given the context, but a quick guard avoids an undefined path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/purl_scan.c` around lines 537 - 545, The file= parsing path in
purl_scan.c should guard against strdup allocation failure before calling
field_n on rec. Update the snippet around the strdup(line + tagln + MD5_LEN * 2
+ 1) assignment so it checks whether rec is NULL, and if so emit the same
malformed/allocation-style error handling used nearby, free any needed
resources, and return EXIT_FAILURE before target_path is derived. Keep the fix
localized to the field_n/target_path parsing flow so NULL never reaches field_n.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@inc/purl_scan.h`:
- Around line 39-49: The `component_scan` contract in the header is outdated: it
describes a single `url_hash_hex`, but the implementation in `component_scan`
actually accepts a comma-separated list and returns a `{"results": [...]}`
payload. Update the Doxygen `@brief` and `@param` text in `purl_scan.h` to
describe list-based input and multi-result output, keeping the wording aligned
with the behavior in `src/purl_scan.c` so callers see the correct API contract.

In `@src/purl_scan.c`:
- Around line 567-569: The copy into scan->file_size in purl_scan is unbounded
and can overflow because size_field may exceed the 32-byte allocation from scan
initialization. Update the assignment in the scan parsing flow to use a bounded
copy or validated length for scan->file_size, and keep the existing
scan->source_md5 handling unchanged since that buffer is already sized
correctly. Use the relevant purl_scan logic around size_field and the scan
struct fields to locate and fix this safely.

In `@src/url.c`:
- Around line 360-362: The logging in get_oldest_url can dereference comp or
comp->purls when allocation or initialization failed, so add null guards before
the scanlog call. Update the get_oldest_url path to verify comp is non-null
after calloc and to check comp->purls before accessing comp->purls[0], falling
back to safe placeholder text in the log message when either is missing.

---

Nitpick comments:
In `@src/purl_scan.c`:
- Around line 118-123: The URL hashes growth logic in purl_scan.c is not safe
against allocation failure because the result of realloc is written directly
back to e->url_hashes and then used immediately. Update the append path around
the url_hashes resizing in the relevant function to store realloc in a temporary
pointer, verify it is non-NULL before assigning back, and only increment
n_url_hashes after a successful growth. Also add a checked capacity calculation
for the e->url_hashes_cap * sizeof(char *) allocation to avoid size overflow
when growing the buffer.
- Around line 537-545: The file= parsing path in purl_scan.c should guard
against strdup allocation failure before calling field_n on rec. Update the
snippet around the strdup(line + tagln + MD5_LEN * 2 + 1) assignment so it
checks whether rec is NULL, and if so emit the same malformed/allocation-style
error handling used nearby, free any needed resources, and return EXIT_FAILURE
before target_path is derived. Keep the fix localized to the field_n/target_path
parsing flow so NULL never reaches field_n.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bbb8d979-36d6-4ad1-8ec4-dfdca4578232

📥 Commits

Reviewing files that changed from the base of the PR and between d6cbf0a and 2d11156.

📒 Files selected for processing (9)
  • inc/limits.h
  • inc/match_list.h
  • inc/purl_scan.h
  • inc/report.h
  • src/help.c
  • src/main.c
  • src/match_list.c
  • src/purl_scan.c
  • src/url.c
✅ Files skipped from review due to trivial changes (2)
  • inc/limits.h
  • src/help.c

Comment thread inc/purl_scan.h Outdated
Comment thread src/purl_scan.c Outdated
Comment thread src/url.c
Comment on lines +360 to +362
scanlog("get_oldest_url iter=%d purl[0]=%s identified=%d rank=%d release=%s\n",
iteration, comp->purls[0] ? comp->purls[0] : "(null)", comp->identified, comp->rank,
comp->release_date ? comp->release_date : "(null)");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard comp/comp->purls before logging to avoid null dereference.

This log dereferences comp->purls[0] without checking comp->purls, and comp is also not checked after calloc. A low-memory path can crash here.

Suggested fix
 		component_data_t * comp = calloc(1, sizeof(*comp));
+		if (!comp)
+		{
+			scanlog("get_oldest_url: out of memory allocating component_data_t\n");
+			free(url);
+			return false;
+		}
 		bool result = fill_component(comp, key, NULL, (uint8_t *)url);
@@
-		scanlog("get_oldest_url iter=%d purl[0]=%s identified=%d rank=%d release=%s\n",
-			iteration, comp->purls[0] ? comp->purls[0] : "(null)", comp->identified, comp->rank,
+		const char *purl0 = (comp->purls && comp->purls[0]) ? comp->purls[0] : "(null)";
+		scanlog("get_oldest_url iter=%d purl[0]=%s identified=%d rank=%d release=%s\n",
+			iteration, purl0, comp->identified, comp->rank,
 			comp->release_date ? comp->release_date : "(null)");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
scanlog("get_oldest_url iter=%d purl[0]=%s identified=%d rank=%d release=%s\n",
iteration, comp->purls[0] ? comp->purls[0] : "(null)", comp->identified, comp->rank,
comp->release_date ? comp->release_date : "(null)");
component_data_t * comp = calloc(1, sizeof(*comp));
if (!comp)
{
scanlog("get_oldest_url: out of memory allocating component_data_t\n");
free(url);
return false;
}
bool result = fill_component(comp, key, NULL, (uint8_t *)url);
const char *purl0 = (comp->purls && comp->purls[0]) ? comp->purls[0] : "(null)";
scanlog("get_oldest_url iter=%d purl[0]=%s identified=%d rank=%d release=%s\n",
iteration, purl0, comp->identified, comp->rank,
comp->release_date ? comp->release_date : "(null)");
🧰 Tools
🪛 Cppcheck (2.21.0)

[warning] 361-361: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 361-361: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 361-361: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 361-361: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 362-362: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 362-362: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/url.c` around lines 360 - 362, The logging in get_oldest_url can
dereference comp or comp->purls when allocation or initialization failed, so add
null guards before the scanlog call. Update the get_oldest_url path to verify
comp is non-null after calloc and to check comp->purls before accessing
comp->purls[0], falling back to safe placeholder text in the log message when
either is missing.

Source: Linters/SAST tools

@scanoss-qg scanoss-qg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/copyright.c (1)

39-58: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Helper returns string literals as char *.

copyright_id_to_source_name returns string literals through a non-const char *. Returning a const char * would better signal that callers must not mutate the result. Minor and optional.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/copyright.c` around lines 39 - 58, The helper copyright_id_to_source_name
currently returns string literals through a non-const char *; update its return
type to const char * and keep the switch cases returning the same literals. Also
check any callers in copyright.c that consume this helper so they accept a
const-qualified result without casting away const.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/main.c`:
- Around line 494-497: Validate the `--max-file-content-size` option in the
`case 262` handler before assigning `max_file_content_size`, because `strtoull`
currently accepts non-numeric input and zero, which can unintentionally disable
all file scanning in `mz_get_key`. Add input checking in the argument parsing
path in `main` so invalid or zero values are either rejected with an error or
explicitly treated as unlimited, and keep the existing `scanlog` message
consistent with the chosen behavior.

In `@src/mz.c`:
- Around line 104-113: The size guard in mz.c is fine, but a zero value for
max_file_content_size will cause every file to be rejected here. Update the
--max-file-content-size handling in main.c (case 262 / the option parsing that
sets max_file_content_size) to validate the parsed value and reject or default
invalid/empty input so this variable is never left at 0. Make the validation
happen before mz.c uses the limit, and keep the existing rejection path
unchanged.

---

Nitpick comments:
In `@src/copyright.c`:
- Around line 39-58: The helper copyright_id_to_source_name currently returns
string literals through a non-const char *; update its return type to const char
* and keep the switch cases returning the same literals. Also check any callers
in copyright.c that consume this helper so they accept a const-qualified result
without casting away const.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 033abae6-ed9d-4a84-9b51-933a529b0365

📥 Commits

Reviewing files that changed from the base of the PR and between 2d11156 and ffdad17.

📒 Files selected for processing (6)
  • inc/limits.h
  • src/copyright.c
  • src/help.c
  • src/limits.c
  • src/main.c
  • src/mz.c
✅ Files skipped from review due to trivial changes (1)
  • src/help.c

Comment thread src/main.c
Comment on lines +494 to +497
case 262: /* --max-file-content-size (value in MB) */
max_file_content_size = strtoull(optarg, NULL, 10) * 1024 * 1024;
scanlog("Max file content size set to %lu MB\n", (unsigned long) (max_file_content_size / (1024 * 1024)));
break;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Validate --max-file-content-size input.

strtoull returns 0 for non-numeric input, and an explicit 0 is accepted as-is. Either case sets max_file_content_size = 0, which makes mz_get_key() (src/mz.c Line 105) reject every file with the "exceeds the maximum allowed" error. Consider rejecting 0/unparseable values, or treating 0 as "unlimited".

🛡️ Suggested guard
 			case 262: /* --max-file-content-size (value in MB) */
-				max_file_content_size = strtoull(optarg, NULL, 10) * 1024 * 1024;
+			{
+				char *endptr = NULL;
+				unsigned long long mb = strtoull(optarg, &endptr, 10);
+				if (endptr == optarg || *endptr != '\0' || mb == 0)
+				{
+					fprintf(stderr, "Invalid --max-file-content-size value: %s\n", optarg);
+					exit(EXIT_FAILURE);
+				}
+				max_file_content_size = mb * 1024 * 1024;
+			}
 				scanlog("Max file content size set to %lu MB\n", (unsigned long) (max_file_content_size / (1024 * 1024)));
 				break;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case 262: /* --max-file-content-size (value in MB) */
max_file_content_size = strtoull(optarg, NULL, 10) * 1024 * 1024;
scanlog("Max file content size set to %lu MB\n", (unsigned long) (max_file_content_size / (1024 * 1024)));
break;
case 262: /* --max-file-content-size (value in MB) */
{
char *endptr = NULL;
unsigned long long mb = strtoull(optarg, &endptr, 10);
if (endptr == optarg || *endptr != '\0' || mb == 0)
{
fprintf(stderr, "Invalid --max-file-content-size value: %s\n", optarg);
exit(EXIT_FAILURE);
}
max_file_content_size = mb * 1024 * 1024;
}
scanlog("Max file content size set to %lu MB\n", (unsigned long) (max_file_content_size / (1024 * 1024)));
break;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.c` around lines 494 - 497, Validate the `--max-file-content-size`
option in the `case 262` handler before assigning `max_file_content_size`,
because `strtoull` currently accepts non-numeric input and zero, which can
unintentionally disable all file scanning in `mz_get_key`. Add input checking in
the argument parsing path in `main` so invalid or zero values are either
rejected with an error or explicitly treated as unlimited, and keep the existing
`scanlog` message consistent with the chosen behavior.

Comment thread src/mz.c
Comment on lines +104 to +113
/* Reject files whose content exceeds the configured maximum size */
if (job.data_ln > max_file_content_size)
{
fprintf(stderr, "File content size (%.2f MB) exceeds the maximum allowed (%lu MB). Use --max-file-content-size to change the limit.\n", (double) job.data_ln / (1024 * 1024), (unsigned long) (max_file_content_size / (1024 * 1024)));
free(job.data);
free(job.key);
free(job.mz);
exit(EXIT_FAILURE);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Limit check looks correct; note max_file_content_size == 0 rejects all files.

The decompressed-size guard and cleanup ordering are correct. Be aware the effective behavior depends on input validation for --max-file-content-size; a 0 value (from invalid/empty input) would reject every file here. See the related comment in src/main.c (case 262).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/mz.c` around lines 104 - 113, The size guard in mz.c is fine, but a zero
value for max_file_content_size will cause every file to be rejected here.
Update the --max-file-content-size handling in main.c (case 262 / the option
parsing that sets max_file_content_size) to validate the parsed value and reject
or default invalid/empty input so this variable is never left at 0. Make the
validation happen before mz.c uses the limit, and keep the existing rejection
path unchanged.

@mscasso-scanoss mscasso-scanoss merged commit af05e96 into main Jun 25, 2026
3 checks passed
@mscasso-scanoss mscasso-scanoss deleted the 5.4.26 branch June 25, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants