Skip to content

Fix: Weigh programming language tallies by file size (bytes) instead … Fixes #3868#4733

Open
RajX-dev wants to merge 3 commits intoaboutcode-org:developfrom
RajX-dev:fix-language-tallies-byte-count
Open

Fix: Weigh programming language tallies by file size (bytes) instead … Fixes #3868#4733
RajX-dev wants to merge 3 commits intoaboutcode-org:developfrom
RajX-dev:fix-language-tallies-byte-count

Conversation

@RajX-dev
Copy link

@RajX-dev RajX-dev commented Feb 6, 2026

Fixed an issue where programming languages were tallied by file count instead of file size. This caused projects with many small files (like C headers) to be misidentified as C instead of C++. Changed logic to sum file sizes (bytes) for the tally.

@RajX-dev RajX-dev changed the title Fix: Weigh programming language tallies by file size (bytes) instead … Fix: Weigh programming language tallies by file size (bytes) instead … Fixes #3868 Feb 6, 2026
…of file count

Previously, language summaries were calculated based on the number of files, causing languages with many small files (e.g., C header files) to incorrectly appear as the primary language over languages with fewer but larger files (e.g., C++ source). This change uses file size (bytes) as the weight for the tally, ensuring the primary language reflects the code mass rather than file count. Fallback to count=1 if size is 0.

Signed-off-by: Raj Shekhar <sraj4090ti@gmail.com>
@RajX-dev RajX-dev force-pushed the fix-language-tallies-byte-count branch from 66ef6c9 to da2166b Compare February 6, 2026 19:18
Signed-off-by: Raj Shekhar <sraj4090ti@gmail.com>
@RajX-dev RajX-dev force-pushed the fix-language-tallies-byte-count branch 20 times, most recently from ac76353 to c1866c4 Compare February 7, 2026 09:01
Signed-off-by: Raj Shekhar <sraj4090ti@gmail.com>
@RajX-dev RajX-dev force-pushed the fix-language-tallies-byte-count branch from c1866c4 to a941f39 Compare February 7, 2026 09:44
@RajX-dev
Copy link
Author

RajX-dev commented Feb 7, 2026

Ready for review! @DennisClark @AyanSinhaMahapatra All checks are passing. 🟢

I have finalized the changes and verified the build. Here is a summary of the fixes included in this PR:

1-Feature Implementation: Switched programming language tallies to use byte counts instead of file counts (Fixes #3868).
2-Bug Fix (Summarizer): Fixed a KeyError crash in get_declared_holders that occurred when a holder existed in key files but was filtered out of the main tallies (0 byte count).
3-Bug Fix (CI/Threading): Patched src/scancode/interrupt.py to handle ValueError when setting signals. This resolves the crash on Python 3.14 (CI environment) where signals were failing because they weren't in the main thread.
4- Test Data: Regenerated the expected JSON and YAML reference files for summarycode, formattedcode, and todo tests to match the new byte-count sorting logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant