Skip to content

fix(log-surgeon): Fix schema regex patterns in the heuristic schema file and LogConverter; Add a unit test to build a parser for the heuristic schema file.#1948

Merged
davidlion merged 6 commits into
y-scope:mainfrom
davidlion:schema-fix
Feb 3, 2026

Conversation

@davidlion

@davidlion davidlion commented Feb 2, 2026

Copy link
Copy Markdown
Member

Description

#1273 updated log-surgeon which included a few syntax changes. Unfortunately, some locations that need updating were missed.

This PR:

  • Updates the delimiter escape characters in the heuristic schema file (components/core/config/schemas.txt) and LogConverter.
  • Updates the LogConverter time stamp pattern to use the header + timestamp capture group syntax.
  • Groups part of the LogConverter time stamp pattern in a sub expression due to the change in alteration behaviour.
  • Adds a unit test to build the heuristic schema file to change any syntax breaks (the accuracy of the heuristic schema will be tested in the integration tests).

Technically, this isn't a breaking change as the old syntax behaviour was already broken by #1273, this PR just updates the syntax.

Updating the syntax exposed what appears to be an issue with log-surgeon, as the schema variable equals:.*=(?<var>.*[a-zA-Z0-9].*) causes the parser to fail to finish building. The work around in this PR (equals:[a-zA-Z0-9]+=(?<var>.*[a-zA-Z0-9].*)) should cover the vast majority of cases as [a-zA-Z0-9]+ should be very close to "any non-delimiter characters" (i.e. .*).

The issue in log surgeon is tracked by y-scope/log-surgeon#207.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

CI passing including new test. LogConverter tested manually.

Summary by CodeRabbit

  • Bug Fixes

    • Improved timestamp recognition by adjusting parsing to more reliably locate timestamps.
    • Refined delimiter handling to accept additional literal characters.
    • Tightened key–value parsing to require an identifier before '=' for more accurate captures.
  • Tests

    • Added test coverage for creating a log parser from configuration-based schemas.

…e and `LogConverter`; Add unit test to build a parser for the heuristic schema file.
@davidlion davidlion requested review from a team and gibber9809 as code owners February 2, 2026 02:05
@coderabbitai

coderabbitai Bot commented Feb 2, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Updated schema delimiter character class and tightened the equals rule; corresponding regex adjustments made in the C++ log converter. Added a test helper to locate the config directory and a new test that builds a log parser from config/schemas.txt.

Changes

Cohort / File(s) Summary
Schema
components/core/config/schemas.txt
Adjusted delimiter character class (removed escaping for brackets/parentheses and removed some symbols); tightened equals rule to require a leading alphanumeric sequence before =.
Log converter
components/core/src/clp_s/log_converter/LogConverter.cpp
Reworked timestamp regex to use a header:(?<timestamp>...) named capture and updated delimiter character class to match schema changes.
Tests
components/core/tests/test-ParserWithUserSchema.cpp
Added get_config_schema_files_dir() helper (computes config path relative to the test file) and a new test that constructs a log parser from config/schemas.txt and asserts no exception is thrown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: fixing schema regex patterns in the heuristic schema file and LogConverter, and adding a unit test for parser creation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp`:
- Line 35: The raw string literal for cDelimiters in LogConverter.cpp ends
prematurely because the default raw-string terminator R"(...)" clashes with the
intended delimiter characters ')' and '"' — change cDelimiters to use a custom
raw-string delimiter (e.g., R"DELIM(... )DELIM") so the characters ')' and '"'
inside the content are preserved; update the declaration of constexpr
std::string_view cDelimiters to use that custom delimiter delimiter name to
ensure the full delimiter list from components/core/config/schemas.txt is
included unchanged.

R"(([,\.:]\d{1,9}){0,1}([ ]{0,1}(UTC){0,1}[\+\-]\d{2}(:{0,1}\d{2}){0,1}Z{0,1}){0,1})"
};
constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n\[\(:)"};
constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n[(:)"};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the current state of the file
cat -n components/core/src/clp_s/log_converter/LogConverter.cpp | sed -n '30,40p'

Repository: y-scope/clp

Length of output: 695


🏁 Script executed:

# Check if schemas.txt exists and what it contains regarding delimiters
find . -name "schemas.txt" -type f

Repository: y-scope/clp

Length of output: 91


🏁 Script executed:

# Look for any reference to cDelimiters in the codebase
rg "cDelimiters" --type cpp --type hpp -A 2 -B 2

Repository: y-scope/clp

Length of output: 82


🌐 Web query:

C++ raw string literal terminator )" how to include ) and " in raw string content

💡 Result:

In a C++ raw string literal, the only thing you must avoid inside the content is the exact closing sequence:

)delimiter"

where delimiter is whatever you chose after R" (it can be empty, up to 16 chars, and can’t contain spaces, (, ), or \). (en.cppreference.com)

So how do you include )" (and " and )) in the content?

Use a non-empty delimiter so the terminator is no longer )".

Example: include )" in the string by choosing delimiter END:

auto s = R"END(Here is )" inside the content.)END";

This works because the literal only ends at )END" (not at )"). (en.cppreference.com)

If your content might contain )END" too

Pick a delimiter that does not appear after a ) in your content, e.g.:

auto s = R"___MYDELIM___( ... any text including )" ... )___MYDELIM___";

(Or, if unavoidable, split/concatenate into multiple string literals.) (en.cppreference.com)

Citations:


🏁 Script executed:

# Let's check the schemas.txt file to see what delimiters are expected
cat components/core/config/schemas.txt | grep -i "delimiters" -A 2 -B 2

Repository: y-scope/clp

Length of output: 160


🏁 Script executed:

# Also let's search for any other references or usage of cDelimiters
rg "cDelimiters" -A 3 -B 3

Repository: y-scope/clp

Length of output: 1452


Use a custom delimiter for the raw string to preserve ) and " characters

The raw string with default terminator R"(...)" has )" as its closing sequence. Since the current content ends with (:) followed by ", the ) and " are interpreted as the closing sequence and string literal terminator, not as part of the delimiter list. Both characters are required in the expected delimiter set (from components/core/config/schemas.txt).

🛠️ Suggested fix
-constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n[(:)"};
+constexpr std::string_view cDelimiters{R"delim(delimiters: \t\r\n[(:)")delim"};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n[(:)"};
constexpr std::string_view cDelimiters{R"delim(delimiters: \t\r\n[(:)")delim"};
🤖 Prompt for AI Agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp` at line 35, The raw
string literal for cDelimiters in LogConverter.cpp ends prematurely because the
default raw-string terminator R"(...)" clashes with the intended delimiter
characters ')' and '"' — change cDelimiters to use a custom raw-string delimiter
(e.g., R"DELIM(... )DELIM") so the characters ')' and '"' inside the content are
preserved; update the declaration of constexpr std::string_view cDelimiters to
use that custom delimiter delimiter name to ensure the full delimiter list from
components/core/config/schemas.txt is included unchanged.

@davidlion davidlion changed the title fix(log-surgeon): Fix delimiter escaping in the heuristic schema file and LogConverter; Add unit test to build a parser for the heuristic schema file. fix(log-surgeon): Fix delimiter escaping in the heuristic schema file and LogConverter; Add a unit test to build a parser for the heuristic schema file. Feb 2, 2026
hex:[a-fA-F]+
hasNumber:.*\d.*
equals:.*=(?<var>.*[a-zA-Z0-9].*)
equals:[a-zA-Z0-9]+=(?<var>.*[a-zA-Z0-9].*)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, doesn't this stop this rule from working with most unicode?

Would be fine with this as a temporary workaround as long as we create an issue to track fixing this case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, although afaik log-surgeon doesn't work properly with unicode in general.

I made y-scope/log-surgeon#207 (I forgot to add it to the PR description sorry). I'm not sure if we also want to add an issue to clp to track the log surgeon issue (lmk if you'd like that).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it's tracked somewhere that's fine with me.

Comment thread components/core/tests/test-ParserWithUserSchema.cpp Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/tests/test-ParserWithUserSchema.cpp (1)

43-57: 🧹 Nitpick | 🔵 Trivial

Reduce duplicated path resolution logic.

get_config_schema_files_dir() repeats the __FILE__canonical flow already in get_tests_dir(). Reusing get_tests_dir() keeps path logic centralised.

♻️ Proposed refactor
 auto get_config_schema_files_dir() -> std::filesystem::path {
-    std::filesystem::path const current_file_path{__FILE__};
-    return std::filesystem::canonical(current_file_path.parent_path().parent_path()) / "config";
+    return get_tests_dir().parent_path() / "config";
 }

@gibber9809 gibber9809 added the bug Something isn't working label Feb 2, 2026
@davidlion davidlion changed the title fix(log-surgeon): Fix delimiter escaping in the heuristic schema file and LogConverter; Add a unit test to build a parser for the heuristic schema file. fix(log-surgeon): Fix schema regex patterns in the heuristic schema file and LogConverter; Add a unit test to build a parser for the heuristic schema file. Feb 2, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp`:
- Around line 28-33: The regex change to cTimestampSchema (used by LogConverter)
likely altered capture group boundaries so the named capture "timestamp" may no
longer include the entire date+time and optional fractional/timezone; verify and
fix by creating a focused unit/integration test that feeds representative sample
log lines (with variants: numeric months, month names, fractional seconds,
UTC/offset/Z) into LogConverter and asserts the extracted "timestamp" equals the
full expected substring, then adjust grouping/parentheses in cTimestampSchema so
the named capture timestamp encompasses the complete date+time plus optional
fractional and timezone parts (ensure no stray outer groups cut off trailing
parts). Ensure tests reference LogConverter and cTimestampSchema so future edits
detect regressions.

Comment on lines 28 to +33
constexpr std::string_view cTimestampSchema{
R"(timestamp:(\d{2,4}[ /\-]{0,1}[ 0-9]{2}[ /\-][ 0-9]{2})|([ 0-9]{2}[ /\-])"
R"(header:(?<timestamp>((\d{2,4}[ /\-]{0,1}[ 0-9]{2}[ /\-][ 0-9]{2})|([ 0-9]{2}[ /\-])"
R"(((Jan(uary){0,1})|(Feb(ruary){0,1})|(Mar(ch){0,1})|(Apr(il){0,1})|(May)|(Jun(e){0,1})|)"
R"((Jul(y){0,1})|(Aug(ust){0,1})|(Sep(tember){0,1})|(Oct(ober){0,1})|(Nov(ember){0,1})|)"
R"((Dec(ember){0,1}))[ /\-]\d{2,4})[ T:][ 0-9]{2}:[ 0-9]{2}:[ 0-9]{2})"
R"(([,\.:]\d{1,9}){0,1}([ ]{0,1}(UTC){0,1}[\+\-]\d{2}(:{0,1}\d{2}){0,1}Z{0,1}){0,1})"
R"((Dec(ember){0,1})))[ /\-]\d{2,4})[ T:][ 0-9]{2}:[ 0-9]{2}:[ 0-9]{2})"
R"(([,\.:]\d{1,9}){0,1}([ ]{0,1}(UTC){0,1}[\+\-]\d{2}(:{0,1}\d{2}){0,1}Z{0,1}){0,1}))"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Verify timestamp capture boundaries after the regex regrouping.

The added header: prefix and closing-group changes can shift what the timestamp named capture returns; please confirm it still includes full date+time (plus optional fractional/timezone) in LogConverter, ideally with a focused sample-log test.

🤖 Prompt for AI Agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp` around lines 28 -
33, The regex change to cTimestampSchema (used by LogConverter) likely altered
capture group boundaries so the named capture "timestamp" may no longer include
the entire date+time and optional fractional/timezone; verify and fix by
creating a focused unit/integration test that feeds representative sample log
lines (with variants: numeric months, month names, fractional seconds,
UTC/offset/Z) into LogConverter and asserts the extracted "timestamp" equals the
full expected substring, then adjust grouping/parentheses in cTimestampSchema so
the named capture timestamp encompasses the complete date+time plus optional
fractional and timezone parts (ensure no stray outer groups cut off trailing
parts). Ensure tests reference LogConverter and cTimestampSchema so future edits
detect regressions.

@gibber9809 gibber9809 self-requested a review February 2, 2026 21:55

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes LGTM and PR title is fine as well.

Also manually validated that the log-converter is still working for a few important cases.

I found another issue with the timestamp regex where it doesn't handle some timezones correctly, but that's just an existing issue with the way the regex is written, not anything to do with log-surgeon changes, so that can be addressed in a separate PR.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title lgtm. @gibber9809 Can u create an issue to keep track of the timezone problem u mentioned?

@junhaoliao junhaoliao added this to the January 2026 milestone Feb 3, 2026
@davidlion davidlion merged commit 3c9dd82 into y-scope:main Feb 3, 2026
23 of 38 checks passed
@davidlion davidlion deleted the schema-fix branch February 3, 2026 00:53
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…ile and `LogConverter`; Add a unit test to build a parser for the heuristic schema file. (y-scope#1948)

Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants