fix(log-surgeon): Fix schema regex patterns in the heuristic schema file and LogConverter; Add a unit test to build a parser for the heuristic schema file.#1948
Conversation
…e and `LogConverter`; Add unit test to build a parser for the heuristic schema file.
WalkthroughUpdated schema delimiter character class and tightened the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp`:
- Line 35: The raw string literal for cDelimiters in LogConverter.cpp ends
prematurely because the default raw-string terminator R"(...)" clashes with the
intended delimiter characters ')' and '"' — change cDelimiters to use a custom
raw-string delimiter (e.g., R"DELIM(... )DELIM") so the characters ')' and '"'
inside the content are preserved; update the declaration of constexpr
std::string_view cDelimiters to use that custom delimiter delimiter name to
ensure the full delimiter list from components/core/config/schemas.txt is
included unchanged.
| R"(([,\.:]\d{1,9}){0,1}([ ]{0,1}(UTC){0,1}[\+\-]\d{2}(:{0,1}\d{2}){0,1}Z{0,1}){0,1})" | ||
| }; | ||
| constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n\[\(:)"}; | ||
| constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n[(:)"}; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the current state of the file
cat -n components/core/src/clp_s/log_converter/LogConverter.cpp | sed -n '30,40p'Repository: y-scope/clp
Length of output: 695
🏁 Script executed:
# Check if schemas.txt exists and what it contains regarding delimiters
find . -name "schemas.txt" -type fRepository: y-scope/clp
Length of output: 91
🏁 Script executed:
# Look for any reference to cDelimiters in the codebase
rg "cDelimiters" --type cpp --type hpp -A 2 -B 2Repository: y-scope/clp
Length of output: 82
🌐 Web query:
C++ raw string literal terminator )" how to include ) and " in raw string content
💡 Result:
In a C++ raw string literal, the only thing you must avoid inside the content is the exact closing sequence:
)delimiter"
where delimiter is whatever you chose after R" (it can be empty, up to 16 chars, and can’t contain spaces, (, ), or \). (en.cppreference.com)
So how do you include )" (and " and )) in the content?
Use a non-empty delimiter so the terminator is no longer )".
Example: include )" in the string by choosing delimiter END:
auto s = R"END(Here is )" inside the content.)END";This works because the literal only ends at )END" (not at )"). (en.cppreference.com)
If your content might contain )END" too
Pick a delimiter that does not appear after a ) in your content, e.g.:
auto s = R"___MYDELIM___( ... any text including )" ... )___MYDELIM___";(Or, if unavoidable, split/concatenate into multiple string literals.) (en.cppreference.com)
Citations:
- 1: https://en.cppreference.com/w/cpp/language/string_literal.html?utm_source=openai
- 2: https://en.cppreference.com/w/cpp/language/string_literal.html?utm_source=openai
- 3: https://en.cppreference.com/w/cpp/language/string_literal.html?utm_source=openai
🏁 Script executed:
# Let's check the schemas.txt file to see what delimiters are expected
cat components/core/config/schemas.txt | grep -i "delimiters" -A 2 -B 2Repository: y-scope/clp
Length of output: 160
🏁 Script executed:
# Also let's search for any other references or usage of cDelimiters
rg "cDelimiters" -A 3 -B 3Repository: y-scope/clp
Length of output: 1452
Use a custom delimiter for the raw string to preserve ) and " characters
The raw string with default terminator R"(...)" has )" as its closing sequence. Since the current content ends with (:) followed by ", the ) and " are interpreted as the closing sequence and string literal terminator, not as part of the delimiter list. Both characters are required in the expected delimiter set (from components/core/config/schemas.txt).
🛠️ Suggested fix
-constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n[(:)"};
+constexpr std::string_view cDelimiters{R"delim(delimiters: \t\r\n[(:)")delim"};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| constexpr std::string_view cDelimiters{R"(delimiters: \t\r\n[(:)"}; | |
| constexpr std::string_view cDelimiters{R"delim(delimiters: \t\r\n[(:)")delim"}; |
🤖 Prompt for AI Agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp` at line 35, The raw
string literal for cDelimiters in LogConverter.cpp ends prematurely because the
default raw-string terminator R"(...)" clashes with the intended delimiter
characters ')' and '"' — change cDelimiters to use a custom raw-string delimiter
(e.g., R"DELIM(... )DELIM") so the characters ')' and '"' inside the content are
preserved; update the declaration of constexpr std::string_view cDelimiters to
use that custom delimiter delimiter name to ensure the full delimiter list from
components/core/config/schemas.txt is included unchanged.
LogConverter; Add unit test to build a parser for the heuristic schema file.LogConverter; Add a unit test to build a parser for the heuristic schema file.
| hex:[a-fA-F]+ | ||
| hasNumber:.*\d.* | ||
| equals:.*=(?<var>.*[a-zA-Z0-9].*) | ||
| equals:[a-zA-Z0-9]+=(?<var>.*[a-zA-Z0-9].*) |
There was a problem hiding this comment.
If I'm understanding correctly, doesn't this stop this rule from working with most unicode?
Would be fine with this as a temporary workaround as long as we create an issue to track fixing this case.
There was a problem hiding this comment.
You're right, although afaik log-surgeon doesn't work properly with unicode in general.
I made y-scope/log-surgeon#207 (I forgot to add it to the PR description sorry). I'm not sure if we also want to add an issue to clp to track the log surgeon issue (lmk if you'd like that).
There was a problem hiding this comment.
As long as it's tracked somewhere that's fine with me.
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/core/tests/test-ParserWithUserSchema.cpp (1)
43-57: 🧹 Nitpick | 🔵 TrivialReduce duplicated path resolution logic.
get_config_schema_files_dir()repeats the__FILE__→canonicalflow already inget_tests_dir(). Reusingget_tests_dir()keeps path logic centralised.♻️ Proposed refactor
auto get_config_schema_files_dir() -> std::filesystem::path { - std::filesystem::path const current_file_path{__FILE__}; - return std::filesystem::canonical(current_file_path.parent_path().parent_path()) / "config"; + return get_tests_dir().parent_path() / "config"; }
LogConverter; Add a unit test to build a parser for the heuristic schema file.LogConverter; Add a unit test to build a parser for the heuristic schema file.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp`:
- Around line 28-33: The regex change to cTimestampSchema (used by LogConverter)
likely altered capture group boundaries so the named capture "timestamp" may no
longer include the entire date+time and optional fractional/timezone; verify and
fix by creating a focused unit/integration test that feeds representative sample
log lines (with variants: numeric months, month names, fractional seconds,
UTC/offset/Z) into LogConverter and asserts the extracted "timestamp" equals the
full expected substring, then adjust grouping/parentheses in cTimestampSchema so
the named capture timestamp encompasses the complete date+time plus optional
fractional and timezone parts (ensure no stray outer groups cut off trailing
parts). Ensure tests reference LogConverter and cTimestampSchema so future edits
detect regressions.
| constexpr std::string_view cTimestampSchema{ | ||
| R"(timestamp:(\d{2,4}[ /\-]{0,1}[ 0-9]{2}[ /\-][ 0-9]{2})|([ 0-9]{2}[ /\-])" | ||
| R"(header:(?<timestamp>((\d{2,4}[ /\-]{0,1}[ 0-9]{2}[ /\-][ 0-9]{2})|([ 0-9]{2}[ /\-])" | ||
| R"(((Jan(uary){0,1})|(Feb(ruary){0,1})|(Mar(ch){0,1})|(Apr(il){0,1})|(May)|(Jun(e){0,1})|)" | ||
| R"((Jul(y){0,1})|(Aug(ust){0,1})|(Sep(tember){0,1})|(Oct(ober){0,1})|(Nov(ember){0,1})|)" | ||
| R"((Dec(ember){0,1}))[ /\-]\d{2,4})[ T:][ 0-9]{2}:[ 0-9]{2}:[ 0-9]{2})" | ||
| R"(([,\.:]\d{1,9}){0,1}([ ]{0,1}(UTC){0,1}[\+\-]\d{2}(:{0,1}\d{2}){0,1}Z{0,1}){0,1})" | ||
| R"((Dec(ember){0,1})))[ /\-]\d{2,4})[ T:][ 0-9]{2}:[ 0-9]{2}:[ 0-9]{2})" | ||
| R"(([,\.:]\d{1,9}){0,1}([ ]{0,1}(UTC){0,1}[\+\-]\d{2}(:{0,1}\d{2}){0,1}Z{0,1}){0,1}))" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Verify timestamp capture boundaries after the regex regrouping.
The added header: prefix and closing-group changes can shift what the timestamp named capture returns; please confirm it still includes full date+time (plus optional fractional/timezone) in LogConverter, ideally with a focused sample-log test.
🤖 Prompt for AI Agents
In `@components/core/src/clp_s/log_converter/LogConverter.cpp` around lines 28 -
33, The regex change to cTimestampSchema (used by LogConverter) likely altered
capture group boundaries so the named capture "timestamp" may no longer include
the entire date+time and optional fractional/timezone; verify and fix by
creating a focused unit/integration test that feeds representative sample log
lines (with variants: numeric months, month names, fractional seconds,
UTC/offset/Z) into LogConverter and asserts the extracted "timestamp" equals the
full expected substring, then adjust grouping/parentheses in cTimestampSchema so
the named capture timestamp encompasses the complete date+time plus optional
fractional and timezone parts (ensure no stray outer groups cut off trailing
parts). Ensure tests reference LogConverter and cTimestampSchema so future edits
detect regressions.
gibber9809
left a comment
There was a problem hiding this comment.
New changes LGTM and PR title is fine as well.
Also manually validated that the log-converter is still working for a few important cases.
I found another issue with the timestamp regex where it doesn't handle some timezones correctly, but that's just an existing issue with the way the regex is written, not anything to do with log-surgeon changes, so that can be addressed in a separate PR.
LinZhihao-723
left a comment
There was a problem hiding this comment.
The PR title lgtm. @gibber9809 Can u create an issue to keep track of the timezone problem u mentioned?
…ile and `LogConverter`; Add a unit test to build a parser for the heuristic schema file. (y-scope#1948) Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
Description
#1273 updated log-surgeon which included a few syntax changes. Unfortunately, some locations that need updating were missed.
This PR:
components/core/config/schemas.txt) andLogConverter.LogConvertertime stamp pattern to use theheader+timestampcapture group syntax.LogConvertertime stamp pattern in a sub expression due to the change in alteration behaviour.Technically, this isn't a breaking change as the old syntax behaviour was already broken by #1273, this PR just updates the syntax.
Updating the syntax exposed what appears to be an issue with log-surgeon, as the schema variable
equals:.*=(?<var>.*[a-zA-Z0-9].*)causes the parser to fail to finish building. The work around in this PR (equals:[a-zA-Z0-9]+=(?<var>.*[a-zA-Z0-9].*)) should cover the vast majority of cases as[a-zA-Z0-9]+should be very close to "any non-delimiter characters" (i.e..*).The issue in log surgeon is tracked by y-scope/log-surgeon#207.
Checklist
breaking change.
Validation performed
CI passing including new test. LogConverter tested manually.
Summary by CodeRabbit
Bug Fixes
Tests