feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser.#1273
Conversation
…a rule to have parity with the heuristic parser.
WalkthroughAdds runtime schema validation enforcing a single capture group per rule; refactors Archive to consume log_surgeon token views and centralizes per-token processing via a new add_token_to_dicts helper; introduces add_static_text for logtype/dictionary entries; updates tests, fixtures, configs, and various accessors/optional usage. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller as Caller
participant Archive as Archive::write_msg_using_schema
participant TS as TimestampPattern
participant TokenProc as Archive::add_token_to_dicts
participant Dict as Dictionaries
Caller->>Archive: write_msg_using_schema(log_surgeon::LogEventView)
Archive->>TS: search_known_ts_patterns(buffer, start, end)
TS-->>Archive: pattern result or none
loop each token
Archive->>TokenProc: add_token_to_dicts(log_view, token)
alt token is delimiter/newline
TokenProc-->>Dict: append static text / delimiter
else token is uncaught-string/int/float
TokenProc-->>Dict: add scalar or dict-var entry
else token has capture group
TokenProc->>Dict: add pre-capture constant
TokenProc->>Dict: add encoded capture substring (dict lookup)
TokenProc->>Dict: add post-capture constant
else no capture
TokenProc-->>Dict: add whole token as variable
end
end
Archive-->>Caller: complete
sequenceDiagram
autonumber
participant Loader as Schema Loader
participant Utils as Utils::load_lexer_from_file
participant Rule as Schema Rule (regex)
Loader->>Utils: load_lexer_from_file(schema_path)
loop for each schema_var rule
Utils->>Rule: rule->m_regex_ptr->get_subtree_positive_captures()
Rule-->>Utils: capture list
alt captures > 1
Utils-->>Loader: throw std::runtime_error(file,line,rule_name,capture_count)
else
Utils-->>Loader: continue
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
components/core/src/clp/Utils.cpp(2 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp(7 hunks)components/core/tests/test-ParserWithUserSchema.cpp(1 hunks)components/core/tests/test_schema_files/multiple_capture_groups.txt(1 hunks)components/core/tests/test_schema_files/single_capture_group.txt(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/tests/test-ParserWithUserSchema.cppcomponents/core/src/clp/Utils.cppcomponents/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (2)
components/core/tests/test-ParserWithUserSchema.cpp (2)
components/core/src/clp/Utils.cpp (2)
load_lexer_from_file(125-254)load_lexer_from_file(126-126)components/core/src/clp/Utils.hpp (1)
load_lexer_from_file(52-55)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
timestamp(146-146)logtype_id(229-232)components/core/src/clp/streaming_archive/writer/File.hpp (1)
timestamp(89-95)components/core/src/clp/EncodedVariableInterpreter.hpp (3)
encoded_var(52-52)encoded_var(111-111)id(51-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: antlr-code-committed (macos-15)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp/Utils.cpp (1)
11-11: Header addition looks good.Adding resolves downstream std::string usage here.
components/core/src/clp/streaming_archive/writer/Archive.cpp (4)
362-368: Potential off-by-one in uncompressed byte count (verify semantics).If m_end_pos is exclusive, end_pos - start_pos is correct; if inclusive, add +1. Please verify Token.m_end_pos semantics, especially across buffer wrap.
Consider a focused unit test with a two-token message exercising boundary conditions (start_pos == end_pos and wrap).
371-382: Good: token_type made const and delimiter handling left intact.No issues spotted; aligns with existing flow.
485-496: OK: zero-initialised logtype_id and downstream writes.This aligns with safer initialisation and existing dictionary API.
317-341: No signature mismatch found
Header and implementation both declare write_msg_using_schema(log_surgeon::LogEventView const&); no action required.components/core/tests/test_schema_files/single_capture_group.txt (1)
1-1: Fixture is minimal and appropriate.Covers the intended single-capture scenario with surrounding literals.
components/core/tests/test_schema_files/multiple_capture_groups.txt (1)
1-1: Good negative fixture.Triggers the >1 capture validation path as desired.
components/core/tests/test-ParserWithUserSchema.cpp (1)
195-204: Exact error assertion is OK; keep in sync if message changes.If you accept the optional richer error in Utils.cpp, adjust this expectation accordingly (or match a substring).
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
362-366: Guard against empty buffer when computing end_pos.pos() − 1 will underflow if the buffer is empty. Unlikely, but add a defensive check/assert.
Would you add a precondition (e.g., assert(log_output_buffer->pos() > 0)) before using pos() − 1?
391-405: Fix: track var_ids for int/float dictionary fallbacks and follow negation style.When integer/float cannot be encoded, you add a dictionary entry but don’t push the var ID into m_var_ids, which breaks segment indexing. Also, prefer “false == …” per repo style.
- encoded_variable_t encoded_var{}; - if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var( + encoded_variable_t encoded_var{}; + if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_int_var(); } m_encoded_vars.push_back(encoded_var); @@ - encoded_variable_t encoded_var{}; - if (!EncodedVariableInterpreter::convert_string_to_representable_float_var( + encoded_variable_t encoded_var{}; + if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_float_var(); } m_encoded_vars.push_back(encoded_var);Also applies to: 407-421
♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
431-440: Also treat empty capture-id vectors as “no capture.”has_value() can hold an empty vector; later at(0) would throw. Handle empty as the no‑capture path.
- auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; - if (false == capture_ids.has_value()) { + auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; + if (false == capture_ids.has_value() || capture_ids->empty()) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); m_var_ids.push_back(id); m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); m_logtype_dict_entry.add_dictionary_var(); break; }
442-454: Validate register positions before using front()/back().get_reversed_reg_positions(...) may return empty; front()/back() would be UB. Throw a clear error if empty.
- auto const [start_reg_id, end_reg_id]{register_ids.value()}; - auto const capture_start{token.get_reversed_reg_positions(start_reg_id).back()}; - auto const capture_end{token.get_reversed_reg_positions(end_reg_id).front()}; + auto const [start_reg_id, end_reg_id]{register_ids.value()}; + auto const& start_positions = token.get_reversed_reg_positions(start_reg_id); + auto const& end_positions = token.get_reversed_reg_positions(end_reg_id); + if (start_positions.empty() || end_positions.empty()) { + throw(std::runtime_error( + "Empty register positions for variable's capture group. Full token: " + + token.to_string() + )); + } + auto const capture_start{start_positions.back()}; + auto const capture_end{end_positions.front()};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/Utils.cpp(2 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp(7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/Utils.cppcomponents/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
timestamp(146-146)logtype_id(229-232)components/core/src/clp/streaming_archive/writer/File.hpp (1)
timestamp(89-95)components/core/src/clp/EncodedVariableInterpreter.cpp (6)
convert_string_to_representable_integer_var(24-61)convert_string_to_representable_integer_var(24-27)convert_string_to_representable_float_var(63-142)convert_string_to_representable_float_var(63-66)encode_var_dict_id(199-201)encode_var_dict_id(199-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (3)
components/core/src/clp/Utils.cpp (1)
11-11: LGTM: include is appropriate.Needed for the new error message construction.
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
5-10: LGTM: header additions are appropriate.These headers match the new usage and remove transitive‑include reliance.
Also applies to: 15-15
479-491: LGTM: correct style and write path.Style matches guideline (false == …) and the write/update path is consistent.
SharafMohamed
left a comment
There was a problem hiding this comment.
Everything look good, but it would be nice to have some small example log and a unit-test to test correctness.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 335-369: The int/float fallback branches add the token string to
m_var_dict but never record the produced variable id into the segment index
list, so add m_var_ids.push_back(id) after m_var_dict.add_entry(...) in both
branches (inside the TokenInt and TokenFloat fallback paths) so the id used to
create encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id) is also
appended to m_var_ids to keep segment indices in sync with m_var_dict entries.
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
379-391: Guard against empty capture‑ID vectors beforeat(0).
captures.has_value()can still hold an empty vector, soat(0)will throw. Treat empty as “no capture”.🩹 Proposed fix
- if (false == captures.has_value()) { + if (false == captures.has_value() || captures->empty()) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token_view.to_string(), id); m_var_ids.push_back(id); m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); m_logtype_dict_entry.add_dictionary_var(); break; } - auto capture_pos_result{ - log_view.get_capture_position(token_view, captures.value().at(0)) - }; + auto capture_pos_result{ + log_view.get_capture_position(token_view, captures->at(0)) + };
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 381-392: The code assumes captures.value() has at least one
element but captures from lexer.get_captures_from_rule_id(token_type) may be
present yet empty; update the guard to treat an empty vector as "no capture"
(e.g., check !captures.has_value() || captures->empty()) before calling
captures.value().at(0) so the existing fallback logic that uses
variable_dictionary_id_t id{}, m_var_dict.add_entry(...),
m_var_ids.push_back(id),
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)),
and m_logtype_dict_entry.add_dictionary_var() is executed for empty captures as
well; ensure log_view.get_capture_position is only called when a non-empty
captures vector is available.
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
452-455: Consider makingm_old_ts_patternaconstpointer to avoidconst_cast.The
const_castremoves the const qualifier fromtimestamp_pattern. Ifm_old_ts_patternis only used for comparison, declare it asTimestampPattern const*in the header to eliminate the cast and avoid potential undefined behaviour if the underlying data is in read-only memory.
…ist (hopfully not possible).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 328-331: Replace the unnecessary allocation by calling
token.to_string_view() instead of token.to_string() when passing to
m_logtype_dict_entry.add_static_text in the TokenNewline/TokenUncaughtString
case; update the branch handling
static_cast<int>(log_surgeon::SymbolId::TokenNewline) and TokenUncaughtString to
use token.to_string_view() so it matches the other cases that call
add_static_text(std::string_view).
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
455-458:const_castdiscards const qualifier; consider makingm_old_ts_patternaconstpointer.The
const_castremoves the const qualifier fromtimestamp_pattern. Ifm_old_ts_patternis only used for comparison purposes (as it appears here), declare it asTimestampPattern const*in the header to avoid the cast. Usingconst_castis a code smell and could lead to undefined behaviour if the underlying data is in read-only memory.In
Archive.hpp, change the member declaration:- TimestampPattern* m_old_ts_pattern; + TimestampPattern const* m_old_ts_pattern;Then remove the
const_casthere:if (m_old_ts_pattern != timestamp_pattern) { change_ts_pattern(timestamp_pattern); - m_old_ts_pattern = const_cast<TimestampPattern*>(timestamp_pattern); + m_old_ts_pattern = timestamp_pattern; }
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 459-461: The else branch that assigns timestamp_pattern = nullptr
is redundant because timestamp_pattern is already initialized to nullptr
earlier; remove the entire else clause so you don't reassign the pointer
unnecessarily; locate the timestamp_pattern variable usage in Archive.cpp and
delete the else { timestamp_pattern = nullptr; } block, leaving the prior
initialization (line ~437) as the single default assignment.
gibber9809
left a comment
There was a problem hiding this comment.
A few extremely minor nits for docstrings, but besides that LGTM.
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
gibber9809
left a comment
There was a problem hiding this comment.
LGTM. For PR title I can't think of anything much better off the top of my head, but maybe achieve parity or reach parity instead of have parity would be better?
Makes sense to me. Double checked for any other improvements with the big boss too. |
…chema rules to achieve parity with the heuristic parser. (y-scope#1273)" This reverts commit 4246ada.
…les to achieve parity with the heuristic parser. (y-scope#1273)
Description
Previously, when using log surgeon for parsing the full match of a schema rule's regex pattern would be stored as a variable in CLP. This created differences from the heuristic parser's behaviour for certain cases.
For example, the heuristic's "equals" rule can be represented with the regex pattern
.*=(?<var>.*[a-zA-Z0-9].*). The heuristic parser will only store thevarcapture group as a variable (storing the prefix.*=as static text). When using log surgeon without capture groups this behaviour was not possible as we would store the full match (including the prefix.*=) as a variable.This PR allows schema rules to contain up to 1 capture group. If a capture group is present only the capture's match will be stored as a variable and anything surrounding it will be stored as static text. In the case where the capture is repeated (e.g.
text(?<var>variable)+text)) all repetitions will be stored together as a single variable.To properly support escaping placeholder characters inside static text the method
LogTypeDictionaryEntry::add_static_textwas added along with some minor refactoring and linting.This is a breaking change as updating log surgeon includes some syntax changes and bug fixes that may impact existing schemas. See https://github.com/y-scope/log-surgeon/blob/193e1f91eb137bb935a7f44b13cc8dd945a8d742/docs/schema.md for more information.
Checklist
breaking change.
Validation performed
Added unit tests for schema creation with single or multiple captures.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.