Skip to content

feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser.#1273

Merged
davidlion merged 107 commits into
y-scope:mainfrom
davidlion:capture-support
Jan 29, 2026
Merged

feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser.#1273
davidlion merged 107 commits into
y-scope:mainfrom
davidlion:capture-support

Conversation

@davidlion

@davidlion davidlion commented Aug 28, 2025

Copy link
Copy Markdown
Member

Description

Previously, when using log surgeon for parsing the full match of a schema rule's regex pattern would be stored as a variable in CLP. This created differences from the heuristic parser's behaviour for certain cases.

For example, the heuristic's "equals" rule can be represented with the regex pattern .*=(?<var>.*[a-zA-Z0-9].*). The heuristic parser will only store the var capture group as a variable (storing the prefix .*= as static text). When using log surgeon without capture groups this behaviour was not possible as we would store the full match (including the prefix .*=) as a variable.

This PR allows schema rules to contain up to 1 capture group. If a capture group is present only the capture's match will be stored as a variable and anything surrounding it will be stored as static text. In the case where the capture is repeated (e.g. text(?<var>variable)+text)) all repetitions will be stored together as a single variable.

To properly support escaping placeholder characters inside static text the method LogTypeDictionaryEntry::add_static_text was added along with some minor refactoring and linting.

This is a breaking change as updating log surgeon includes some syntax changes and bug fixes that may impact existing schemas. See https://github.com/y-scope/log-surgeon/blob/193e1f91eb137bb935a7f44b13cc8dd945a8d742/docs/schema.md for more information.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Added unit tests for schema creation with single or multiple captures.

Summary by CodeRabbit

  • New Features

    • Enhanced per-token processing and static-text insertion for logtypes; message/timestamp handling modernized to use token/view semantics.
  • Bug Fixes

    • Schema validation now rejects rules with multiple capture groups and reports file/line/rule; safer handling and fallbacks for malformed or missing token type information; timestamp rules still skipped.
  • Tests

    • Added tests and test data for single-capture acceptance, multi-capture rejection, lexer/token behavior, and archive/dictionary verification.
  • Chores

    • Updated build/config settings and headers to support the new token/view flow.

✏️ Tip: You can customize this high-level summary in your review settings.


@davidlion davidlion requested a review from a team as a code owner August 28, 2025 15:37
@coderabbitai

coderabbitai Bot commented Aug 28, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Adds runtime schema validation enforcing a single capture group per rule; refactors Archive to consume log_surgeon token views and centralizes per-token processing via a new add_token_to_dicts helper; introduces add_static_text for logtype/dictionary entries; updates tests, fixtures, configs, and various accessors/optional usage.

Changes

Cohort / File(s) Summary
Schema validation
components/core/src/clp/Utils.cpp
Adds #include <string> and runtime checks in both loops over schema_vars in load_lexer_from_file to compute num_captures from rule->m_regex_ptr->get_subtree_positive_captures().size(); throws std::runtime_error (file, line, rule) when num_captures > 1; preserves skip for rules named "timestamp" and comments single-capture limitation.
Archive writer — token-centric refactor
components/core/src/clp/streaming_archive/writer/Archive.cpp, components/core/src/clp/streaming_archive/writer/Archive.hpp
Adds add_token_to_dicts(log_surgeon::LogEventView const&, log_surgeon::Token) and changes write_msg_using_schema to accept log_surgeon::LogEventView const&; centralizes per-token handling (newline/uncaught/int/float/capture/default), migrates to log_surgeon token accessors (get_start_pos/get_end_pos/get_length/get_delimiter/get_type_ids), updates timestamp/pattern handling, and adds error/logging for missing type IDs or capture resolution.
Tests, fixtures and scaffolding
components/core/tests/test-ParserWithUserSchema.cpp, components/core/tests/test_schema_files/*, components/core/tests/test_log_files/log_with_capture.txt
Adds CLP test helpers (run_clp_compress, path helpers), swaps legacy test plumbing for CLP wrappers and get_type_ids() accessors, adds tests for single-capture success and multi-capture error, verifies archive/dictionary contents, and adds fixtures (single_capture_group.txt, multiple_capture_groups.txt, log_with_capture.txt).
LogType / Dictionary entry API additions
components/core/src/clp/LogTypeDictionaryEntry.cpp, components/core/src/clp/LogTypeDictionaryEntry.hpp, components/core/src/clp/LogTypeDictionaryEntryReq.hpp, components/core/src/clp_s/DictionaryEntry.cpp, components/core/src/clp_s/DictionaryEntry.hpp
Introduces add_static_text(std::string_view) and add_escape() requirement; refactors constant/static-text handling to use add_static_text, replaces inline escape-manipulation with add_escape() calls, and converts parse_next_var to trailing-return style.
Accessor & optional refactors
components/core/src/clp/GrepCore.cpp, components/core/src/clp_s/log_converter/LogConverter.cpp
Replaces direct token member access (m_type_ids_ptr->at(0)) with get_type_ids()->at(0) and replaces nullptr timestamp checks with std::optional usage (if (auto ts{event.get_timestamp()}; ts.has_value())) with optional-based slicing.
Config & deps updates
components/core/config/schemas.txt, taskfiles/deps/main.yaml
Changes equals schema entry to equals:.*=(?<var>.*[a-zA-Z0-9].*) to expose a var capture; adds G_YSTDLIB_LIB_NAME and ystdlib dep, updates log-surgeon tarball URL/SHA and CMake args to include ystdlib.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Caller
  participant Archive as Archive::write_msg_using_schema
  participant TS as TimestampPattern
  participant TokenProc as Archive::add_token_to_dicts
  participant Dict as Dictionaries
  Caller->>Archive: write_msg_using_schema(log_surgeon::LogEventView)
  Archive->>TS: search_known_ts_patterns(buffer, start, end)
  TS-->>Archive: pattern result or none
  loop each token
    Archive->>TokenProc: add_token_to_dicts(log_view, token)
    alt token is delimiter/newline
      TokenProc-->>Dict: append static text / delimiter
    else token is uncaught-string/int/float
      TokenProc-->>Dict: add scalar or dict-var entry
    else token has capture group
      TokenProc->>Dict: add pre-capture constant
      TokenProc->>Dict: add encoded capture substring (dict lookup)
      TokenProc->>Dict: add post-capture constant
    else no capture
      TokenProc-->>Dict: add whole token as variable
    end
  end
  Archive-->>Caller: complete
Loading
sequenceDiagram
  autonumber
  participant Loader as Schema Loader
  participant Utils as Utils::load_lexer_from_file
  participant Rule as Schema Rule (regex)
  Loader->>Utils: load_lexer_from_file(schema_path)
  loop for each schema_var rule
    Utils->>Rule: rule->m_regex_ptr->get_subtree_positive_captures()
    Rule-->>Utils: capture list
    alt captures > 1
      Utils-->>Loader: throw std::runtime_error(file,line,rule_name,capture_count)
    else
      Utils-->>Loader: continue
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding support for single capture groups in schema rules to achieve parity with the heuristic parser.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fbd5a12 and a7a0be1.

📒 Files selected for processing (5)
  • components/core/src/clp/Utils.cpp (2 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)
  • components/core/tests/test-ParserWithUserSchema.cpp (1 hunks)
  • components/core/tests/test_schema_files/multiple_capture_groups.txt (1 hunks)
  • components/core/tests/test_schema_files/single_capture_group.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-ParserWithUserSchema.cpp
  • components/core/src/clp/Utils.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (2)
components/core/tests/test-ParserWithUserSchema.cpp (2)
components/core/src/clp/Utils.cpp (2)
  • load_lexer_from_file (125-254)
  • load_lexer_from_file (126-126)
components/core/src/clp/Utils.hpp (1)
  • load_lexer_from_file (52-55)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
  • timestamp (146-146)
  • logtype_id (229-232)
components/core/src/clp/streaming_archive/writer/File.hpp (1)
  • timestamp (89-95)
components/core/src/clp/EncodedVariableInterpreter.hpp (3)
  • encoded_var (52-52)
  • encoded_var (111-111)
  • id (51-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: antlr-code-committed (macos-15)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp/Utils.cpp (1)

11-11: Header addition looks good.

Adding resolves downstream std::string usage here.

components/core/src/clp/streaming_archive/writer/Archive.cpp (4)

362-368: Potential off-by-one in uncompressed byte count (verify semantics).

If m_end_pos is exclusive, end_pos - start_pos is correct; if inclusive, add +1. Please verify Token.m_end_pos semantics, especially across buffer wrap.

Consider a focused unit test with a two-token message exercising boundary conditions (start_pos == end_pos and wrap).


371-382: Good: token_type made const and delimiter handling left intact.

No issues spotted; aligns with existing flow.


485-496: OK: zero-initialised logtype_id and downstream writes.

This aligns with safer initialisation and existing dictionary API.


317-341: No signature mismatch found
Header and implementation both declare write_msg_using_schema(log_surgeon::LogEventView const&); no action required.

components/core/tests/test_schema_files/single_capture_group.txt (1)

1-1: Fixture is minimal and appropriate.

Covers the intended single-capture scenario with surrounding literals.

components/core/tests/test_schema_files/multiple_capture_groups.txt (1)

1-1: Good negative fixture.

Triggers the >1 capture validation path as desired.

components/core/tests/test-ParserWithUserSchema.cpp (1)

195-204: Exact error assertion is OK; keep in sync if message changes.

If you accept the optional richer error in Utils.cpp, adjust this expectation accordingly (or match a substring).

Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
Comment thread components/core/src/clp/Utils.cpp
Comment thread components/core/tests/test-ParserWithUserSchema.cpp
Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
Comment thread components/core/tests/test_schema_files/multiple_capture_groups.txt
Comment thread components/core/tests/test_schema_files/single_capture_group.txt Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

362-366: Guard against empty buffer when computing end_pos.

pos() − 1 will underflow if the buffer is empty. Unlikely, but add a defensive check/assert.

Would you add a precondition (e.g., assert(log_output_buffer->pos() > 0)) before using pos() − 1?


391-405: Fix: track var_ids for int/float dictionary fallbacks and follow negation style.

When integer/float cannot be encoded, you add a dictionary entry but don’t push the var ID into m_var_ids, which breaks segment indexing. Also, prefer “false == …” per repo style.

-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_int_var();
                 }
                 m_encoded_vars.push_back(encoded_var);
@@
-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_float_var();
                 }
                 m_encoded_vars.push_back(encoded_var);

Also applies to: 407-421

♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

431-440: Also treat empty capture-id vectors as “no capture.”

has_value() can hold an empty vector; later at(0) would throw. Handle empty as the no‑capture path.

-                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
-                if (false == capture_ids.has_value()) {
+                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
+                if (false == capture_ids.has_value() || capture_ids->empty()) {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
                     m_var_ids.push_back(id);
                     m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                     m_logtype_dict_entry.add_dictionary_var();
 
                     break;
                 }

442-454: Validate register positions before using front()/back().

get_reversed_reg_positions(...) may return empty; front()/back() would be UB. Throw a clear error if empty.

-                auto const [start_reg_id, end_reg_id]{register_ids.value()};
-                auto const capture_start{token.get_reversed_reg_positions(start_reg_id).back()};
-                auto const capture_end{token.get_reversed_reg_positions(end_reg_id).front()};
+                auto const [start_reg_id, end_reg_id]{register_ids.value()};
+                auto const& start_positions = token.get_reversed_reg_positions(start_reg_id);
+                auto const& end_positions = token.get_reversed_reg_positions(end_reg_id);
+                if (start_positions.empty() || end_positions.empty()) {
+                    throw(std::runtime_error(
+                            "Empty register positions for variable's capture group. Full token: "
+                            + token.to_string()
+                    ));
+                }
+                auto const capture_start{start_positions.back()};
+                auto const capture_end{end_positions.front()};
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc591cd and 03a1f02.

📒 Files selected for processing (2)
  • components/core/src/clp/Utils.cpp (2 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/Utils.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
  • timestamp (146-146)
  • logtype_id (229-232)
components/core/src/clp/streaming_archive/writer/File.hpp (1)
  • timestamp (89-95)
components/core/src/clp/EncodedVariableInterpreter.cpp (6)
  • convert_string_to_representable_integer_var (24-61)
  • convert_string_to_representable_integer_var (24-27)
  • convert_string_to_representable_float_var (63-142)
  • convert_string_to_representable_float_var (63-66)
  • encode_var_dict_id (199-201)
  • encode_var_dict_id (199-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (3)
components/core/src/clp/Utils.cpp (1)

11-11: LGTM: include is appropriate.

Needed for the new error message construction.

components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

5-10: LGTM: header additions are appropriate.

These headers match the new usage and remove transitive‑include reliance.

Also applies to: 15-15


479-491: LGTM: correct style and write path.

Style matches guideline (false == …) and the write/update path is consistent.

Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
Comment thread components/core/src/clp/Utils.cpp

@SharafMohamed SharafMohamed left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything look good, but it would be nice to have some small example log and a unit-test to test correctness.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 335-369: The int/float fallback branches add the token string to
m_var_dict but never record the produced variable id into the segment index
list, so add m_var_ids.push_back(id) after m_var_dict.add_entry(...) in both
branches (inside the TokenInt and TokenFloat fallback paths) so the id used to
create encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id) is also
appended to m_var_ids to keep segment indices in sync with m_var_dict entries.
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

379-391: Guard against empty capture‑ID vectors before at(0).

captures.has_value() can still hold an empty vector, so at(0) will throw. Treat empty as “no capture”.

🩹 Proposed fix
-            if (false == captures.has_value()) {
+            if (false == captures.has_value() || captures->empty()) {
                 variable_dictionary_id_t id{};
                 m_var_dict.add_entry(token_view.to_string(), id);
                 m_var_ids.push_back(id);
                 m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                 m_logtype_dict_entry.add_dictionary_var();
                 break;
             }

-            auto capture_pos_result{
-                    log_view.get_capture_position(token_view, captures.value().at(0))
-            };
+            auto capture_pos_result{
+                    log_view.get_capture_position(token_view, captures->at(0))
+            };

Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 381-392: The code assumes captures.value() has at least one
element but captures from lexer.get_captures_from_rule_id(token_type) may be
present yet empty; update the guard to treat an empty vector as "no capture"
(e.g., check !captures.has_value() || captures->empty()) before calling
captures.value().at(0) so the existing fallback logic that uses
variable_dictionary_id_t id{}, m_var_dict.add_entry(...),
m_var_ids.push_back(id),
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)),
and m_logtype_dict_entry.add_dictionary_var() is executed for empty captures as
well; ensure log_view.get_capture_position is only called when a non-empty
captures vector is available.
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

452-455: Consider making m_old_ts_pattern a const pointer to avoid const_cast.

The const_cast removes the const qualifier from timestamp_pattern. If m_old_ts_pattern is only used for comparison, declare it as TimestampPattern const* in the header to eliminate the cast and avoid potential undefined behaviour if the underlying data is in read-only memory.

Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated
@davidlion davidlion requested a review from gibber9809 January 23, 2026 04:04

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 328-331: Replace the unnecessary allocation by calling
token.to_string_view() instead of token.to_string() when passing to
m_logtype_dict_entry.add_static_text in the TokenNewline/TokenUncaughtString
case; update the branch handling
static_cast<int>(log_surgeon::SymbolId::TokenNewline) and TokenUncaughtString to
use token.to_string_view() so it matches the other cases that call
add_static_text(std::string_view).
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

455-458: const_cast discards const qualifier; consider making m_old_ts_pattern a const pointer.

The const_cast removes the const qualifier from timestamp_pattern. If m_old_ts_pattern is only used for comparison purposes (as it appears here), declare it as TimestampPattern const* in the header to avoid the cast. Using const_cast is a code smell and could lead to undefined behaviour if the underlying data is in read-only memory.

In Archive.hpp, change the member declaration:

-    TimestampPattern* m_old_ts_pattern;
+    TimestampPattern const* m_old_ts_pattern;

Then remove the const_cast here:

         if (m_old_ts_pattern != timestamp_pattern) {
             change_ts_pattern(timestamp_pattern);
-            m_old_ts_pattern = const_cast<TimestampPattern*>(timestamp_pattern);
+            m_old_ts_pattern = timestamp_pattern;
         }

Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 459-461: The else branch that assigns timestamp_pattern = nullptr
is redundant because timestamp_pattern is already initialized to nullptr
earlier; remove the entire else clause so you don't reassign the pointer
unnecessarily; locate the timestamp_pattern variable usage in Archive.cpp and
delete the else { timestamp_pattern = nullptr; } block, leaving the prior
initialization (line ~437) as the single default assignment.

Comment thread components/core/src/clp/streaming_archive/writer/Archive.cpp Outdated

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few extremely minor nits for docstrings, but besides that LGTM.

Comment thread components/core/src/clp/LogTypeDictionaryEntry.hpp Outdated
Comment thread components/core/src/clp/LogTypeDictionaryEntryReq.hpp Outdated
Comment thread components/core/src/clp/LogTypeDictionaryEntryReq.hpp Outdated
Comment thread components/core/src/clp_s/DictionaryEntry.hpp Outdated
davidlion and others added 2 commits January 28, 2026 16:49
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
@davidlion davidlion requested a review from gibber9809 January 28, 2026 23:43

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. For PR title I can't think of anything much better off the top of my head, but maybe achieve parity or reach parity instead of have parity would be better?

@davidlion davidlion changed the title feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser. Jan 29, 2026
@davidlion

davidlion commented Jan 29, 2026

Copy link
Copy Markdown
Member Author

LGTM. For PR title I can't think of anything much better off the top of my head, but maybe achieve parity or reach parity instead of have parity would be better?

Makes sense to me. Double checked for any other improvements with the big boss too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.