diff --git a/.github/SECURITY.md b/.github/SECURITY.md
new file mode 100644
index 00000000..274f4244
--- /dev/null
+++ b/.github/SECURITY.md
@@ -0,0 +1,223 @@
+# Security Policy
+
+## Purpose
+
+This document defines how security issues affecting the python release of `Hamstring` must be reported, handled, remediated, and disclosed. The goal is to ensure that security-relevant findings are managed confidentially, triaged consistently, and resolved within appropriate operational timelines.
+
+## Supported Versions
+
+Security fixes are provided for the latest released major version.
+
+| Version | Supported |
+| ------- | --------- |
+| `main` / latest release | Yes |
+| Previous minor release | Yes |
+| Older releases | No |
+
+## Confidential Reporting
+
+Security issues must be reported through confidential channels only. Public
+GitHub issues, public pull requests, or other public forums must not be used
+for reporting vulnerabilities.
+
+Approved reporting channels:
+
+- GitHub Private Vulnerability Reporting / Security Advisories
+- Email: `pm300@uni-heidelberg.de`
+
+Reports should include, where available:
+
+- A concise description of the issue
+- Affected component, endpoint, container, or workflow
+- Reproduction steps or validation details
+- Expected impact and potential exploitation scenario
+- Affected version, branch, commit, or image tag
+- Relevant logs, screenshots, request samples, or proof-of-concept material
+- Whether the issue appears actively exploited or time-sensitive
+
+## Confidentiality Requirements
+
+All reported security issues are handled as confidential until review is
+complete and a coordinated disclosure decision has been made.
+
+During this period:
+
+- Issue details must not be disclosed publicly
+- Sensitive technical details must be shared only on a need-to-know basis
+- Access to internal discussion, remediation branches, and advisory drafts
+ should be restricted
+- If there is evidence of active exploitation, internal escalation should occur
+ immediately
+
+## Intake and Triage
+
+Each report is reviewed to determine:
+
+- Whether the issue is reproducible
+- Whether it affects a supported version
+- Whether there is a meaningful confidentiality, integrity, or availability impact
+- Whether exploitation requires special conditions or privileged access
+- Whether the issue represents active exploitation, a misconfiguration, or a
+ theoretical weakness without practical impact
+
+Severity should be classified as `Critical / High / Medium / Low`.
+
+## Response Targets
+
+The following target times apply to supported versions and valid security
+reports. These targets are operational goals, not contractual guarantees.
+
+| Stage | Target |
+| ----- | ------ |
+| Initial acknowledgement | Within 2 business days |
+| Initial triage decision | Within 5 business days |
+| First remediation update | Within 7 calendar days |
+| Ongoing status updates | At least every 7 calendar days |
+| Critical issue remediation plan | Within 7 calendar days |
+| High severity remediation plan | Within 14 calendar days |
+| Medium severity remediation plan | Within 30 calendar days |
+| Low severity remediation plan | Best effort |
+
+If a report indicates active exploitation, credential exposure, remote code
+execution, or broad unauthorized access, the issue should be escalated as an
+incident and handled with priority outside normal backlog processes.
+
+## Remediation Process
+
+When a security issue is confirmed, maintainers should:
+
+- Reproduce and validate the issue
+- Define affected versions and deployment scenarios
+- Prepare a remediation plan proportional to the severity
+- Implement and review the fix
+- Backport the fix to supported versions where feasible
+- Validate the fix before release
+- Prepare customer-facing or operator-facing guidance if configuration or
+ operational action is required
+
+Where immediate remediation is not possible, temporary mitigations should be
+documented and communicated clearly.
+
+## Disclosure and Publication
+
+Confirmed vulnerabilities are disclosed in a coordinated manner after one of
+the following conditions is met:
+
+- A fix has been released
+- A mitigation has been published and the residual risk is understood
+- A disclosure deadline has been reached and leadership approves publication
+
+The default disclosure target is `90 days`, but the actual window may be
+shortened or extended based on:
+
+- Evidence of exploitation
+- Fix availability and deployment risk
+- Customer exposure
+- Dependency or vendor coordination needs
+
+Public disclosures may include:
+
+- A security advisory
+- Release notes
+- Upgrade or mitigation instructions
+- Severity and affected-version information
+
+## Operational Communication
+
+Where a confirmed issue affects deployed environments, communication should be
+proportionate to impact. This may include:
+
+- Internal security or operations escalation
+- Notification to administrators, customers, or service owners
+- Temporary mitigation guidance
+- Required upgrade or rotation steps
+- Post-remediation confirmation and closure
+
+Security communications should avoid unnecessary disclosure of exploit details
+before mitigations are available.
+
+## Scope
+
+The following areas are considered in scope for security handling:
+
+- Authentication and authorization controls
+- Password handling and account lifecycle
+- File upload, parsing, and processing pipelines
+- Secrets handling and environment configuration
+- Data access controls and audit logging
+- Container, service, and network configuration
+- Dependency vulnerabilities with validated product impact
+- Sensitive data exposure, privilege escalation, SSRF, RCE, injection, and
+ broken access control
+
+## Out of Scope
+
+The following are generally not treated as security vulnerabilities unless
+clear and demonstrated security impact exists:
+
+- Cosmetic misconfigurations without exploitability
+- Missing hardening headers without a practical attack path
+- Issues affecting unsupported or end-of-life releases only
+- Hypothetical findings without reproducible impact
+- Third-party platform issues outside the control of this project
+- Reports based only on scanner output without technical validation
+
+## Safe Handling Expectations
+
+Anyone validating a suspected issue is expected to act in a controlled and
+minimal manner.
+
+Expected behavior:
+
+- Limit activity to what is necessary to confirm the issue
+- Avoid unauthorized access to non-public or third-party data
+- Avoid disruption of production systems
+- Avoid persistence, data modification, or data destruction
+- Stop testing and report promptly once the issue is confirmed
+
+This policy does not authorize:
+
+- Access to data belonging to other users or organizations
+- Service disruption or denial-of-service activity
+- Data exfiltration or retention of sensitive information
+- Any activity that violates applicable law or contractual obligations
+
+## Security Updates
+
+Security fixes may be distributed through one or more of the following:
+
+- Normal release process
+- Out-of-band patch release
+- Security advisory
+- Operational mitigation notice
+
+Where appropriate, the published update should include:
+
+- Affected versions
+- Fixed versions
+- Severity
+- Upgrade path
+- Required operational actions
+
+## Escalation
+
+If no acknowledgement is received within the response target above, the report
+should be resent to:
+
+- `pm300@uni-heidelberg.de`
+
+Urgent reports involving active exploitation or high-confidence compromise
+should use the subject line:
+
+`[URGENT SECURITY REPORT]`
+
+## Policy Maintenance
+
+This policy should be reviewed whenever:
+
+- Reporting channels change
+- Supported versions change
+- Incident response expectations change
+- Disclosure commitments change
+
+Last reviewed: `27-03-2026`
diff --git a/.readthedocs.yml b/.readthedocs.yml
index b241b932..336e933a 100644
--- a/.readthedocs.yml
+++ b/.readthedocs.yml
@@ -1,9 +1,9 @@
version: "2"
build:
- os: "ubuntu-22.04"
+ os: "ubuntu-24.04"
tools:
- python: "3.10"
+ python: "3.11"
jobs:
pre_build:
- sphinx-apidoc -T -M -o docs/api src/ "*/tests"
@@ -20,4 +20,4 @@ python:
- requirements: requirements/requirements.logserver.txt
sphinx:
- configuration: docs/conf.py
+ configuration: docs/conf.py
\ No newline at end of file
diff --git a/README.md b/README.md
index 5e1780a4..28f35375 100644
--- a/README.md
+++ b/README.md
@@ -16,9 +16,7 @@
-
-
-
+
HAMSTRING
@@ -34,8 +32,6 @@
-> [!CAUTION]
-> This project has been moved to https://github.com/Hamstring-NDR/hamstring. Future development, issues, and releases will be maintained there.
@@ -56,7 +52,7 @@
## About the Project
-
+
## Getting Started
@@ -68,27 +64,11 @@ HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml --profile prod up
-#### Use the dev profile for testing out changes in docke containers:
+#### Use the dev profile for testing out changes in docker containers:
```sh
HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml --profile dev up
```
-
-#### Or run the modules locally on your machine:
-```sh
-python -m venv .venv
-source .venv/bin/activate
-
-sh install_requirements.sh
-```
-Alternatively, you can use `pip install` and enter all needed requirements individually with `-r requirements.*.txt`.
-
-Now, you can start each stage, e.g. the inspector:
-
-```sh
-python src/inspector/inspector.py
-```
-
(back to top )
@@ -126,6 +106,11 @@ For more in-depth information on your options, have a look at our
[official documentation](https://hamstring.readthedocs.io/en/latest/usage.html), where we provide tables explaining all
values in detail.
+
+### Testing Your Own Data
+
+If you want to ingest data to the pipeline, you can do so via the zeek container. Either select the interface in the `config.yaml` zeek should be listening on and set `static_analysis: false` or provide PCAPs to Zeek by adding them in the `data/test_pcaps` directory, which is mounted per default for Zeek to ingest static data.
+
### Monitoring
To monitor the system and observe its real-time behavior, multiple Grafana dashboards have been set up.
@@ -209,22 +194,17 @@ Have a look at the following pictures showing examples of how these dashboards m
To train and test our and possibly your own models, we currently rely on the following datasets:
-- [CICBellDNS2021](https://www.unb.ca/cic/datasets/dns-2021.html)
- [DGTA Benchmark](https://data.mendeley.com/datasets/2wzf9bz7xr/1)
- [DNS Tunneling Queries for Binary Classification](https://data.mendeley.com/datasets/mzn9hvdcxg/1)
- [UMUDGA - University of Murcia Domain Generation Algorithm Dataset](https://data.mendeley.com/datasets/y8ph45msv8/1)
- [DGArchive](https://dgarchive.caad.fkie.fraunhofer.de/)
+- [DNS Exfiltration](https://data.mendeley.com/datasets/c4n7fckkz3/3)
We compute all features separately and only rely on the `domain` and `class` for binary classification.
### Inserting Data for Testing
-For testing purposes, we provide multiple scripts in the `scripts` directory. Use `real_logs.dev.py` to send data from
-the datasets into the pipeline. After downloading the dataset and storing it under `/data`, run
-```sh
-python scripts/real_logs.dev.py
-```
-to start continuously inserting dataset traffic.
+For testing purposes, you can ingest PCAPs or tap on network interfaces using the zeek-based sensor in its `1.0.0` release. For more information on it, please refer to [the documentation](https://github.com/Hamstring-NDR/hamstring-zeek).
### Training Your Own Models
@@ -270,7 +250,7 @@ The results will be saved per default to `./results`, if not configured otherwis
#### Model Tests
```sh
-> python src/train/train.py test --dataset --dataset_path --model --model_path
+> python src/train/train.py test --dataset --dataset_path --model --model_output_path
```
#### Model Explain
diff --git a/assets/hamstring.svg b/assets/hamstring.svg
new file mode 100644
index 00000000..605bdfac
--- /dev/null
+++ b/assets/hamstring.svg
@@ -0,0 +1,44 @@
+
+
+
+
diff --git a/assets/heidgaf_architecture.svg b/assets/heidgaf_architecture.svg
index bef8170d..4530acf2 100644
--- a/assets/heidgaf_architecture.svg
+++ b/assets/heidgaf_architecture.svg
@@ -1,4 +1,4 @@
-
+Log Metadata & Analysis Aggregation
\ No newline at end of file
diff --git a/assets/heidgaf_cicd.svg b/assets/heidgaf_cicd.svg
deleted file mode 100644
index 6c89576c..00000000
--- a/assets/heidgaf_cicd.svg
+++ /dev/null
@@ -1,4 +0,0 @@
-
-
-
-Triggered Workflow on GitHub
diff --git a/assets/upload_seafile.py b/assets/upload_seafile.py
deleted file mode 100644
index 8ac8074c..00000000
--- a/assets/upload_seafile.py
+++ /dev/null
@@ -1,181 +0,0 @@
-import re
-import argparse
-import sys
-import copy
-from pathlib import Path
-from urllib.parse import urlparse
-
-import requests
-from bs4 import BeautifulSoup
-
-optional_packages = True
-try:
- # optional, for upload progess updates
- from requests_toolbelt import MultipartEncoder, MultipartEncoderMonitor
- from tqdm import tqdm
-except ImportError:
- optional_packages = False
-
-
-def extract_var(script_text, variable_name, default=None):
- if variable_name in script_text:
- # match: var_name: "value" or var_name: 'value' or var_name = "value" or var_name = 'value'
- pattern = re.compile(
- r'{}\s*[:=]\s*(["\'])(.*?)\1'.format(re.escape(variable_name))
- )
- match = pattern.search(script_text)
- if match:
- return match.group(2)
- return default
-
-
-def extract_info_from_html(html_content):
- soup = BeautifulSoup(html_content, "html.parser")
- scripts = soup.find_all("script")
- token = parent_dir = repo_id = dir_name = None
- for script in scripts:
- token = extract_var(script.text, "token", token)
- parent_dir = extract_var(script.text, "path", parent_dir)
- repo_id = extract_var(script.text, "repoID", repo_id)
- dir_name = extract_var(script.text, "dirName", dir_name)
- return token, parent_dir, repo_id, dir_name
-
-
-def get_html_content(url):
- response = requests.get(url)
- return response.text
-
-
-def get_upload_url(api_url):
- response = requests.get(api_url)
- if response.status_code == 200:
- return response.json().get("upload_link")
- return None
-
-
-def get_upload_url2(api_url):
- headers = {"Accept": "application/json", "X-Requested-With": "XMLHttpRequest"}
- response = requests.get(api_url, headers=headers)
- if response.status_code == 200:
- return response.json().get("url")
- return None
-
-
-def upload_file(upload_url, file_path, fields):
- fields = copy.deepcopy(fields)
- path = Path(file_path)
- filename = path.name
- total_size = path.stat().st_size
-
- if not optional_packages:
- with open(file_path, "rb") as f:
- fields["file"] = (filename, f)
- response = requests.post(
- upload_url, files=fields, params={"ret-json": "true"}
- )
- return response
-
- # ref: https://stackoverflow.com/a/67726532/11854304
- with tqdm(
- desc=filename,
- total=total_size,
- unit="B",
- unit_scale=True,
- unit_divisor=1024,
- ) as bar:
- with open(file_path, "rb") as f:
- fields["file"] = (filename, f)
- encoder = MultipartEncoder(fields=fields)
- monitor = MultipartEncoderMonitor(
- encoder, lambda monitor: bar.update(monitor.bytes_read - bar.n)
- )
- headers = {"Content-Type": monitor.content_type}
- response = requests.post(
- upload_url, headers=headers, data=monitor, params={"ret-json": "true"}
- )
- return response
-
-
-def upload_seafile(upload_page_link, file_path_list, replace_file, verbose):
- parsed_results = urlparse(upload_page_link)
- base_url = f"{parsed_results.scheme}://{parsed_results.netloc}"
- if verbose:
- print(f"Input:")
- print(f" * Upload page url: {upload_page_link}")
- print(f" * Files to be uploaded: {file_path_list}")
- print(f" * Replace existing files: {replace_file}")
- print(f"Preparation:")
- print(f" * Base url: {base_url}")
-
- # get html content
- html_content = get_html_content(upload_page_link)
-
- # extract variables from html content
- token, parent_dir, repo_id, dir_name = extract_info_from_html(html_content)
- if not parent_dir:
- print(f"Cannot extract parent_dir from HTML content.", file=sys.stderr)
- return 1
- if verbose:
- print(f" * dir_name: {dir_name}")
- print(f" * parent_dir: {parent_dir}")
-
- # get upload url
- upload_url = None
- if token:
- # ref: https://github.com/haiwen/seafile-js/blob/master/src/seafile-api.js#L1164
- api_url = f"{base_url}/api/v2.1/upload-links/{token}/upload/"
- upload_url = get_upload_url(api_url)
- elif repo_id:
- # ref: https://stackoverflow.com/a/38743242/11854304
- api_url = (
- upload_page_link.replace("/u/d/", "/ajax/u/d/").rstrip("/")
- + f"/upload/?r={repo_id}"
- )
- upload_url = get_upload_url2(api_url)
- if not upload_url:
- print(f"Cannot get upload_url.", file=sys.stderr)
- return 1
- if verbose:
- print(f" * upload_url: {upload_url}")
-
- # prepare payload fields
- fields = {"parent_dir": parent_dir}
- # overwrite file if already present in the upload directory.
- # contributor: hmassias
- # ref: https://gist.github.com/hmassias/358895ef0b2ffaa9e708181b16b554cf
- if replace_file:
- fields["replace"] = "1"
-
- # upload each file
- print(f"Upload:")
- for idx, file_path in enumerate(file_path_list):
- print(f"({idx+1}) {file_path}")
- try:
- response = upload_file(upload_url, file_path, fields)
- if response.status_code == 200:
- print(f"({idx+1}) upload completed: {response.json()}")
- else:
- print(
- f"({idx+1}) {file_path} ERROR: {response.status_code} {response.text}",
- file=sys.stderr,
- )
- except Exception as e:
- print(f"({idx+1}) {file_path} EXCEPTION: {e}", file=sys.stderr)
-
- return 0
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument(
- "-l", "--link", required=True, help="upload page link (generated by seafile)"
- )
- parser.add_argument(
- "-f", "--file", required=True, nargs="+", help="file(s) to upload"
- )
- parser.add_argument(
- "-v", "--verbose", action="store_true", help="show detailed output"
- )
- parser.add_argument("--replace", action="store_true", help="Replace existing files")
- args = parser.parse_args()
- sys.exit(upload_seafile(args.link, args.file, args.replace, args.verbose))
diff --git a/config-test.yaml b/config-test.yaml
deleted file mode 100644
index d809e13d..00000000
--- a/config-test.yaml
+++ /dev/null
@@ -1,123 +0,0 @@
-logging:
- base:
- debug: false
- modules:
- log_storage.logserver:
- debug: false
- log_collection.collector:
- debug: false
- log_collection.batch_handler:
- debug: false
- log_filtering.prefilter:
- debug: false
- data_inspection.inspector:
- debug: false
- data_analysis.detector:
- debug: false
-
-pipeline:
- log_storage:
- logserver:
- input_file: "/opt/file.txt"
-
-
-
- log_collection:
- default_batch_handler_config:
- batch_size: 2000
- batch_timeout: 30.0
- subnet_id:
- ipv4_prefix_length: 24
- ipv6_prefix_length: 64
- collectors:
- - name: "dga_collector"
- protocol_base: dns
- required_log_information:
- - [ "ts", Timestamp, "%Y-%m-%dT%H:%M:%S" ]
- - [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
- - [ "src_ip", IpAddress ]
- - [ "dns_server_ip", IpAddress ]
- - [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(? Subnet: 192.168.1.0/24
-[info] Domain: www.example.com
-[info] Second level: example.com
-[info] Third level: www
-```
-
-## Project Statistics
-
-### Code Metrics
-- **Total Files:** 40+ C++ files
-- **Lines of Code:** ~2,500+ lines
-- **Build Time:** ~30 seconds (after dependencies)
-- **Binary Size:** ~7.5 MB total
-
-### Dependencies Installed (via vcpkg)
-- ✅ yaml-cpp (0.8.0)
-- ✅ librdkafka (2.12.0)
-- ✅ clickhouse-cpp (2.6.0)
-- ✅ boost (1.89.0)
-- ✅ spdlog (1.16.0)
-- ✅ fmt (12.1.0)
-- ✅ nlohmann-json (3.12.0)
-- ✅ openssl (3.6.0)
-- ✅ gtest (1.17.0)
-
-## What Works
-
-### ✅ Fully Functional
-1. **Configuration Loading** - Parse config.yaml
-2. **Feature Extraction** - Extract 44 DGA detection features
-3. **Logging** - Structured logging with spdlog
-4. **Utilities** - UUID, IP, domain parsing, SHA256
-5. **Data Classes** - LogLine, Batch, Warning with JSON
-6. **Tests** - Google Test framework integrated
-
-### ⚠️ Partially Implemented
-1. **LogServer** - Core logic implemented, needs ClickHouse integration
-2. **Kafka Integration** - Headers defined, implementation pending
-3. **ClickHouse Integration** - Headers defined, implementation pending
-
-### ❌ Not Yet Implemented
-1. LogCollector module
-2. Prefilter module
-3. Inspector module
-4. Detector executable with ONNX
-5. Full Kafka/ClickHouse implementations
-
-## Performance Comparison
-
-### Expected vs Python
-
-| Metric | Python | C++ (Expected) |
-|--------|--------|----------------|
-| Binary Size | N/A | 7.5 MB |
-| Startup Time | ~500ms | ~5ms |
-| Config Load | ~100ms | ~4ms |
-| Feature Extract | ~1ms | ~0.01ms |
-
-## Build Instructions
-
-###Quick Build (After vcpkg is set up)
-
-```bash
-cd /Users/smachmeier/Documents/projects/hamstring/cpp
-
-# Configure
-cmake -B build \
- -DCMAKE_TOOLCHAIN_FILE=/Users/smachmeier/vcpkg/scripts/buildsystems/vcpkg.cmake \
- -DCMAKE_BUILD_TYPE=Debug
-
-# Build (30 seconds)
-cmake --build build -j$(sysctl -n hw.ncpu)
-
-# Run demo
-./build/examples/demo ../config.yaml
-
-# Run tests
-cd build && ctest
-```
-
-## Known Issues
-
-1. **Third-level domain entropy** - Returns 0 for single-label third levels
-2. **ClickHouseSender** - Not implemented (placeholder only)
-3. **Kafka handlers** - Not fully implemented yet
-4. **OpenSSL warnings** - Using deprecated SHA256 API (non-critical)
-5. **yaml-cpp deprecation** - Using deprecated target name (non-critical)
-
-## Next Steps
-
-### High Priority
-1. Fix third-level domain entropy calculation
-2. Implement ClickHouseSender
-3. Implement Kafka handlers
-4. Complete LogServer executable
-
-### Medium Priority
-1. Implement LogCollector module
-2. Implement Prefilter module
-3. Implement Inspector module
-4. Implement Detector with ONNX
-
-### Low Priority
-1. Migrate to new OpenSSL EVP API
-2. Update yaml-cpp target name
-3. Add more integration tests
-4. Performance benchmarking
-
-## Achievements 🏆
-
-- ✅ Modern C++20 codebase
-- ✅ CMake + vcpkg build system
-- ✅ 95% test pass rate
-- ✅ Configuration system working
-- ✅ Feature extraction matching Python
-- ✅ Professional logging
-- ✅ Clean architecture
-- ✅ Ready for production modules
-
-## Conclusion
-
-The C++ conversion is **highly successful**! The core infrastructure is solid, the build system works perfectly, and the feature extraction (the most critical component for DGA detection) is fully functional and tested.
-
-The remaining work is primarily implementing the pipeline modules (LogCollector, Prefilter, Inspector, Detector) and completing the Kafka/ClickHouse integrations, which are straightforward now that the foundation is established.
-
-**Ready to proceed with full module implementation!** 🚀
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
deleted file mode 100644
index 9728c000..00000000
--- a/cpp/CMakeLists.txt
+++ /dev/null
@@ -1,108 +0,0 @@
-cmake_minimum_required(VERSION 3.20)
-project(hamstring VERSION 1.0.0 LANGUAGES CXX)
-
-# Enable vcpkg manifest mode
-set(VCPKG_MANIFEST_MODE ON)
-
-# C++20 standard
-set(CMAKE_CXX_STANDARD 20)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-set(CMAKE_CXX_EXTENSIONS OFF)
-
-# Export compile commands for IDE support
-set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
-
-# Compiler warnings
-if(MSVC)
- add_compile_options(/W4)
-else()
- add_compile_options(-Wall -Wextra -Wpedantic) # Removed -Werror for now
-endif()
-
-# Build types
-if(NOT CMAKE_BUILD_TYPE)
- set(CMAKE_BUILD_TYPE Release)
-endif()
-
-# Optimization flags
-set(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG")
-set(CMAKE_CXX_FLAGS_DEBUG "-g -O0")
-
-# Find packages
-find_package(yaml-cpp CONFIG REQUIRED)
-find_package(RdKafka CONFIG REQUIRED)
-# find_package(clickhouse-cpp CONFIG REQUIRED) # No CMake config provided by vcpkg
-find_package(Boost REQUIRED COMPONENTS system thread)
-find_package(spdlog CONFIG REQUIRED)
-find_package(fmt CONFIG REQUIRED)
-# find_package(onnxruntime CONFIG REQUIRED) # Optional for Detector module
-find_package(nlohmann_json CONFIG REQUIRED)
-find_package(OpenSSL REQUIRED)
-
-# Optional ClickHouse support
-option(ENABLE_CLICKHOUSE "Enable ClickHouse integration" ON)
-
-if(ENABLE_CLICKHOUSE)
- # ClickHouse C++ library - find manually since vcpkg doesn't provide CMake config
- find_library(CLICKHOUSE_CPP_LIB
- NAMES clickhouse-cpp-lib libclickhouse-cpp-lib
- PATHS ${CMAKE_SOURCE_DIR}/vcpkg_installed/${VCPKG_TARGET_TRIPLET}/lib
- NO_DEFAULT_PATH
- )
- if(NOT CLICKHOUSE_CPP_LIB)
- message(WARNING "ClickHouse C++ library not found - building with stub implementation")
- set(ENABLE_CLICKHOUSE OFF)
- else()
- message(STATUS "Found ClickHouse C++ library: ${CLICKHOUSE_CPP_LIB}")
-
- # ClickHouse include directory
- set(CLICKHOUSE_INCLUDE_DIR ${CMAKE_SOURCE_DIR}/vcpkg_installed/${VCPKG_TARGET_TRIPLET}/include)
- message(STATUS "ClickHouse include directory: ${CLICKHOUSE_INCLUDE_DIR}")
- endif()
-else()
- message(STATUS "ClickHouse integration disabled - using stub implementation")
-endif()
-
-# Find ClickHouse dependencies (zstd and cityhash) - only if ClickHouse is enabled
-if(ENABLE_CLICKHOUSE)
- find_library(ZSTD_LIB
- NAMES zstd libzstd
- PATHS ${CMAKE_SOURCE_DIR}/vcpkg_installed/${VCPKG_TARGET_TRIPLET}/lib
- NO_DEFAULT_PATH
- )
- find_library(CITYHASH_LIB
- NAMES cityhash libcityhash
- PATHS ${CMAKE_SOURCE_DIR}/vcpkg_installed/${VCPKG_TARGET_TRIPLET}/lib
- NO_DEFAULT_PATH
- )
-
- if(NOT ZSTD_LIB OR NOT CITYHASH_LIB)
- message(FATAL_ERROR "ClickHouse dependencies not found (zstd: ${ZSTD_LIB}, cityhash: ${CITYHASH_LIB})")
- endif()
- message(STATUS "Found zstd: ${ZSTD_LIB}")
- message(STATUS "Found cityhash: ${CITYHASH_LIB}")
-endif()
-
-# Include directories
-include_directories(${CMAKE_SOURCE_DIR}/include)
-
-# Subdirectories
-add_subdirectory(src)
-add_subdirectory(examples)
-
-# Testing
-option(BUILD_TESTING "Build tests" ON)
-if(BUILD_TESTING)
- enable_testing()
- find_package(GTest CONFIG REQUIRED)
- add_subdirectory(tests)
-endif()
-
-# Benchmarks
-option(BUILD_BENCHMARKS "Build benchmarks" OFF)
-if(BUILD_BENCHMARKS)
- add_subdirectory(benchmarks)
-endif()
-
-# Installation
-install(DIRECTORY include/ DESTINATION include)
diff --git a/cpp/QUICKSTART.md b/cpp/QUICKSTART.md
deleted file mode 100644
index dbb2a498..00000000
--- a/cpp/QUICKSTART.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Quick Start Build Guide
-
-## Option 1: Build Without Full Dependencies (Demo Mode)
-
-If you want to quickly test the code without installing all dependencies via vcpkg:
-
-```bash
-cd cpp
-
-# Create a minimal build (no external dependencies)
-cmake -B build-minimal \
- -DCMAKE_BUILD_TYPE=Debug \
- -DBUILD_TESTING=OFF
-
-# This will fail - need to make dependencies optional first
-```
-
-## Option 2: Full Build with vcpkg (Recommended)
-
-### Step 1: Install vcpkg
-
-```bash
-# Clone vcpkg (one-time setup)
-git clone https://github.com/Microsoft/vcpkg.git ~/vcpkg
-cd ~/vcpkg
-./bootstrap-vcpkg.sh
-```
-
-### Step 2: Configure CMake with vcpkg
-
-```bash
-cd /path/to/hamstring/cpp
-
-# Configure with vcpkg toolchain
-cmake -B build \
- -DCMAKE_TOOLCHAIN_FILE=~/vcpkg/scripts/buildsystems/vcpkg.cmake \
- -DCMAKE_BUILD_TYPE=Debug
-
-# vcpkg will automatically download and build:
-# - yaml-cpp
-# - rdkafka
-# - clickhouse-cpp
-# - boost
-# - spdlog
-# - nlohmann-json
-# - openssl
-# - gtest
-# This may take 10-30 minutes on first run
-```
-
-### Step 3: Build
-
-```bash
-cmake --build build -j$(nproc)
-```
-
-### Step 4: Run
-
-```bash
-# Run demo
-./build/examples/demo ../config.yaml
-
-# Run logserver (when Kafka/ClickHouse are ready)
-./build/src/logserver/logserver ../config.yaml
-
-# Run tests
-cd build && ctest
-```
-
-## Option 3: Docker Build (Easiest)
-
-Create a simple Dockerfile:
-
-```dockerfile
-FROM ubuntu:22.04
-
-RUN apt-get update && apt-get install -y \
- build-essential cmake git \
- libssl-dev pkg-config
-
-# Install vcpkg
-RUN git clone https://github.com/Microsoft/vcpkg.git /opt/vcpkg && \
- /opt/vcpkg/bootstrap-vcpkg.sh
-
-WORKDIR /app
-COPY cpp /app
-
-# Build
-RUN cmake -B build \
- -DCMAKE_TOOLCHAIN_FILE=/opt/vcpkg/scripts/buildsystems/vcpkg.cmake && \
- cmake --build build -j$(nproc)
-
-CMD ["./build/src/logserver/logserver", "config.yaml"]
-```
-
-Then:
-```bash
-docker build -t hamstring-cpp .
-docker run hamstring-cpp
-```
-
-## What's the Error?
-
-The error "Could not find a package configuration file provided by yaml-cpp" means CMake can't find the required libraries. You have 3 options:
-
-1. **Use vcpkg** (recommended) - it will install everything automatically
-2. **Install system packages** manually - but this is complex for all deps
-3. **Make dependencies optional** - I can modify CMake to make some deps optional for demo builds
-
-## Quick Fix: System Packages (macOS)
-
-If you want to try with system packages:
-
-```bash
-brew install yaml-cpp librdkafka boost spdlog nlohmann-json openssl
-
-# Then configure without vcpkg
-cmake -B build -DCMAKE_BUILD_TYPE=Debug
-```
-
-**Note**: This may not work for all dependencies (clickhouse-cpp, onnxruntime not in brew)
-
-## Recommended Next Step
-
-I recommend using vcpkg as it's the most reliable method and matches the documentation. Would you like me to:
-
-1. **Modify CMake to make dependencies optional** (for quick demo builds)
-2. **Create a Docker setup** for easy building
-3. **Help you install vcpkg** and walk through the full build
-4. **Create a minimal example** that builds with no dependencies
-
-Let me know which approach you prefer!
diff --git a/cpp/README.md b/cpp/README.md
deleted file mode 100644
index 32aaf0f0..00000000
--- a/cpp/README.md
+++ /dev/null
@@ -1,185 +0,0 @@
-# HAMSTRING C++ Implementation
-
-This directory contains the C++ implementation of the HAMSTRING DGA detection pipeline, providing significant performance improvements over the Python version.
-
-## Features
-
-- **High Performance**: 5-10x throughput improvement over Python
-- **Low Latency**: 60-80% reduction in processing latency
-- **Memory Efficient**: 50-70% reduction in memory usage
-- **Modern C++20**: Leveraging latest language features
-- **Async I/O**: Non-blocking Kafka and database operations
-- **ML Inference**: ONNX Runtime for model execution
-
-## Architecture
-
-The C++ implementation maintains the same pipeline architecture as the Python version:
-
-```
-LogServer → LogCollector → Prefilter → Inspector → Detector
- ↓ ↓ ↓ ↓
- Kafka Kafka Kafka Kafka
- ↓ ↓ ↓ ↓
- ClickHouse (Monitoring & Alerts)
-```
-
-## Building
-
-### Prerequisites
-
-- CMake 3.20 or higher
-- C++20 compatible compiler (GCC 10+, Clang 12+, MSVC 2019+)
-- vcpkg (for dependency management)
-
-### Dependencies
-
-All dependencies are managed via vcpkg:
-- yaml-cpp (configuration parsing)
-- librdkafka (Kafka client)
-- clickhouse-cpp (ClickHouse client)
-- Boost (async I/O, utilities)
-- spdlog (logging)
-- ONNX Runtime (ML inference)
-- Google Test (testing)
-
-### Build Instructions
-
-```bash
-# Clone vcpkg if not already installed
-git clone https://github.com/Microsoft/vcpkg.git
-./vcpkg/bootstrap-vcpkg.sh
-
-# Configure with vcpkg
-cd cpp
-cmake -B build -DCMAKE_TOOLCHAIN_FILE=../vcpkg/scripts/buildsystems/vcpkg.cmake
-
-# Build
-cmake --build build -j$(nproc)
-
-# Run tests
-cd build && ctest --output-on-failure
-```
-
-## Running
-
-### Individual Modules
-
-```bash
-# LogServer
-./build/src/logserver/logserver --config ../config.yaml
-
-# LogCollector
-./build/src/logcollector/collector --config ../config.yaml
-
-# Prefilter
-./build/src/prefilter/prefilter --config ../config.yaml
-
-# Inspector
-./build/src/inspector/inspector --config ../config.yaml
-
-# Detector
-./build/src/detector/detector --config ../config.yaml
-```
-
-### Docker
-
-Docker images are built automatically for each module:
-
-```bash
-# Build all images
-docker compose -f ../docker/docker-compose.yml build
-
-# Run the pipeline
-HOST_IP=127.0.0.1 docker compose -f ../docker/docker-compose.yml --profile prod up
-```
-
-## Configuration
-
-The C++ implementation uses the same `config.yaml` format as the Python version. No changes are required to existing configurations.
-
-## Model Conversion
-
-Before running the detector, convert existing Python models to ONNX format:
-
-```bash
-# Convert XGBoost/RandomForest models to ONNX
-python ../scripts/convert_models_to_onnx.py
-
-# Verify conversion
-python ../scripts/verify_onnx_conversion.py
-```
-
-## Performance
-
-Benchmarks comparing C++ vs Python implementation:
-
-| Metric | Python | C++ | Improvement |
-|--------|--------|-----|-------------|
-| Throughput (msgs/sec) | 10,000 | 75,000 | 7.5x |
-| Latency (ms) | 50 | 12 | 76% reduction |
-| Memory (MB) | 500 | 180 | 64% reduction |
-| CPU Usage (%) | 80 | 35 | 56% reduction |
-
-## Development
-
-### Code Structure
-
-```
-cpp/
-├── include/hamstring/ # Public headers
-│ ├── base/ # Core infrastructure
-│ ├── config/ # Configuration
-│ ├── detector/ # Detector module
-│ ├── inspector/ # Inspector module
-│ ├── logcollector/ # LogCollector module
-│ ├── logserver/ # LogServer module
-│ └── prefilter/ # Prefilter module
-├── src/ # Implementation files
-├── tests/ # Unit and integration tests
-└── benchmarks/ # Performance benchmarks
-```
-
-### Testing
-
-```bash
-# Run all tests
-cd build && ctest
-
-# Run specific test
-./build/tests/detector/test_feature_extractor
-
-# Run with verbose output
-ctest --verbose
-```
-
-### Code Quality
-
-```bash
-# Format code
-find . -name "*.cpp" -o -name "*.hpp" | xargs clang-format -i
-
-# Static analysis
-clang-tidy src/**/*.cpp -- -std=c++20
-
-# Memory safety check
-cmake -B build -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON
-cmake --build build
-./build/tests/all_tests
-```
-
-## Migration from Python
-
-The C++ implementation is designed to be a drop-in replacement:
-
-1. **Same Configuration**: Use existing `config.yaml`
-2. **Same Kafka Topics**: Compatible message formats
-3. **Same Database Schema**: ClickHouse tables unchanged
-4. **Same Monitoring**: Grafana dashboards work as-is
-
-## Contributing
-
-See the main [CONTRIBUTING.md](../CONTRIBUTING.md) for guidelines.
-
-## License
-
-Same as the main project (EUPL License).
diff --git a/cpp/SUMMARY.md b/cpp/SUMMARY.md
deleted file mode 100644
index 7595653b..00000000
--- a/cpp/SUMMARY.md
+++ /dev/null
@@ -1,155 +0,0 @@
-# C++ Conversion Summary
-
-## Files Created
-
-### Build System (3 files)
-- ✅ `cpp/CMakeLists.txt` - Root build configuration
-- ✅ `cpp/vcpkg.json` - Dependency manifest
-- ✅ `cpp/src/CMakeLists.txt` - Source build configuration
-
-### Headers (8 files)
-- ✅ `cpp/include/hamstring/config/config.hpp` - Configuration system
-- ✅ `cpp/include/hamstring/base/logger.hpp` - Logging framework
-- ✅ `cpp/include/hamstring/base/data_classes.hpp` - Core data structures
-- ✅ `cpp/include/hamstring/base/utils.hpp` - Utility functions
-- ✅ `cpp/include/hamstring/base/kafka_handler.hpp` - Kafka integration
-- ✅ `cpp/include/hamstring/base/clickhouse_sender.hpp` - ClickHouse integration
-- ✅ `cpp/include/hamstring/detector/feature_extractor.hpp` - Feature extraction
-
-### Implementation (2 files)
-- ✅ `cpp/src/detector/feature_extractor.cpp` - Feature extraction implementation
-- ✅ `cpp/src/detector/CMakeLists.txt` - Detector build configuration
-
-### Tests (7 files)
-- ✅ `cpp/tests/CMakeLists.txt` - Test build configuration
-- ✅ `cpp/tests/base/CMakeLists.txt` - Base tests configuration
-- ✅ `cpp/tests/base/test_utils.cpp` - Utility tests
-- ✅ `cpp/tests/detector/CMakeLists.txt` - Detector tests configuration
-- ✅ `cpp/tests/detector/test_feature_extractor.cpp` - Feature extractor tests
-- ✅ `cpp/tests/integration/CMakeLists.txt` - Integration tests configuration
-- ✅ `cpp/tests/integration/test_pipeline.cpp` - Pipeline integration tests
-
-### Documentation (2 files)
-- ✅ `cpp/README.md` - C++ implementation documentation
-- ✅ `scripts/convert_models_to_onnx.py` - Model conversion script
-
-### Configuration (1 file)
-- ✅ `.gitignore` - Updated to allow CMake and vcpkg files
-
-**Total: 23 files created**
-
-## Project Structure
-
-```
-hamstring/
-├── cpp/ # NEW: C++ implementation
-│ ├── CMakeLists.txt # Build configuration
-│ ├── vcpkg.json # Dependencies
-│ ├── README.md # Documentation
-│ ├── include/hamstring/ # Public headers
-│ │ ├── base/ # Core infrastructure
-│ │ │ ├── logger.hpp
-│ │ │ ├── data_classes.hpp
-│ │ │ ├── utils.hpp
-│ │ │ ├── kafka_handler.hpp
-│ │ │ └── clickhouse_sender.hpp
-│ │ ├── config/
-│ │ │ └── config.hpp
-│ │ └── detector/
-│ │ └── feature_extractor.hpp
-│ ├── src/ # Implementation
-│ │ ├── CMakeLists.txt
-│ │ └── detector/
-│ │ ├── CMakeLists.txt
-│ │ └── feature_extractor.cpp
-│ └── tests/ # Test suite
-│ ├── CMakeLists.txt
-│ ├── base/
-│ │ ├── CMakeLists.txt
-│ │ └── test_utils.cpp
-│ ├── detector/
-│ │ ├── CMakeLists.txt
-│ │ └── test_feature_extractor.cpp
-│ └── integration/
-│ ├── CMakeLists.txt
-│ └── test_pipeline.cpp
-├── scripts/
-│ └── convert_models_to_onnx.py # NEW: Model conversion
-└── .gitignore # MODIFIED: Allow CMake files
-```
-
-## Key Achievements
-
-### ✅ Core Infrastructure
-- Modern C++20 codebase
-- CMake build system with vcpkg
-- Configuration system (YAML parsing)
-- Logging framework (spdlog)
-- Data structures (LogLine, Batch, Warning)
-- Field validators (RegEx, Timestamp, IP, ListItem)
-
-### ✅ Integration
-- Kafka handlers (librdkafka)
-- ClickHouse client (clickhouse-cpp)
-- ONNX Runtime (ML inference)
-- Boost.Asio (async I/O)
-
-### ✅ Feature Extraction
-- Complete implementation matching Python
-- 44 features extracted per domain
-- Label statistics
-- Character frequency
-- Domain level analysis
-- Entropy calculation
-
-### ✅ Testing
-- Google Test framework
-- Unit tests for feature extractor
-- Unit tests for utilities
-- Integration test framework
-- 11 test cases for feature extraction
-
-### ✅ Documentation
-- Comprehensive README
-- Build instructions
-- Performance benchmarks
-- Model conversion guide
-- Walkthrough document
-
-## Next Steps
-
-To complete the implementation:
-
-1. **Implement remaining modules** (LogServer, LogCollector, Prefilter, Inspector, Detector)
-2. **Implement base infrastructure** (Logger, Utils, Data classes, Kafka, ClickHouse)
-3. **Complete configuration parsing**
-4. **Add integration tests**
-5. **Performance benchmarking**
-6. **Docker integration**
-
-## Build Instructions
-
-```bash
-# Install vcpkg
-git clone https://github.com/Microsoft/vcpkg.git
-./vcpkg/bootstrap-vcpkg.sh
-
-# Configure
-cd cpp
-cmake -B build -DCMAKE_TOOLCHAIN_FILE=../vcpkg/scripts/buildsystems/vcpkg.cmake
-
-# Build
-cmake --build build -j$(nproc)
-
-# Run tests
-cd build && ctest --output-on-failure
-```
-
-## Performance Targets
-
-| Metric | Python | C++ Target |
-|--------|--------|------------|
-| Throughput | 10K msgs/sec | 75K msgs/sec |
-| Latency | 50 ms | 12 ms |
-| Memory | 500 MB | 180 MB |
-| CPU | 80% | 35% |
diff --git a/cpp/auto-build.sh b/cpp/auto-build.sh
deleted file mode 100755
index bea0ffd5..00000000
--- a/cpp/auto-build.sh
+++ /dev/null
@@ -1,59 +0,0 @@
-#!/bin/bash
-# Auto-build script - run this after vcpkg finishes
-
-set -e # Exit on error
-
-echo "================================================"
-echo "HAMSTRING C++ Auto-Build Script"
-echo "================================================"
-echo ""
-
-# Check if vcpkg finished
-if [ ! -d "$HOME/vcpkg/installed/arm64-osx" ]; then
- echo "Error: vcpkg dependencies not installed yet"
- echo "Please wait for 'vcpkg install' to complete first"
- exit 1
-fi
-
-echo "✓ vcpkg dependencies installed"
-echo ""
-
-# Configure CMake
-echo "Step 1: Configuring CMake..."
-cmake -B build \
- -DCMAKE_TOOLCHAIN_FILE=$HOME/vcpkg/scripts/buildsystems/vcpkg.cmake \
- -DCMAKE_BUILD_TYPE=Debug \
- -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
-
-echo ""
-echo "✓ CMake configured"
-echo ""
-
-# Build
-echo "Step 2: Building project..."
-cmake --build build -j$(sysctl -n hw.ncpu)
-
-echo ""
-echo "✓ Build complete!"
-echo ""
-
-# Show what was built
-echo "Built executables:"
-echo " - build/examples/demo"
-echo " - build/src/logserver/logserver"
-echo ""
-
-# Optionally run tests
-read -p "Run tests? (y/n) " -n 1 -r
-echo
-if [[ $REPLY =~ ^[Yy]$ ]]; then
- cd build && ctest --output-on-failure
- cd ..
-fi
-
-echo ""
-echo "================================================"
-echo "Build complete! You can now run:"
-echo " ./build/examples/demo ../config.yaml"
-echo " ./build/src/logserver/logserver ../config.yaml"
-echo "================================================"
diff --git a/cpp/examples/CMakeLists.txt b/cpp/examples/CMakeLists.txt
deleted file mode 100644
index d6787b9f..00000000
--- a/cpp/examples/CMakeLists.txt
+++ /dev/null
@@ -1,10 +0,0 @@
-# Example executable
-add_executable(demo
- demo.cpp
-)
-
-target_link_libraries(demo
- PRIVATE
- hamstring_base
- hamstring_detector
-)
diff --git a/cpp/examples/demo.cpp b/cpp/examples/demo.cpp
deleted file mode 100644
index 0b413ff9..00000000
--- a/cpp/examples/demo.cpp
+++ /dev/null
@@ -1,96 +0,0 @@
-#include "hamstring/base/logger.hpp"
-#include "hamstring/base/utils.hpp"
-#include "hamstring/config/config.hpp"
-#include "hamstring/detector/feature_extractor.hpp"
-#include
-
-using namespace hamstring;
-
-int main(int argc, char **argv) {
- // Initialize logger
- base::Logger::initialize(true); // debug mode
- auto logger = base::Logger::get_logger("example");
-
- logger->info("HAMSTRING C++ Example");
- logger->info("=====================");
-
- // Load configuration
- std::string config_path = (argc > 1) ? argv[1] : "../../config.yaml";
- logger->info("Loading configuration from: {}", config_path);
-
- try {
- auto config = config::Config::load_from_file(config_path);
-
- logger->info("Configuration loaded successfully");
- logger->info("Number of collectors: {}",
- config->pipeline.collectors.size());
- logger->info("Number of detectors: {}", config->pipeline.detectors.size());
- logger->info("Kafka brokers: {}", config->environment.kafka_brokers.size());
-
- // Show Kafka bootstrap servers
- std::string bootstrap = config->environment.get_kafka_bootstrap_servers();
- logger->info("Kafka bootstrap servers: {}", bootstrap);
-
- } catch (const std::exception &e) {
- logger->error("Failed to load configuration: {}", e.what());
- logger->warn("Continuing with feature extraction demo...");
- }
-
- // Demonstrate feature extraction
- logger->info("");
- logger->info("Feature Extraction Demo");
- logger->info("=======================");
-
- detector::FeatureExtractor extractor;
-
- // Test domains
- std::vector test_domains = {"google.com", "www.example.com",
- "xjk3n2m9pq.com", // DGA-like
- "mail.google.com"};
-
- for (const auto &domain : test_domains) {
- logger->info("");
- logger->info("Domain: {}", domain);
-
- auto features = extractor.extract(domain);
-
- logger->info(" Label length: {}", features.label_length);
- logger->info(" Label max: {}", features.label_max);
- logger->info(" FQDN entropy: {:.4f}", features.fqdn_entropy);
- logger->info(" SLD entropy: {:.4f}", features.secondleveldomain_entropy);
- logger->info(" Alpha ratio: {:.4f}", features.fqdn_alpha_count);
- logger->info(" Numeric ratio: {:.4f}", features.fqdn_numeric_count);
-
- auto vec = features.to_vector();
- logger->info(" Feature vector size: {}", vec.size());
-
- // Check if DGA-like (high entropy)
- if (features.fqdn_entropy > 3.0) {
- logger->warn(" ⚠ High entropy - possible DGA domain!");
- }
- }
-
- // Demonstrate utilities
- logger->info("");
- logger->info("Utilities Demo");
- logger->info("==============");
-
- std::string uuid = base::utils::generate_uuid();
- logger->info("Generated UUID: {}", uuid);
-
- std::string ip = "192.168.1.100";
- std::string subnet = base::utils::get_subnet_id(ip, 24);
- logger->info("IP: {} -> Subnet: {}", ip, subnet);
-
- std::string domain = "www.example.com";
- std::string sld = base::utils::extract_second_level_domain(domain);
- std::string tld_part = base::utils::extract_third_level_domain(domain);
- logger->info("Domain: {}", domain);
- logger->info(" Second level: {}", sld);
- logger->info(" Third level: {}", tld_part);
-
- logger->info("");
- logger->info("Example completed successfully!");
-
- return 0;
-}
diff --git a/cpp/include/hamstring/base/clickhouse_sender.hpp b/cpp/include/hamstring/base/clickhouse_sender.hpp
deleted file mode 100644
index 3d654372..00000000
--- a/cpp/include/hamstring/base/clickhouse_sender.hpp
+++ /dev/null
@@ -1,74 +0,0 @@
-#pragma once
-
-#include
-#include
-#include
-
-namespace hamstring {
-namespace base {
-
-/**
- * @brief ClickHouse client for monitoring and logging
- *
- * Provides interface for inserting data into ClickHouse tables.
- * Supports server logs, timestamps, batch tracking, and metrics.
- */
-class ClickHouseSender {
-public:
- ClickHouseSender(const std::string &hostname, int port = 9000,
- const std::string &database = "default",
- const std::string &user = "default",
- const std::string &password = "");
- ~ClickHouseSender();
-
- // Batch tracking (logged as structured messages)
- void insert_batch_timestamp(const std::string &batch_id,
- const std::string &stage,
- const std::string &instance_name,
- const std::string &status, size_t message_count,
- bool is_active = true);
-
- // Logline tracking
- void insert_logline_timestamp(const std::string &logline_id,
- const std::string &stage,
- const std::string &status,
- bool is_active = true);
-
- // Metrics/fill levels
- void insert_fill_level(const std::string &stage,
- const std::string &entry_type, size_t entry_count);
-
- // DGA detections
- void insert_dga_detection(const std::string &domain, double score,
- const std::string &batch_id,
- const std::string &src_ip);
-
- // Server logs (LogServer module)
- void insert_server_log(const std::string &message_id, int64_t timestamp_ms,
- const std::string &message_text);
- void insert_server_log_timestamp(const std::string &message_id,
- const std::string &event,
- int64_t event_timestamp_ms);
-
- // Failed loglines (LogCollector module)
- void insert_failed_logline(const std::string &message_text,
- int64_t timestamp_in_ms,
- int64_t timestamp_failed_ms,
- const std::string &reason);
-
- // Generic methods
- void execute(const std::string &query);
- bool ping();
-
-private:
- std::string hostname_;
- int port_;
- std::string database_;
- std::string user_;
- std::string password_;
- bool connected_;
- std::unique_ptr client_;
-};
-
-} // namespace base
-} // namespace hamstring
diff --git a/cpp/include/hamstring/base/data_classes.hpp b/cpp/include/hamstring/base/data_classes.hpp
deleted file mode 100644
index 173d2671..00000000
--- a/cpp/include/hamstring/base/data_classes.hpp
+++ /dev/null
@@ -1,163 +0,0 @@
-#pragma once
-
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace base {
-
-// Forward declarations
-class FieldValidator;
-
-// LogLine represents a validated log entry
-class LogLine {
-public:
- std::string logline_id;
- std::string batch_id;
- std::map fields;
- std::chrono::system_clock::time_point timestamp;
-
- // Serialize to JSON string
- std::string to_json() const;
-
- // Deserialize from JSON string
- static std::shared_ptr from_json(const std::string &json_str);
-
- // Get field value
- std::optional get_field(const std::string &name) const;
-
- // Set field value
- void set_field(const std::string &name, const std::string &value);
-};
-
-// Batch represents a collection of log lines grouped by subnet
-class Batch {
-public:
- std::string batch_id;
- std::string subnet_id;
- std::string collector_name;
- std::vector> loglines;
- std::chrono::system_clock::time_point created_at;
- std::chrono::system_clock::time_point timestamp_in;
-
- // Serialize to JSON string
- std::string to_json() const;
-
- // Deserialize from JSON string
- static std::shared_ptr from_json(const std::string &json_str);
-
- // Add a log line to the batch
- void add_logline(std::shared_ptr logline);
-
- // Get number of log lines
- size_t size() const { return loglines.size(); }
-
- // Check if batch is empty
- bool empty() const { return loglines.empty(); }
-};
-
-// Warning represents a detected threat
-class Warning {
-public:
- std::string warning_id;
- std::string batch_id;
- std::string src_ip;
- std::string domain_name;
- double score;
- double threshold;
- std::chrono::system_clock::time_point timestamp;
- std::map metadata;
-
- // Serialize to JSON string
- std::string to_json() const;
-
- // Deserialize from JSON string
- static std::shared_ptr from_json(const std::string &json_str);
-};
-
-// Base class for field validators
-class FieldValidator {
-public:
- virtual ~FieldValidator() = default;
-
- // Validate a field value
- virtual bool validate(const std::string &value) const = 0;
-
- // Get field name
- virtual std::string get_name() const = 0;
-};
-
-// RegEx field validator
-class RegExValidator : public FieldValidator {
-public:
- RegExValidator(const std::string &name, const std::string &pattern);
-
- bool validate(const std::string &value) const override;
- std::string get_name() const override { return name_; }
-
-private:
- std::string name_;
- std::regex pattern_;
-};
-
-// Timestamp field validator
-class TimestampValidator : public FieldValidator {
-public:
- TimestampValidator(const std::string &name, const std::string &format);
-
- bool validate(const std::string &value) const override;
- std::string get_name() const override { return name_; }
-
- // Parse timestamp to time_point
- std::chrono::system_clock::time_point parse(const std::string &value) const;
-
-private:
- std::string name_;
- std::string format_;
-};
-
-// IP Address field validator
-class IpAddressValidator : public FieldValidator {
-public:
- explicit IpAddressValidator(const std::string &name);
-
- bool validate(const std::string &value) const override;
- std::string get_name() const override { return name_; }
-
- // Check if IPv4
- static bool is_ipv4(const std::string &value);
-
- // Check if IPv6
- static bool is_ipv6(const std::string &value);
-
-private:
- std::string name_;
-};
-
-// ListItem field validator
-class ListItemValidator : public FieldValidator {
-public:
- ListItemValidator(const std::string &name,
- const std::vector &allowed_list,
- const std::vector &relevant_list);
-
- bool validate(const std::string &value) const override;
- std::string get_name() const override { return name_; }
-
- // Check if value is relevant
- bool is_relevant(const std::string &value) const;
-
-private:
- std::string name_;
- std::vector allowed_list_;
- std::vector relevant_list_;
-};
-
-} // namespace base
-} // namespace hamstring
diff --git a/cpp/include/hamstring/base/kafka_handler.hpp b/cpp/include/hamstring/base/kafka_handler.hpp
deleted file mode 100644
index f87df4c3..00000000
--- a/cpp/include/hamstring/base/kafka_handler.hpp
+++ /dev/null
@@ -1,114 +0,0 @@
-#pragma once
-
-#include
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace base {
-
-// Kafka message callback
-using KafkaMessageCallback =
- std::function;
-
-// Base Kafka handler
-class KafkaHandler {
-public:
- virtual ~KafkaHandler() = default;
-
- // Get Kafka configuration
- static std::unique_ptr
- create_config(const std::string &bootstrap_servers,
- const std::string &group_id = "");
-
-protected:
- std::string bootstrap_servers_;
- std::unique_ptr conf_;
-};
-
-// Kafka producer
-class KafkaProduceHandler : public KafkaHandler {
-public:
- KafkaProduceHandler(const std::string &bootstrap_servers,
- const std::string &topic);
- ~KafkaProduceHandler();
-
- // Send a message
- bool send(const std::string &key, const std::string &value);
-
- // Send a message with timestamp
- bool send(const std::string &key, const std::string &value,
- int64_t timestamp);
-
- // Flush pending messages
- void flush(int timeout_ms = 10000);
-
-private:
- std::string topic_;
- std::unique_ptr producer_;
- std::unique_ptr topic_handle_;
-};
-
-// Kafka consumer
-class KafkaConsumeHandler : public KafkaHandler {
-public:
- KafkaConsumeHandler(const std::string &bootstrap_servers,
- const std::string &group_id,
- const std::vector &topics);
- ~KafkaConsumeHandler();
-
- // Poll for messages (blocking)
- void poll(KafkaMessageCallback callback, int timeout_ms = 1000);
-
- // Start consuming in background
- void start_async(boost::asio::io_context &io_context,
- KafkaMessageCallback callback);
-
- // Stop consuming
- void stop();
-
- // Commit offsets
- void commit();
-
-private:
- std::vector topics_;
- std::unique_ptr consumer_;
- bool running_ = false;
-};
-
-// Exactly-once Kafka handler
-class ExactlyOnceKafkaHandler {
-public:
- ExactlyOnceKafkaHandler(const std::string &bootstrap_servers,
- const std::string &consumer_group_id,
- const std::vector &consume_topics,
- const std::string &produce_topic);
- ~ExactlyOnceKafkaHandler();
-
- // Process messages with exactly-once semantics
- void
- process(std::function
- transform_fn,
- int timeout_ms = 1000);
-
- // Start processing in background
- void start_async(
- boost::asio::io_context &io_context,
- std::function
- transform_fn);
-
- // Stop processing
- void stop();
-
-private:
- std::unique_ptr consumer_;
- std::unique_ptr producer_;
- bool running_ = false;
-};
-
-} // namespace base
-} // namespace hamstring
diff --git a/cpp/include/hamstring/base/logger.hpp b/cpp/include/hamstring/base/logger.hpp
deleted file mode 100644
index 31ffe0de..00000000
--- a/cpp/include/hamstring/base/logger.hpp
+++ /dev/null
@@ -1,32 +0,0 @@
-#pragma once
-
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace base {
-
-class Logger {
-public:
- // Get or create a logger for a specific module
- static std::shared_ptr
- get_logger(const std::string &module_name);
-
- // Set log level for a specific module
- static void set_level(const std::string &module_name,
- spdlog::level::level_enum level);
-
- // Set log level for all loggers
- static void set_global_level(spdlog::level::level_enum level);
-
- // Initialize logging system with configuration
- static void initialize(bool debug = false);
-
-private:
- static std::shared_ptr create_logger(const std::string &name);
-};
-
-} // namespace base
-} // namespace hamstring
diff --git a/cpp/include/hamstring/base/utils.hpp b/cpp/include/hamstring/base/utils.hpp
deleted file mode 100644
index 5044176b..00000000
--- a/cpp/include/hamstring/base/utils.hpp
+++ /dev/null
@@ -1,50 +0,0 @@
-#pragma once
-
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace base {
-namespace utils {
-
-// UUID generation
-std::string generate_uuid();
-
-// IP address utilities
-bool is_valid_ipv4(const std::string &ip);
-bool is_valid_ipv6(const std::string &ip);
-std::string get_subnet_id(const std::string &ip, int prefix_length);
-
-// Time utilities
-std::string format_timestamp(const std::chrono::system_clock::time_point &tp,
- const std::string &format = "%Y-%m-%dT%H:%M:%S");
-std::chrono::system_clock::time_point
-parse_timestamp(const std::string &ts_str,
- const std::string &format = "%Y-%m-%dT%H:%M:%S");
-int64_t timestamp_to_ms(const std::chrono::system_clock::time_point &tp);
-std::chrono::system_clock::time_point ms_to_timestamp(int64_t ms);
-
-// String utilities
-std::vector split(const std::string &str, char delimiter);
-std::string join(const std::vector &vec,
- const std::string &delimiter);
-std::string trim(const std::string &str);
-std::string to_lower(const std::string &str);
-std::string to_upper(const std::string &str);
-
-// Domain name utilities
-std::string extract_fqdn(const std::string &domain);
-std::string extract_second_level_domain(const std::string &domain);
-std::string extract_third_level_domain(const std::string &domain);
-std::optional extract_tld(const std::string &domain);
-
-// Hash utilities
-std::string sha256_file(const std::string &filepath);
-std::string sha256_string(const std::string &data);
-
-} // namespace utils
-} // namespace base
-} // namespace hamstring
diff --git a/cpp/include/hamstring/config/config.hpp b/cpp/include/hamstring/config/config.hpp
deleted file mode 100644
index 2c5f1a3c..00000000
--- a/cpp/include/hamstring/config/config.hpp
+++ /dev/null
@@ -1,162 +0,0 @@
-#pragma once
-
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace config {
-
-// Logging configuration
-struct ModuleLoggingConfig {
- bool debug = false;
-};
-
-struct LoggingConfig {
- bool base_debug = false;
- std::map modules;
-
- static LoggingConfig from_yaml(const YAML::Node& node);
-};
-
-// Kafka broker configuration
-struct KafkaBroker {
- std::string hostname;
- int internal_port;
- int external_port;
- std::string node_ip;
-
- static KafkaBroker from_yaml(const YAML::Node& node);
-};
-
-// Environment configuration
-struct EnvironmentConfig {
- std::vector kafka_brokers;
- std::map kafka_topics_prefix;
- std::string clickhouse_hostname;
-
- static EnvironmentConfig from_yaml(const YAML::Node& node);
- std::string get_kafka_bootstrap_servers() const;
-};
-
-// Field validation types
-enum class FieldType {
- RegEx,
- Timestamp,
- IpAddress,
- ListItem
-};
-
-struct FieldConfig {
- std::string name;
- FieldType type;
- std::string pattern; // For RegEx
- std::string timestamp_format; // For Timestamp
- std::vector allowed_list; // For ListItem
- std::vector relevant_list; // For ListItem
-
- static FieldConfig from_yaml(const YAML::Node& node);
-};
-
-// Batch handler configuration
-struct BatchHandlerConfig {
- int batch_size = 2000;
- double batch_timeout = 30.0;
- int ipv4_prefix_length = 24;
- int ipv6_prefix_length = 64;
-
- static BatchHandlerConfig from_yaml(const YAML::Node& node);
-};
-
-// Collector configuration
-struct CollectorConfig {
- std::string name;
- std::string protocol_base;
- std::vector required_log_information;
- BatchHandlerConfig batch_handler_config;
-
- static CollectorConfig from_yaml(const YAML::Node& node, const BatchHandlerConfig& default_config);
-};
-
-// Prefilter configuration
-struct PrefilterConfig {
- std::string name;
- std::string relevance_method;
- std::string collector_name;
-
- static PrefilterConfig from_yaml(const YAML::Node& node);
-};
-
-// Inspector configuration
-struct InspectorConfig {
- std::string name;
- std::string inspector_module_name;
- std::string inspector_class_name;
- std::string prefilter_name;
- std::string mode; // univariate, multivariate, ensemble
- YAML::Node models;
- YAML::Node ensemble;
- double anomaly_threshold;
- double score_threshold;
- std::string time_type;
- int time_range;
-
- static InspectorConfig from_yaml(const YAML::Node& node);
-};
-
-// Detector configuration
-struct DetectorConfig {
- std::string name;
- std::string detector_module_name;
- std::string detector_class_name;
- std::string model;
- std::string checksum;
- std::string base_url;
- double threshold;
- std::string inspector_name;
-
- static DetectorConfig from_yaml(const YAML::Node& node);
-};
-
-// Monitoring configuration
-struct MonitoringConfig {
- int clickhouse_batch_size = 50;
- double clickhouse_batch_timeout = 2.0;
-
- static MonitoringConfig from_yaml(const YAML::Node& node);
-};
-
-// Pipeline configuration
-struct PipelineConfig {
- std::string logserver_input_file;
- BatchHandlerConfig default_batch_handler_config;
- std::vector collectors;
- std::vector prefilters;
- std::vector inspectors;
- std::vector detectors;
- MonitoringConfig monitoring;
-
- static PipelineConfig from_yaml(const YAML::Node& node);
-};
-
-// Root configuration
-class Config {
-public:
- LoggingConfig logging;
- PipelineConfig pipeline;
- EnvironmentConfig environment;
-
- // Load configuration from YAML file
- static std::shared_ptr load_from_file(const std::string& filepath);
-
- // Load configuration from YAML string
- static std::shared_ptr load_from_string(const std::string& yaml_content);
-
-private:
- static std::shared_ptr from_yaml(const YAML::Node& root);
-};
-
-} // namespace config
-} // namespace hamstring
diff --git a/cpp/include/hamstring/detector/detector.hpp b/cpp/include/hamstring/detector/detector.hpp
deleted file mode 100644
index 7da5cf14..00000000
--- a/cpp/include/hamstring/detector/detector.hpp
+++ /dev/null
@@ -1,40 +0,0 @@
-#pragma once
-
-#include "hamstring/detector/feature_extractor.hpp"
-#include
-#include
-#include
-
-// Forward declaration for ONNX Runtime classes to avoid exposing them in the
-// header
-namespace onnxruntime {
-class InferenceSession;
-class Env;
-class SessionOptions;
-class RunOptions;
-} // namespace onnxruntime
-
-namespace hamstring {
-namespace detector {
-
-class Detector {
-public:
- Detector();
- ~Detector();
-
- // Load ONNX model from file
- void load_model(const std::string &model_path);
-
- // Predict probability of domain being DGA (0.0 - 1.0)
- float predict(const std::string &domain);
-
-private:
- FeatureExtractor feature_extractor_;
-
- // Pimpl idiom for ONNX Runtime objects
- struct Impl;
- std::unique_ptr impl_;
-};
-
-} // namespace detector
-} // namespace hamstring
diff --git a/cpp/include/hamstring/detector/detector_service.hpp b/cpp/include/hamstring/detector/detector_service.hpp
deleted file mode 100644
index 5e288c53..00000000
--- a/cpp/include/hamstring/detector/detector_service.hpp
+++ /dev/null
@@ -1,71 +0,0 @@
-```cpp
-#pragma once
-
-#include "hamstring/base/clickhouse_sender.hpp"
-#include "hamstring/base/data_classes.hpp"
-#include "hamstring/base/kafka_handler.hpp"
-#include "hamstring/base/logger.hpp"
-#include "hamstring/config/config.hpp"
-#include "hamstring/detector/detector.hpp"
-#include
-#include
-#include
-#include
-#include
-#include
-
- namespace hamstring {
- namespace detector {
-
- class DetectorService {
- public:
- DetectorService(const std::string &name, const std::string &consume_topic,
- const std::string &model_path, double threshold,
- std::shared_ptr config,
- const std::string &bootstrap_servers,
- const std::string &group_id);
- ~DetectorService();
-
- void start();
- void stop();
- bool is_running() const { return running_; }
-
- struct Stats {
- uint64_t batches_consumed = 0;
- uint64_t domains_scanned = 0;
- uint64_t domains_detected = 0;
- };
- Stats get_stats() const;
-
- private:
- void consume_loop();
- void process_batch(const base::Batch &batch);
-
- std::string name_;
- std::string consume_topic_;
- std::string model_path_;
- double threshold_;
- std::shared_ptr config_;
-
- std::shared_ptr logger_;
- std::unique_ptr consumer_;
- std::shared_ptr clickhouse_;
-
- // The core detector logic
- Detector detector_;
-
- std::atomic running_{false};
- std::thread worker_thread_;
-
- // Stats
- std::atomic batches_consumed_{0};
- std::atomic domains_scanned_{0};
- std::atomic domains_detected_{0};
- };
-
- // Factory function
- std::vector>
- create_detector_services(std::shared_ptr config);
-
- } // namespace detector
-} // namespace hamstring
diff --git a/cpp/include/hamstring/detector/feature_extractor.hpp b/cpp/include/hamstring/detector/feature_extractor.hpp
deleted file mode 100644
index 94a4163a..00000000
--- a/cpp/include/hamstring/detector/feature_extractor.hpp
+++ /dev/null
@@ -1,78 +0,0 @@
-#pragma once
-
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace detector {
-
-// Feature vector for a domain name
-struct DomainFeatures {
- // Label statistics
- int label_length = 0;
- int label_max = 0;
- double label_average = 0.0;
-
- // Character frequency (a-z)
- std::map char_freq;
-
- // Domain level counts
- double fqdn_full_count = 0.0;
- double fqdn_alpha_count = 0.0;
- double fqdn_numeric_count = 0.0;
- double fqdn_special_count = 0.0;
-
- double secondleveldomain_full_count = 0.0;
- double secondleveldomain_alpha_count = 0.0;
- double secondleveldomain_numeric_count = 0.0;
- double secondleveldomain_special_count = 0.0;
-
- double thirdleveldomain_full_count = 0.0;
- double thirdleveldomain_alpha_count = 0.0;
- double thirdleveldomain_numeric_count = 0.0;
- double thirdleveldomain_special_count = 0.0;
-
- // Entropy
- double fqdn_entropy = 0.0;
- double secondleveldomain_entropy = 0.0;
- double thirdleveldomain_entropy = 0.0;
-
- // Convert to vector for ML model input
- std::vector to_vector() const;
-
- // Get feature names (for debugging/logging)
- static std::vector get_feature_names();
-};
-
-// Feature extractor matching Python implementation
-class FeatureExtractor {
-public:
- FeatureExtractor() = default;
-
- // Extract features from a domain name
- DomainFeatures extract(const std::string &domain) const;
-
-private:
- // Helper methods matching Python implementation
- int count_labels(const std::string &domain) const;
- int get_max_label_length(const std::string &domain) const;
- double get_average_label_length(const std::string &domain) const;
-
- std::map
- calculate_char_frequency(const std::string &domain) const;
-
- double calculate_alpha_ratio(const std::string &text) const;
- double calculate_numeric_ratio(const std::string &text) const;
- double calculate_special_ratio(const std::string &text) const;
-
- double calculate_entropy(const std::string &text) const;
-
- std::string extract_fqdn(const std::string &domain) const;
- std::string extract_second_level_domain(const std::string &domain) const;
- std::string extract_third_level_domain(const std::string &domain) const;
-};
-
-} // namespace detector
-} // namespace hamstring
diff --git a/cpp/include/hamstring/inspector/anomaly_detector.hpp b/cpp/include/hamstring/inspector/anomaly_detector.hpp
deleted file mode 100644
index 34506e45..00000000
--- a/cpp/include/hamstring/inspector/anomaly_detector.hpp
+++ /dev/null
@@ -1,124 +0,0 @@
-#pragma once
-
-#include "hamstring/base/data_classes.hpp"
-#include "hamstring/config/config.hpp"
-#include
-#include
-#include
-
-namespace hamstring {
-namespace inspector {
-
-/**
- * @brief Metrics extracted from a batch for anomaly detection
- */
-struct BatchMetrics {
- double nxdomain_rate; // Ratio of NXDOMAIN responses
- double avg_domain_length; // Average domain name length
- double domain_entropy; // Shannon entropy of domain names
- double unique_domain_ratio; // Unique domains / total queries
- size_t total_queries; // Total number of queries
- double query_rate; // Queries per second
-
- // Character distribution
- double numeric_char_ratio; // Ratio of numeric characters
- double special_char_ratio; // Ratio of special characters
-};
-
-/**
- * @brief Statistical anomaly detector using time-series analysis
- *
- * This class implements lightweight statistical methods for anomaly detection:
- * - Z-score outlier detection
- * - Moving averages and standard deviations
- * - Threshold-based rules
- * - Multi-metric ensemble scoring
- */
-class AnomalyDetector {
-public:
- /**
- * @brief Construct anomaly detector with configuration
- *
- * @param config Inspector configuration with thresholds
- */
- explicit AnomalyDetector(const config::InspectorConfig &config);
-
- /**
- * @brief Analyze batch and return suspicion score
- *
- * @param batch Batch to analyze
- * @return Suspicion score [0.0, 1.0] where higher = more suspicious
- */
- double analyze_batch(const base::Batch &batch);
-
- /**
- * @brief Update internal statistics with new batch
- *
- * @param batch Batch to update statistics with
- */
- void update_state(const base::Batch &batch);
-
- /**
- * @brief Get current statistics (for debugging/monitoring)
- */
- struct Statistics {
- double mean_nxdomain_rate;
- double stddev_nxdomain_rate;
- double mean_domain_length;
- double stddev_domain_length;
- size_t samples_count;
- };
-
- Statistics get_statistics() const;
-
-private:
- /**
- * @brief Extract metrics from batch
- */
- BatchMetrics extract_metrics(const base::Batch &batch);
-
- /**
- * @brief Calculate Z-score for a value given historical data
- */
- double calculate_z_score(double value, double mean, double stddev);
-
- /**
- * @brief Update rolling statistics
- */
- void update_rolling_stats();
-
- /**
- * @brief Calculate Shannon entropy of a string
- */
- double calculate_entropy(const std::string &str);
-
- /**
- * @brief Detect anomalies using multiple methods
- */
- double detect_anomalies(const BatchMetrics &metrics);
-
- // Configuration
- config::InspectorConfig config_;
-
- // Rolling window statistics
- struct RollingStats {
- std::deque nxdomain_rates;
- std::deque avg_domain_lengths;
- std::deque query_rates;
- std::deque entropies;
-
- double mean_nxdomain = 0.0;
- double stddev_nxdomain = 0.0;
- double mean_domain_length = 0.0;
- double stddev_domain_length = 0.0;
- double mean_entropy = 0.0;
- double stddev_entropy = 0.0;
- };
-
- RollingStats stats_;
- size_t window_size_;
- double z_score_threshold_;
-};
-
-} // namespace inspector
-} // namespace hamstring
diff --git a/cpp/include/hamstring/inspector/inspector.hpp b/cpp/include/hamstring/inspector/inspector.hpp
deleted file mode 100644
index 65dd5559..00000000
--- a/cpp/include/hamstring/inspector/inspector.hpp
+++ /dev/null
@@ -1,154 +0,0 @@
-#pragma once
-
-#include "hamstring/base/clickhouse_sender.hpp"
-#include "hamstring/base/data_classes.hpp"
-#include "hamstring/base/kafka_handler.hpp"
-#include "hamstring/base/logger.hpp"
-#include "hamstring/config/config.hpp"
-#include "hamstring/inspector/anomaly_detector.hpp"
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace inspector {
-
-/**
- * @brief Inspector - Performs anomaly detection on batches
- *
- * The Inspector consumes filtered batches from the Prefilter stage and
- * performs anomaly detection. Suspicious batches are grouped by source IP
- * and forwarded to the Detector stage for further analysis.
- *
- * Features:
- * - Anomaly detection using statistical models
- * - IP-based batch grouping
- * - Threshold-based filtering
- * - ClickHouse monitoring integration
- */
-class Inspector {
-public:
- /**
- * @brief Construct a new Inspector
- *
- * @param name Inspector name
- * @param consume_topic Kafka topic to consume batches from
- * @param produce_topics Kafka topics to produce suspicious batches to
- * @param mode Anomaly detection mode (univariate, multivariate, ensemble)
- * @param anomaly_threshold Threshold for anomaly ratio
- * @param score_threshold Threshold for anomaly scores
- * @param config Global configuration
- * @param bootstrap_servers Kafka broker addresses
- * @param group_id Kafka consumer group ID
- */
- Inspector(const std::string &name, const std::string &consume_topic,
- const std::vector &produce_topics,
- const std::string &mode, double anomaly_threshold,
- double score_threshold, std::shared_ptr config,
- const std::string &bootstrap_servers, const std::string &group_id);
-
- ~Inspector();
-
- /**
- * @brief Start the inspector
- *
- * Begins consuming batches from Kafka and performing anomaly detection.
- */
- void start();
-
- /**
- * @brief Stop the inspector gracefully
- */
- void stop();
-
- /**
- * @brief Check if inspector is running
- */
- bool is_running() const { return running_; }
-
- /**
- * @brief Get inspector statistics
- */
- struct Stats {
- size_t batches_consumed;
- size_t batches_suspicious;
- size_t batches_filtered;
- size_t suspicious_batches_sent;
- };
-
- Stats get_stats() const;
-
-private:
- /**
- * @brief Main message consumption loop
- */
- void consume_loop();
-
- /**
- * @brief Process a single batch
- *
- * @param batch Batch to inspect
- */
- void process_batch(const base::Batch &batch);
-
- /**
- * @brief Check if batch is suspicious
- *
- * @param batch Batch to check
- * @return true if suspicious, false otherwise
- */
- bool is_suspicious(const base::Batch &batch);
-
- /**
- * @brief Send suspicious batches to Kafka
- *
- * @param batches_by_ip Map of IP address to batches
- * @param original_batch_id Original batch ID
- */
- void send_suspicious_batches(
- const std::map>>
- &batches_by_ip,
- const std::string &original_batch_id);
-
- // Configuration
- std::string name_;
- std::string consume_topic_;
- std::vector produce_topics_;
- std::string mode_;
- double anomaly_threshold_;
- double score_threshold_;
- std::shared_ptr config_;
-
- // Components
- std::unique_ptr consumer_;
- std::vector> producers_;
- std::shared_ptr clickhouse_;
- std::unique_ptr anomaly_detector_;
-
- // Threading
- std::atomic running_{false};
- std::thread worker_thread_;
-
- // Metrics
- std::atomic batches_consumed_{0};
- std::atomic batches_suspicious_{0};
- std::atomic batches_filtered_{0};
- std::atomic suspicious_batches_sent_{0};
-
- // Logger
- std::shared_ptr logger_;
-};
-
-/**
- * @brief Create Inspector instances from configuration
- *
- * @param config Application configuration
- * @return Vector of Inspector instances
- */
-std::vector>
-create_inspectors(std::shared_ptr config);
-
-} // namespace inspector
-} // namespace hamstring
diff --git a/cpp/include/hamstring/logcollector/logcollector.hpp b/cpp/include/hamstring/logcollector/logcollector.hpp
deleted file mode 100644
index 054ca17a..00000000
--- a/cpp/include/hamstring/logcollector/logcollector.hpp
+++ /dev/null
@@ -1,278 +0,0 @@
-#pragma once
-
-#include "hamstring/base/clickhouse_sender.hpp"
-#include "hamstring/base/data_classes.hpp"
-#include "hamstring/base/kafka_handler.hpp"
-#include "hamstring/base/logger.hpp"
-#include "hamstring/config/config.hpp"
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace logcollector {
-
-/**
- * @brief Buffered batch for efficient log line aggregation
- *
- * Thread-safe batch container that groups log lines by subnet ID.
- * Automatically triggers sends when size or timeout limits are reached.
- */
-class BufferedBatch {
-public:
- /**
- * @brief Construct a new Buffered Batch
- *
- * @param collector_name Name of the collector
- * @param batch_size Max messages per batch
- * @param batch_timeout_ms Timeout in milliseconds
- */
- BufferedBatch(const std::string &collector_name, size_t batch_size,
- int batch_timeout_ms);
-
- ~BufferedBatch();
-
- /**
- * @brief Add a log line to the batch
- *
- * Thread-safe method to add a message to the appropriate batch.
- *
- * @param subnet_id Subnet identifier for batching
- * @param logline LogLine to add
- * @return true if batch is ready to send
- */
- bool add_logline(const std::string &subnet_id, const base::LogLine &logline);
-
- /**
- * @brief Get completed batch for a subnet
- *
- * @param subnet_id Subnet identifier
- * @return Batch object ready to send
- */
- base::Batch get_batch(const std::string &subnet_id);
-
- /**
- * @brief Get all ready batches
- *
- * @return Vector of batches that are ready to send
- */
- std::vector get_ready_batches();
-
- /**
- * @brief Force send all batches (timeout or shutdown)
- *
- * @return Vector of all batches
- */
- std::vector flush_all();
-
- /**
- * @brief Get statistics about current batches
- */
- struct Stats {
- size_t total_batches;
- size_t total_loglines;
- size_t largest_batch;
- std::chrono::milliseconds oldest_batch_age;
- };
-
- Stats get_stats() const;
-
-private:
- struct BatchData {
- std::string batch_id;
- std::string subnet_id;
- std::vector loglines;
- std::chrono::system_clock::time_point created_at;
- std::chrono::system_clock::time_point last_updated;
- };
-
- mutable std::mutex batches_mutex_;
- std::unordered_map batches_;
-
- std::string collector_name_;
- size_t batch_size_;
- std::chrono::milliseconds batch_timeout_;
-
- // Metrics
- std::atomic total_loglines_processed_{0};
- std::atomic total_batches_sent_{0};
-
- std::shared_ptr logger_;
-};
-
-/**
- * @brief LogCollector - Validates and batches log lines
- *
- * Main component for log collection stage. Features:
- * - Multi-threaded log line processing
- * - Field validation with configurable rules
- * - Subnet-based batching
- * - Automatic batch dispatch
- * - ClickHouse monitoring integration
- * - Horizontal scalability support
- */
-class LogCollector {
-public:
- /**
- * @brief Construct a new Log Collector
- *
- * @param name Collector name
- * @param protocol Protocol type (dns, http, etc.)
- * @param consume_topic Kafka topic to consume from
- * @param produce_topics Kafka topics to produce to
- * @param validation_config Field validation rules
- * @param config Global configuration
- * @param bootstrap_servers Kafka broker addresses
- * @param group_id Kafka consumer group ID
- */
- LogCollector(const std::string &name, const std::string &protocol,
- const std::string &consume_topic,
- const std::vector &produce_topics,
- const std::vector &validation_config,
- std::shared_ptr config,
- const std::string &bootstrap_servers,
- const std::string &group_id);
-
- ~LogCollector();
-
- /**
- * @brief Start the collector
- *
- * Begins consuming from Kafka and processing log lines.
- */
- void start();
-
- /**
- * @brief Stop the collector gracefully
- *
- * Finishes processing in-flight messages and flushes batches.
- */
- void stop();
-
- /**
- * @brief Check if collector is running
- */
- bool is_running() const { return running_; }
-
- /**
- * @brief Get collector statistics
- */
- struct Stats {
- size_t messages_consumed;
- size_t messages_validated;
- size_t messages_failed;
- size_t batches_sent;
- double avg_validation_time_ms;
- double avg_batch_time_ms;
- };
-
- Stats get_stats() const;
-
-private:
- /**
- * @brief Main message consumption loop
- */
- void consume_loop();
-
- /**
- * @brief Process a single message
- *
- * @param message Raw message string from Kafka
- */
- void process_message(const std::string &message);
-
- /**
- * @brief Validate and parse a log line
- *
- * @param message Raw JSON message
- * @return Validated LogLine object
- * @throws std::runtime_error if validation fails
- */
- base::LogLine validate_logline(const std::string &message);
-
- /**
- * @brief Calculate subnet ID from IP address
- *
- * @param ip_address IP address string
- * @return Subnet ID string
- */
- std::string get_subnet_id(const std::string &ip_address);
-
- /**
- * @brief Batch timeout handler
- *
- * Periodically checks for batches that need to be sent due to timeout.
- */
- void batch_timeout_handler();
-
- /**
- * @brief Send batches to Kafka
- *
- * @param batches Batches to send
- */
- void send_batches(const std::vector &batches);
-
- /**
- * @brief Log failed validation to ClickHouse
- *
- * @param message Original message
- * @param reason Failure reason
- */
- void log_failed_logline(const std::string &message,
- const std::string &reason);
-
- // Configuration
- std::string name_;
- std::string protocol_;
- std::string consume_topic_;
- std::vector produce_topics_;
- std::vector validation_config_;
- std::shared_ptr config_;
-
- // Batch configuration
- size_t batch_size_;
- int batch_timeout_ms_;
- int ipv4_prefix_length_;
- int ipv6_prefix_length_;
-
- // Components
- std::unique_ptr batch_handler_;
- std::unique_ptr consumer_;
- std::unique_ptr producer_;
- std::shared_ptr clickhouse_;
-
- // Threading
- std::atomic running_{false};
- std::thread consumer_thread_;
- std::thread batch_timer_thread_;
-
- // Metrics
- std::atomic messages_consumed_{0};
- std::atomic messages_validated_{0};
- std::atomic messages_failed_{0};
- std::atomic batches_sent_{0};
-
- // Logger
- std::shared_ptr logger_;
-};
-
-/**
- * @brief Create LogCollector instances from configuration
- *
- * Factory function that creates collectors for each configured collector.
- * Supports horizontal scaling by creating multiple instances.
- *
- * @param config Application configuration
- * @return Vector of LogCollector instances
- */
-std::vector>
-create_logcollectors(std::shared_ptr config);
-
-} // namespace logcollector
-} // namespace hamstring
diff --git a/cpp/include/hamstring/logserver/logserver.hpp b/cpp/include/hamstring/logserver/logserver.hpp
deleted file mode 100644
index 5f7fc73e..00000000
--- a/cpp/include/hamstring/logserver/logserver.hpp
+++ /dev/null
@@ -1,133 +0,0 @@
-#pragma once
-
-#include "hamstring/base/clickhouse_sender.hpp"
-#include "hamstring/base/kafka_handler.hpp"
-#include "hamstring/base/logger.hpp"
-#include "hamstring/config/config.hpp"
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace logserver {
-
-/**
- * @brief LogServer - Entry point for log data into the pipeline
- *
- * The LogServer consumes log messages from Kafka topics, stores them in
- * ClickHouse for monitoring, and forwards them to the appropriate collector
- * topics based on protocol configuration.
- *
- * Features:
- * - Consumes from multiple Kafka input topics
- * - Produces to multiple collector topics
- * - Logs all messages and timestamps to ClickHouse
- * - Async processing for high throughput
- * - Graceful shutdown support
- */
-class LogServer {
-public:
- /**
- * @brief Construct a LogServer instance
- *
- * @param consume_topic Kafka topic to consume from
- * @param produce_topics List of Kafka topics to produce to
- * @param clickhouse ClickHouse sender for monitoring
- * @param bootstrap_servers Comma-separated Kafka broker addresses
- * @param group_id Kafka consumer group ID
- */
- LogServer(const std::string &consume_topic,
- const std::vector &produce_topics,
- std::shared_ptr clickhouse,
- const std::string &bootstrap_servers, const std::string &group_id);
-
- ~LogServer();
-
- /**
- * @brief Start the LogServer
- *
- * Begins consuming messages from Kafka and processing them.
- * This method blocks until stop() is called.
- */
- void start();
-
- /**
- * @brief Stop the LogServer
- *
- * Gracefully shuts down the server, finishing any in-flight messages.
- */
- void stop();
-
- /**
- * @brief Check if server is running
- */
- bool is_running() const { return running_; }
-
-private:
- /**
- * @brief Send a message to all producer topics
- *
- * @param message_id UUID of the message
- * @param message Message content
- */
- void send(const std::string &message_id, const std::string &message);
-
- /**
- * @brief Main message fetching loop
- *
- * Continuously fetches messages from Kafka and processes them.
- */
- void fetch_from_kafka();
-
- /**
- * @brief Log message to ClickHouse
- *
- * @param message_id UUID of the message
- * @param message Message content
- */
- void log_message(const std::string &message_id, const std::string &message);
-
- /**
- * @brief Log timestamp event to ClickHouse
- *
- * @param message_id UUID of the message
- * @param event Event type (timestamp_in, timestamp_out)
- */
- void log_timestamp(const std::string &message_id, const std::string &event);
-
- // Configuration
- std::string consume_topic_;
- std::vector produce_topics_;
-
- // Kafka handlers
- std::unique_ptr consumer_;
- std::vector> producers_;
-
- // ClickHouse for monitoring
- std::shared_ptr clickhouse_;
-
- // Logger
- std::shared_ptr logger_;
-
- // Runtime state
- std::atomic running_;
- std::thread worker_thread_;
-};
-
-/**
- * @brief Create and start LogServer instances based on configuration
- *
- * Creates one LogServer instance per protocol defined in the configuration.
- * Each server consumes from its protocol-specific topic and produces to
- * collector topics that handle that protocol.
- *
- * @param config Application configuration
- * @return Vector of LogServer instances
- */
-std::vector>
-create_logservers(std::shared_ptr config);
-
-} // namespace logserver
-} // namespace hamstring
diff --git a/cpp/include/hamstring/prefilter/prefilter.hpp b/cpp/include/hamstring/prefilter/prefilter.hpp
deleted file mode 100644
index 70fcd2a3..00000000
--- a/cpp/include/hamstring/prefilter/prefilter.hpp
+++ /dev/null
@@ -1,149 +0,0 @@
-#pragma once
-
-#include "hamstring/base/clickhouse_sender.hpp"
-#include "hamstring/base/data_classes.hpp"
-#include "hamstring/base/kafka_handler.hpp"
-#include "hamstring/base/logger.hpp"
-#include "hamstring/config/config.hpp"
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace prefilter {
-
-/**
- * @brief Prefilter - Filters batches based on relevance rules
- *
- * The Prefilter consumes batches from the LogCollector stage and applies
- * relevance-based filtering. Only relevant log lines are forwarded to
- * the Inspector stage for anomaly detection.
- *
- * Features:
- * - Rule-based relevance filtering
- * - Batch processing with metrics
- * - ClickHouse monitoring integration
- * - Multi-threaded processing
- */
-class Prefilter {
-public:
- /**
- * @brief Construct a new Prefilter
- *
- * @param name Prefilter name
- * @param consume_topic Kafka topic to consume batches from
- * @param produce_topics Kafka topics to produce filtered batches to
- * @param relevance_function Name of relevance function to use
- * @param validation_config Field validation rules
- * @param config Global configuration
- * @param bootstrap_servers Kafka broker addresses
- * @param group_id Kafka consumer group ID
- */
- Prefilter(const std::string &name, const std::string &consume_topic,
- const std::vector &produce_topics,
- const std::string &relevance_function,
- const std::vector &validation_config,
- std::shared_ptr config,
- const std::string &bootstrap_servers, const std::string &group_id);
-
- ~Prefilter();
-
- /**
- * @brief Start the prefilter
- *
- * Begins consuming batches from Kafka and filtering them.
- */
- void start();
-
- /**
- * @brief Stop the prefilter gracefully
- */
- void stop();
-
- /**
- * @brief Check if prefilter is running
- */
- bool is_running() const { return running_; }
-
- /**
- * @brief Get prefilter statistics
- */
- struct Stats {
- size_t batches_consumed;
- size_t batches_sent;
- size_t loglines_received;
- size_t loglines_filtered;
- size_t loglines_sent;
- };
-
- Stats get_stats() const;
-
-private:
- /**
- * @brief Main message consumption loop
- */
- void consume_loop();
-
- /**
- * @brief Process a single batch
- *
- * @param batch Batch to process
- */
- void process_batch(const base::Batch &batch);
-
- /**
- * @brief Check if a log line is relevant
- *
- * @param logline LogLine to check
- * @return true if relevant, false otherwise
- */
- bool check_relevance(const base::LogLine &logline);
-
- /**
- * @brief Send filtered batch to Kafka
- *
- * @param batch Filtered batch to send
- */
- void send_batch(const base::Batch &batch);
-
- // Configuration
- std::string name_;
- std::string consume_topic_;
- std::vector produce_topics_;
- std::string relevance_function_;
- std::vector validation_config_;
- std::shared_ptr config_;
-
- // Components
- std::unique_ptr consumer_;
- std::vector> producers_;
- std::shared_ptr clickhouse_;
-
- // Threading
- std::atomic running_{false};
- std::thread worker_thread_;
-
- // Metrics
- std::atomic batches_consumed_{0};
- std::atomic batches_sent_{0};
- std::atomic loglines_received_{0};
- std::atomic loglines_filtered_{0};
- std::atomic loglines_sent_{0};
-
- // Logger
- std::shared_ptr logger_;
-};
-
-/**
- * @brief Create Prefilter instances from configuration
- *
- * @param config Application configuration
- * @return Vector of Prefilter instances
- */
-std::vector>
-create_prefilters(std::shared_ptr config);
-
-} // namespace prefilter
-} // namespace hamstring
diff --git a/cpp/src/CMakeLists.txt b/cpp/src/CMakeLists.txt
deleted file mode 100644
index 486bc26f..00000000
--- a/cpp/src/CMakeLists.txt
+++ /dev/null
@@ -1,48 +0,0 @@
-# Base sources
-set(BASE_SOURCES
- base/logger.cpp
- base/utils.cpp
- base/data_classes.cpp
- base/clickhouse_sender.cpp
- base/kafka_handler.cpp
-)
-
-# Config sources
-set(CONFIG_SOURCES
- config/config.cpp
-)
-
-# Base library with core infrastructure
-add_library(hamstring_base
- ${BASE_SOURCES}
- ${CONFIG_SOURCES}
-)
-
-target_link_libraries(hamstring_base
- PUBLIC
- spdlog::spdlog
- fmt::fmt
- RdKafka::rdkafka
- RdKafka::rdkafka++
- nlohmann_json::nlohmann_json
- yaml-cpp
- OpenSSL::SSL
- OpenSSL::Crypto
- ${CLICKHOUSE_CPP_LIB}
- ${ZSTD_LIB}
- ${CITYHASH_LIB}
-)
-
-target_include_directories(hamstring_base
- PUBLIC
- ${CMAKE_SOURCE_DIR}/include
- ${CLICKHOUSE_INCLUDE_DIR}
-)
-
-# Modules
-add_subdirectory(logserver)
-add_subdirectory(logcollector)
-add_subdirectory(prefilter)
-add_subdirectory(inspector)
-# add_subdirectory(detector) # Requires ONNX Runtime
-# add_subdirectory(monitoring)
diff --git a/cpp/src/base/clickhouse_sender.cpp b/cpp/src/base/clickhouse_sender.cpp
deleted file mode 100644
index 723a3d8e..00000000
--- a/cpp/src/base/clickhouse_sender.cpp
+++ /dev/null
@@ -1,236 +0,0 @@
-#include "hamstring/base/clickhouse_sender.hpp"
-#include "hamstring/base/logger.hpp"
-#include
-#include
-#include
-#include
-#include
-
-namespace hamstring {
-namespace base {
-
-ClickHouseSender::ClickHouseSender(const std::string &hostname, int port,
- const std::string &database,
- const std::string &user,
- const std::string &password)
- : hostname_(hostname), port_(port), database_(database), user_(user),
- password_(password), connected_(false) {
-
- auto logger = Logger::get_logger("clickhouse");
-
- try {
- // Create ClickHouse client with connection options
- clickhouse::ClientOptions options;
- options.SetHost(hostname_);
- options.SetPort(port_);
- options.SetDefaultDatabase(database_);
- options.SetUser(user_);
- options.SetPassword(password_);
- options.SetPingBeforeQuery(true);
-
- client_ = std::make_unique(options);
- connected_ = true;
-
- logger->info("ClickHouse client connected to {}:{}/{}", hostname_, port_,
- database_);
- } catch (const std::exception &e) {
- logger->error("Failed to connect to ClickHouse: {}", e.what());
- connected_ = false;
- }
-}
-
-ClickHouseSender::~ClickHouseSender() = default;
-
-void ClickHouseSender::insert_batch_timestamp(const std::string &batch_id,
- const std::string &stage,
- const std::string &instance_name,
- const std::string &status,
- size_t message_count,
- bool is_active) {
- auto logger = Logger::get_logger("clickhouse.metrics");
- logger->debug("BATCH_TIMESTAMP: batch_id={}, stage={}, instance={}, "
- "status={}, count={}, active={}",
- batch_id, stage, instance_name, status, message_count,
- is_active);
-}
-
-void ClickHouseSender::insert_logline_timestamp(const std::string &logline_id,
- const std::string &stage,
- const std::string &status,
- bool is_active) {
- auto logger = Logger::get_logger("clickhouse.metrics");
- logger->trace(
- "LOGLINE_TIMESTAMP: logline_id={}, stage={}, status={}, active={}",
- logline_id, stage, status, is_active);
-}
-
-void ClickHouseSender::insert_fill_level(const std::string &stage,
- const std::string &entry_type,
- size_t entry_count) {
- auto logger = Logger::get_logger("clickhouse.metrics");
- logger->debug("FILL_LEVEL: stage={}, type={}, count={}", stage, entry_type,
- entry_count);
-}
-
-void ClickHouseSender::insert_dga_detection(const std::string &domain,
- double score,
- const std::string &batch_id,
- const std::string &src_ip) {
- auto logger = Logger::get_logger("clickhouse.detections");
- logger->info("DGA_DETECTION: domain={}, score={:.4f}, batch={}, ip={}",
- domain, score, batch_id, src_ip);
-}
-
-void ClickHouseSender::execute(const std::string &query) {
- auto logger = Logger::get_logger("clickhouse");
- logger->debug("EXECUTE: {}", query);
-}
-
-bool ClickHouseSender::ping() {
- if (!connected_ || !client_) {
- return false;
- }
-
- try {
- client_->Execute("SELECT 1");
- return true;
- } catch (const std::exception &e) {
- auto logger = Logger::get_logger("clickhouse");
- logger->error("ClickHouse ping failed: {}", e.what());
- return false;
- }
-}
-
-void ClickHouseSender::insert_server_log(const std::string &message_id,
- int64_t timestamp_ms,
- const std::string &message_text) {
- if (!connected_ || !client_) {
- auto logger = Logger::get_logger("clickhouse");
- logger->warn("ClickHouse not connected, skipping server_log insert");
- return;
- }
-
- try {
- // Create block with columns
- clickhouse::Block block;
-
- // message_id as String (ClickHouse will convert to UUID)
- // Using String instead of UUID column to avoid boost UUID parsing
- // complexity
- auto col_message_id = std::make_shared();
- col_message_id->Append(message_id);
-
- // timestamp_in as DateTime64(6) - represented as milliseconds since epoch
- auto col_timestamp = std::make_shared(6);
- col_timestamp->Append(timestamp_ms * 1000); // Convert ms to microseconds
-
- // message_text as String
- auto col_message = std::make_shared();
- col_message->Append(message_text);
-
- // Add columns to block
- block.AppendColumn("message_id", col_message_id);
- block.AppendColumn("timestamp_in", col_timestamp);
- block.AppendColumn("message_text", col_message);
-
- // Insert block
- client_->Insert("server_logs", block);
-
- } catch (const std::exception &e) {
- auto logger = Logger::get_logger("clickhouse");
- logger->error("Failed to insert server_log: {}", e.what());
- }
-}
-
-void ClickHouseSender::insert_server_log_timestamp(
- const std::string &message_id, const std::string &event,
- int64_t event_timestamp_ms) {
- if (!connected_ || !client_) {
- auto logger = Logger::get_logger("clickhouse");
- logger->warn(
- "ClickHouse not connected, skipping server_log_timestamp insert");
- return;
- }
-
- try {
- // Create block with columns
- clickhouse::Block block;
-
- // message_id as String (ClickHouse will convert to UUID)
- auto col_message_id = std::make_shared();
- col_message_id->Append(message_id);
-
- // event as String
- auto col_event = std::make_shared();
- col_event->Append(event);
-
- // event_timestamp as DateTime64(6)
- auto col_timestamp = std::make_shared(6);
- col_timestamp->Append(event_timestamp_ms *
- 1000); // Convert ms to microseconds
-
- // Add columns to block
- block.AppendColumn("message_id", col_message_id);
- block.AppendColumn("event", col_event);
- block.AppendColumn("event_timestamp", col_timestamp);
-
- // Insert block
- client_->Insert("server_logs_timestamps", block);
-
- } catch (const std::exception &e) {
- auto logger = Logger::get_logger("clickhouse");
- logger->error("Failed to insert server_log_timestamp: {}", e.what());
- }
-}
-
-void ClickHouseSender::insert_failed_logline(const std::string &message_text,
- int64_t timestamp_in_ms,
- int64_t timestamp_failed_ms,
- const std::string &reason) {
- if (!connected_ || !client_) {
- auto logger = Logger::get_logger("clickhouse");
- logger->warn("ClickHouse not connected, skipping failed_logline insert");
- return;
- }
-
- try {
- // Create block with columns
- clickhouse::Block block;
-
- // message_text as String
- auto col_message = std::make_shared();
- col_message->Append(message_text);
-
- // timestamp_in as DateTime64(6)
- auto col_timestamp_in = std::make_shared(6);
- col_timestamp_in->Append(timestamp_in_ms *
- 1000); // Convert ms to microseconds
-
- // timestamp_failed as DateTime64(6)
- auto col_timestamp_failed =
- std::make_shared(6);
- col_timestamp_failed->Append(timestamp_failed_ms *
- 1000); // Convert ms to microseconds
-
- // reason_for_failure as String (nullable in schema, but we always provide a
- // value)
- auto col_reason = std::make_shared();
- col_reason->Append(reason);
-
- // Add columns to block
- block.AppendColumn("message_text", col_message);
- block.AppendColumn("timestamp_in", col_timestamp_in);
- block.AppendColumn("timestamp_failed", col_timestamp_failed);
- block.AppendColumn("reason_for_failure", col_reason);
-
- // Insert block
- client_->Insert("failed_loglines", block);
-
- } catch (const std::exception &e) {
- auto logger = Logger::get_logger("clickhouse");
- logger->error("Failed to insert failed_logline: {}", e.what());
- }
-}
-
-} // namespace base
-} // namespace hamstring
diff --git a/cpp/src/base/data_classes.cpp b/cpp/src/base/data_classes.cpp
deleted file mode 100644
index 22c10dd2..00000000
--- a/cpp/src/base/data_classes.cpp
+++ /dev/null
@@ -1,197 +0,0 @@
-#include "hamstring/base/data_classes.hpp"
-#include "hamstring/base/utils.hpp"
-#include
-#include
-#include
-
-using json = nlohmann::json;
-
-namespace hamstring {
-namespace base {
-
-// ============================================================================
-// LogLine Implementation
-// ============================================================================
-
-std::string LogLine::to_json() const {
- json j;
- j["logline_id"] = logline_id;
- j["batch_id"] = batch_id;
- j["fields"] = fields;
- j["timestamp"] = utils::timestamp_to_ms(timestamp);
- return j.dump();
-}
-
-std::shared_ptr LogLine::from_json(const std::string &json_str) {
- auto j = json::parse(json_str);
- auto logline = std::make_shared();
-
- logline->logline_id = j["logline_id"];
- logline->batch_id = j.value("batch_id", "");
- logline->fields = j["fields"].get>();
- logline->timestamp = utils::ms_to_timestamp(j["timestamp"]);
-
- return logline;
-}
-
-std::optional LogLine::get_field(const std::string &name) const {
- auto it = fields.find(name);
- if (it != fields.end()) {
- return it->second;
- }
- return std::nullopt;
-}
-
-void LogLine::set_field(const std::string &name, const std::string &value) {
- fields[name] = value;
-}
-
-// ============================================================================
-// Batch Implementation
-// ============================================================================
-
-std::string Batch::to_json() const {
- json j;
- j["batch_id"] = batch_id;
- j["subnet_id"] = subnet_id;
- j["collector_name"] = collector_name;
- j["created_at"] = utils::timestamp_to_ms(created_at);
- j["timestamp_in"] = utils::timestamp_to_ms(timestamp_in);
-
- json loglines_json = json::array();
- for (const auto &logline : loglines) {
- loglines_json.push_back(json::parse(logline->to_json()));
- }
- j["loglines"] = loglines_json;
-
- return j.dump();
-}
-
-std::shared_ptr Batch::from_json(const std::string &json_str) {
- auto j = json::parse(json_str);
- auto batch = std::make_shared();
-
- batch->batch_id = j["batch_id"];
- batch->subnet_id = j["subnet_id"];
- batch->collector_name = j["collector_name"];
- batch->created_at = utils::ms_to_timestamp(j["created_at"]);
- batch->timestamp_in = utils::ms_to_timestamp(j["timestamp_in"]);
-
- for (const auto &logline_json : j["loglines"]) {
- batch->loglines.push_back(LogLine::from_json(logline_json.dump()));
- }
-
- return batch;
-}
-
-void Batch::add_logline(std::shared_ptr logline) {
- loglines.push_back(logline);
-}
-
-// ============================================================================
-// Warning Implementation
-// ============================================================================
-
-std::string Warning::to_json() const {
- json j;
- j["warning_id"] = warning_id;
- j["batch_id"] = batch_id;
- j["src_ip"] = src_ip;
- j["domain_name"] = domain_name;
- j["score"] = score;
- j["threshold"] = threshold;
- j["timestamp"] = utils::timestamp_to_ms(timestamp);
- j["metadata"] = metadata;
-
- return j.dump();
-}
-
-std::shared_ptr Warning::from_json(const std::string &json_str) {
- auto j = json::parse(json_str);
- auto warning = std::make_shared();
-
- warning->warning_id = j["warning_id"];
- warning->batch_id = j["batch_id"];
- warning->src_ip = j["src_ip"];
- warning->domain_name = j["domain_name"];
- warning->score = j["score"];
- warning->threshold = j["threshold"];
- warning->timestamp = utils::ms_to_timestamp(j["timestamp"]);
- warning->metadata = j["metadata"].get>();
-
- return warning;
-}
-
-// ============================================================================
-// RegExValidator Implementation
-// ============================================================================
-
-RegExValidator::RegExValidator(const std::string &name,
- const std::string &pattern)
- : name_(name), pattern_(pattern) {}
-
-bool RegExValidator::validate(const std::string &value) const {
- return std::regex_match(value, pattern_);
-}
-
-// ============================================================================
-// TimestampValidator Implementation
-// ============================================================================
-
-TimestampValidator::TimestampValidator(const std::string &name,
- const std::string &format)
- : name_(name), format_(format) {}
-
-bool TimestampValidator::validate(const std::string &value) const {
- try {
- parse(value);
- return true;
- } catch (...) {
- return false;
- }
-}
-
-std::chrono::system_clock::time_point
-TimestampValidator::parse(const std::string &value) const {
- return utils::parse_timestamp(value, format_);
-}
-
-// ============================================================================
-// IpAddressValidator Implementation
-// ============================================================================
-
-IpAddressValidator::IpAddressValidator(const std::string &name) : name_(name) {}
-
-bool IpAddressValidator::validate(const std::string &value) const {
- return is_ipv4(value) || is_ipv6(value);
-}
-
-bool IpAddressValidator::is_ipv4(const std::string &value) {
- return utils::is_valid_ipv4(value);
-}
-
-bool IpAddressValidator::is_ipv6(const std::string &value) {
- return utils::is_valid_ipv6(value);
-}
-
-// ============================================================================
-// ListItemValidator Implementation
-// ============================================================================
-
-ListItemValidator::ListItemValidator(
- const std::string &name, const std::vector &allowed_list,
- const std::vector &relevant_list)
- : name_(name), allowed_list_(allowed_list), relevant_list_(relevant_list) {}
-
-bool ListItemValidator::validate(const std::string &value) const {
- return std::find(allowed_list_.begin(), allowed_list_.end(), value) !=
- allowed_list_.end();
-}
-
-bool ListItemValidator::is_relevant(const std::string &value) const {
- return std::find(relevant_list_.begin(), relevant_list_.end(), value) !=
- relevant_list_.end();
-}
-
-} // namespace base
-} // namespace hamstring
diff --git a/cpp/src/base/kafka_handler.cpp b/cpp/src/base/kafka_handler.cpp
deleted file mode 100644
index 95738c92..00000000
--- a/cpp/src/base/kafka_handler.cpp
+++ /dev/null
@@ -1,286 +0,0 @@
-#include "hamstring/base/kafka_handler.hpp"
-#include "hamstring/base/logger.hpp"
-#include
-
-namespace hamstring {
-namespace base {
-
-// ============================================================================
-// KafkaHandler Base Implementation
-// ============================================================================
-
-std::unique_ptr
-KafkaHandler::create_config(const std::string &bootstrap_servers,
- const std::string &group_id) {
-
- std::string errstr;
- auto conf = std::unique_ptr(
- RdKafka::Conf::create(RdKafka::Conf::CONF_GLOBAL));
-
- if (conf->set("bootstrap.servers", bootstrap_servers, errstr) !=
- RdKafka::Conf::CONF_OK) {
- throw std::runtime_error("Failed to set bootstrap.servers: " + errstr);
- }
-
- if (!group_id.empty()) {
- if (conf->set("group.id", group_id, errstr) != RdKafka::Conf::CONF_OK) {
- throw std::runtime_error("Failed to set group.id: " + errstr);
- }
- }
-
- return conf;
-}
-
-// ============================================================================
-// KafkaProduceHandler Implementation
-// ============================================================================
-
-KafkaProduceHandler::KafkaProduceHandler(const std::string &bootstrap_servers,
- const std::string &topic)
- : topic_(topic) {
-
- bootstrap_servers_ = bootstrap_servers;
-
- std::string errstr;
- conf_ = create_config(bootstrap_servers);
-
- // Producer-specific settings
- conf_->set("enable.idempotence", "false", errstr);
- conf_->set("acks", "1", errstr);
- conf_->set("message.max.bytes", "1000000000", errstr);
-
- // Create producer
- producer_ = std::unique_ptr