Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions Tools/Solutions Analyzer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,57 @@ See the script documentation for details:

## Version History

### v9.10 - Schema reference documentation links for table pages

**Schema references section added to table documentation:**
- Each generated table page now includes a "Schema References" section with official Microsoft Learn documentation links for field/column information.
- **Specific schema documentation** is provided for well-documented tables (e.g., SecurityAlert for security alerts, DnsEvents/DnsInventory for DNS via AMA) with dedicated reference pages.
- **General data source schema reference** is provided for all other tables as a fallback.
- The mapping is configurable via the `TABLE_SCHEMA_REFERENCES` dictionary in `generate_connector_docs.py`, allowing easy addition of new table-specific references.
- Current mappings include:
- `SecurityAlert` → [Security Alert Schema](https://learn.microsoft.com/en-us/azure/sentinel/security-alert-schema)
- `DnsEvents`, `DnsInventory`, `AMA_DNS` → [DNS AMA Fields Reference](https://learn.microsoft.com/en-us/azure/sentinel/dns-ama-fields)
- All other tables → [Data Source Schema Reference](https://learn.microsoft.com/en-us/azure/sentinel/data-source-schema-reference) (general reference)
- Schema References section appears in the Table of Contents for easy navigation.

### v9.9 - In-solution override flag for misclassified published connectors

**Connectors with all tables filtered out no longer drop their solution from the index (`solutions_connectors_tables_mapping.csv`, `solutions-index.md`, `index.html`):**
- The mapper now **always** emits a placeholder mapping row when a connector produces zero table rows, regardless of *why* (no table tokens, parser-only tokens, failed table-name validation, or a `reported_table_exclusions` override). Previously only the `no_table_definitions` case kept a row; the other three drop reasons (`table_detection_failed`, `parser_tables_only`, `partial_parser_tables`) discarded the connector entirely. When such a connector was a solution's **only** connector, the whole solution vanished from the mapping CSV and therefore from `solutions-index.md` — even though its detail page was still generated. This regressed SlashNext, whose sole Function App connector (`SlashNextFunctionApp`) only references `AzureDiagnostics`/`AzureMetrics` health tables that the v9.9 `reported_table_exclusions` override drops.
- Both doc generators now **seed the index from the union of the mapping CSV and `solutions.csv`** as a safety net: any solution present in `solutions.csv` but absent from the mapping CSV is added with an empty-connector placeholder row so it can never be silently dropped from `solutions-index.md` (`generate_connector_docs.py`) or the interactive `index.html` (`generate_interactive_docs.py`). The placeholder carries an empty `connector_id`, so it adds no phantom connector to the connectors index.

**Marketplace double-prefix publish-status fix (`solutions.csv`):**
- `check_marketplace_availability()` now builds the marketplace legacy ID via `_build_legacy_id()`, which uses `offerId` as-is when it is already prefixed with `<publisherId>.` instead of blindly forming `<publisherId>.<offerId>`. Some `SolutionMetadata.json` files store the full legacy ID in `offerId` (e.g. `azuresentinel` + `azuresentinel.trendmicrocas`, `squadratechnologies` + `squadratechnologies.secrmmsentinel`). The previous logic produced a double-prefixed ID (e.g. `azuresentinel.azuresentinel.trendmicrocas`) that 404s, so those published solutions (Trend Micro Cloud App Security, Squadra Technologies SecRmm) were wrongly reported as `mp_is_published=false`. The marketplace cache key uses the same helper so cache hits match the API ID.

**Marketplace filter-query fallback for republished offers (`solutions.csv`):**
- When the direct legacy-ID lookup 404s, `check_marketplace_availability()` now retries via a catalog `$filter` query keyed by `offerId` (new helper `_query_marketplace_by_offer_id()`), mirroring the official packaging flow in `.script/package-automation/catalogAPI.ps1`. The filter is scoped to Sentinel offers (`categoryIds` eq `AzureSentinelSolution` or `keywords` contains the Sentinel keyword GUID) and matches `offerId` exactly. This recovers solutions that were **republished under a different `publisherId`** than the one stored in `SolutionMetadata.json` (e.g. Zscaler Internet Access: `zscaler.zscaler_zia` → live `zscaler1579058425289.zscaler_zia`), so they are no longer mis-reported as `mp_is_published=false` and no longer need a per-solution `is_published=true` override. The fallback only *adds* recovery on a 404 — it never flips a published solution to unpublished. Solutions whose **`offerId` itself changed** in the marketplace (not just the publisher) are still reported unpublished and require a `SolutionMetadata.json` `offerId` correction.

**Marketplace lookup-key overrides for renamed offers / metadata-less folders (`solutions.csv`):**
- `Solution`-scoped `solution_publisher_id` and `solution_offer_id` overrides are now applied to each solution **before** the marketplace availability check, redirecting *what* the public catalog API looks up rather than hard-coding the published verdict. This is the preferred fix when a solution ships under a different marketplace offer than its repo `SolutionMetadata.json` records — a renamed/re-published offer, a publisher hand-off, or a repo folder that carries no `SolutionMetadata.json` at all (e.g. Farsight DNSDB → `domaintoolsllc….farsight-dnsdb`, Synack → `synackinc….synack-sentinel-integration`). Because the published flag is then derived from the live public catalog, it self-corrects on future marketplace changes instead of being frozen by a blanket `is_published=true` override. The mapper still consults **only** the public marketplace catalog and never the authenticated Content Hub APIs. The standard solution-override pass continues to run later in the pipeline; this earlier pass narrowly targets the two lookup-key fields so marketplace status is resolved against the corrected offer id.

**Removed all blanket `is_published=true` solution overrides (data only):**
- Eliminated the ~430 `Solution,…,is_published,true` override rows from `solution_analyzer_overrides.csv`. The combination of `_build_legacy_id()` (double-prefix fix), the `offerId` filter-query fallback, and the pre-check lookup-key redirects now resolves published status directly from the live public catalog for the vast majority of these solutions, so the blanket overrides were redundant. The remaining mismatches were verified against the public marketplace catalog and replaced with 11 `solution_publisher_id` / `solution_offer_id` lookup-redirect override pairs (Barracuda WAF, BitSight, Farsight DNSDB, Intel471, Lumen Defender Threat Feed, SailPointIdentityNow, SecurityScorecard Cybersecurity Ratings, Semperis Directory Services Protector, Synack, Egress Iris, OneIdentity). Solutions confirmed genuinely unpublished/superseded in the catalog now report `is_published=false` from live data rather than being masked. Net effect: connector/solution publish status is fully marketplace-authoritative and self-correcting, with no frozen verdicts.


- Added a computed `category_primary` column that maps each table to a closed reporting taxonomy — `Cloud`, `Endpoint`, `Syslog/CEF`, `3rd Party (SaaS)`, `Defender`, `ASIM`, `Internal`, `Unknown` — alongside the raw `category` string (kept unchanged for traceability). Two diagnostic columns mirror the `collection_method` family: `category_source` (provenance) and `category_candidates` (all distinct taxonomy values produced, ordered by precedence).
- Resolution combines **strong** signals (ASIM name prefix, `source_defender_xdr`, mapped doc-category tokens such as `AWS`/`GCP`/`Crowdstrike`/`Entra`/`MDE`/`Normalized`/`Syslog/CEF`) using a deterministic combo precedence (`Internal` > `Defender` > `ASIM` > `Endpoint` > `Syslog/CEF` > `3rd Party (SaaS)` > `Cloud` > `Unknown`), so combos like SigninLogs resolve to `Cloud` rather than `Defender`. **Weak** fallbacks (cross-derive from `collection_method`, `resource_types` → `Cloud`, and `_CL` chains) fire only when no strong signal exists.
- `_CL` custom-log tables are categorized via their feeding connectors' vendor/product (→ `3rd Party (SaaS)`) or, absent that, the feeding **solution's publisher tier** (partner/community/developer → `3rd Party (SaaS)`; Microsoft → `Cloud`). Solution-private storage tables are forced to `Internal`. `category_primary` is overridable via `Entity=Table, Field=category_primary` rows in `solution_analyzer_overrides.csv`.

**`solution_categories` fix (`solutions.csv`):**
- The `solution_categories` column now lists the actual domain/vertical **values** from `SolutionMetadata.json` (e.g. `Security - Threat Protection`) instead of the JSON key names (`domains`, `verticals`).

**ARM-expression table-name filter (`tables.csv`):**
- `is_true_table_name()` now rejects ARM-template expressions captured as literal table names (strings starting with `[` or containing `parameters(`/`variables(`), so placeholders like `[parameters('PlaybookName')]_CL` and `[variables('Sentinel_LogName')]_CL` no longer leak into `tables.csv` as bogus `_CL` rows. Previously these passed the `_CL`-suffix check and were emitted as real tables.

**Connector table-source precedence and DCR normalization fixes (`connectors.csv`, `solutions_connectors_tables_mapping.csv`):**
- Companion files are now authoritative for table mapping: `*_Table.json` / `*_DCR.json` are applied first, query analysis runs only when companion files are absent, and `dataTypes` is now a fallback source (instead of Priority 0). This avoids over-trusting UI declarations when explicit DCR/table companion files are present.
- DCR extraction now treats `outputStream` as authoritative destination-table signal and uses `streams` only as fallback when `outputStream` is missing. This prevents input stream declarations from being misreported as extra ingested tables (for example Zscaler `nss_*` helper streams alongside `CommonSecurityLog`).
- `dataTypes` fallback extraction now expands placeholders (for example `{{graphQueriesTableName}}`) before resolving table tokens, improving coverage for connectors that parameterize table names in the UI config.

**Override-driven "discovered" corrections (data only):**
- Added `not_in_solution_json=false` overrides for three published connectors that the mapper flags as "discovered" because of source-side gaps in their solutions: `MailGuard365` (solution has no `Solution_*.json` data file), `CiscoMerakiNativePoller` (absent from the `Data Connectors` list in `Solution_CiscoMeraki.json`), and `Pathlock_TDnR` (legacy root `Pathlock_TDnR.json` collides with the CCP definition in `Pathlock_TDnR_PUSH_CCP/` that the solution actually references). The overrides are an interim accuracy fix; the underlying solutions still need upstream correction (tracked in the reports folder).
- Documented `not_in_solution_json` as an overridable connector field in the override-system reference.

### v9.8 - Artifact deep-links, connector/table accuracy, Learn deep-links, and faster HTML generation

**New artifact deep-link CSV + Kusto upload:**
Expand Down
122 changes: 122 additions & 0 deletions Tools/Solutions Analyzer/_build_connector_history_xlsx.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
#!/usr/bin/env python3
"""Build connector_history.xlsx from connector_history.csv.

Produces a workbook with:
* "Data" — the full CSV, styled header, frozen panes, auto-filter.
* "Stock" — a line chart of active / deprecated / total connectors over time.
* "Flow" — a column chart of connectors created vs updated per month.

Charts reference the Data sheet live, so editing the data updates the charts.
"""
from __future__ import annotations

import csv
from pathlib import Path

from openpyxl import Workbook
from openpyxl.chart import BarChart, LineChart, Reference
from openpyxl.styles import Alignment, Font, PatternFill
from openpyxl.utils import get_column_letter

HERE = Path(__file__).resolve().parent
CSV_PATH = HERE / "connector_history.csv"
XLSX_PATH = HERE / "connector_history.xlsx"

HEADER_FILL = PatternFill("solid", fgColor="1F4E78")
HEADER_FONT = Font(bold=True, color="FFFFFF")


def main() -> int:
with open(CSV_PATH, newline="", encoding="utf-8") as f:
reader = csv.reader(f)
rows = list(reader)
header, data = rows[0], rows[1:]

wb = Workbook()
ws = wb.active
ws.title = "Data"

# Header.
ws.append(header)
for col_idx, _ in enumerate(header, start=1):
cell = ws.cell(row=1, column=col_idx)
cell.fill = HEADER_FILL
cell.font = HEADER_FONT
cell.alignment = Alignment(horizontal="center")

# Data, coercing numeric columns to int.
numeric_cols = {
header.index(c)
for c in (
"active_connectors", "deprecated_connectors", "total_connectors",
"connectors_created", "connectors_updated",
)
if c in header
}
for record in data:
out = []
for i, value in enumerate(record):
if i in numeric_cols and value != "":
out.append(int(value))
else:
out.append(value)
ws.append(out)

ws.freeze_panes = "A2"
last_col = get_column_letter(len(header))
ws.auto_filter.ref = f"A1:{last_col}{len(data) + 1}"
for col_idx, name in enumerate(header, start=1):
ws.column_dimensions[get_column_letter(col_idx)].width = max(14, len(name) + 2)

n_rows = len(data)
cats = Reference(ws, min_col=header.index("month") + 1,
min_row=2, max_row=n_rows + 1)

# Stock chart (line).
stock_ws = wb.create_sheet("Stock")
stock = LineChart()
stock.title = "Connectors over time (as of 1st of month)"
stock.y_axis.title = "Connectors"
stock.x_axis.title = "Month"
stock.height = 12
stock.width = 28
# openpyxl defaults axes to delete=True, which hides tick labels/titles.
stock.x_axis.delete = False
stock.y_axis.delete = False
stock.legend.position = "b"
for name in ("active_connectors", "deprecated_connectors", "total_connectors"):
col = header.index(name) + 1
ref = Reference(ws, min_col=col, min_row=1, max_row=n_rows + 1)
stock.add_data(ref, titles_from_data=True)
stock.set_categories(cats)
stock_ws.add_chart(stock, "B2")

# Flow chart (column) — only if the flow columns exist.
if "connectors_created" in header and "connectors_updated" in header:
flow_ws = wb.create_sheet("Flow")
flow = BarChart()
flow.type = "col"
flow.grouping = "clustered"
flow.title = "Connectors created vs updated per month (merges to master)"
flow.y_axis.title = "Distinct connectors"
flow.x_axis.title = "Month"
flow.height = 12
flow.width = 28
flow.x_axis.delete = False
flow.y_axis.delete = False
flow.legend.position = "b"
for name in ("connectors_created", "connectors_updated"):
col = header.index(name) + 1
ref = Reference(ws, min_col=col, min_row=1, max_row=n_rows + 1)
flow.add_data(ref, titles_from_data=True)
flow.set_categories(cats)
flow_ws.add_chart(flow, "B2")

wb.save(XLSX_PATH)
print(f"Wrote {XLSX_PATH}")
print(f"Sheets: {wb.sheetnames}")
return 0


if __name__ == "__main__":
raise SystemExit(main())
Loading
Loading