From 928d9278ebd19411c8b5ad42f636dfc11b76cf57 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 27 Mar 2026 12:39:17 -0600 Subject: [PATCH 1/3] docs: add bug triage guide for prioritizing open issues Add a contributor guide page documenting the bug triage process, including priority labels (critical/high/medium/low), area labels, a decision tree for choosing priorities, escalation triggers, and common bug categories. --- docs/source/contributor-guide/bug_triage.md | 158 ++++++++++++++++++++ docs/source/contributor-guide/index.md | 1 + 2 files changed, 159 insertions(+) create mode 100644 docs/source/contributor-guide/bug_triage.md diff --git a/docs/source/contributor-guide/bug_triage.md b/docs/source/contributor-guide/bug_triage.md new file mode 100644 index 0000000000..4f6addd442 --- /dev/null +++ b/docs/source/contributor-guide/bug_triage.md @@ -0,0 +1,158 @@ + + +# Bug Triage Guide + +This guide describes how we prioritize and triage bugs in the Comet project. The goal is to ensure +that the most impactful bugs — especially correctness issues that produce wrong results — are +identified and addressed before less critical issues. + +## Priority Labels + +Every bug should have exactly one priority label. When filing or triaging a bug, apply the +appropriate label from the table below. + +| Label | Color | Description | Examples | +|-------|-------|-------------|----------| +| `priority:critical` | Red | Data corruption, silent wrong results, security vulnerabilities | Wrong aggregation results, FFI data corruption, incorrect cast output | +| `priority:high` | Orange | Crashes, panics, segfaults, major functional breakage affecting production workloads | Native engine panic, JVM segfault, NPE on supported code path | +| `priority:medium` | Yellow | Functional bugs, performance regressions, broken features that have workarounds | Missing expression support, writer feature gaps, excessive spilling | +| `priority:low` | Green | Minor issues, test-only failures, tooling, CI flakes, cosmetic issues | Flaky CI test, build script edge case, documentation generator bug | + +### How to Choose a Priority + +Use this decision tree: + +1. **Can this bug cause silent wrong results?** If yes → `priority:critical`. These are the most + dangerous bugs because users may not notice the incorrect output. +2. **Does this bug crash the JVM or native engine?** If yes → `priority:high`. Crashes are + disruptive but at least visible to the user. +3. **Does this bug break a feature or cause significant performance degradation?** If yes → + `priority:medium`. The user can work around it (e.g., falling back to Spark) but it impacts + the value of Comet. +4. **Everything else** → `priority:low`. Test failures, CI issues, tooling, and cosmetic problems. + +### Escalation Triggers + +A bug should be escalated to a higher priority if: + +- A `priority:high` crash is discovered to also produce wrong results silently in some cases → + escalate to `priority:critical` +- A `priority:medium` bug is reported by multiple users or affects a common workload → consider + escalating to `priority:high` +- A `priority:low` CI flake is blocking PR merges consistently → escalate to `priority:medium` + +## Area Labels + +Area labels indicate which subsystem is affected. A bug may have multiple area labels. These +help contributors find bugs in their area of expertise. + +| Label | Description | +|-------|-------------| +| `area:writer` | Native Parquet writer | +| `area:shuffle` | Shuffle (JVM and native) | +| `area:aggregation` | Hash aggregates, aggregate expressions | +| `area:scan` | Parquet scan / data reading | +| `area:expressions` | Expression evaluation | +| `area:ffi` | Arrow FFI / JNI boundary | +| `area:ci` | CI/CD, GitHub Actions, build tooling | + +The following pre-existing labels also serve as area indicators: `native_datafusion`, +`native_iceberg_compat`, `spark 4`, `spark sql tests`. + +## Triage Process + +### For New Issues + +When a new bug is filed: + +1. **Reproduce or verify** the issue if possible. If the report lacks reproduction steps, ask + the reporter for more details. +2. **Assess correctness impact first.** Ask: "Could this produce wrong results silently?" This + is more important than whether it crashes. +3. **Apply a priority label** using the decision tree above. +4. **Apply area labels** to indicate the affected subsystem(s). +5. **Apply `good first issue`** if the fix is likely straightforward and well-scoped. + +### For Existing Bugs + +Periodically review open bugs to ensure priorities are still accurate: + +- Has a `priority:medium` bug been open for a long time with user reports? Consider escalating. +- Has a `priority:high` bug been fixed by a related change? Close it. +- Are there clusters of related bugs that should be tracked under an EPIC? + +### Prioritization Principles + +1. **Correctness over crashes.** A bug that silently returns wrong results is worse than one that + crashes, because crashes are at least visible. +2. **User-reported over test-only.** A bug hit by a real user on a real workload takes priority + over one found only in test suites. +3. **Core path over experimental.** Bugs in the default scan mode (`native_comet`) or widely-used + expressions take priority over bugs in experimental features like `native_datafusion` or + `native_iceberg_compat`. +4. **Production safety over feature completeness.** Fixing a data corruption bug is more important + than adding support for a new expression. + +## Common Bug Categories + +### Correctness Bugs (`priority:critical`) + +These are bugs where Comet produces different results than Spark without any error or warning. +Examples include: + +- Incorrect cast behavior (e.g., negative zero to string) +- Aggregate functions ignoring configuration (e.g., `ignoreNulls`) +- Data corruption in FFI boundary (e.g., boolean arrays with non-zero offset) +- Type mismatches between partial and final aggregation stages + +When fixing correctness bugs, always add a regression test that verifies the output matches Spark. + +### Crash Bugs (`priority:high`) + +These are bugs where the native engine panics, segfaults, or throws an unhandled exception. +Common patterns include: + +- **All-scalar inputs:** Some expressions assume at least one columnar input and panic when all + inputs are literals (e.g., when `ConstantFolding` is disabled) +- **Type mismatches:** Downcasting to the wrong Arrow array type +- **Memory safety:** FFI boundary issues, unaligned arrays, GlobalRef lifecycle + +### Aggregate Planning Bugs + +Several bugs relate to how Comet plans hash aggregates across stage boundaries. The key issue is +that Spark's AQE may materialize a Comet partial aggregate but then run the final aggregate in +Spark (or vice versa), and the intermediate formats may not be compatible. See the +[EPIC #2892](https://github.com/apache/datafusion-comet/issues/2892) for the full picture. + +### Native Writer Bugs + +The native Parquet writer has a cluster of known test failures tracked as individual issues +(#3417–#3430). These are lower priority since the native writer is still maturing, but they +should be addressed before the writer is promoted to production-ready status. + +## How to Help with Triage + +Triage is a valuable contribution that doesn't require writing code. You can help by: + +- Reviewing new issues and suggesting a priority label +- Reproducing reported bugs and adding details +- Identifying duplicate issues +- Linking related issues together +- Testing whether old bugs have been fixed by recent changes diff --git a/docs/source/contributor-guide/index.md b/docs/source/contributor-guide/index.md index c0751c68a3..f6c1f4a100 100644 --- a/docs/source/contributor-guide/index.md +++ b/docs/source/contributor-guide/index.md @@ -39,6 +39,7 @@ Profiling Native Code Spark SQL Tests Iceberg Spark Tests SQL File Tests +Bug Triage Roadmap Release Process Github and Issue Tracker From c935d369f16510515d671709587559ad117bd7e3 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 27 Mar 2026 12:49:53 -0600 Subject: [PATCH 2/3] prettier --- docs/source/contributor-guide/bug_triage.md | 28 ++++++++++----------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/source/contributor-guide/bug_triage.md b/docs/source/contributor-guide/bug_triage.md index 4f6addd442..33e61644a7 100644 --- a/docs/source/contributor-guide/bug_triage.md +++ b/docs/source/contributor-guide/bug_triage.md @@ -28,12 +28,12 @@ identified and addressed before less critical issues. Every bug should have exactly one priority label. When filing or triaging a bug, apply the appropriate label from the table below. -| Label | Color | Description | Examples | -|-------|-------|-------------|----------| -| `priority:critical` | Red | Data corruption, silent wrong results, security vulnerabilities | Wrong aggregation results, FFI data corruption, incorrect cast output | -| `priority:high` | Orange | Crashes, panics, segfaults, major functional breakage affecting production workloads | Native engine panic, JVM segfault, NPE on supported code path | -| `priority:medium` | Yellow | Functional bugs, performance regressions, broken features that have workarounds | Missing expression support, writer feature gaps, excessive spilling | -| `priority:low` | Green | Minor issues, test-only failures, tooling, CI flakes, cosmetic issues | Flaky CI test, build script edge case, documentation generator bug | +| Label | Color | Description | Examples | +| ------------------- | ------ | ------------------------------------------------------------------------------------ | --------------------------------------------------------------------- | +| `priority:critical` | Red | Data corruption, silent wrong results, security vulnerabilities | Wrong aggregation results, FFI data corruption, incorrect cast output | +| `priority:high` | Orange | Crashes, panics, segfaults, major functional breakage affecting production workloads | Native engine panic, JVM segfault, NPE on supported code path | +| `priority:medium` | Yellow | Functional bugs, performance regressions, broken features that have workarounds | Missing expression support, writer feature gaps, excessive spilling | +| `priority:low` | Green | Minor issues, test-only failures, tooling, CI flakes, cosmetic issues | Flaky CI test, build script edge case, documentation generator bug | ### How to Choose a Priority @@ -63,15 +63,15 @@ A bug should be escalated to a higher priority if: Area labels indicate which subsystem is affected. A bug may have multiple area labels. These help contributors find bugs in their area of expertise. -| Label | Description | -|-------|-------------| -| `area:writer` | Native Parquet writer | -| `area:shuffle` | Shuffle (JVM and native) | +| Label | Description | +| ------------------ | -------------------------------------- | +| `area:writer` | Native Parquet writer | +| `area:shuffle` | Shuffle (JVM and native) | | `area:aggregation` | Hash aggregates, aggregate expressions | -| `area:scan` | Parquet scan / data reading | -| `area:expressions` | Expression evaluation | -| `area:ffi` | Arrow FFI / JNI boundary | -| `area:ci` | CI/CD, GitHub Actions, build tooling | +| `area:scan` | Parquet scan / data reading | +| `area:expressions` | Expression evaluation | +| `area:ffi` | Arrow FFI / JNI boundary | +| `area:ci` | CI/CD, GitHub Actions, build tooling | The following pre-existing labels also serve as area indicators: `native_datafusion`, `native_iceberg_compat`, `spark 4`, `spark sql tests`. From 5682e9d37bc22eed5f1056cda9cfeae421888e62 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Sat, 28 Mar 2026 10:27:10 -0600 Subject: [PATCH 3/3] feat: auto-label new issues with requires-triage Add a GitHub Actions workflow that automatically applies a `requires-triage` label to newly opened issues, making it clear which issues still need triage attention. Update the bug triage guide to document this workflow. --- .github/workflows/label_new_issues.yml | 39 +++++++++++++++++++++ docs/source/contributor-guide/bug_triage.md | 5 +++ 2 files changed, 44 insertions(+) create mode 100644 .github/workflows/label_new_issues.yml diff --git a/.github/workflows/label_new_issues.yml b/.github/workflows/label_new_issues.yml new file mode 100644 index 0000000000..60ff35c3da --- /dev/null +++ b/.github/workflows/label_new_issues.yml @@ -0,0 +1,39 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Label new issues with requires-triage + +on: + issues: + types: [opened] + +permissions: + issues: write + +jobs: + add-triage-label: + runs-on: ubuntu-latest + steps: + - uses: actions/github-script@v7 + with: + script: | + await github.rest.issues.addLabels({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: context.issue.number, + labels: ['requires-triage'] + }) diff --git a/docs/source/contributor-guide/bug_triage.md b/docs/source/contributor-guide/bug_triage.md index 33e61644a7..8524b9fe44 100644 --- a/docs/source/contributor-guide/bug_triage.md +++ b/docs/source/contributor-guide/bug_triage.md @@ -78,6 +78,10 @@ The following pre-existing labels also serve as area indicators: `native_datafus ## Triage Process +Every new issue is automatically labeled with `requires-triage` when it is opened. This makes it +easy to find issues that have not yet been triaged by filtering on that label. Once an issue has +been triaged, remove the `requires-triage` label and apply the appropriate priority and area labels. + ### For New Issues When a new bug is filed: @@ -89,6 +93,7 @@ When a new bug is filed: 3. **Apply a priority label** using the decision tree above. 4. **Apply area labels** to indicate the affected subsystem(s). 5. **Apply `good first issue`** if the fix is likely straightforward and well-scoped. +6. **Remove the `requires-triage` label** to indicate triage is complete. ### For Existing Bugs