Skip to content

optimisation: replace per-module row copying with shared map enrichment#149

Merged
jnioche merged 4 commits intomainfrom
perf/field-index-caching
Feb 21, 2026
Merged

optimisation: replace per-module row copying with shared map enrichment#149
jnioche merged 4 commits intomainfrom
perf/field-index-caching

Conversation

@jnioche
Copy link
Member

@jnioche jnioche commented Feb 14, 2026

Summary

  • Replace per-module row copying with a shared Map<Column, Object> — modules now read CUR columns from the immutable original Row and read/write Spruce enrichment columns
    via a shared map. A single GenericRowWithSchema is materialised at the end of the pipeline instead of one per module.
  • ~4.4x speedup measured on a 6-module pipeline with a 108-column schema (1M rows: 1,200ms → 273ms). Benchmark included as EnrichmentStrategyBenchmark.java.
  • Interface change: Row process(Row) replaced by void enrich(Row inputRow, Map<Column, Object> enrichedValues). All 11 modules and 14 test classes updated. 112 tests
    pass.

Changes

Area Files What changed
Interface EnrichmentModule.java Removed process() and static helpers (withUpdatedValue, withUpdatedValues); added enrich(Row, Map)
Pipeline EnrichmentPipeline.java Creates one HashMap per row, calls enrich() on each module, materialises single Row at end
Modules (11) All under modules/ Converted from returning new Rows to putting values into the enriched map
Tests (14) All test classes Updated to use enrich(row, map) pattern; assertions check map contents
Benchmark EnrichmentStrategyBenchmark.java New — compares old row-copy vs new map approach

How it works

Before (N row copies per input row)

// Each module creates a full copy of the Row
Row process(Row row) {
    Object[] values = new Object[row.size()];   // allocate
    for (int i = 0; i < row.size(); i++)        // copy ALL columns
        values[i] = row.get(i);
    values[targetIndex] = newValue;              // set 1 value
    return new GenericRowWithSchema(values, row.schema());
}

With 6 modules × 108 columns = 648 column copies per row.

After (1 row copy per input row)

// Modules write to a shared map — no Row allocation
void enrich(Row inputRow, Map<Column, Object> enrichedValues) {
    enrichedValues.put(ENERGY_USED, computedValue);
}

// Pipeline materialises one Row at the end
Map<Column, Object> enriched = new HashMap<>();
for (EnrichmentModule module : modules) {
    module.enrich(row, enriched);
}
// Single copy + apply map
Object[] values = new Object[row.size()];
for (int i = 0; i < row.size(); i++) values[i] = row.get(i);
for (var entry : enriched.entrySet())
    values[entry.getKey().resolveIndex(row)] = entry.getValue();
return new GenericRowWithSchema(values, row.schema());

108 column copies + small map overhead = ~4.4x faster.

Test plan

- All 112 existing tests pass (mvn test)
- Review benchmark results on CI
- Verify no regressions on a real CUR dataset

Each enrichment module previously created a new Row (copying all column
values) for every update. With 6 modules and a 108-column schema, this
meant 6 full array copies per input row. The new approach has modules
read/write enrichment values via a shared Map<Column, Object>, with a
single Row materialised at the end of the pipeline (~4.4x speedup).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jnioche jnioche added the enhancement New feature or request label Feb 14, 2026
@jnioche jnioche added this to the 0.9 milestone Feb 14, 2026
@jnioche
Copy link
Member Author

jnioche commented Feb 14, 2026

In practice this does not make a huge difference. Tried both locally and on EMR

0.8
"totalResourceUtilization": {
"vCPUHour": 0.362,
"memoryGBHour": 0.82,
"storageGBHour": 2.289
},
"totalExecutionDurationSeconds": 114,
"executionTimeoutMinutes": 720,
"billedResourceUtilization": {
"vCPUHour": 0.469,
"memoryGBHour": 1.033,
"storageGBHour": 0.0
}

this PR

"totalResourceUtilization": {
"vCPUHour": 0.341,
"memoryGBHour": 0.769,
"storageGBHour": 2.144
},
"totalExecutionDurationSeconds": 105,
"executionTimeoutMinutes": 720,
"billedResourceUtilization": {
"vCPUHour": 0.449,
"memoryGBHour": 0.987,
"storageGBHour": 0.0
}

The question of whether it should be adopted is then primarily about whether it makes things cleaner conceptually and nicer code.

…added utility method to SPRUCEcolumns

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche jnioche merged commit 87d9a05 into main Feb 21, 2026
5 checks passed
@jnioche jnioche deleted the perf/field-index-caching branch February 21, 2026 08:05
jnioche added a commit that referenced this pull request Feb 21, 2026
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant