optimisation: replace per-module row copying with shared map enrichment by jnioche · Pull Request #149 · DigitalPebble/spruce

jnioche · 2026-02-14T08:19:44Z

Summary

Replace per-module row copying with a shared Map<Column, Object> — modules now read CUR columns from the immutable original Row and read/write Spruce enrichment columns
via a shared map. A single GenericRowWithSchema is materialised at the end of the pipeline instead of one per module.
~4.4x speedup measured on a 6-module pipeline with a 108-column schema (1M rows: 1,200ms → 273ms). Benchmark included as EnrichmentStrategyBenchmark.java.
Interface change: Row process(Row) replaced by void enrich(Row inputRow, Map<Column, Object> enrichedValues). All 11 modules and 14 test classes updated. 112 tests
pass.

Changes

Area	Files	What changed
Interface	`EnrichmentModule.java`	Removed `process()` and static helpers (`withUpdatedValue`, `withUpdatedValues`); added `enrich(Row, Map)`
Pipeline	`EnrichmentPipeline.java`	Creates one `HashMap` per row, calls `enrich()` on each module, materialises single Row at end
Modules (11)	All under `modules/`	Converted from returning new Rows to putting values into the enriched map
Tests (14)	All test classes	Updated to use `enrich(row, map)` pattern; assertions check map contents
Benchmark	`EnrichmentStrategyBenchmark.java`	New — compares old row-copy vs new map approach

How it works

Before (N row copies per input row)

// Each module creates a full copy of the Row
Row process(Row row) {
    Object[] values = new Object[row.size()];   // allocate
    for (int i = 0; i < row.size(); i++)        // copy ALL columns
        values[i] = row.get(i);
    values[targetIndex] = newValue;              // set 1 value
    return new GenericRowWithSchema(values, row.schema());
}

With 6 modules × 108 columns = 648 column copies per row.

After (1 row copy per input row)

// Modules write to a shared map — no Row allocation
void enrich(Row inputRow, Map<Column, Object> enrichedValues) {
    enrichedValues.put(ENERGY_USED, computedValue);
}

// Pipeline materialises one Row at the end
Map<Column, Object> enriched = new HashMap<>();
for (EnrichmentModule module : modules) {
    module.enrich(row, enriched);
}
// Single copy + apply map
Object[] values = new Object[row.size()];
for (int i = 0; i < row.size(); i++) values[i] = row.get(i);
for (var entry : enriched.entrySet())
    values[entry.getKey().resolveIndex(row)] = entry.getValue();
return new GenericRowWithSchema(values, row.schema());

108 column copies + small map overhead = ~4.4x faster.

Test plan

- All 112 existing tests pass (mvn test)
- Review benchmark results on CI
- Verify no regressions on a real CUR dataset

Each enrichment module previously created a new Row (copying all column values) for every update. With 6 modules and a 108-column schema, this meant 6 full array copies per input row. The new approach has modules read/write enrichment values via a shared Map<Column, Object>, with a single Row materialised at the end of the pipeline (~4.4x speedup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jnioche · 2026-02-14T11:14:38Z

In practice this does not make a huge difference. Tried both locally and on EMR

0.8
"totalResourceUtilization": {
"vCPUHour": 0.362,
"memoryGBHour": 0.82,
"storageGBHour": 2.289
},
"totalExecutionDurationSeconds": 114,
"executionTimeoutMinutes": 720,
"billedResourceUtilization": {
"vCPUHour": 0.469,
"memoryGBHour": 1.033,
"storageGBHour": 0.0
}

this PR

"totalResourceUtilization": {
"vCPUHour": 0.341,
"memoryGBHour": 0.769,
"storageGBHour": 2.144
},
"totalExecutionDurationSeconds": 105,
"executionTimeoutMinutes": 720,
"billedResourceUtilization": {
"vCPUHour": 0.449,
"memoryGBHour": 0.987,
"storageGBHour": 0.0
}

The question of whether it should be adopted is then primarily about whether it makes things cleaner conceptually and nicer code.

…added utility method to SPRUCEcolumns Signed-off-by: Julien Nioche <julien@digitalpebble.com>

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

jnioche added the enhancement New feature or request label Feb 14, 2026

Merge branch 'main' into perf/field-index-caching

3cfa0b8

jnioche added this to the 0.9 milestone Feb 14, 2026

Method refactoring in Column - move specific to Rows to CURCOlumns + …

a93655b

…added utility method to SPRUCEcolumns Signed-off-by: Julien Nioche <julien@digitalpebble.com>

davide954 mentioned this pull request Feb 18, 2026

Fix false-positive in EnrichmentPipelineTest by introducing mock Config with DummyModule #151

Merged

Rename inputRow into row

a437b64

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

jnioche merged commit 87d9a05 into main Feb 21, 2026
5 checks passed

jnioche deleted the perf/field-index-caching branch February 21, 2026 08:05

jnioche added a commit that referenced this pull request Feb 21, 2026

Fix code post #149

f455540

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimisation: replace per-module row copying with shared map enrichment#149

optimisation: replace per-module row copying with shared map enrichment#149
jnioche merged 4 commits intomainfrom
perf/field-index-caching

jnioche commented Feb 14, 2026

Uh oh!

jnioche commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jnioche commented Feb 14, 2026

Summary

Changes

How it works

Before (N row copies per input row)

Uh oh!

jnioche commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant