|
| 1 | +# Loading Performance Optimization — Session 2 (Raw Pixel Cache + Single-Pass Texture Population) |
| 2 | +**Date**: 2026-02-02 |
| 3 | +**Session**: 2 |
| 4 | +**Status**: ✅ Complete |
| 5 | +**Priority**: High |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Session Goal |
| 10 | + |
| 11 | +**Primary Objective:** |
| 12 | +- Eliminate PNG decompression bottleneck (~7.8s) via raw pixel cache |
| 13 | +- Optimize CPU pixel loops in MapTexturePopulator (~5.2s) |
| 14 | + |
| 15 | +**Success Criteria:** |
| 16 | +- Province data loading < 1s on cache hit |
| 17 | +- Texture population < 1s |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## Context & Background |
| 22 | + |
| 23 | +**Previous Work:** |
| 24 | +- See: [1-loading-performance-optimization.md](1-loading-performance-optimization.md) |
| 25 | + |
| 26 | +**Current State (start of session):** |
| 27 | +- Baseline ~33s reduced to ~21s after session 1 optimizations |
| 28 | +- Two remaining bottlenecks: PNG decompress (~7.8s) and CPU pixel loops (~5.2s) |
| 29 | +- GPU compute shader for province ID texture verified working |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## What We Did |
| 34 | + |
| 35 | +### 1. Raw Pixel Cache for Province Map Loading |
| 36 | +**Files Changed:** |
| 37 | +- `Scripts/Map/Loading/ProvinceMapProcessor.cs` — added `TryLoadPixelCache()`, `SavePixelCache()`, `BuildResultFromPixelData()` |
| 38 | +- `Scripts/Map/Loading/Images/ProvinceMapParser.cs` — extracted `ParseProvinceMapWithPixelData()` from `ParseProvinceMapUnified()` |
| 39 | + |
| 40 | +**Architecture:** |
| 41 | +- Cache file: `{image_path}.pixels` (e.g., `provinces.png.pixels`) |
| 42 | +- Binary format: 16-byte header (magic "RPXL", width, height, bpp, colorType, bitDepth) + raw decoded pixel bytes |
| 43 | +- First run: PNG decompress as normal, then save cache (~292MB for 15000x6500 RGB) |
| 44 | +- Subsequent runs: `File.ReadAllBytes` + single `UnsafeUtility.MemCpy` into `NativeArray<byte>` |
| 45 | +- Cache invalidation: `File.GetLastWriteTimeUtc` comparison — cache stale if source PNG is newer |
| 46 | +- CSV parsing still runs every load (fast, ~87ms) — only image decompression is cached |
| 47 | +- `ParseProvinceMapWithPixelData()` extracted to avoid duplicating CSV logic between cache-hit and cache-miss paths |
| 48 | + |
| 49 | +**Measured Impact:** PNG load **~7.8s → 197ms** (119ms cache read + 78ms CSV) |
| 50 | + |
| 51 | +### 2. Single-Pass Texture Population |
| 52 | +**Files Changed:** `Scripts/Map/Rendering/MapTexturePopulator.cs` |
| 53 | + |
| 54 | +**Before:** Two separate 97.5M pixel CPU loops: |
| 55 | +1. `PackRGBPixels()` — raw bytes → `uint[]` for GPU compute shader |
| 56 | +2. `PopulateColorTextureFromRawBytes()` — raw bytes → `Color32[]` → `SetPixels32` → `Apply` |
| 57 | + |
| 58 | +**After:** Single loop that simultaneously: |
| 59 | +1. Packs `uint[]` for GPU upload |
| 60 | +2. Writes RGBA32 directly into texture buffer via `GetRawTextureData<byte>()` — zero managed allocation for color texture |
| 61 | + |
| 62 | +**Measured Impact:** Texture population **~5.2s → 644ms** (pack+color: 188ms, hash: 7ms, GPU dispatch+sync: 449ms) |
| 63 | + |
| 64 | +### 3. Timing Instrumentation |
| 65 | +**Files Changed:** |
| 66 | +- `Scripts/Map/Loading/ProvinceMapProcessor.cs` — added cache hit/miss timing logs |
| 67 | +- `Scripts/Map/Rendering/MapTexturePopulator.cs` — unconditional timing log (not gated by `logProgress`) |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## Decisions Made |
| 72 | + |
| 73 | +### Decision 1: File.ReadAllBytes + Single MemCpy vs Streamed Read |
| 74 | +**Context:** Initial implementation used FileStream with 1MB chunked reads to avoid 292MB managed allocation |
| 75 | +**Result:** Chunked reads were slower due to 292 `fixed`+`MemCpy` calls. `File.ReadAllBytes` + single `MemCpy` with `NativeArrayOptions.UninitializedMemory` was significantly faster. |
| 76 | +**Lesson:** For sequential reads, .NET's internal buffering in `File.ReadAllBytes` outperforms manual chunking. |
| 77 | + |
| 78 | +### Decision 2: GetRawTextureData vs SetPixels32 |
| 79 | +**Context:** `SetPixels32` requires allocating a `Color32[]` managed array (390MB for RGBA32 at 97.5M pixels) |
| 80 | +**Decision:** Use `GetRawTextureData<byte>()` to get a NativeArray view of the texture's internal buffer, write RGBA bytes directly via unsafe pointer. |
| 81 | +**Benefit:** Eliminates 390MB managed allocation, halves memory pressure, single pass over source data. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## Problems Encountered & Solutions |
| 86 | + |
| 87 | +### Problem 1: Cache Read Slower Than Expected (~5.3s) |
| 88 | +**Symptom:** First cache implementation saved ~2.4s instead of expected ~7s |
| 89 | +**Root Cause:** FileStream with 1MB chunked reads + 292 `fixed`/`MemCpy` calls per chunk was slow for 292MB |
| 90 | +**Solution:** Replaced with `File.ReadAllBytes` (one sequential read) + single `UnsafeUtility.MemCpy` + `NativeArrayOptions.UninitializedMemory` |
| 91 | +**Result:** Cache read dropped to 119ms |
| 92 | + |
| 93 | +### Problem 2: Missing MapTexturePopulator Logs |
| 94 | +**Symptom:** No MapTexturePopulator timing logs in any log file |
| 95 | +**Root Cause:** `logProgress` parameter was `false` because `gameSettings.ShouldLog(LogLevel.Info)` returns false when log level is Warnings |
| 96 | +**Solution:** Made GPU path timing log unconditional (not gated by `logProgress`) |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## Performance Results |
| 101 | + |
| 102 | +### Final Measured Breakdown (cached run, 97.5M pixels, 50k provinces) |
| 103 | + |
| 104 | +| Phase | Before (baseline) | After | Measured | |
| 105 | +|-------|-------------------|-------|----------| |
| 106 | +| Province registration | 2.7s | 9ms | Session 1 | |
| 107 | +| Normal map gen | 0.9s | 0s | Session 1 | |
| 108 | +| Province data loading | 7.8s | 197ms | Cache: 119ms, CSV: 78ms | |
| 109 | +| Texture population | ~9.4s | 644ms | Pack+color: 188ms, hash: 7ms, GPU: 449ms | |
| 110 | +| Pre-sized collections | — | — | ~0.1s | |
| 111 | + |
| 112 | +### Remaining Time (not optimized this session) |
| 113 | +- Texture creation/allocation: ~3s (VRAM allocation for 15000x6500 textures) |
| 114 | +- Terrain texture generation: ~0.7s (compute shader, already fast) |
| 115 | +- Heightmap loading: ~0.4s |
| 116 | +- Texture binding: ~0.8s per rebind cycle |
| 117 | +- MapMode/border init: ~5.7s |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## Quick Reference for Future Claude |
| 122 | + |
| 123 | +**Key implementation:** |
| 124 | +- Pixel cache: `Scripts/Map/Loading/ProvinceMapProcessor.cs` — `TryLoadPixelCache()`, `SavePixelCache()` |
| 125 | +- Cache format: 16-byte header ("RPXL" + dims + bpp) + raw pixel bytes |
| 126 | +- Separated parser: `Scripts/Map/Loading/Images/ProvinceMapParser.cs:ParseProvinceMapWithPixelData()` |
| 127 | +- Single-pass populator: `Scripts/Map/Rendering/MapTexturePopulator.cs:TryPopulateGPU()` |
| 128 | +- Direct texture write: `GetRawTextureData<byte>()` for zero-alloc RGBA32 population |
| 129 | + |
| 130 | +**Gotchas:** |
| 131 | +- `File.ReadAllBytes` + single `MemCpy` beats streamed chunked reads for large sequential files |
| 132 | +- `NativeArrayOptions.UninitializedMemory` skips zeroing — critical for 292MB allocations |
| 133 | +- `GetRawTextureData<byte>()` returns RGBA32 layout (R,G,B,A per pixel, 4 bytes) for RGBA32 textures |
| 134 | +- Cache invalidation uses file timestamps — modifying the PNG auto-invalidates |
| 135 | +- GPU dispatch+sync takes ~449ms — this is `ComputeBuffer.SetData` (390MB upload) + dispatch + `AsyncGPUReadback.WaitForCompletion` |
| 136 | + |
| 137 | +**Files changed this session:** |
| 138 | +- `Scripts/Map/Loading/ProvinceMapProcessor.cs` — cache read/write, timing logs |
| 139 | +- `Scripts/Map/Loading/Images/ProvinceMapParser.cs` — extracted `ParseProvinceMapWithPixelData` |
| 140 | +- `Scripts/Map/Rendering/MapTexturePopulator.cs` — single-pass pack+color, unconditional timing log |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## Related Sessions |
| 145 | +- [Session 1](1-loading-performance-optimization.md) — GPU compute shader, logging removal, pre-sized collections |
0 commit comments