Skip to content

Commit ac2db13

Browse files
Optimize map loading: GPU compute shader, raw pixel cache, single-pass texture population
- GPU compute shader for province ID texture (CPU pixel loop ~9.4s → 644ms) - Raw pixel cache skips PNG decompression on subsequent loads (~7.8s → 197ms) - Single-pass pack+color writes both GPU buffer and color texture in one loop - Remove per-entity logging in ProvinceRegistry, CountryRegistry, Registry - Pre-size collections in ProvinceRegistry and DefinitionLoader - Disable broken normal map generation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent a502357 commit ac2db13

19 files changed

Lines changed: 1074 additions & 145 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,3 +104,5 @@ tmpclaude-*
104104
_site/
105105
# api/ - NOT ignored! Pre-generated API metadata must be committed for GitHub Pages
106106
/Template-Data/map/personal
107+
Template-Data/map/provinces.png.pixels
108+
Template-Data/map/provinces.png.pixels.meta

Docs/Log/2026/02.meta

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Docs/Log/2026/02/02.meta

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Loading Performance Optimization (97.5M pixel stress test)
2+
**Date**: 2026-02-02
3+
**Session**: 1
4+
**Status**: ✅ Complete
5+
**Priority**: High
6+
7+
---
8+
9+
## Session Goal
10+
11+
**Primary Objective:**
12+
- Optimize map loading time for 15000x6500 (97.5M pixel) stress test map with 50k provinces
13+
14+
**Success Criteria:**
15+
- Reduce total load time from ~33s baseline
16+
17+
---
18+
19+
## Context & Background
20+
21+
**Current State:**
22+
- Stress test map: 15000x6500 pixels, 50,000 provinces, ~97.5M pixels
23+
- Baseline total load: ~33 seconds
24+
- Three bottlenecks identified: province registration logging (~2.7s), PNG load+parse (~8s), CPU pixel loop (~10s)
25+
26+
---
27+
28+
## What We Did
29+
30+
### 1. Disabled Normal Map Generation
31+
**Files Changed:** `Scripts/Map/Loading/MapDataLoader.cs:122,181`
32+
33+
Commented out two `GenerateNormalMapFromHeightmap()` calls — feature was broken anyway, saves ~0.9s.
34+
35+
### 2. Removed Per-Province Logging in Registries
36+
**Files Changed:**
37+
- `Scripts/Core/Registries/ProvinceRegistry.cs:52` — removed per-registration log (50,000 calls)
38+
- `Scripts/Core/Registries/CountryRegistry.cs:60` — removed per-registration log
39+
- `Scripts/Core/Registries/Registry.cs:53` — removed per-registration log (generic registry)
40+
41+
**Impact:** Province registration: **2.7s → 9ms**. Game init total: **2.91s → 0.21s**.
42+
43+
### 3. Pre-sized Collections in ProvinceRegistry and DefinitionLoader
44+
**Files Changed:**
45+
- `Scripts/Core/Registries/ProvinceRegistry.cs` — constructor now takes capacity, pre-sizes Dictionary and List (default 65536)
46+
- `Scripts/Core/Loaders/DefinitionLoader.cs` — pre-sizes entries List with `lines.Length` capacity
47+
48+
### 4. GPU Compute Shader for Province Texture Population (VERIFIED)
49+
**Files Changed:**
50+
- `Resources/Shaders/PopulateProvinceTextures.compute`**NEW** compute shader
51+
- `Scripts/Map/Rendering/MapTexturePopulator.cs` — rewritten with GPU path + CPU fallback
52+
- `Scripts/Map/Loading/ProvinceMapProcessor.cs` — added `BMPData.TryGetRawPixelBytes()` accessor
53+
54+
**Architecture:**
55+
- Open-addressing hash table built on CPU from `NativeHashMap<int,int>` color→provinceID mapping
56+
- Uploaded to GPU as `StructuredBuffer<uint2>` (~128k slots, ~1MB)
57+
- Raw PNG pixel bytes packed into `uint[]` on CPU, uploaded as `StructuredBuffer<uint>`
58+
- Compute shader (8x8 threads): each pixel reads RGB, hashes to find province ID, writes packed ID to ProvinceIDTexture
59+
- ProvinceColorTexture populated separately via CPU (it's a Texture2D, not RenderTexture)
60+
- CPU fallback path preserved for BMP format (not currently used)
61+
- Hash function: `key ^= key >> 16; key *= 0x45d9f3b; key ^= key >> 16;` — identical in C# and HLSL
62+
63+
**Measured Impact:** CPU pixel loop **~9.4s → ~1.3s** (pack + hash build + GPU dispatch + color texture copy)
64+
Note: ~1.3s includes CPU-side pixel packing of 97.5M pixels into uint[] + Color32[] arrays. Detailed breakdown unavailable because log level was set to Warnings (Info logs suppressed).
65+
66+
---
67+
68+
## Decisions Made
69+
70+
### Decision 1: GPU Hash Table vs 3D Lookup Texture
71+
**Options:**
72+
1. Open-addressing hash table in ComputeBuffer — ~1MB, works for any province count
73+
2. 3D lookup texture (256^3) — 64MB VRAM, fast but wasteful
74+
75+
**Decision:** Hash table. 50% load factor, linear probing, max 64 probes.
76+
**Rationale:** 1MB vs 64MB, scales to 65k provinces trivially.
77+
78+
### Decision 2: Keep 65k Province Limit (ushort)
79+
**Context:** Considered 80k to match EU5
80+
**Decision:** Stay at 65k. EU4 ~5k, Vic3 ~15k, HOI4 ~13k, CK3 ~10k. 65k exceeds all shipped Paradox titles except EU5. Changing ushort encoding would be invasive across entire texture pipeline.
81+
82+
### Decision 3: ProvinceColorTexture Separate from GPU Path
83+
**Context:** ProvinceColorTexture is a Texture2D (not RenderTexture), can't be written by compute shader
84+
**Decision:** GPU shader only writes ProvinceIDTexture. Color texture populated via direct CPU byte copy from raw PNG data (no hash lookups needed — just memcpy-equivalent).
85+
86+
### Decision 4: BMP Support Deferred
87+
**Context:** BMP has BGR order, row padding, bottom-up flip
88+
**Decision:** GPU path only supports PNG. BMP falls back to CPU. Comment left for future if needed.
89+
90+
---
91+
92+
## Performance Results (All Optimizations)
93+
94+
| Metric | Before | After | Saved |
95+
|--------|--------|-------|-------|
96+
| Province registration | 2.7s | 9ms | ~2.7s |
97+
| Normal map gen | 0.9s | 0s (disabled) | ~0.9s |
98+
| CPU pixel loop → GPU | ~9.4s | ~1.3s | ~8.1s |
99+
| Pre-sized collections ||| ~0.1s |
100+
| **Total saved** ||| **~11.8s** |
101+
102+
Remaining bottleneck:
103+
- PNG load + decompress: ~7.8s (unchanged) — raw pixel cache (Task #4) would address this
104+
105+
---
106+
107+
## Resolved Questions
108+
1. **Hash function GPU correctness:** ✅ Confirmed — map loads correctly, province selection and ownership work
109+
2. **AsyncGPUReadback.WaitForCompletion sync:** ✅ Confirmed — OwnerTextureDispatcher reads ProvinceIDTexture correctly after GPU populate
110+
111+
## Next Session
112+
113+
### Immediate Next Steps
114+
1. Raw pixel cache for PNG decompression (Task #4) — cache decoded pixels as .raw binary, skip PNG decompress on reload (~7.8s → <1s)
115+
2. Consider further optimizing the ~1.3s GPU path — most time is CPU-side pixel packing (97.5M pixels × 2 passes: uint[] for GPU + Color32[] for color texture)
116+
117+
---
118+
119+
## Quick Reference for Future Claude
120+
121+
**Key implementation:**
122+
- Compute shader: `Resources/Shaders/PopulateProvinceTextures.compute`
123+
- C# dispatcher: `Scripts/Map/Rendering/MapTexturePopulator.cs:106-198` (TryPopulateGPU)
124+
- Raw byte accessor: `Scripts/Map/Loading/ProvinceMapProcessor.cs:84-97` (TryGetRawPixelBytes)
125+
- Hash function must match EXACTLY between C# (HashRGB) and HLSL (HashRGB)
126+
127+
**Gotchas:**
128+
- `Core.Registries` resolves to `Map.Core.Registries` in Map namespace — use `global::Core.Registries`
129+
- `ComputeShader.SetInt` takes `int` not `uint` — no cast needed since table sizes are int
130+
- ProvinceColorTexture is Texture2D, NOT RenderTexture — can't use RWTexture2D in compute shader

Docs/Log/2026/02/02/1-loading-performance-optimization.md.meta

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Loading Performance Optimization — Session 2 (Raw Pixel Cache + Single-Pass Texture Population)
2+
**Date**: 2026-02-02
3+
**Session**: 2
4+
**Status**: ✅ Complete
5+
**Priority**: High
6+
7+
---
8+
9+
## Session Goal
10+
11+
**Primary Objective:**
12+
- Eliminate PNG decompression bottleneck (~7.8s) via raw pixel cache
13+
- Optimize CPU pixel loops in MapTexturePopulator (~5.2s)
14+
15+
**Success Criteria:**
16+
- Province data loading < 1s on cache hit
17+
- Texture population < 1s
18+
19+
---
20+
21+
## Context & Background
22+
23+
**Previous Work:**
24+
- See: [1-loading-performance-optimization.md](1-loading-performance-optimization.md)
25+
26+
**Current State (start of session):**
27+
- Baseline ~33s reduced to ~21s after session 1 optimizations
28+
- Two remaining bottlenecks: PNG decompress (~7.8s) and CPU pixel loops (~5.2s)
29+
- GPU compute shader for province ID texture verified working
30+
31+
---
32+
33+
## What We Did
34+
35+
### 1. Raw Pixel Cache for Province Map Loading
36+
**Files Changed:**
37+
- `Scripts/Map/Loading/ProvinceMapProcessor.cs` — added `TryLoadPixelCache()`, `SavePixelCache()`, `BuildResultFromPixelData()`
38+
- `Scripts/Map/Loading/Images/ProvinceMapParser.cs` — extracted `ParseProvinceMapWithPixelData()` from `ParseProvinceMapUnified()`
39+
40+
**Architecture:**
41+
- Cache file: `{image_path}.pixels` (e.g., `provinces.png.pixels`)
42+
- Binary format: 16-byte header (magic "RPXL", width, height, bpp, colorType, bitDepth) + raw decoded pixel bytes
43+
- First run: PNG decompress as normal, then save cache (~292MB for 15000x6500 RGB)
44+
- Subsequent runs: `File.ReadAllBytes` + single `UnsafeUtility.MemCpy` into `NativeArray<byte>`
45+
- Cache invalidation: `File.GetLastWriteTimeUtc` comparison — cache stale if source PNG is newer
46+
- CSV parsing still runs every load (fast, ~87ms) — only image decompression is cached
47+
- `ParseProvinceMapWithPixelData()` extracted to avoid duplicating CSV logic between cache-hit and cache-miss paths
48+
49+
**Measured Impact:** PNG load **~7.8s → 197ms** (119ms cache read + 78ms CSV)
50+
51+
### 2. Single-Pass Texture Population
52+
**Files Changed:** `Scripts/Map/Rendering/MapTexturePopulator.cs`
53+
54+
**Before:** Two separate 97.5M pixel CPU loops:
55+
1. `PackRGBPixels()` — raw bytes → `uint[]` for GPU compute shader
56+
2. `PopulateColorTextureFromRawBytes()` — raw bytes → `Color32[]``SetPixels32``Apply`
57+
58+
**After:** Single loop that simultaneously:
59+
1. Packs `uint[]` for GPU upload
60+
2. Writes RGBA32 directly into texture buffer via `GetRawTextureData<byte>()` — zero managed allocation for color texture
61+
62+
**Measured Impact:** Texture population **~5.2s → 644ms** (pack+color: 188ms, hash: 7ms, GPU dispatch+sync: 449ms)
63+
64+
### 3. Timing Instrumentation
65+
**Files Changed:**
66+
- `Scripts/Map/Loading/ProvinceMapProcessor.cs` — added cache hit/miss timing logs
67+
- `Scripts/Map/Rendering/MapTexturePopulator.cs` — unconditional timing log (not gated by `logProgress`)
68+
69+
---
70+
71+
## Decisions Made
72+
73+
### Decision 1: File.ReadAllBytes + Single MemCpy vs Streamed Read
74+
**Context:** Initial implementation used FileStream with 1MB chunked reads to avoid 292MB managed allocation
75+
**Result:** Chunked reads were slower due to 292 `fixed`+`MemCpy` calls. `File.ReadAllBytes` + single `MemCpy` with `NativeArrayOptions.UninitializedMemory` was significantly faster.
76+
**Lesson:** For sequential reads, .NET's internal buffering in `File.ReadAllBytes` outperforms manual chunking.
77+
78+
### Decision 2: GetRawTextureData vs SetPixels32
79+
**Context:** `SetPixels32` requires allocating a `Color32[]` managed array (390MB for RGBA32 at 97.5M pixels)
80+
**Decision:** Use `GetRawTextureData<byte>()` to get a NativeArray view of the texture's internal buffer, write RGBA bytes directly via unsafe pointer.
81+
**Benefit:** Eliminates 390MB managed allocation, halves memory pressure, single pass over source data.
82+
83+
---
84+
85+
## Problems Encountered & Solutions
86+
87+
### Problem 1: Cache Read Slower Than Expected (~5.3s)
88+
**Symptom:** First cache implementation saved ~2.4s instead of expected ~7s
89+
**Root Cause:** FileStream with 1MB chunked reads + 292 `fixed`/`MemCpy` calls per chunk was slow for 292MB
90+
**Solution:** Replaced with `File.ReadAllBytes` (one sequential read) + single `UnsafeUtility.MemCpy` + `NativeArrayOptions.UninitializedMemory`
91+
**Result:** Cache read dropped to 119ms
92+
93+
### Problem 2: Missing MapTexturePopulator Logs
94+
**Symptom:** No MapTexturePopulator timing logs in any log file
95+
**Root Cause:** `logProgress` parameter was `false` because `gameSettings.ShouldLog(LogLevel.Info)` returns false when log level is Warnings
96+
**Solution:** Made GPU path timing log unconditional (not gated by `logProgress`)
97+
98+
---
99+
100+
## Performance Results
101+
102+
### Final Measured Breakdown (cached run, 97.5M pixels, 50k provinces)
103+
104+
| Phase | Before (baseline) | After | Measured |
105+
|-------|-------------------|-------|----------|
106+
| Province registration | 2.7s | 9ms | Session 1 |
107+
| Normal map gen | 0.9s | 0s | Session 1 |
108+
| Province data loading | 7.8s | 197ms | Cache: 119ms, CSV: 78ms |
109+
| Texture population | ~9.4s | 644ms | Pack+color: 188ms, hash: 7ms, GPU: 449ms |
110+
| Pre-sized collections ||| ~0.1s |
111+
112+
### Remaining Time (not optimized this session)
113+
- Texture creation/allocation: ~3s (VRAM allocation for 15000x6500 textures)
114+
- Terrain texture generation: ~0.7s (compute shader, already fast)
115+
- Heightmap loading: ~0.4s
116+
- Texture binding: ~0.8s per rebind cycle
117+
- MapMode/border init: ~5.7s
118+
119+
---
120+
121+
## Quick Reference for Future Claude
122+
123+
**Key implementation:**
124+
- Pixel cache: `Scripts/Map/Loading/ProvinceMapProcessor.cs``TryLoadPixelCache()`, `SavePixelCache()`
125+
- Cache format: 16-byte header ("RPXL" + dims + bpp) + raw pixel bytes
126+
- Separated parser: `Scripts/Map/Loading/Images/ProvinceMapParser.cs:ParseProvinceMapWithPixelData()`
127+
- Single-pass populator: `Scripts/Map/Rendering/MapTexturePopulator.cs:TryPopulateGPU()`
128+
- Direct texture write: `GetRawTextureData<byte>()` for zero-alloc RGBA32 population
129+
130+
**Gotchas:**
131+
- `File.ReadAllBytes` + single `MemCpy` beats streamed chunked reads for large sequential files
132+
- `NativeArrayOptions.UninitializedMemory` skips zeroing — critical for 292MB allocations
133+
- `GetRawTextureData<byte>()` returns RGBA32 layout (R,G,B,A per pixel, 4 bytes) for RGBA32 textures
134+
- Cache invalidation uses file timestamps — modifying the PNG auto-invalidates
135+
- GPU dispatch+sync takes ~449ms — this is `ComputeBuffer.SetData` (390MB upload) + dispatch + `AsyncGPUReadback.WaitForCompletion`
136+
137+
**Files changed this session:**
138+
- `Scripts/Map/Loading/ProvinceMapProcessor.cs` — cache read/write, timing logs
139+
- `Scripts/Map/Loading/Images/ProvinceMapParser.cs` — extracted `ParseProvinceMapWithPixelData`
140+
- `Scripts/Map/Rendering/MapTexturePopulator.cs` — single-pass pack+color, unconditional timing log
141+
142+
---
143+
144+
## Related Sessions
145+
- [Session 1](1-loading-performance-optimization.md) — GPU compute shader, logging removal, pre-sized collections

Docs/Log/2026/02/02/2-loading-performance-optimization.md.meta

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)