Skip to content

Commit 938a8b1

Browse files
committed
Add system and research roadmap documents
1 parent 5f56a8b commit 938a8b1

42 files changed

Lines changed: 1215 additions & 33 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

INSTALL.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
This document explains how to install and prepare StableSteering on Windows for local development or evaluation.
44

5+
Published HTML documentation:
6+
7+
- [GitHub Pages Docs](https://apartsinprojects.github.io/StableSteering/)
8+
59
## Requirements
610

711
- Windows with PowerShell

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ StableSteering is a research documentation repository for an interactive system
44

55
The runtime app is now GPU-only by default and expects CUDA-backed Diffusers inference.
66

7+
Published HTML documentation:
8+
9+
- [GitHub Pages Docs](https://apartsinprojects.github.io/StableSteering/)
10+
711
The current repository contains the specification set used to define the project before implementation:
812

913
- [Motivation](./docs/motivation.md)
@@ -19,6 +23,8 @@ The current repository contains the specification set used to define the project
1923
- [Install Guide](./INSTALL.md)
2024
- [Release Guide](./RELEASE.md)
2125
- [Release Notes v0.1.0](./RELEASE_NOTES_v0.1.0.md)
26+
- [System Improvement Roadmap](./docs/system_improvement_roadmap.md)
27+
- [Research Improvement Roadmap](./docs/research_improvement_roadmap.md)
2228

2329
## Folder Guides
2430

@@ -61,6 +67,8 @@ This repository now contains:
6167
4. [System Test Specification](./docs/system_test_specification.md)
6268
5. [Pre-Implementation Blueprint](./docs/pre_implementation_blueprint.md)
6369
6. [Quick Start](./docs/quick_start.md)
70+
7. [System Improvement Roadmap](./docs/system_improvement_roadmap.md)
71+
8. [Research Improvement Roadmap](./docs/research_improvement_roadmap.md)
6472

6573
## Run Locally
6674

RELEASE_NOTES_v0.1.0.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
Initial runnable MVP release of StableSteering.
44

5+
Published HTML documentation:
6+
7+
- [GitHub Pages Docs](https://apartsinprojects.github.io/StableSteering/)
8+
59
## Highlights
610

711
- FastAPI backend for experiments, sessions, rounds, feedback, replay, diagnostics, and async job status

docs/README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,18 @@
22

33
This folder contains the specification set for the StableSteering research platform.
44

5+
Published HTML version:
6+
7+
- [GitHub Pages Docs](https://apartsinprojects.github.io/StableSteering/)
8+
59
The current repository also includes a runnable prototype. The most implementation-specific documents are:
610

711
- [quick_start.md](/E:/Projects/StableSteering/docs/quick_start.md)
812
- [user_guide.md](/E:/Projects/StableSteering/docs/user_guide.md)
913
- [developer_guide.md](/E:/Projects/StableSteering/docs/developer_guide.md)
1014
- [faq.md](/E:/Projects/StableSteering/docs/faq.md)
15+
- [system_improvement_roadmap.md](/E:/Projects/StableSteering/docs/system_improvement_roadmap.md)
16+
- [research_improvement_roadmap.md](/E:/Projects/StableSteering/docs/research_improvement_roadmap.md)
1117

1218
Current implementation highlights reflected in these guides:
1319

@@ -46,6 +52,12 @@ Current implementation highlights reflected in these guides:
4652
- [faq.md](/E:/Projects/StableSteering/docs/faq.md)
4753
Answers common questions about the prototype and its current limitations.
4854

55+
- [system_improvement_roadmap.md](/E:/Projects/StableSteering/docs/system_improvement_roadmap.md)
56+
Tracks engineering, UX, observability, performance, and release priorities for the system itself.
57+
58+
- [research_improvement_roadmap.md](/E:/Projects/StableSteering/docs/research_improvement_roadmap.md)
59+
Tracks study design, baselines, measurement, and analysis priorities for the research program.
60+
4961
## Supporting Documents
5062

5163
- [document_audit.md](/E:/Projects/StableSteering/docs/document_audit.md)
@@ -62,3 +74,5 @@ Current implementation highlights reflected in these guides:
6274
4. [system_test_specification.md](/E:/Projects/StableSteering/docs/system_test_specification.md)
6375
5. [pre_implementation_blueprint.md](/E:/Projects/StableSteering/docs/pre_implementation_blueprint.md)
6476
6. [quick_start.md](/E:/Projects/StableSteering/docs/quick_start.md)
77+
7. [system_improvement_roadmap.md](/E:/Projects/StableSteering/docs/system_improvement_roadmap.md)
78+
8. [research_improvement_roadmap.md](/E:/Projects/StableSteering/docs/research_improvement_roadmap.md)

docs/quick_start.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
## 1. Install
44

5+
Published HTML documentation:
6+
7+
- [GitHub Pages Docs](https://apartsinprojects.github.io/StableSteering/)
8+
59
From the repository root:
610

711
```bash
Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# Research Improvement Roadmap
2+
3+
## 1. Purpose
4+
5+
This document tracks the highest-value research improvements for StableSteering as a study platform.
6+
7+
It focuses on:
8+
9+
- research design
10+
- experimental validity
11+
- evaluation quality
12+
- interpretability
13+
- study operations
14+
- comparative baselines
15+
16+
It does not focus on core engineering execution. That belongs in:
17+
18+
- [system_improvement_roadmap.md](/E:/Projects/StableSteering/docs/system_improvement_roadmap.md)
19+
20+
## 2. Current Research Baseline
21+
22+
The current system already supports:
23+
24+
- iterative steering sessions
25+
- multiple samplers and updaters
26+
- multiple feedback modes at the schema level
27+
- deterministic test paths
28+
- replay and trace capture
29+
- real GPU-backed image generation
30+
31+
This is enough for exploratory pilot work, but not yet enough for a strong research program.
32+
33+
## 3. Main Research Gaps
34+
35+
The largest current gaps are:
36+
37+
- limited comparative baselines
38+
- limited human-study instrumentation
39+
- no formal study protocols in the repo
40+
- limited analysis automation
41+
- weak coverage of confounds like seed sensitivity and user inconsistency
42+
43+
## 4. Priority Levels
44+
45+
- `R0`
46+
Needed before making strong research claims.
47+
48+
- `R1`
49+
Strongly improves study quality and interpretability.
50+
51+
- `R2`
52+
Valuable expansions once the core research loop is stable.
53+
54+
## 5. R0: Research Validity Priorities
55+
56+
### 5.1 Establish a baseline comparison matrix
57+
58+
Goals:
59+
60+
- compare steering against simpler alternatives
61+
- avoid overclaiming based on one workflow
62+
63+
Minimum baselines:
64+
65+
- prompt-only rewriting
66+
- prompt-only manual iteration without steering state
67+
- no-update random sampling baseline
68+
- winner-copy vs winner-average vs linear-preference updater comparison
69+
70+
### 5.2 Add explicit study protocols
71+
72+
Goals:
73+
74+
- make experiments repeatable across operators
75+
- reduce ad hoc evaluation drift
76+
77+
Suggested work:
78+
79+
- define pilot study templates
80+
- define prompt set selection rules
81+
- define stopping criteria
82+
- define annotation instructions for operators
83+
84+
### 5.3 Improve confound logging
85+
86+
Goals:
87+
88+
- understand when session outcomes are caused by seed, fatigue, or interface effects rather than the steering method itself
89+
90+
Suggested work:
91+
92+
- log repeated hidden comparisons
93+
- log user confidence
94+
- log time-to-decision
95+
- log interruptions, retries, and session abandonment
96+
97+
### 5.4 Define research success criteria
98+
99+
Goals:
100+
101+
- make it clear when a strategy is actually better
102+
- prevent endless qualitative-only iteration
103+
104+
Suggested work:
105+
106+
- define minimum effect expectations
107+
- define acceptable operator burden
108+
- define replay-based success checks
109+
- define robustness thresholds across seeds
110+
111+
## 6. R1: Better Measurement and Analysis
112+
113+
### 6.1 Add stronger outcome metrics
114+
115+
Suggested metrics:
116+
117+
- incumbent win rate against previous incumbents
118+
- average rounds to satisfaction
119+
- preference consistency over repeated judgments
120+
- robustness under alternate seeds
121+
- user-reported controllability
122+
- user-reported fatigue
123+
124+
### 6.2 Build analysis-ready exports
125+
126+
Goals:
127+
128+
- reduce manual cleanup before analysis
129+
- make traces easier to use in notebooks and reports
130+
131+
Suggested work:
132+
133+
- export tidy CSV or parquet summaries
134+
- create one row per candidate
135+
- create one row per feedback event
136+
- create one row per round
137+
- include experiment/session metadata joins
138+
139+
### 6.3 Add notebook-based analysis templates
140+
141+
Goals:
142+
143+
- make it easy to analyze sessions without rebuilding analysis logic each time
144+
145+
Suggested work:
146+
147+
- session trajectory notebook
148+
- seed robustness notebook
149+
- sampler comparison notebook
150+
- updater comparison notebook
151+
152+
### 6.4 Strengthen replay as a research asset
153+
154+
Goals:
155+
156+
- use replay not just for debugging, but for comparative analysis and auditing
157+
158+
Suggested work:
159+
160+
- derive session summaries automatically
161+
- compute change-over-round plots
162+
- highlight candidate lineage and incumbent transitions
163+
- compare replay trajectories across strategies
164+
165+
## 7. R1: Better Human Interaction Research
166+
167+
### 7.1 Move beyond rating-only interaction
168+
169+
Current gap:
170+
171+
- although the system supports multiple feedback schemas, the current UI still centers ratings
172+
173+
Research opportunity:
174+
175+
- compare rating-based interaction with true pairwise and ranking interactions
176+
- measure cognitive load and speed differences
177+
- measure whether richer critique improves update quality
178+
179+
### 7.2 Evaluate user consistency and fatigue
180+
181+
Goals:
182+
183+
- understand how stable user judgment is across rounds
184+
- understand when the session length starts harming data quality
185+
186+
Suggested work:
187+
188+
- hidden repeat judgments
189+
- forced calibration rounds
190+
- round-count versus confidence tracking
191+
- fatigue self-report prompts
192+
193+
### 7.3 Study interface bias
194+
195+
Goals:
196+
197+
- ensure the UI is not shaping results more than the underlying algorithm
198+
199+
Suggested work:
200+
201+
- randomize candidate order in controlled experiments
202+
- compare metadata-hidden vs metadata-visible views
203+
- compare grid sizes and density
204+
- compare replay-rich vs replay-light workflows
205+
206+
## 8. R2: Strategy Research Expansions
207+
208+
### 8.1 Add richer steering representations
209+
210+
Suggested expansions:
211+
212+
- token-level steering
213+
- pooled-embedding steering
214+
- hybrid low-dimensional plus token mask approaches
215+
216+
### 8.2 Add stronger samplers
217+
218+
Suggested expansions:
219+
220+
- Thompson-style sampling
221+
- quality-diversity or archive-based exploration
222+
- critique-conditioned candidate proposals
223+
- adaptive trust-region sampling
224+
225+
### 8.3 Add stronger updaters
226+
227+
Suggested expansions:
228+
229+
- Bradley-Terry style preference updating
230+
- Bayesian preference models
231+
- contextual bandit approaches
232+
- critique-aware updates
233+
234+
## 9. Study Program Milestones
235+
236+
### Milestone R-A: Pilot Validity
237+
238+
- establish baseline comparison tasks
239+
- define prompt set
240+
- define study protocol
241+
- log confounds more explicitly
242+
243+
### Milestone R-B: Reliable Measurement
244+
245+
- add stronger metrics
246+
- add analysis exports
247+
- add notebooks and replay summaries
248+
249+
### Milestone R-C: Comparative Research
250+
251+
- compare samplers
252+
- compare updaters
253+
- compare feedback modalities
254+
- compare representation strategies
255+
256+
## 10. Suggested Execution Order
257+
258+
1. define baseline comparison matrix
259+
2. define pilot protocol and prompt/task sets
260+
3. add stronger confound logging
261+
4. add analysis-ready exports
262+
5. add replay-based comparative summaries
263+
6. compare feedback modalities
264+
7. compare samplers and updaters
265+
8. expand representation strategies
266+
267+
## 11. Summary
268+
269+
The next research phase should shift from “can the system run?” to “can the system support credible conclusions?”
270+
271+
That means focusing on:
272+
273+
- better baselines
274+
- better measurement
275+
- better confound control
276+
- better analysis workflows
277+
- better human-study structure

0 commit comments

Comments
 (0)