Skip to content

Latest commit

 

History

History
219 lines (144 loc) · 7.23 KB

File metadata and controls

219 lines (144 loc) · 7.23 KB

Motivation

1. Document Role

This document explains why the project exists, what research gap it addresses, and what outcomes make the effort worthwhile.

It is the entry point for:

  • researchers evaluating whether the problem is meaningful
  • engineers needing context before implementation
  • reviewers checking whether later design decisions stay aligned with research intent

Related documents:

2. Problem Statement

Text-to-image diffusion systems are powerful but difficult to steer reliably. A user may know what they want to improve after seeing a result, yet still be unable to express that change as a clean prompt rewrite.

Small prompt edits can cause disproportionately large output changes because generation also depends on:

  • prompt wording
  • negative prompt wording
  • seed
  • guidance scale
  • scheduler choice
  • inference step count
  • image resolution

This creates five core usability failures:

  • prompt editing is discrete rather than continuous
  • useful changes are often hard to verbalize
  • user intent evolves after inspecting outputs
  • seed variation obscures whether a prompt change helped
  • repeated trial and error wastes time and attention

3. Central Research Claim

The project is based on one central claim:

User preference may be learned more effectively by steering prompt-conditioning embeddings than by repeatedly rewriting visible prompt text.

Instead of treating the prompt string as the only editable control, the system treats the prompt-conditioned embedding as a controllable search object.

The core loop is:

  1. encode the initial prompt
  2. generate candidate embedding variations
  3. render images from those candidates
  4. collect user feedback
  5. estimate a preferred direction in steering space
  6. update the steering state
  7. repeat

The goal is not merely to produce nicer images. The goal is to study whether this interaction pattern is measurably more controllable, learnable, and reproducible than direct prompt rewriting alone.

4. Why This Matters

If the central claim holds, the project could improve both research and practical image-generation workflows.

Potential research value:

  • clearer measurement of local controllability in text conditioning
  • more rigorous comparison of preference-learning strategies
  • cleaner isolation of seed effects versus semantic steering effects
  • reusable infrastructure for interactive generative-model experiments

Potential practical value:

  • faster personalization for repeated users
  • less dependence on perfect prompt-writing skill
  • more stable iteration on composition, realism, and style
  • better support for evolving intent during exploration

5. Research Questions

The platform should make it possible to answer questions such as:

  • Does local exploration in embedding space produce meaningful and interpretable visual changes?
  • Which feedback type is most informative under limited user attention?
  • Which sampling policy best balances exploration and exploitation?
  • How strongly do random seeds confound preference learning?
  • Do users prefer one shared steering space or semantically separated subspaces?
  • Can a lightweight preference model adapt faster than manual prompt rewriting?
  • How much user fatigue appears across repeated steering rounds?
  • Which interaction design choices reduce inconsistency and bias?

6. Why Existing Interfaces Are Not Enough

Most image-generation products optimize for convenience, speed, and visual polish. They are not designed for controlled research.

A research platform must instead prioritize:

  • exact reproducibility
  • pluggable exploration policies
  • pluggable feedback modes
  • pluggable update mechanisms
  • strong logging
  • deterministic replay
  • experiment export
  • traceable configuration

Without those properties, promising results are difficult to reproduce and negative results are difficult to interpret.

7. Intended Outcomes

This project should produce:

  • a controlled environment for studying interactive embedding steering
  • a repeatable way to compare candidate-generation policies
  • a repeatable way to compare feedback mechanisms
  • a repeatable way to compare preference-update rules
  • enough trace data to analyze confounds after the fact

The project is successful if it enables trustworthy experiments, even if some steering strategies ultimately fail.

8. Research Goals

The system must support controlled experiments on:

  • embedding-space candidate generation
  • user feedback collection
  • user-preference inference
  • iterative update policies
  • robustness to randomness
  • reproducibility and traceability
  • session replay and comparative analysis

9. Non-Goals

The first version is not intended to:

  • deliver state-of-the-art image quality
  • support production traffic
  • handle large multi-user concurrency
  • perform full model fine-tuning
  • support every diffusion family
  • optimize hardware throughput aggressively
  • solve identity, billing, or enterprise security requirements

10. First Experimental Matrix

A useful first comparison grid is:

Axis 1: Sampling

  • random local
  • exploit-plus-orthogonal
  • uncertainty-guided

Axis 2: Feedback

  • scalar rating
  • pairwise comparison
  • top-3 ranking

Axis 3: Update

  • winner-average
  • linear preference update
  • pairwise logistic update

Axis 4: Seed policy

  • fixed-per-round
  • fixed-per-round with periodic validation seeds

This matrix is large enough to reveal meaningful differences while still being manageable for a first study.

11. Main Risks and Confounds

The specification must explicitly acknowledge the major sources of ambiguity.

11.1 Seed confounding

A candidate may appear better due to random seed luck rather than steering quality.

11.2 Human inconsistency

Users may change their preference criteria over time or answer inconsistently under fatigue.

11.3 Entangled directions

A single movement in steering space may alter style, composition, and realism simultaneously.

11.4 Interface bias

Layout, labeling order, or visual emphasis may bias selection independently of image quality.

11.5 Fatigue effects

Long sessions can reduce decision quality and increase noisy feedback.

11.6 Overfitting to one user workflow

A strategy that works well for one interaction style may fail under a different feedback mode or prompt type.

The system must log enough metadata to study these confounds rather than hiding them.

12. Success Criteria

The project is worth continuing if it can demonstrate:

  • reproducible experiment runs
  • replayable session traces
  • meaningful comparison between at least several sampler and updater combinations
  • measurable user preference improvement across rounds in at least some settings
  • clear analysis of when steering succeeds and when it fails

13. Summary

This project matters because it creates a disciplined environment for studying whether prompt-embedding steering can make interactive image generation more controllable than prompt rewriting alone.

Its value comes from the quality of the experiment environment, not from assuming in advance that one steering method will win.