Improve context awareness and image handling for multimodal understanding by quiet-node · Pull Request #69 · quiet-node/thuki

quiet-node · 2026-04-08T20:06:28Z

Summary

This PR significantly enhances Thuki's ability to understand and reason across multiple context signals (highlighted text, images, user messages) simultaneously, making it smarter at inferring intent and reducing the need for clarification.

Changes

1. Preserve images in conversation history

Images are no longer stripped from persisted conversation history
Follow-up questions can now reference earlier screenshots without them being lost
Full multimodal context available for every response

2. Structured message format

Replaced ambiguous Context: "..." format with explicit [Highlighted Text] and [Request] labels
Helps smaller models like Gemma clearly parse boundaries between subject, context, and intent
Cleaner separation for multimodal reasoning

3. Enhanced system prompt

Three major additions:

Multi-signal reasoning: Teaches the model that highlighted text = primary subject, images = supporting context, message = user intent. When all three are present, synthesize them holistically.
Conversational pattern recognition: When users establish a structural pattern (e.g., "What is the population of the US?" then "What about Vietnam?"), recognize and apply the pattern directly instead of asking for clarification.
Intelligent multi-image handling: When users reference "the previous image" in a conversation with multiple images, don't assume the immediately preceding one. If it's blank/irrelevant, scan history to find the most recent visually relevant image for comparison.

Impact

Users can now send follow-up questions about images without re-attaching them
Fewer clarification requests; more direct answers
Better handling of quoted text + images together (Thuki's core feature)
More natural conversational flow with pattern inference
Smarter image comparisons in multi-image conversations

Testing

All 472 frontend tests pass
All 149 backend tests pass
Lint, format, typecheck, and build validation: zero warnings, zero errors

Generated with Claude Code

…dal understanding Preserve images in conversation history so follow-up messages can reference earlier screenshots. When users ask follow-up questions about images, Thuki now has access to all previously shared images, enabling richer context for responses. Restructure message format for clarity. When quoted text is present (from highlighted selections), the message is now explicitly labeled as [Highlighted Text] and [Request], replacing the ambiguous 'Context: ' prefix. This helps smaller models like Gemma parse intent boundaries clearly. Enhance system prompt with three major improvements: 1. Multi-signal reasoning: Teach the model that highlighted text is the primary subject, images are supporting context, and the user message is the intent. When all three are present, synthesize them holistically. 2. Conversational pattern recognition: Recognize when users establish a pattern (e.g., 'What is the population of X?' then 'What about Y?') and apply the pattern to the new topic instead of asking for clarification. 3. Intelligent multi-image handling: When users reference 'the previous image' in a conversation with multiple images, don't assume the immediately preceding one. If that image is blank or irrelevant, look back through history to find the most recent visually relevant image for comparison. These changes leverage 100% conversation history context (images are now preserved) to deliver smarter, more context-aware responses that require fewer clarification prompts. Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com>

…ding (#69) feat: improve context awareness and image handling for better multimodal understanding Preserve images in conversation history so follow-up messages can reference earlier screenshots. When users ask follow-up questions about images, Thuki now has access to all previously shared images, enabling richer context for responses. Restructure message format for clarity. When quoted text is present (from highlighted selections), the message is now explicitly labeled as [Highlighted Text] and [Request], replacing the ambiguous 'Context: ' prefix. This helps smaller models like Gemma parse intent boundaries clearly. Enhance system prompt with three major improvements: 1. Multi-signal reasoning: Teach the model that highlighted text is the primary subject, images are supporting context, and the user message is the intent. When all three are present, synthesize them holistically. 2. Conversational pattern recognition: Recognize when users establish a pattern (e.g., 'What is the population of X?' then 'What about Y?') and apply the pattern to the new topic instead of asking for clarification. 3. Intelligent multi-image handling: When users reference 'the previous image' in a conversation with multiple images, don't assume the immediately preceding one. If that image is blank or irrelevant, look back through history to find the most recent visually relevant image for comparison. These changes leverage 100% conversation history context (images are now preserved) to deliver smarter, more context-aware responses that require fewer clarification prompts. Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com>

quiet-node merged commit 7f64352 into main Apr 8, 2026
3 checks passed

quiet-node deleted the worktree-ancient-fluttering-river branch April 8, 2026 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve context awareness and image handling for multimodal understanding#69

Improve context awareness and image handling for multimodal understanding#69
quiet-node merged 1 commit intomainfrom
worktree-ancient-fluttering-river

quiet-node commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant