Skip to content

Add unified text message import (WhatsApp, iMessage, Google Voice)#238

Open
wesm wants to merge 61 commits intomainfrom
feat/text-message-import
Open

Add unified text message import (WhatsApp, iMessage, Google Voice)#238
wesm wants to merge 61 commits intomainfrom
feat/text-message-import

Conversation

@wesm
Copy link
Copy Markdown
Owner

@wesm wesm commented Apr 1, 2026

Summary

Unifies three independent text message import implementations into a coherent system with a shared database schema, phone-based participant model, and dedicated TUI Texts mode.

What changed

  • Shared foundation: NormalizePhone E.164 utility, generalized EnsureParticipantByPhone with per-platform identifier tracking, RecomputeConversationStats shared store method
  • iMessage refactored: dropped gmail.API adapter and synthetic MIME; writes directly to store with proper message_type/sender_id/conversation_type, phone-based participants, message_recipients, and raw JSON storage
  • Google Voice refactored: same pattern; three message_type values (google_voice_text/call/voicemail), labels, raw HTML storage, correct outbound recipients
  • WhatsApp cleaned up: uses shared utilities, skips broken attachment rows when no --media-dir
  • CLI renamed: import-whatsapp, import-imessage, import-gvoice (deprecated import --type whatsapp alias kept)
  • Parquet cache extended: conversation_type column, schema v5, email-only filter on existing queries
  • TextEngine query interface: separate from Engine to avoid rippling through remote/API/MCP layers; DuckDB + SQLite implementations
  • TUI Texts mode: m key toggles modes; Conversations view, aggregate views (Contacts, Sources, Labels, Time), message timeline, plain-text search, read-only
  • FTS backfill: handles phone-based text senders via sender_id

🤖 Generated with Claude Code

eddowding and others added 30 commits March 31, 2026 18:15
Import WhatsApp messages from decrypted msgstore.db backups into
msgvault. Reads contacts, messages, and group metadata from the
WhatsApp SQLite database and maps them into the existing msgvault
schema with message_type='whatsapp'.

Includes:
- WhatsApp SQLite queries for messages, contacts, group metadata
- Contact resolution (phone → name) with WhatsApp contact DB support
- Conversation/thread mapping for 1:1 and group chats
- TUI and query engine updates for multi-source message types
- Schema migrations for phone_number, message_type, conversation title

Co-Authored-By: Ed Dowding <me@eddowding.com>
Sync iMessage history from the local macOS chat.db into msgvault.
Reads conversations, messages, and participants from the iMessage
SQLite database and stores them using the existing schema.

Includes:
- iMessage SQLite client with timestamp format auto-detection
- Message and conversation parsing with participant resolution
- CLI command (sync-imessage) with incremental sync support
- Parser tests for message extraction and formatting

Co-Authored-By: Ryan Stern <206953196+vanboompow@users.noreply.github.com>
Import Google Voice history from Google Takeout exports into msgvault.
Parses HTML conversation files, VCF contacts, and call logs from the
Takeout directory structure.

Includes:
- Takeout directory parser for texts, voicemails, and calls
- HTML conversation parser with timestamp and participant extraction
- VCF contact parser for Google Voice number detection
- CLI command (sync-gvoice) with conversation deduplication
- Parser tests for HTML and VCF extraction

Co-Authored-By: Ryan Stern <206953196+vanboompow@users.noreply.github.com>
Design for merging WhatsApp (#160), iMessage (#224), and Google Voice
(#225) import implementations into a coherent system with shared
phone-based participant model, proper schema usage, and dedicated TUI
Texts mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Support non-phone participants (iMessage email handles, short codes)
  with a resolution order: phone → email → raw handle
- Add conversation title fallback chain for 1:1 chats
- Generalize EnsureParticipantByPhone to accept identifierType param
- Split Google Voice into distinct message_types (text/call/voicemail)
  so Texts mode can cleanly filter out call records

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Texts mode is explicitly read-only (no deletion staging); imported
  archives have no live delete API
- Scope the cross-channel unification principle honestly: phone-based
  dedup works, email-only iMessage handles remain separate until
  address book resolution
- Conversation stats maintained by the store layer on insert, not
  left to each importer
- Unified Parquet cache with mode filtering instead of separate
  texts/ directory
- Label persistence is part of the shared importer contract
- FTS backfill updated to populate sender from phone_number via
  sender_id for text messages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace incremental stats with idempotent RecomputeConversationStats
  post-import step to avoid counter drift on re-imports
- Define Texts mode as a parallel navigation tree with new types
  (TextViewType, ConversationRow) and separate query methods, not a
  parameterization of the existing email aggregate model
- Explicitly disable message detail in Texts mode (Enter is no-op on
  timeline messages) since the detail model is email-shaped
- Clarify source filter as per-account (same plumbing as email), not
  a source-type bucket; defer source-type grouping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove all ambiguity from conversation stats: explicitly state stats
  are NOT maintained during insert, only recomputed post-import
- Replace vague keybinding claims with explicit mapping table showing
  every key's behavior in both Email and Texts modes
- Define Texts mode search as plain full-text only; Gmail-style
  operators (from:, subject:, etc.) are email-mode only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix A keybinding: principle 4 incorrectly listed A (account
  selector) as a disabled selection key; actual selection keys are
  Space and S
- Introduce TextEngine as a separate interface from Engine to avoid
  rippling text query methods through remote/API/MCP/mock layers
- DuckDBEngine implements both Engine and TextEngine; remote layers
  only implement TextEngine when remote Texts mode is added

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
23 tasks across 5 phases: foundation store methods, importer
refactoring (iMessage + GVoice + WhatsApp cleanup), Parquet cache +
TextEngine query interface, TUI Texts mode, and integration testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add identifierType as a third parameter so callers can pass "imessage",
"google_voice", etc. The participant_identifiers INSERT now uses the
parameter instead of the hardcoded literal 'whatsapp'.

Move the INSERT OR IGNORE outside the new-only branch so that calling
the method for an existing participant with a new identifierType (e.g.
the same phone number seen on a second platform) still records the
additional identifier row.

All existing WhatsApp call sites updated to pass "whatsapp" explicitly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract inline SQL from WhatsApp importer into a reusable
Store.RecomputeConversationStats(sourceID) method that updates
message_count, participant_count, last_message_at, and
last_message_preview for all conversations belonging to a source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop the gmail.API adapter pattern. Instead of building synthetic MIME
messages and flowing through the email sync pipeline, the client now
calls store methods directly with proper message_type, sender_id,
conversation_type, and phone-based participants.

- parser.go: Replace normalizeIdentifier with resolveHandle using
  shared textimport.NormalizePhone; remove buildMIME, formatMIMEAddress,
  normalizePhone
- client.go: Replace gmail.API interface with Import method that
  writes to store directly; keep all chat.db reading logic
- models.go: Add ImportSummary type
- CLI: New import-imessage command replaces sync-imessage
- parser_test.go: Replace normalizeIdentifier/MIME tests with
  resolveHandle tests covering phone/email/raw-handle cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop gmail.API adapter and synthetic MIME. Google Voice now writes
to the store with proper message_type (google_voice_text/call/
voicemail), phone-based participants via EnsureParticipantByPhone,
and labels linked via EnsureLabel + LinkMessageLabel.
Consistent naming with import-imessage and import-gvoice. The --type
flag is removed since each source has its own subcommand. --phone is
now a required flag instead of validated at runtime.
- Add conversation_type to conversations Parquet export
- Add conversation_type to DuckDB optional-column probe and parquetCTEs
  (falls back to 'email' for old caches)
- Bump cacheSchemaVersion to 5 to force full rebuild on upgrade
- Add email-only filter to DuckDB buildFilterConditions and buildWhereClause
  so existing email TUI views exclude WhatsApp/iMessage/GVoice messages
- Apply same email-only filter to SQLite optsToFilterConditions and
  buildFilterJoinsAndConditions
- Update test schemas to include conversation_type column

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement the Texts mode toggle (m key) for browsing text message data
(WhatsApp, iMessage, SMS, Google Voice) alongside the existing Email mode.

New files:
- text_state.go: tuiMode, textViewLevel, textState, textNavSnapshot types
- text_keys.go: key handling for all text view levels
- text_view.go: rendering for conversations, aggregates, and timeline
- text_commands.go: async data loading via TextEngine

Modified files:
- model.go: mode/textEngine/textState fields, Options.TextEngine,
  Update routing for text message types, handleKeyPress mode dispatch
- keys.go: m key in handleGlobalKeys to toggle modes
- view.go: mode indicator in title bar, m key in help modal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests the full pipeline: store methods, participant deduplication
across sources, conversation stats recomputation, and all four
TextEngine methods (ListConversations, TextAggregate, ListConversation
Messages, GetTextStats).

Also fixes a bug in ListConversations where MAX(sent_at) — returned
as a plain string by SQLite's aggregate — was scanned into sql.NullTime
(which fails). Now scans into sql.NullString and parses with explicit
timestamp format list.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `id DESC` as a tie-breaker in the last_message_preview subquery so
messages with identical timestamps always resolve to the same row.

Strengthen TestRecomputeConversationStats to assert participant_count
(via EnsureParticipantByPhone + EnsureConversationParticipant) and
last_message_preview (latest message snippet by sent_at).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use "local" as default source identifier instead of meaningless
  "imessage"; add --me flag for phone/email override
- Write message_recipients rows (from/to) based on is_from_me:
  outgoing messages set owner as from and chat participants as to;
  incoming messages set sender handle as from and owner as to
- Set sender_id on is_from_me messages when owner is resolved
- Store message data as JSON in message_raw via
  UpsertMessageRawWithFormat with "imessage_json" format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
macOS Ventura+/Sequoia uses NSArchiver "streamtyped" serialization
for attributedBody, not NSKeyedArchiver binary plist. The parser now
handles both formats:

- streamtyped: scan for \x84\x01+ NSString marker, read length-
  prefixed UTF-8 text
- bplist: existing NSKeyedArchiver path (unchanged)

This was causing all message snippets and bodies to be empty for
iMessage imports on modern macOS.
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 1, 2026

roborev: Combined Review (43b53fe)

Verdict: Review blocked — no validated code findings were produced because the changes could not be inspected.

High

  • Review could not be completed (Location: N/A)
    One review reported that the required local git diff/log commands failed in the sandbox (bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted), so the commit range could not be inspected and no code-grounded review could be performed.
    Fix: Re-run the review with working read-only local git access or provide the combined diff inline.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm marked this pull request as ready for review April 1, 2026 17:20
MaxOpenConns(1) caused the TUI to deadlock when the FTS backfill
goroutine held the single connection while async text queries tried
to use it. SQLite WAL supports concurrent readers, so allow 4
connections to unblock parallel read operations.

Also fix iMessage body extraction for macOS streamtyped format.
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 1, 2026

roborev: Combined Review (38884c9)

Verdict: Review could not be completed because the PR diff was inaccessible in the current sandbox.

High

  • Diff unavailable (Location: N/A): Multiple reviews were blocked from inspecting commit range 3c9ae7ccdc08697166ac148ab488f15298fbea51..38884c9bf4a01899d3c84d55e9fb2924d7f7f2ea because read-only git log / git diff commands failed with bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted. As a result, no reliable correctness or security review could be performed.
    • Suggested fix: Re-run review in an environment where local git diff commands work, or provide the combined diff directly in the PR.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The streamtyped length encoding had extra framing bytes between the
length prefix and text content, causing the parser to include garbage
bytes at the start of extracted text. These non-UTF-8 bytes then
caused DuckDB to reject the Parquet file.

Fix: skip non-printable bytes after the length prefix to find the
actual text start, and trim any trailing incomplete UTF-8 sequences
before returning. Also add a test case matching the real macOS
Sequoia attributedBody format.
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 1, 2026

roborev: Combined Review (2180ad9)

Verdict: No Medium, High, or Critical findings were identified from the completed reviews; the security review was inconclusive due to an environment error.

No Medium, High, or Critical findings to report.

Security review status:

  • Commit range 3c9ae7ccdc08697166ac148ab488f15298fbea51..2180ad9c98d91b1744f91b7805ca06a23d5fba0c was not fully reviewed because local read-only git commands failed with bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted.
  • No security findings are included here because the agent could not reliably inspect the diff.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Press Enter on a message in the timeline to expand it and show the
full body text with word wrapping. Press Enter again or Esc to
collapse. The full body is fetched from message_bodies via
Engine.GetMessage.
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 1, 2026

roborev: Combined Review (dc158c9)

Review incomplete: the PR diff could not be inspected in this environment, so no code-level findings could be verified.

High

  • Diff unavailable for review
    Location: N/A
    The requested commit range could not be inspected because local read-only git commands failed in this environment with bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted, so there was no diff available to review.
    Fix: Re-run the review in an environment where local git read commands work, or provide the combined diff inline.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Replace the table-style one-line-per-message timeline with a chat
layout: each message shows a sender+time header followed by the
full body text with word wrapping.

- ListConversationMessages now joins message_bodies for full text
- DuckDB delegates to SQLite for timeline (Parquet has no bodies)
- Added BodyText field to MessageSummary (populated only for timelines)
- Removed the broken Enter-to-expand mechanism
- Messages display inline with alternating background per message
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 1, 2026

roborev: Combined Review (3ce0795)

Verdict: Review blocked; no code-level findings could be validated because the diff was not accessible in the review environment.

High

  • Review blocked
    • Location: N/A
    • Problem: The requested commit range could not be inspected because local read-only git commands failed in the sandbox with bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted. Any code-specific findings would be speculative without access to the actual diff.
    • Fix: Provide the diff inline or restore local read-only git access for the review environment, then rerun the review.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- r key reverses chronological order (newest/oldest first) with
  breadcrumb indicator
- / key opens search in timeline view
- Timestamps right-justified on the header line (sender left, time right)
- ListConversationMessages now respects TextFilter.SortDirection
- Updated footer keybindings to show new keys
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 1, 2026

roborev: Combined Review (b29bc6a)

Verdict: Review blocked; no validated code findings could be confirmed because the diff was not accessible in one review and the others reported no issues.

High

  • Review coverage is incomplete for commit range 3c9ae7ccdc08697166ac148ab488f15298fbea51..b29bc6a0f7e7dc86bc2115a928cb6cbc17ca2f55: the reviewer could not inspect the actual diff because read-only local git commands failed with a sandbox startup error (bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted). As a result, this PR has not been reliably reviewed for regressions or missing coverage.
    • Fix: re-run review in an environment where git log / git diff work, or attach the diff directly for review.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

When --me flag isn't provided, create a fallback "Me" participant
so outbound messages show "Me" instead of "Unknown" in the timeline.
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 2, 2026

roborev: Combined Review (6c37857)

No medium-or-higher findings were identified from the completed reviews, but the security review could not be completed because the sandbox blocked local diff inspection.

Medium+

None reported.

Review Coverage Gap

  • Security review could not inspect the PR diff due environment restrictions (bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted), so there are no security-specific file/line findings to report from that pass.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 2, 2026

roborev: Combined Review (b2cef0a)

Verdict: Review could not be completed because the PR diff was not accessible in the review environment.

High

  • Diff inaccessible for review
    Location: N/A
    The required read-only git commands failed in the sandbox (bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted), so the commit range could not be inspected and no static analysis of the actual code changes was possible.
    Fix: Re-run the review in an environment where read-only git diff/git log works, or provide the combined diff inline.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- Fix 2 blank lines at bottom: use pageSize-5 for views with
  header+separator (accounts for all chrome lines correctly)
- Add sort arrows to Conversation/Messages/Last Message headers
  showing which column is actively sorted and in which direction
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 2, 2026

roborev: Combined Review (2d67cbc)

Verdict: Review could not be completed because the PR diff was not accessible in the review environment.

High

  • Review environment
    • Problem: Multiple reviewers could not inspect the code changes because the diff was unavailable in the payload ("Diff too large to include") and local git/shell access was blocked in the sandbox, including failures running git diff/other read-only git commands.
    • Fix: Re-run the review with the full diff available or in an environment where read-only local git commands can execute so the actual commit range can be inspected.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm added 3 commits April 1, 2026 20:15
When / search is used in the timeline view, filter the already-loaded
messages client-side instead of calling the global FTS engine. This
searches body text and sender name/phone within the current
conversation. Press Esc to return to the full message list.
- Timeline search now filters locally within the conversation
- Fix page size calculation to match email mode (pageSize - 1)
- Add conversation title header + separator to timeline view so
  footer doesn't shift when drilling in from conversation list
- Show search input bar (/) in timeline info line
- All three text views now use pageSize-3 consistently
  (header + separator + info line)
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 2, 2026

roborev: Combined Review (00e085f)

Verdict: No Medium+ findings identified from the provided reviews, but the security review was inconclusive due to missing diff access.

No Medium, High, or Critical findings to report.

Security review note:

  • Review 2 (security) could not complete a reliable assessment because the required local git inspection failed in the sandbox (bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted) and the combined diff was not provided inline.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 2, 2026

roborev: Combined Review (92f2c35)

Verdict: Review blocked; no defensible PR findings can be reported because the diff could not be inspected.

High

  • Review blocked
    • Location: N/A
    • Problem: Multiple review passes were unable to inspect the commit range because required read-only local commands such as git log and git diff failed with bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted. No inline diff was provided, so neither general code review nor security review could be completed.
    • Fix: Provide the PR diff inline or make read-only git inspection commands available, then rerun review.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Empty views (no data, no conversations, no messages) now render
header+separator+fill to match the data view height exactly,
preventing blank lines at the bottom. Also adds \x1b[J
(clear-to-end-of-screen) to text view output to prevent ghost text
when switching between views of different heights.

Fixes both email and text mode empty states.
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 2, 2026

roborev: Combined Review (a8cd880)

Verdict: Review could not be completed because the diff was inaccessible in this environment.

High

  • Location: N/A
    Finding: The requested commit range could not be inspected because read-only local git commands failed in this environment with bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted, so no reliable diff-based review could be performed, including security review.
    Suggested fix: Re-run the review in an environment where git diff/log commands work, or provide the diff inline for static review.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Apr 2, 2026

roborev: Combined Review (41ab8fe)

Verdict: Review blocked; the PR could not be substantively reviewed because the diff was unavailable in the current environment.

High

  • Review could not be completed
    • Location: Commit range 3c9ae7ccdc08697166ac148ab488f15298fbea51..41ab8fe11c4dc64a84dd4fef7b9662540b73e792
    • Problem: Multiple reviewers reported that the actual code diff was not available for inspection. The environment either did not include the diff inline or blocked read-only git diff/git log access, so the changed code could not be analyzed.
    • Fix: Re-run review with the combined diff included inline, or provide an environment where read-only git commands can access that commit range.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

… empty states

- Clamp cursor/scrollOffset when new text data arrives to prevent
  negative-capacity panic in textConversationsView after account change
- Rewrite streamtyped parser to handle 0x92 framing bytes and use
  decoded length for single-byte prefix; add >127 byte test cases
- Keep MaxOpenConns(1) for :memory: databases to avoid separate
  per-connection databases
- Thread SortDirection into DuckDB fallback ORDER BY for
  ListConversationMessages
- Fix empty-state fill to render pageSize-1 data rows (was 1 short)
- Store unfilteredMessages to prevent repeated timeline searches from
  stacking breadcrumbs and narrowing results
- Prefer highest-ID non-local source in resolveImessageSource
- Ensure timeline scroll shows message header + body context lines
- Add known-limitation comment for TextSearch snippet-only results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants