Add Obsidian vault import recipe#28
Add Obsidian vault import recipe#28snapsynapse wants to merge 2 commits intoNateBJones-Projects:mainfrom
Conversation
Parses any Obsidian vault, chunks notes into atomic thoughts, generates embeddings via OpenRouter, and inserts into Supabase. Tested on 500+ note LifeHQ-pattern vault. Closes NateBJones-Projects#13
|
@claude |
|
@claude let's do this one more time... I believe in you! give this a proper review! |
justfinethanku
left a comment
There was a problem hiding this comment.
Admin Review
CI didn't run (Actions blocker), so I did a manual pass.
Security: Clean — no credentials, no dangerous operations, all external calls go to OpenRouter API only.
Code quality: Solid. Proper retry logic with exponential backoff, rate limiting, sync log for idempotent re-runs, graceful error handling for encoding issues. The hybrid chunking approach (headings + optional LLM fallback) is well thought out.
Documentation: Strong README with vault compatibility table, credential tracker, filtering docs, and 5 troubleshooting entries.
Verdict: Approved. Nice first contribution!
Welcome to Open Brain — come say hi in Discord: https://discord.gg/Cgh9WJEkeG
justfinethanku
left a comment
There was a problem hiding this comment.
PR Review: Obsidian Vault Import
Nice work @snapsynapse — the core code is solid, the parsing logic is thorough, and the vault compatibility table is a great addition. A few things need addressing before merge.
Must Fix (blocking)
1. PR/branch/commit naming conventions
These will fail the automated review (ob1-review.yml):
- PR title should be
[recipes] Obsidian vault import(needs[recipes]prefix) - Branch should be
contrib/snapsynapse/obsidian-vault-import(notrecipe/obsidian-vault-import) - Commit messages should use
[recipes]prefix
See CONTRIBUTING.md — PR Format for the expected pattern.
2. Dedup gap — not using content-fingerprint-dedup primitive
The sync log (obsidian-sync-log.json) prevents duplicates on same-machine re-runs, but it's local-only. If the log is deleted, or a second user imports an overlapping vault, you get full duplicates.
The repo has a content-fingerprint-dedup primitive that solves this at the database level. The other import recipes either use it or document it as a companion.
Options (in order of preference):
- Compute a
content_fingerprint(SHA-256 of normalized content) and include it in the insert payload — the DB unique index handles the rest - Call the
upsert_thoughtRPC instead of rawINSERT - At minimum, document the primitive as a recommended companion in the README's Prerequisites section
The sync log is still useful as a performance optimization (skip embedding API calls for unchanged notes), but shouldn't be the only dedup mechanism.
3. --no-embed flag undocumented in README
import-obsidian.py line 619 defines --no-embed but it's missing from the Options table in the README. Add it — users who browse the README won't know it exists.
Should Fix (strong recommendation)
4. .env parser doesn't handle quoted values
The hand-rolled parser (lines 637–643) will include literal quote characters if a user writes:
SUPABASE_URL="https://foo.supabase.co"
Many users copy-paste from Supabase with quotes. Quick fix — strip surrounding quotes:
value = value.strip().strip('"').strip("'")Or add python-dotenv to requirements.txt and use load_dotenv().
5. Dead code: HEADING_RE (line 297)
HEADING_RE is defined but never referenced — the actual heading splitting at line 389 uses re.split with an inline pattern. Remove it.
6. Add a cost estimate section to the README
The ChatGPT import recipe includes a cost table by export size. Since this recipe hits OpenRouter for both embeddings and optional LLM chunking, users would benefit from knowing approximate costs. Even a rough table like:
| Vault size | Embeddings | With LLM chunking |
|---|---|---|
| 100 notes | ~$0.02 | ~$0.15 |
| 500 notes | ~$0.10 | ~$0.75 |
| 1000+ notes | ~$0.20 | ~$1.50 |
Nice to Have
7. Rate limiting between embedding calls
Currently the script only pauses every 50 inserts (time.sleep(1) at line 871), but there's no delay between individual embedding API calls. For a 1000-thought import, that's 1000 rapid-fire calls to OpenRouter. A small delay (0.1–0.2s) between calls or a note in the README about rate limits would help.
8. Shields.io step badges
CONTRIBUTING.md recommends these for recipes. Not required, but would match the style of the extension guides and make the README more scannable.
What's Good
- Obsidian parsing is excellent — frontmatter, wikilinks (including aliased), inline tags with false-positive stripping for code blocks/HTML. Well-engineered.
- Hybrid chunking (whole-note → heading-split → LLM distill) is a smart approach.
- Retry logic with exponential backoff on all API calls.
- Vault detection (
.obsidian/check), template filtering, and the filtering transparency (--dry-run --verbose) are all thoughtful touches. - No security issues: REST API with JSON payloads (no SQL injection), resolved paths (no traversal), keys in env vars only.
- README has all required sections, exceeds the troubleshooting minimum, and the vault compatibility table adds genuine value.
TL;DR: Fix the naming conventions (items 1), add content fingerprinting or document the dedup primitive (item 2), and document --no-embed (item 3). One revision should do it — the underlying work is strong.
|
Thanks for the thorough review @justfinethanku! All feedback addressed. Closing this PR in favor of a new one from the correctly-named branch (contrib/snapsynapse/obsidian-vault-import) as recommended (sorry). The new PR also includes improvements discovered during live testing on an 800+ note vault:
New PR incoming that references this one for review context. |
…n vault import Addresses all feedback from PR NateBJones-Projects#28: branch/commit naming conventions, content fingerprint dedup, --no-embed docs, .env quote handling, dead code removal, cost estimates, and rate limiting. Adds preflight connection check, secret detection scanner, early abort on consecutive failures, per-note sync log timestamps, and line-buffered output. Tested on 800+ note vault (3,743 thoughts, zero data loss). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parses any Obsidian vault, chunks notes into atomic thoughts, generates embeddings via OpenRouter, and inserts into Supabase. Tested on 500+ note vault.
Closes #13
Summary
Adds a complete recipe for importing Obsidian vaults into Open Brain as searchable, embedded thoughts. Closes #13.
Vault compatibility
Tested/documented for BASB/PARA, LYT/Ideaverse, LifeHQ, FLAP, Zettelkasten, and MOC-centric patterns. No special configuration needed.
Files
import-obsidian.py— standalone script (~480 lines)README.md— step-by-step guide with credential tracker, options, filtering docs, troubleshootingrequirements.txt— python-frontmatter, requests.env.example,.gitignore,metadata.json