feat(docs): add LLMs.txt generation for AI-friendly documentation#1870
feat(docs): add LLMs.txt generation for AI-friendly documentation#1870
Conversation
… generation script
…ies and enhancing document processing
…nd improving description clarity
People can be co-author:
|
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1870 +/- ##
=======================================
Coverage 93.53% 93.53%
=======================================
Files 45 45
Lines 727 727
Branches 185 185
=======================================
Hits 680 680
Misses 41 41
Partials 6 6
🚀 New features to boost your workflow:
|
|
Size Change: 0 B Total Size: 88.6 kB ℹ️ View Unchanged
|
…es and extract first sentence and paragraph
ecad064 to
3ea90b2
Compare
manudeli
left a comment
There was a problem hiding this comment.
We have node v24 so we don't need tsx
| "type": "module", | ||
| "scripts": { | ||
| "build": "next build && pagefind --site .next/server/app --output-path public/_pagefind", | ||
| "build": "next build && tsx scripts/llms-txt/generate-llms-txt.ts && pagefind --site .next/server/app --output-path public/_pagefind", |
There was a problem hiding this comment.
| "build": "next build && tsx scripts/llms-txt/generate-llms-txt.ts && pagefind --site .next/server/app --output-path public/_pagefind", | |
| "build": "next build && node scripts/llms-txt/generate-llms-txt.ts && pagefind --site .next/server/app --output-path public/_pagefind", |
| "tailwindcss": "catalog:", | ||
| "tsx": "^4.21.0" |
There was a problem hiding this comment.
| "tailwindcss": "catalog:", | |
| "tsx": "^4.21.0" | |
| "tailwindcss": "catalog:" |
There was a problem hiding this comment.
Pull request overview
This PR adds LLMs.txt support to generate AI-friendly documentation from Nextra-based docs. The implementation parses MDX files, transforms Nextra components to markdown, and generates three types of output files: an overview file with links, a complete documentation file, and individual markdown files.
Key changes:
- Script suite to parse Nextra _meta.tsx files, transform MDX/Nextra components, and generate LLMs.txt files
- Middleware bypass for .txt and .md files to skip i18n processing
- Integration into the build process
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/suspensive.org/src/middleware.ts | Adds bypass logic for .md and .txt files to skip i18n middleware |
| docs/suspensive.org/scripts/llms-txt/types.ts | Defines TypeScript interfaces for metadata entries and document information |
| docs/suspensive.org/scripts/llms-txt/nextra-transform.ts | Transforms Nextra/MDX components (Tabs, Callout, Sandpack) to standard markdown |
| docs/suspensive.org/scripts/llms-txt/meta-parser.ts | Parses Nextra _meta.tsx files to extract sidebar ordering and titles |
| docs/suspensive.org/scripts/llms-txt/document-processor.ts | Processes MDX files to extract titles, descriptions, and clean content |
| docs/suspensive.org/scripts/llms-txt/config.ts | Configuration constants for directory paths |
| docs/suspensive.org/scripts/llms-txt/builders.ts | Builds the LLMs.txt output formats from processed documents |
| docs/suspensive.org/scripts/llms-txt/generate-llms-txt.ts | Main script that orchestrates the generation process |
| docs/suspensive.org/package.json | Integrates LLMs.txt generation into build script and adds tsx dependency |
| docs/suspensive.org/.gitignore | Ignores generated LLMs.txt files |
| pnpm-lock.yaml | Lock file updates for tsx dependency and libc fields for Linux packages |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "type": "module", | ||
| "scripts": { | ||
| "build": "next build && pagefind --site .next/server/app --output-path public/_pagefind", | ||
| "build": "next build && tsx scripts/llms-txt/generate-llms-txt.ts && pagefind --site .next/server/app --output-path public/_pagefind", |
There was a problem hiding this comment.
The build script chains commands with && which means if the LLMs.txt generation fails, the pagefind step won't run. However, the build output (next build) will already exist. Consider whether LLMs.txt generation should be a critical step that blocks the entire build, or if it should run independently with proper error handling.
| "build": "next build && tsx scripts/llms-txt/generate-llms-txt.ts && pagefind --site .next/server/app --output-path public/_pagefind", | |
| "build": "next build && (tsx scripts/llms-txt/generate-llms-txt.ts || echo \"Warning: failed to generate LLMs.txt\" >&2) && pagefind --site .next/server/app --output-path public/_pagefind", |
| return { | ||
| category: categoryMatch?.[1] ?? ROOT_CATEGORY, |
There was a problem hiding this comment.
The regex pattern /docs\/([^/]+)\// is fragile as it assumes a specific directory structure. If the path doesn't match this pattern (e.g., for root-level docs or if the path format changes), the category will always default to ROOT_CATEGORY without warning. Consider adding validation or logging when the pattern doesn't match to help debug potential issues.
| return { | |
| category: categoryMatch?.[1] ?? ROOT_CATEGORY, | |
| const category = categoryMatch?.[1] ?? ROOT_CATEGORY | |
| if (!categoryMatch) { | |
| console.warn( | |
| `[document-processor] Unable to extract category from path "${filePath}" using pattern /docs\\/([^/]+)\\//; defaulting to "${category}".` | |
| ) | |
| } | |
| return { | |
| category, |
| function findSentenceEnd(text: string): number { | ||
| let bracketDepth = 0 | ||
| let inCode = false | ||
|
|
||
| for (let i = 0; i < text.length; i++) { | ||
| const char = text[i] | ||
|
|
||
| if (char === '`') { | ||
| inCode = !inCode | ||
| continue | ||
| } | ||
|
|
||
| if (inCode) continue | ||
|
|
||
| if (char === '[') bracketDepth++ | ||
| else if (char === ']') bracketDepth = Math.max(0, bracketDepth - 1) | ||
| else if (bracketDepth === 0 && SENTENCE_ENDINGS.has(char)) return i | ||
| } | ||
|
|
||
| return -1 | ||
| } |
There was a problem hiding this comment.
The findSentenceEnd function doesn't handle consecutive backticks correctly. If the text contains an even number of backticks (like "test `` test."), the inCode flag will be false at the period, potentially causing incorrect sentence detection. Consider tracking backtick pairs more carefully or using a more robust parser.
| const rawContent = fs.readFileSync(filePath, 'utf-8') | ||
| const { category, slug } = parseFilePath(filePath) | ||
| const relativePath = filePath.replace(DOCS_DIR, '/docs').replace(/\.mdx$/, '') | ||
|
|
||
| return { | ||
| title: extractTitle(rawContent, slug), | ||
| description: extractFirstSentence(rawContent), | ||
| content: cleanContent(rawContent), | ||
| path: relativePath, | ||
| category, | ||
| slug, |
There was a problem hiding this comment.
Missing error handling when reading files or parsing content. If a file is corrupted, unreadable, or contains malformed content, the script will crash without a helpful error message. Consider wrapping file operations and parsing logic in try-catch blocks with descriptive error messages that include the file path being processed.
| const rawContent = fs.readFileSync(filePath, 'utf-8') | |
| const { category, slug } = parseFilePath(filePath) | |
| const relativePath = filePath.replace(DOCS_DIR, '/docs').replace(/\.mdx$/, '') | |
| return { | |
| title: extractTitle(rawContent, slug), | |
| description: extractFirstSentence(rawContent), | |
| content: cleanContent(rawContent), | |
| path: relativePath, | |
| category, | |
| slug, | |
| try { | |
| const rawContent = fs.readFileSync(filePath, 'utf-8') | |
| const { category, slug } = parseFilePath(filePath) | |
| const relativePath = filePath.replace(DOCS_DIR, '/docs').replace(/\.mdx$/, '') | |
| return { | |
| title: extractTitle(rawContent, slug), | |
| description: extractFirstSentence(rawContent), | |
| content: cleanContent(rawContent), | |
| path: relativePath, | |
| category, | |
| slug, | |
| } | |
| } catch (error) { | |
| const message = | |
| error instanceof Error ? error.message : String(error) | |
| throw new Error( | |
| `Failed to process document "${filePath}": ${message}` | |
| ) |
| preserved.push(match) | ||
| return `__PRESERVED_${preserved.length - 1}__` | ||
| }) | ||
| .replace(/`[^`]+`/g, (match) => { |
There was a problem hiding this comment.
The regex pattern for inline code may incorrectly match escaped backticks. Consider using a more robust pattern that handles escaped backticks: /(?<!\\)(?:[^`\\]|\\.)+`(?!\\)/g`. The current pattern `/`[^`]+/g will fail to preserve inline code that contains escaped backticks or match unintended sequences when backticks appear in other contexts.
| .replace(/`[^`]+`/g, (match) => { | |
| .replace(/(?<!\\)`(?:[^`\\]|\\.)+`(?!\\)/g, (match) => { |
| @@ -0,0 +1,3 @@ | |||
| export const DOCS_DIR = 'src/content/en/docs' | |||
There was a problem hiding this comment.
The hardcoded path 'src/content/en/docs' may break if the directory structure changes or if the script is run from a different working directory. Consider making this configurable through environment variables or command-line arguments, or use path.join with __dirname to make it relative to the script location.
| export const DOCS_DIR = 'src/content/en/docs' | |
| import * as path from 'path' | |
| export const DOCS_DIR = | |
| process.env.DOCS_DIR ?? path.resolve(__dirname, '../../../src/content/en/docs') |
| const entryPattern = | ||
| /['"]?([^'":,\s]+)['"]?\s*:\s*\{[^}]*?title:\s*['"]([^'"]+)['"][^}]*?\}/g |
There was a problem hiding this comment.
The regex pattern /['"]?([^'":,\s]+)['"]?\s*:\s*\{[^}]*?title:\s*['"]([^'"]+)['"][^}]*?\}/g is complex and may fail to match entries if the _meta.tsx format varies slightly (e.g., with nested objects, multi-line formatting, or titles containing quotes). Consider using a proper TypeScript/JavaScript parser (like @babel/parser) to reliably extract the metadata structure instead of relying on regex.
| trimmed.startsWith('>') || | ||
| trimmed.startsWith('- ') || | ||
| trimmed.startsWith('* ') || | ||
| /^[\w]+=/.test(trimmed) || // JSX attributes like title="..." |
There was a problem hiding this comment.
The regex pattern /^[\w]+=/.test(trimmed) for detecting JSX attributes is too broad and may incorrectly skip valid content lines. For example, a line like "key=value" in regular text would be skipped. Consider making this more specific to JSX attribute patterns, such as checking for common JSX attribute patterns like className=, style=, or using a more precise pattern that accounts for JSX context.
| /^[\w]+=/.test(trimmed) || // JSX attributes like title="..." | |
| /^[A-Za-z_][\w:-]*\s*=\s*["'{]/.test(trimmed) || // JSX-like attributes, e.g. title="...", className={...} |
| result = result.replace( | ||
| /<Tabs\s+items=\{(\[[^\]]+\])\}>([\s\S]*?)<\/Tabs>/g, | ||
| (_match, itemsStr: string, tabsContent: string) => { | ||
| const items = JSON.parse(itemsStr.replace(/'/g, '"')) as string[] |
There was a problem hiding this comment.
The JSON.parse call on line 22 lacks error handling and assumes the items array only contains single-quoted strings. The replacement logic itemsStr.replace(/'/g, '"') could produce invalid JSON if the actual items use double quotes, contain apostrophes in the strings, or have mixed quote styles. Wrap this in a try-catch block with a fallback value, and consider a more robust parsing approach that handles various quote styles.
| const items = JSON.parse(itemsStr.replace(/'/g, '"')) as string[] | |
| let items: string[] = [] | |
| try { | |
| // First, try parsing as-is in case itemsStr is already valid JSON | |
| items = JSON.parse(itemsStr) as string[] | |
| } catch { | |
| try { | |
| // Fallback: attempt a simple single-quote to double-quote conversion | |
| items = JSON.parse(itemsStr.replace(/'/g, '"')) as string[] | |
| } catch { | |
| // On failure, keep items as an empty array and use default tab labels | |
| items = [] | |
| } | |
| } |
| "esModuleInterop": true, | ||
| "isolatedModules": true, | ||
| "moduleResolution": "bundler", | ||
| "allowImportingTsExtensions": true, |
There was a problem hiding this comment.
Added .ts extensions to all relative imports to enable native Node.js TypeScript execution without tsx. This requires allowImportingTsExtensions: true in tsconfig.json (which works because noEmit is already set).
|
Thanks! ❤️ |
|
cool |
Overview
Closes #1858
Summary
Add LLMs.txt support for AI-friendly documentation. This generates machine-readable files that help LLMs understand Suspensive's documentation structure.
Generated Files
/llms.txt- Overview with links to all docs/llms-full.txt- Complete documentation in single file/docs/**/*.md- Individual markdown filesImplementation
_meta.tsxto preserve sidebar ordering and section titles.txtand.mdfilesnext build && tsx scripts/llms-txt/generate-llms-txt.ts)PR Checklist