Discussion: llms.txt Discovery and Placement Strategy
Hey @jph00, I'm new here but the last few weeks, we have built a tool for llms.txt creation, validation and use through MCP: Blog here. We have more plans for dev tools for agent accessibility. One of the problems I'm seeing is discoverability.
Current State and Ambiguity
The current llms.txt specification states that files should be "located in the root path /llms.txt of a website (or, optionally, in a subpath)" but provides no concrete guidance on:
- How to discover llms.txt files in subpaths
- Whether multiple llms.txt files can coexist at different levels
- What conventions should govern subpath naming
- How tooling should handle discovery in a standardized way
This ambiguity has real consequences. As noted in the LLMTEXT toolkit launch, many implementations fail validation because they serve llms.txt at non-root locations like https://www.mintlify.com/docs/llms.txt or https://nextjs.org/docs/llms.txt, making them hard to find programmatically.
Proposal: A Three-Tiered Discovery Mechanism
1. Primary Location: /llms.txt at Root (Required)
Every website implementing the standard must serve a primary llms.txt at the root domain:
https://example.com/llms.txt
This serves as the canonical discovery point, following the established pattern of /robots.txt and /sitemap.xml. Tools, agents, and MCP servers should always check this location first.
Rationale: Predictable, universal discovery without requiring HTTP negotiations or HTML parsing.
2. Secondary Locations: Subpath llms.txt Files (Optional)
Websites may serve additional llms.txt files at any subpath for scoped documentation:
https://example.com/docs/api/llms.txt (API-specific documentation)
https://github.com/username/repo/llms.txt (Repository-level documentation)
https://example.com/products/widget/llms.txt (Product-specific information)
Use cases:
- Multi-product companies: Separate llms.txt for each product line
- Code hosting platforms: Per-repository documentation contexts
- Large documentation sites: Section-specific contexts to avoid token bloat
- Multi-tenant platforms: User or organization-specific contexts
3. Discovery Mechanism: Link Headers → Link Tags → Path Walking
To enable programmatic discovery of subpath llms.txt files without requiring standardized paths, implement a cascading discovery mechanism:
Option A: HTTP Link Header (Preferred)
Link: </docs/api/llms.txt>; rel="llms-txt"
When an agent requests any URL, the server can indicate the relevant llms.txt via the Link response header. This works for any content type and requires no HTML parsing.
Example workflow:
- Agent requests
https://example.com/docs/api/endpoints.html
- Server responds with
Link: </docs/api/llms.txt>; rel="llms-txt"
- Agent discovers context-appropriate llms.txt
Option B: HTML Link Tag (Fallback)
For HTML responses, use a <link> tag in the <head>:
<link rel="llms-txt" href="/docs/api/llms.txt" />
This requires HTML parsing but remains straightforward for web-based content.
Option C: Path Segment Walking (Last Resort)
If neither Link header nor HTML tag is present, agents may attempt discovery by walking up path segments:
For URL https://example.com/docs/api/v2/endpoints.html, try:
/docs/api/v2/llms.txt
/docs/api/llms.txt
/docs/llms.txt
/llms.txt (root, always exists per requirement)
Caveats: This creates multiple HTTP requests and may result in false positives. Should be used sparingly.
Benefits of This Approach
1. Backward Compatibility
Existing implementations with root-level llms.txt continue working without changes. The requirement for root-level placement matches current best practices.
2. Scalability
Large sites can scope contexts appropriately. Instead of a 36k token llms.txt like Cloudflare's current implementation, they could have:
/llms.txt (high-level overview)
/workers/llms.txt (Workers-specific docs)
/r2/llms.txt (R2-specific docs)
/pages/llms.txt (Pages-specific docs)
3. Discoverability
The Link header mechanism enables automatic, efficient discovery without guessing paths or parsing HTML for every page.
4. Flexibility for Platform Providers
GitHub, GitLab, or documentation hosting platforms can serve repository or project-specific llms.txt files while maintaining their own root-level overview.
5. Token Efficiency
Agents can load precisely the context they need rather than ingesting massive omnibus llms.txt files. The llms.txt MCP tool could use Link headers to automatically discover the most relevant context.
Implementation Considerations
For Website Operators
Minimum viable implementation:
GET /llms.txt HTTP/1.1
Host: example.com
HTTP/1.1 200 OK
Content-Type: text/markdown
Enhanced implementation with subpaths:
GET /docs/api/endpoint-guide.html HTTP/1.1
Host: example.com
HTTP/1.1 200 OK
Content-Type: text/html
Link: </docs/api/llms.txt>; rel="llms-txt"
For Tool Builders
The LLMTEXT Check tool and llms.txt MCP should:
- Always validate that
/llms.txt exists at root
- Parse Link headers from responses when discovering context
- Support HTML link tag parsing as fallback
- Optionally implement path segment walking with rate limiting
- Cache discovery results to minimize redundant requests
For the Specification
The llms.txt spec should be updated to include:
- Explicit requirement for root-level
/llms.txt
- Permission for subpath llms.txt files
- Specification of the
rel="llms-txt" link relation
- Recommendation for Link headers over HTML tags
- Example implementations for common web servers (nginx, Apache, CDNs)
Potential Challenges
1. Link Header Adoption
Many CMS platforms and static site generators don't easily support custom Link headers. The spec should provide clear implementation examples.
2. Ambiguity Resolution
If multiple llms.txt files are discovered (via Link header + path walking), which takes precedence? Proposal: Link header > path walking, most specific path wins.
Conclusion
By requiring /llms.txt at the root while enabling optional subpath placement with standardized discovery via Link headers, the llms.txt standard can maintain simplicity for basic implementations while supporting the scalability needs of complex, multi-product websites. This approach:
- Preserves the simplicity that made
/robots.txt successful
- Enables the flexibility needed for modern web architectures
- Provides clear, programmatic discovery mechanisms
- Supports the token-efficient context management that makes llms.txt valuable
The LLMTEXT toolkit launch demonstrates both the promise of llms.txt adoption and the practical challenges of ambiguous specifications. By formalizing discovery mechanisms now, we can ensure the standard scales effectively as AI agents become the primary consumers of web content.
Discussion: llms.txt Discovery and Placement Strategy
Hey @jph00, I'm new here but the last few weeks, we have built a tool for llms.txt creation, validation and use through MCP: Blog here. We have more plans for dev tools for agent accessibility. One of the problems I'm seeing is discoverability.
Current State and Ambiguity
The current llms.txt specification states that files should be "located in the root path
/llms.txtof a website (or, optionally, in a subpath)" but provides no concrete guidance on:This ambiguity has real consequences. As noted in the LLMTEXT toolkit launch, many implementations fail validation because they serve llms.txt at non-root locations like
https://www.mintlify.com/docs/llms.txtorhttps://nextjs.org/docs/llms.txt, making them hard to find programmatically.Proposal: A Three-Tiered Discovery Mechanism
1. Primary Location:
/llms.txtat Root (Required)Every website implementing the standard must serve a primary llms.txt at the root domain:
https://example.com/llms.txtThis serves as the canonical discovery point, following the established pattern of
/robots.txtand/sitemap.xml. Tools, agents, and MCP servers should always check this location first.Rationale: Predictable, universal discovery without requiring HTTP negotiations or HTML parsing.
2. Secondary Locations: Subpath llms.txt Files (Optional)
Websites may serve additional llms.txt files at any subpath for scoped documentation:
https://example.com/docs/api/llms.txt(API-specific documentation)https://github.com/username/repo/llms.txt(Repository-level documentation)https://example.com/products/widget/llms.txt(Product-specific information)Use cases:
3. Discovery Mechanism: Link Headers → Link Tags → Path Walking
To enable programmatic discovery of subpath llms.txt files without requiring standardized paths, implement a cascading discovery mechanism:
Option A: HTTP Link Header (Preferred)
Link: </docs/api/llms.txt>; rel="llms-txt"When an agent requests any URL, the server can indicate the relevant llms.txt via the
Linkresponse header. This works for any content type and requires no HTML parsing.Example workflow:
https://example.com/docs/api/endpoints.htmlLink: </docs/api/llms.txt>; rel="llms-txt"Option B: HTML Link Tag (Fallback)
For HTML responses, use a
<link>tag in the<head>:This requires HTML parsing but remains straightforward for web-based content.
Option C: Path Segment Walking (Last Resort)
If neither Link header nor HTML tag is present, agents may attempt discovery by walking up path segments:
For URL
https://example.com/docs/api/v2/endpoints.html, try:/docs/api/v2/llms.txt/docs/api/llms.txt/docs/llms.txt/llms.txt(root, always exists per requirement)Caveats: This creates multiple HTTP requests and may result in false positives. Should be used sparingly.
Benefits of This Approach
1. Backward Compatibility
Existing implementations with root-level llms.txt continue working without changes. The requirement for root-level placement matches current best practices.
2. Scalability
Large sites can scope contexts appropriately. Instead of a 36k token llms.txt like Cloudflare's current implementation, they could have:
/llms.txt(high-level overview)/workers/llms.txt(Workers-specific docs)/r2/llms.txt(R2-specific docs)/pages/llms.txt(Pages-specific docs)3. Discoverability
The Link header mechanism enables automatic, efficient discovery without guessing paths or parsing HTML for every page.
4. Flexibility for Platform Providers
GitHub, GitLab, or documentation hosting platforms can serve repository or project-specific llms.txt files while maintaining their own root-level overview.
5. Token Efficiency
Agents can load precisely the context they need rather than ingesting massive omnibus llms.txt files. The llms.txt MCP tool could use Link headers to automatically discover the most relevant context.
Implementation Considerations
For Website Operators
Minimum viable implementation:
Enhanced implementation with subpaths:
For Tool Builders
The LLMTEXT Check tool and llms.txt MCP should:
/llms.txtexists at rootFor the Specification
The llms.txt spec should be updated to include:
/llms.txtrel="llms-txt"link relationPotential Challenges
1. Link Header Adoption
Many CMS platforms and static site generators don't easily support custom Link headers. The spec should provide clear implementation examples.
2. Ambiguity Resolution
If multiple llms.txt files are discovered (via Link header + path walking), which takes precedence? Proposal: Link header > path walking, most specific path wins.
Conclusion
By requiring
/llms.txtat the root while enabling optional subpath placement with standardized discovery via Link headers, the llms.txt standard can maintain simplicity for basic implementations while supporting the scalability needs of complex, multi-product websites. This approach:/robots.txtsuccessfulThe LLMTEXT toolkit launch demonstrates both the promise of llms.txt adoption and the practical challenges of ambiguous specifications. By formalizing discovery mechanisms now, we can ensure the standard scales effectively as AI agents become the primary consumers of web content.