Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions .claude/commands/pageindex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
allowed-tools:
- Bash(python3:*)
- Bash(pip3:*)
- Bash(cat:*)
- Bash(ls:*)
- Read
- Write
- Glob
- Grep
---

You are a PageIndex assistant. PageIndex is a vectorless, reasoning-based RAG system that builds hierarchical tree indexes from documents (PDF or Markdown) and enables human-like retrieval via tree search.

## Input

The user's request: $ARGUMENTS

## Capabilities

You can help users with:

1. **Index a document** - Generate a PageIndex tree structure from a PDF or Markdown file
2. **Query a document** - Use an existing PageIndex tree to find relevant sections for a question
3. **Inspect a tree** - Read and explain an existing PageIndex tree structure
4. **Configure providers** - Set up OpenAI, Anthropic, or Ollama as the LLM provider

## Steps

### 1. Understand the request

Parse the user's request to determine which capability they need. Extract:
- The document path (PDF or Markdown)
- The query/question (if doing retrieval)
- The preferred LLM provider and model (if specified)
- Any configuration overrides

### 2. Ensure dependencies are installed

Check that the PageIndex package is available:
```bash
pip3 install --upgrade -r requirements.txt 2>/dev/null
```

### 3. Verify environment

Check that the required API key is set for the chosen provider:
- **OpenAI** (default): `CHATGPT_API_KEY`
- **Anthropic**: `ANTHROPIC_API_KEY`
- **Ollama**: No key needed, but verify Ollama is running

If the key is missing, inform the user and ask them to set it in a `.env` file or export it.

### 4. Execute the request

**Indexing a document (PDF):**
```bash
python3 run_pageindex.py --pdf_path <path> \
--provider <provider> --model <model> \
--if-add-node-summary yes --if-add-doc-description yes
```

**Indexing a document (Markdown):**
```bash
python3 run_pageindex.py --md_path <path> \
--provider <provider> --model <model>
```

**Querying an existing tree:**
- Read the tree structure JSON from `./results/<doc_name>_structure.json`
- Perform tree search: start at the root, read node titles and summaries, reason about which branch is most relevant to the user's query, then drill down into child nodes
- Return the most relevant section(s) with page references

**Inspecting a tree:**
- Read the JSON file and present a human-readable summary of the tree structure, including depth, number of nodes, and top-level sections

### 5. Present results

- For indexing: Show the output file path and a summary of the generated tree (number of sections, depth, total pages covered)
- For queries: Show the relevant section(s) with titles, summaries, and page ranges
- For inspection: Show the tree hierarchy in a readable format

## Provider reference

| Provider | Models | API Key | Notes |
|----------|--------|---------|-------|
| openai (default) | gpt-4o-2024-11-20, gpt-4o-mini | CHATGPT_API_KEY | Recommended |
| anthropic | claude-sonnet-4-20250514, claude-haiku-4-5-20251001 | ANTHROPIC_API_KEY | Full support |
| ollama | llama3, mistral, qwen2.5 | _(none)_ | Requires local Ollama server |

## Important notes

- Always run commands from the PageIndex repository root directory
- For best results, use capable models (GPT-4o, Claude Sonnet/Opus, or Llama 3 70B+)
- Results are saved to `./results/<document_name>_structure.json`
- The Python API can also be used directly: `from pageindex import page_index`
47 changes: 44 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,34 +147,75 @@ You can follow these steps to generate a PageIndex tree from a PDF document.
pip3 install --upgrade -r requirements.txt
```

### 2. Set your OpenAI API key
### 2. Set your API key

Create a `.env` file in the root directory and add your API key:
Create a `.env` file in the root directory and add your API key for your chosen provider:

```bash
# OpenAI (default)
CHATGPT_API_KEY=your_openai_key_here

# Anthropic (optional)
ANTHROPIC_API_KEY=your_anthropic_key_here

# Ollama — no API key needed, just have Ollama running locally
```

### 3. Run PageIndex on your PDF

**OpenAI (default):**
```bash
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
```

**Anthropic:**
```bash
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf \
--provider anthropic --model claude-sonnet-4-20250514
```

**Ollama (local models):**
```bash
# Make sure Ollama is running (ollama serve)
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf \
--provider ollama --model llama3
```

<details>
<summary><strong>Optional parameters</strong></summary>
<br>
You can customize the processing with additional optional arguments:

```
--model OpenAI model to use (default: gpt-4o-2024-11-20)
--model Model to use (default: gpt-4o-2024-11-20)
--provider LLM provider: openai, anthropic, or ollama (default: openai)
--api-base-url Custom API base URL (e.g. http://localhost:11434/v1 for Ollama)
--toc-check-pages Pages to check for table of contents (default: 20)
--max-pages-per-node Max pages per node (default: 10)
--max-tokens-per-node Max tokens per node (default: 20000)
--if-add-node-id Add node ID (yes/no, default: yes)
--if-add-node-summary Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)
```

You can also set the provider via environment variables instead of CLI flags:
```bash
export LLM_PROVIDER=ollama # or "anthropic"
export API_BASE_URL=http://localhost:11434/v1 # optional, for custom endpoints
```
</details>

<details>
<summary><strong>Supported LLM Providers</strong></summary>
<br>

| Provider | Example Models | API Key Env Var | Notes |
|----------|---------------|-----------------|-------|
| **OpenAI** (default) | `gpt-4o-2024-11-20`, `gpt-4o-mini` | `CHATGPT_API_KEY` | Full support, recommended |
| **Anthropic** | `claude-sonnet-4-20250514`, `claude-haiku-4-5-20251001` | `ANTHROPIC_API_KEY` | Full support |
| **Ollama** | `llama3`, `mistral`, `qwen2.5` | _(none needed)_ | Requires Ollama running locally. Uses OpenAI-compatible API at `http://localhost:11434/v1` |

**Note:** PageIndex relies on structured JSON output from the LLM. For best results, use capable models (GPT-4o, Claude Sonnet/Opus, or large Ollama models like Llama 3 70B+). Smaller local models may produce lower-quality tree structures.
</details>

<details>
Expand Down
2 changes: 2 additions & 0 deletions pageindex/config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
model: "gpt-4o-2024-11-20"
provider: "openai" # "openai", "anthropic", or "ollama"
api_base_url: null # Custom API base URL (e.g. http://localhost:11434/v1 for Ollama)
toc_check_page_num: 20
max_page_num_each_node: 10
max_token_num_each_node: 20000
Expand Down
20 changes: 16 additions & 4 deletions pageindex/page_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -1057,16 +1057,27 @@ async def tree_parser(page_list, opt, doc=None, logger=None):

def page_index_main(doc, opt=None):
logger = JsonLogger(doc)


# Set provider config from opt so all downstream API calls pick it up
if hasattr(opt, 'provider') and opt.provider:
os.environ['LLM_PROVIDER'] = opt.provider
# Re-import module-level variable
from pageindex import utils
utils.LLM_PROVIDER = opt.provider
if hasattr(opt, 'api_base_url') and opt.api_base_url:
os.environ['API_BASE_URL'] = opt.api_base_url
from pageindex import utils
utils.API_BASE_URL = opt.api_base_url

is_valid_pdf = (
(isinstance(doc, str) and os.path.isfile(doc) and doc.lower().endswith(".pdf")) or
(isinstance(doc, str) and os.path.isfile(doc) and doc.lower().endswith(".pdf")) or
isinstance(doc, BytesIO)
)
if not is_valid_pdf:
raise ValueError("Unsupported input type. Expected a PDF file path or BytesIO object.")

print('Parsing PDF...')
page_list = get_page_tokens(doc)
page_list = get_page_tokens(doc, model=opt.model)

logger.info({'total_page_number': len(page_list)})
logger.info({'total_token': sum([page[1] for page in page_list])})
Expand Down Expand Up @@ -1100,7 +1111,8 @@ async def page_index_builder():
return asyncio.run(page_index_builder())


def page_index(doc, model=None, toc_check_page_num=None, max_page_num_each_node=None, max_token_num_each_node=None,
def page_index(doc, model=None, provider=None, api_base_url=None,
toc_check_page_num=None, max_page_num_each_node=None, max_token_num_each_node=None,
if_add_node_id=None, if_add_node_summary=None, if_add_doc_description=None, if_add_node_text=None):

user_opt = {
Expand Down
Loading