Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
410 changes: 410 additions & 0 deletions changes/nova_2_model_support.md

Large diffs are not rendered by default.

248 changes: 248 additions & 0 deletions changes/providers-enhancements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
# Reader Provider Enhancements

## Overview
This document describes the enhancements made to the lexical-graph reader providers, focusing on improved error handling, comprehensive logging, and the addition of a new universal directory reader provider.

## Key Enhancements

### 1. Enhanced Error Handling

All reader providers now implement robust error handling patterns:

#### Import Error Handling
- **Clear dependency messages**: When required dependencies are missing, providers now show specific installation commands
- **Graceful degradation**: Import failures are caught and re-raised with helpful context

```python
try:
from llama_index.readers.file.pymu_pdf import PyMuPDFReader
except ImportError as e:
logger.error("Failed to import PyMuPDFReader: missing pymupdf")
raise ImportError(
"PyMuPDFReader requires 'pymupdf'. Install with: pip install pymupdf"
) from e
```

#### Runtime Error Handling
- **Input validation**: All providers validate input sources before processing
- **Exception chaining**: Original exceptions are preserved using `from e` syntax
- **Contextual error messages**: Errors include specific information about what failed

```python
if not input_source:
logger.error("No input source provided to PDFReaderProvider")
raise ValueError("input_source cannot be None or empty")

try:
documents = self._reader.load_data(file_path=processed_paths[0])
except Exception as e:
logger.error(f"Failed to read PDF from {input_source}: {e}", exc_info=True)
raise RuntimeError(f"Failed to read PDF: {e}") from e
```

### 2. Comprehensive Logging

#### Structured Logging Pattern
All providers follow a consistent logging pattern:

- **Debug level**: Initialization and configuration details
- **Info level**: Operation progress and success metrics
- **Error level**: Failures with full exception context

```python
logger = logging.getLogger(__name__)

# Initialization
logger.debug(f"Initialized PDFReaderProvider with return_full_document={config.return_full_document}")

# Progress tracking
logger.info(f"Reading PDF from: {input_source}")
logger.info(f"Successfully read {len(documents)} document(s) from PDF")

# Error reporting
logger.error(f"Failed to read PDF from {input_source}: {e}", exc_info=True)
```

#### Log Context
- **File paths**: Original and processed paths are logged
- **Document counts**: Success metrics include document counts
- **Configuration**: Key configuration parameters are logged at debug level
- **Exception traces**: Full stack traces are captured with `exc_info=True`

### 3. Universal Directory Reader Provider

#### Purpose
The new `UniversalDirectoryReaderProvider` provides a unified interface for reading from both local directories and S3-based document collections.

#### Key Features

**Dual Mode Operation**:
- **Local mode**: Uses LlamaIndex's `SimpleDirectoryReader` for local file system access
- **S3 mode**: Uses GraphRAG's `S3BasedDocs` for S3-based document collections

**Intelligent Source Detection**:
```python
def read(self, input_source: Optional[Union[str, Dict[str, str]]] = None) -> List[Document]:
"""Read from local or S3 based on config/input."""

if isinstance(input_source, dict) or self.config.bucket_name:
return self._read_from_s3(input_source)
else:
return self._read_from_local(input_source)
```

**Comprehensive Configuration**:
```python
class UniversalDirectoryReaderConfig(ReaderProviderConfig):
# Local directory options
input_dir: Optional[str] = None
input_files: Optional[List[str]] = None
exclude_hidden: bool = True
recursive: bool = False
required_exts: Optional[List[str]] = None
file_extractor: Optional[Dict[str, Any]] = None
metadata_fn: Optional[callable] = None

# S3BasedDocs options
region: Optional[str] = None
bucket_name: Optional[str] = None
key_prefix: Optional[str] = None
collection_id: Optional[str] = None
```

#### Usage Examples

**Local Directory Reading**:
```python
config = UniversalDirectoryReaderConfig(
input_dir="/path/to/documents",
recursive=True,
required_exts=[".pdf", ".txt"],
metadata_fn=lambda path: {"source": "local", "path": path}
)
reader = UniversalDirectoryReaderProvider(config)
documents = reader.read()
```

**S3 Collection Reading**:
```python
config = UniversalDirectoryReaderConfig(
region="us-east-1",
bucket_name="my-documents",
key_prefix="collections",
collection_id="project-docs",
metadata_fn=lambda path: {"source": "s3", "path": path}
)
reader = UniversalDirectoryReaderProvider(config)
documents = reader.read()
```

**Dynamic Source Selection**:
```python
# Local reading
documents = reader.read("/local/path")

# S3 reading via dict
s3_config = {
"region": "us-west-2",
"bucket_name": "docs-bucket",
"key_prefix": "data",
"collection_id": "analysis"
}
documents = reader.read(s3_config)
```

## Error Handling Patterns

### 1. Validation Errors
```python
if not all([region, bucket_name, key_prefix, collection_id]):
logger.error("Missing S3 configuration")
raise ValueError("S3 requires: region, bucket_name, key_prefix, collection_id")
```

### 2. Import Errors
```python
try:
from graphrag_toolkit.lexical_graph.indexing.load import S3BasedDocs
except ImportError as e:
logger.error("Failed to import S3BasedDocs")
raise ImportError("S3BasedDocs not available") from e
```

### 3. Runtime Errors
```python
try:
documents = reader.load_data()
logger.info(f"Successfully read {len(documents)} document(s) from local")
except Exception as e:
logger.error(f"Failed to read from local: {e}", exc_info=True)
raise RuntimeError(f"Failed to read documents: {e}") from e
```

## Logging Best Practices

### 1. Consistent Logger Naming
```python
logger = logging.getLogger(__name__)
```

### 2. Appropriate Log Levels
- **DEBUG**: Configuration details, internal state
- **INFO**: Operation progress, success metrics
- **ERROR**: Failures with context

### 3. Exception Context
```python
logger.error(f"Failed to read from S3: {e}", exc_info=True)
```

### 4. Structured Messages
```python
logger.info(f"Reading from S3: s3://{bucket_name}/{key_prefix}/{collection_id}")
```

## Benefits

### 1. Improved Debugging
- Clear error messages with specific failure context
- Full exception traces for troubleshooting
- Progress tracking through structured logging

### 2. Better User Experience
- Helpful dependency installation messages
- Clear validation error descriptions
- Consistent error handling across all providers

### 3. Enhanced Flexibility
- Universal directory reader supports both local and S3 sources
- Dynamic source detection based on input type
- Unified configuration interface

### 4. Production Readiness
- Robust error handling prevents crashes
- Comprehensive logging aids monitoring
- Graceful degradation when dependencies are missing

## Migration Notes

### Existing Code Compatibility
All existing reader providers maintain backward compatibility while adding enhanced error handling and logging.

### New Universal Provider
The `UniversalDirectoryReaderProvider` can replace separate local and S3 directory readers in many use cases, providing a single interface for both scenarios.

### Logging Configuration
Applications should configure logging appropriately to capture the enhanced log output:

```python
import logging
logging.basicConfig(level=logging.INFO)
```

## Future Enhancements

1. **Metrics Integration**: Add performance metrics to logging output
2. **Retry Logic**: Implement automatic retry for transient failures
3. **Async Support**: Add asynchronous reading capabilities
4. **Caching**: Implement document caching for frequently accessed sources
4 changes: 2 additions & 2 deletions lexical-graph/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The lexical-graph package provides a framework for automating the construction o
### Features

- Built-in graph store support for [Amazon Neptune Analytics](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html), [Amazon Neptune Database](https://docs.aws.amazon.com/neptune/latest/userguide/intro.html), and [Neo4j](https://neo4j.com/docs/).
- Built-in vector store support for Neptune Analytics, [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html), [Amazon S3 Vectors](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors.html) and Postgres with the pgvector extension.
- Built-in vector store support for Neptune Analytics, [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html) and Postgres with the pgvector extension.
- Built-in support for foundation models (LLMs and embedding models) on [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/).
- Easily extended to support additional graph and vector stores and model backends.
- [Multi-tenancy](../docs/lexical-graph/multi-tenancy.md) – multiple separate lexical graphs in the same underlying graph and vector stores.
Expand All @@ -19,7 +19,7 @@ point in time.
The lexical-graph requires Python and [pip](http://www.pip-installer.org/en/latest/) to install. You can install the lexical-graph using pip:

```
$ pip install https://github.com/awslabs/graphrag-toolkit/archive/refs/tags/v3.15.5.zip#subdirectory=lexical-graph
$ pip install https://github.com/awslabs/graphrag-toolkit/archive/refs/tags/v3.15.4.zip#subdirectory=lexical-graph
```

If you're running on AWS, you must run your application in an AWS region containing the Amazon Bedrock foundation models used by the lexical graph (see the [configuration](../docs/lexical-graph/configuration.md#graphragconfig) section in the documentation for details on the default models used), and must [enable access](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) to these models before running any part of the solution.
Expand Down
2 changes: 1 addition & 1 deletion lexical-graph/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ packages = ["src/graphrag_toolkit"]

[project]
name = "graphrag-toolkit-lexical-graph"
version = "3.16.2-SNAPSHOT"
version = "3.15.5.dev0"
description = "AWS GraphRAG Toolkit, lexical graph"
readme = "README.md"
requires-python = ">=3.10"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ def _asyncio_run(coro):

from .tenant_id import TenantId, DEFAULT_TENANT_ID, DEFAULT_TENANT_NAME, TenantIdType, to_tenant_id
from .config import GraphRAGConfig as GraphRAGConfig, LLMType, EmbeddingType
from .bedrock_llm import DirectBedrockLLM
from .errors import ModelError, BatchJobError, IndexError, GraphQueryError
from .logging import set_logging_config, set_advanced_logging_config
from .lexical_graph_query_engine import LexicalGraphQueryEngine
Expand Down
Loading