SharePoint Document Scraper

A Python package for authenticating with SharePoint (user-level permissions) and downloading documents using Microsoft Graph API with support for MFA and optional AWS Bedrock knowledge base ingestion (custom data source).

Features

🔐 Easy Authentication: Interactive OAuth 2.0 flow with MFA support
📄 Document Discovery: Scan and catalog all documents in SharePoint sites
⬇️ Bulk Download: Download multiple documents with progress tracking
🧠 Bedrock Integration: Optional ingestion into AWS Bedrock knowledge bases
🛠️ CLI & Library: Use as command-line tool or Python library
📊 Rich Metadata: Detailed document information and statistics

Installation

pip install sharepoint-scraper

For AWS Bedrock integration:

pip install sharepoint-scraper[bedrock]

Quick Start

Command Line Usage

Set up environment variables (if you do not have the required Azure values please first follow the "Azure AD App Registration Setup" instructions from the Configuration section below):

export SHAREPOINT_SITE_URL="https://yourcompany.sharepoint.com/sites/yoursite"
export AZURE_CLIENT_ID="your-azure-app-client-id"
export AZURE_CLIENT_SECRET="your-azure-client-secret"
export AZURE_TENANT_ID="your-tenant-id"
export AZURE_REDIRECT_URI="your-redirect-uri" # for local testing: http://localhost:8080/callback

Test connection:

sharepoint-scraper test

Scan documents:

sharepoint-scraper scan

Download documents:

sharepoint-scraper download

Download and ingest to Bedrock:

# Set Bedrock environment variables
export BEDROCK_KNOWLEDGE_BASE_ID="your-kb-id"
export BEDROCK_DATA_SOURCE_ID="your-data-source-id"
export AWS_REGION="us-east-1"

sharepoint-scraper download --bedrock

Python Library Usage

from sharepoint_scraper import SharePointScraper, SharePointAuth, BedrockIntegration

# Basic usage
scraper = SharePointScraper("https://yourcompany.sharepoint.com/sites/yoursite")

# Authenticate (opens browser for OAuth)
scraper.authenticate()

# Get all documents
documents = scraper.get_documents()

# Download documents
scraper.bulk_download(documents, "downloads/")

# With Bedrock integration
bedrock = BedrockIntegration(
    knowledge_base_id="your-kb-id",
    data_source_id="your-data-source-id"
)

scraper = SharePointScraper(site_url, bedrock=bedrock)
scraper.authenticate()

# Download and ingest to Bedrock
documents = scraper.get_documents()
scraper.bulk_download_and_ingest(documents, "downloads/")

Configuration

Environment Variables

Required for SharePoint

SHAREPOINT_SITE_URL: Your SharePoint site URL
AZURE_CLIENT_ID: Azure AD App Registration Client ID
AZURE_TENANT_ID: Azure AD Tenant ID
AZURE_CLIENT_SECRET: Azure AD Client Secret
AZURE_REDIRECT_URI: Azure Redirect URI

Required for Bedrock Integration

BEDROCK_KNOWLEDGE_BASE_ID: AWS Bedrock Knowledge Base ID
BEDROCK_DATA_SOURCE_ID: AWS Bedrock Data Source ID
AWS_REGION: AWS Region (default: us-east-1)

Azure AD App Registration Setup

Go to Azure Portal → Azure Active Directory → App registrations
Click "New registration"
Set these values:
- Name: SharePoint Scraper
- Supported account types: Accounts in this organizational directory only
- Redirect URI: Web → http://localhost:8080/callback or configure it based on your application. Make sure to also set the corresponding env variable AZURE_REDIRECT_URI as indicated above.
After creation, note the Application (client) ID, Directory (tenant) ID, and navigate to Certificates & secrets > Create new client secret. Also note the Client Secret ID.
Go to "API permissions" → Add permission → Microsoft Graph → Delegated permissions
Add these permissions:
- Sites.Read.All
- Files.Read.All
Click "Grant admin consent"

CLI Reference

# Test connection
sharepoint-scraper test

# Scan and save document metadata
sharepoint-scraper scan --output documents.json

# Download documents
sharepoint-scraper download --output-dir downloads/

# Download and ingest to Bedrock
sharepoint-scraper download --bedrock

# Use existing metadata file
sharepoint-scraper download --metadata-file documents.json

# Show configuration status
sharepoint-scraper config

# Get help
sharepoint-scraper --help
sharepoint-scraper download --help

Python API Reference

SharePointScraper

Main class for SharePoint operations.

from sharepoint_scraper import SharePointScraper

scraper = SharePointScraper(site_url, auth=None, bedrock=None)

# Authentication
scraper.authenticate() -> bool

# Document operations
scraper.get_documents() -> List[Dict]
scraper.download_document(document, download_path) -> Optional[str]
scraper.bulk_download(documents, download_path) -> Dict[str, str]

# Bedrock integration
scraper.download_and_ingest_document(document, download_path) -> bool
scraper.bulk_download_and_ingest(documents, download_path) -> Dict[str, bool]

# Connection testing
scraper.test_connection() -> bool
scraper.get_site_info() -> Dict

SharePointAuth

Handles authentication with Microsoft Graph API.

from sharepoint_scraper import SharePointAuth

auth = SharePointAuth(client_id, tenant_id, redirect_uri)

# Authenticate user
auth.authenticate() -> str  # Returns access token

# Check status
auth.is_authenticated() -> bool
auth.get_access_token() -> Optional[str]
auth.get_auth_headers() -> Dict[str, str]

BedrockIntegration

Optional AWS Bedrock knowledge base integration.

from sharepoint_scraper import BedrockIntegration

bedrock = BedrockIntegration(knowledge_base_id, data_source_id, region_name)

# Ingest single document
bedrock.ingest_document(document_path, document_id, title) -> Dict

# Batch ingest
bedrock.batch_ingest_documents(documents, progress_callback) -> Dict[str, Dict]

Examples

Example 1: Basic Document Download

import os
from sharepoint_scraper import SharePointScraper

# Set up
os.environ['SHAREPOINT_SITE_URL'] = 'https://company.sharepoint.com/sites/mysite'
os.environ['AZURE_CLIENT_ID'] = 'your-client-id'

# Create scraper and authenticate
scraper = SharePointScraper(os.environ['SHAREPOINT_SITE_URL'])
scraper.authenticate()  # Opens browser for login

# Get and download all documents
documents = scraper.get_documents()
print(f"Found {len(documents)} documents")

# Download with progress
def progress(current, total):
    print(f"Progress: {current}/{total}")

results = scraper.bulk_download(documents, "downloads/", progress)
print(f"Downloaded {len(results)} documents")

Example 2: Bedrock Integration

import os
from sharepoint_scraper import SharePointScraper, BedrockIntegration

# Configure environment
os.environ.update({
    'SHAREPOINT_SITE_URL': 'https://company.sharepoint.com/sites/mysite',
    'AZURE_CLIENT_ID': 'your-client-id',
    'BEDROCK_KNOWLEDGE_BASE_ID': 'your-kb-id',
    'BEDROCK_DATA_SOURCE_ID': 'your-ds-id',
    'AWS_REGION': 'us-east-1'
})

# Set up Bedrock integration
bedrock = BedrockIntegration(
    knowledge_base_id=os.environ['BEDROCK_KNOWLEDGE_BASE_ID'],
    data_source_id=os.environ['BEDROCK_DATA_SOURCE_ID']
)

# Create scraper with Bedrock
scraper = SharePointScraper(
    site_url=os.environ['SHAREPOINT_SITE_URL'],
    bedrock=bedrock
)

# Authenticate and process
scraper.authenticate()
documents = scraper.get_documents()

# Download and ingest to Bedrock
results = scraper.bulk_download_and_ingest(documents, "downloads/")

successful = sum(1 for success in results.values() if success)
print(f"Successfully processed {successful}/{len(documents)} documents")

Example 3: Custom Authentication

from sharepoint_scraper import SharePointScraper, SharePointAuth

# Custom auth setup
auth = SharePointAuth(
    client_id="your-client-id",
    tenant_id="your-tenant-id",  # Optional
    redirect_uri="http://localhost:8080/callback"  # change as desired
)

# Authenticate
token = auth.authenticate()
print(f"Access token: {token[:20]}...")

# Use with scraper
scraper = SharePointScraper(
    site_url="https://company.sharepoint.com/sites/mysite",
    auth=auth
)

# Now scraper is already authenticated
documents = scraper.get_documents()

Error Handling

The package includes comprehensive error handling:

from sharepoint_scraper import (
    SharePointScraper, 
    SharePointError, 
    AuthenticationError, 
    DownloadError,
    ConfigurationError
)

try:
    scraper = SharePointScraper(site_url)
    scraper.authenticate()
    documents = scraper.get_documents()
    
except ConfigurationError as e:
    print(f"Configuration issue: {e}")
except AuthenticationError as e:
    print(f"Authentication failed: {e}")
except SharePointError as e:
    print(f"SharePoint error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Troubleshooting

Common Issues

Authentication fails
- Ensure Azure AD app has correct permissions
- Check that redirect URI is set to http://localhost:8080/callback
- Verify client ID and tenant ID
"Access denied" errors
- User needs read permissions on SharePoint site
- Azure AD app needs admin consent for permissions
Documents not found
- Check SharePoint site URL format
- Ensure user has access to document libraries
- Site might have restricted access
Bedrock ingestion fails
- Verify AWS credentials are configured
- Check Bedrock knowledge base and data source IDs
- Ensure proper IAM permissions for Bedrock

Debug Mode

Enable debug logging:

sharepoint-scraper --log-level DEBUG scan

Or in Python:

import logging
logging.basicConfig(level=logging.DEBUG)

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make changes and add tests
Run tests: pytest
Submit a pull request

License

MIT License - see LICENSE file for details.

Changelog

v1.0.0

Initial release
Microsoft Graph API integration
Interactive OAuth authentication with MFA support
Document scanning and downloading
AWS Bedrock knowledge base integration
Command-line interface
Comprehensive error handling and logging

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
sharepoint_scraper		sharepoint_scraper
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SharePoint Document Scraper

Features

Installation

Quick Start

Command Line Usage

Python Library Usage

Configuration

Environment Variables

Required for SharePoint

Required for Bedrock Integration

Azure AD App Registration Setup

CLI Reference

Python API Reference

SharePointScraper

SharePointAuth

BedrockIntegration

Examples

Example 1: Basic Document Download

Example 2: Bedrock Integration

Example 3: Custom Authentication

Error Handling

Troubleshooting

Common Issues

Debug Mode

Contributing

License

Changelog

v1.0.0

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SharePoint Document Scraper

Features

Installation

Quick Start

Command Line Usage

Python Library Usage

Configuration

Environment Variables

Required for SharePoint

Required for Bedrock Integration

Azure AD App Registration Setup

CLI Reference

Python API Reference

SharePointScraper

SharePointAuth

BedrockIntegration

Examples

Example 1: Basic Document Download

Example 2: Bedrock Integration

Example 3: Custom Authentication

Error Handling

Troubleshooting

Common Issues

Debug Mode

Contributing

License

Changelog

v1.0.0

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages