A Python package for authenticating with SharePoint (user-level permissions) and downloading documents using Microsoft Graph API with support for MFA and optional AWS Bedrock knowledge base ingestion (custom data source).
- 🔐 Easy Authentication: Interactive OAuth 2.0 flow with MFA support
- 📄 Document Discovery: Scan and catalog all documents in SharePoint sites
- ⬇️ Bulk Download: Download multiple documents with progress tracking
- 🧠 Bedrock Integration: Optional ingestion into AWS Bedrock knowledge bases
- 🛠️ CLI & Library: Use as command-line tool or Python library
- 📊 Rich Metadata: Detailed document information and statistics
pip install sharepoint-scraperFor AWS Bedrock integration:
pip install sharepoint-scraper[bedrock]- Set up environment variables (if you do not have the required Azure values please first follow the "Azure AD App Registration Setup" instructions from the Configuration section below):
export SHAREPOINT_SITE_URL="https://yourcompany.sharepoint.com/sites/yoursite"
export AZURE_CLIENT_ID="your-azure-app-client-id"
export AZURE_CLIENT_SECRET="your-azure-client-secret"
export AZURE_TENANT_ID="your-tenant-id"
export AZURE_REDIRECT_URI="your-redirect-uri" # for local testing: http://localhost:8080/callback- Test connection:
sharepoint-scraper test- Scan documents:
sharepoint-scraper scan- Download documents:
sharepoint-scraper download- Download and ingest to Bedrock:
# Set Bedrock environment variables
export BEDROCK_KNOWLEDGE_BASE_ID="your-kb-id"
export BEDROCK_DATA_SOURCE_ID="your-data-source-id"
export AWS_REGION="us-east-1"
sharepoint-scraper download --bedrockfrom sharepoint_scraper import SharePointScraper, SharePointAuth, BedrockIntegration
# Basic usage
scraper = SharePointScraper("https://yourcompany.sharepoint.com/sites/yoursite")
# Authenticate (opens browser for OAuth)
scraper.authenticate()
# Get all documents
documents = scraper.get_documents()
# Download documents
scraper.bulk_download(documents, "downloads/")
# With Bedrock integration
bedrock = BedrockIntegration(
knowledge_base_id="your-kb-id",
data_source_id="your-data-source-id"
)
scraper = SharePointScraper(site_url, bedrock=bedrock)
scraper.authenticate()
# Download and ingest to Bedrock
documents = scraper.get_documents()
scraper.bulk_download_and_ingest(documents, "downloads/")SHAREPOINT_SITE_URL: Your SharePoint site URLAZURE_CLIENT_ID: Azure AD App Registration Client IDAZURE_TENANT_ID: Azure AD Tenant IDAZURE_CLIENT_SECRET: Azure AD Client SecretAZURE_REDIRECT_URI: Azure Redirect URI
BEDROCK_KNOWLEDGE_BASE_ID: AWS Bedrock Knowledge Base IDBEDROCK_DATA_SOURCE_ID: AWS Bedrock Data Source IDAWS_REGION: AWS Region (default: us-east-1)
- Go to Azure Portal → Azure Active Directory → App registrations
- Click "New registration"
- Set these values:
- Name: SharePoint Scraper
- Supported account types: Accounts in this organizational directory only
- Redirect URI: Web →
http://localhost:8080/callbackor configure it based on your application. Make sure to also set the corresponding env variable AZURE_REDIRECT_URI as indicated above.
- After creation, note the Application (client) ID, Directory (tenant) ID, and navigate to Certificates & secrets > Create new client secret. Also note the Client Secret ID.
- Go to "API permissions" → Add permission → Microsoft Graph → Delegated permissions
- Add these permissions:
Sites.Read.AllFiles.Read.All
- Click "Grant admin consent"
# Test connection
sharepoint-scraper test
# Scan and save document metadata
sharepoint-scraper scan --output documents.json
# Download documents
sharepoint-scraper download --output-dir downloads/
# Download and ingest to Bedrock
sharepoint-scraper download --bedrock
# Use existing metadata file
sharepoint-scraper download --metadata-file documents.json
# Show configuration status
sharepoint-scraper config
# Get help
sharepoint-scraper --help
sharepoint-scraper download --helpMain class for SharePoint operations.
from sharepoint_scraper import SharePointScraper
scraper = SharePointScraper(site_url, auth=None, bedrock=None)
# Authentication
scraper.authenticate() -> bool
# Document operations
scraper.get_documents() -> List[Dict]
scraper.download_document(document, download_path) -> Optional[str]
scraper.bulk_download(documents, download_path) -> Dict[str, str]
# Bedrock integration
scraper.download_and_ingest_document(document, download_path) -> bool
scraper.bulk_download_and_ingest(documents, download_path) -> Dict[str, bool]
# Connection testing
scraper.test_connection() -> bool
scraper.get_site_info() -> DictHandles authentication with Microsoft Graph API.
from sharepoint_scraper import SharePointAuth
auth = SharePointAuth(client_id, tenant_id, redirect_uri)
# Authenticate user
auth.authenticate() -> str # Returns access token
# Check status
auth.is_authenticated() -> bool
auth.get_access_token() -> Optional[str]
auth.get_auth_headers() -> Dict[str, str]Optional AWS Bedrock knowledge base integration.
from sharepoint_scraper import BedrockIntegration
bedrock = BedrockIntegration(knowledge_base_id, data_source_id, region_name)
# Ingest single document
bedrock.ingest_document(document_path, document_id, title) -> Dict
# Batch ingest
bedrock.batch_ingest_documents(documents, progress_callback) -> Dict[str, Dict]import os
from sharepoint_scraper import SharePointScraper
# Set up
os.environ['SHAREPOINT_SITE_URL'] = 'https://company.sharepoint.com/sites/mysite'
os.environ['AZURE_CLIENT_ID'] = 'your-client-id'
# Create scraper and authenticate
scraper = SharePointScraper(os.environ['SHAREPOINT_SITE_URL'])
scraper.authenticate() # Opens browser for login
# Get and download all documents
documents = scraper.get_documents()
print(f"Found {len(documents)} documents")
# Download with progress
def progress(current, total):
print(f"Progress: {current}/{total}")
results = scraper.bulk_download(documents, "downloads/", progress)
print(f"Downloaded {len(results)} documents")import os
from sharepoint_scraper import SharePointScraper, BedrockIntegration
# Configure environment
os.environ.update({
'SHAREPOINT_SITE_URL': 'https://company.sharepoint.com/sites/mysite',
'AZURE_CLIENT_ID': 'your-client-id',
'BEDROCK_KNOWLEDGE_BASE_ID': 'your-kb-id',
'BEDROCK_DATA_SOURCE_ID': 'your-ds-id',
'AWS_REGION': 'us-east-1'
})
# Set up Bedrock integration
bedrock = BedrockIntegration(
knowledge_base_id=os.environ['BEDROCK_KNOWLEDGE_BASE_ID'],
data_source_id=os.environ['BEDROCK_DATA_SOURCE_ID']
)
# Create scraper with Bedrock
scraper = SharePointScraper(
site_url=os.environ['SHAREPOINT_SITE_URL'],
bedrock=bedrock
)
# Authenticate and process
scraper.authenticate()
documents = scraper.get_documents()
# Download and ingest to Bedrock
results = scraper.bulk_download_and_ingest(documents, "downloads/")
successful = sum(1 for success in results.values() if success)
print(f"Successfully processed {successful}/{len(documents)} documents")from sharepoint_scraper import SharePointScraper, SharePointAuth
# Custom auth setup
auth = SharePointAuth(
client_id="your-client-id",
tenant_id="your-tenant-id", # Optional
redirect_uri="http://localhost:8080/callback" # change as desired
)
# Authenticate
token = auth.authenticate()
print(f"Access token: {token[:20]}...")
# Use with scraper
scraper = SharePointScraper(
site_url="https://company.sharepoint.com/sites/mysite",
auth=auth
)
# Now scraper is already authenticated
documents = scraper.get_documents()The package includes comprehensive error handling:
from sharepoint_scraper import (
SharePointScraper,
SharePointError,
AuthenticationError,
DownloadError,
ConfigurationError
)
try:
scraper = SharePointScraper(site_url)
scraper.authenticate()
documents = scraper.get_documents()
except ConfigurationError as e:
print(f"Configuration issue: {e}")
except AuthenticationError as e:
print(f"Authentication failed: {e}")
except SharePointError as e:
print(f"SharePoint error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")-
Authentication fails
- Ensure Azure AD app has correct permissions
- Check that redirect URI is set to
http://localhost:8080/callback - Verify client ID and tenant ID
-
"Access denied" errors
- User needs read permissions on SharePoint site
- Azure AD app needs admin consent for permissions
-
Documents not found
- Check SharePoint site URL format
- Ensure user has access to document libraries
- Site might have restricted access
-
Bedrock ingestion fails
- Verify AWS credentials are configured
- Check Bedrock knowledge base and data source IDs
- Ensure proper IAM permissions for Bedrock
Enable debug logging:
sharepoint-scraper --log-level DEBUG scanOr in Python:
import logging
logging.basicConfig(level=logging.DEBUG)- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes and add tests
- Run tests:
pytest - Submit a pull request
MIT License - see LICENSE file for details.
- Initial release
- Microsoft Graph API integration
- Interactive OAuth authentication with MFA support
- Document scanning and downloading
- AWS Bedrock knowledge base integration
- Command-line interface
- Comprehensive error handling and logging