Import documents directly from cloud storage providers (AWS S3, Google Drive) into your vector database with full processing pipeline integration.
- Overview
- Supported Providers
- Getting Started
- Pause & Resume Analysis
- AWS S3 Import
- Google Drive Import
- Import Options
- File Selection
- Progress Tracking
- Processing Pipeline
- API Reference
- Troubleshooting
The Cloud Import feature allows you to:
- 📦 Import documents from public S3 buckets/folders
- 📁 Import from public Google Drive folders (optional, requires API key)
- 📊 Analyze folders before importing (file count, size, types)
- 🎯 Select specific files with advanced filters
- ⚙️ Full processing pipeline integration (embedding, PII detection, categorization)
- 📈 Real-time progress tracking
- ✅ Status: Fully supported
- 🔓 Authentication: Anonymous access (public buckets only)
- 📝 URL Formats:
https://bucket-name.s3.amazonaws.com/folder/https://bucket-name.s3.region.amazonaws.com/folder/s3://bucket-name/folder/
- ✅ Status: Fully supported (optional, requires API key)
- 🔑 Authentication: Requires Google Drive API key
- 📝 URL Format:
https://drive.google.com/drive/folders/FOLDER_ID ⚠️ Note: Skips Google Workspace files (Docs, Sheets, Slides) - only downloads binary files
In the web UI:
- Click "Add Document" button
- Select "☁️ Cloud Import" tab
- Choose your cloud provider (S3 or Google Drive)
- Enter your cloud storage URL
- Click "🔍 Analyze Folder"
- (Optional) Click "⏸️ Pause & Use Current Results" to stop early and use partial results
- (Optional) Click "🛑 Cancel Analysis" to stop and discard progress
- Review folder statistics:
- Total files
- Total size
- File type breakdown
- (Optional) Click "🧹 Clear Analysis" to reset analysis results and start over
Note: Analysis runs as a background job and streams progress into the modal. It does not block the UI.
Long-running folder analysis supports pausing and resuming from the last cursor.
What pausing does
- Stops the analysis worker via
AbortController. - Keeps partial counts and discovered files in memory (server-side) so the UI can proceed.
- Leaves a resumable job that can be continued shortly after.
What resuming does
- Reuses the same analysis job ID.
- Continues listing from the last cursor token:
- S3:
NextContinuationToken - Google Drive:
nextPageToken
- S3:
- Preserves previously accumulated stats (e.g.
fileTypes) so the UI does not “forget” what it already found.
TTL / persistence
- Resume works for the same provider+URL within ~5–10 minutes.
- Resume state is in-memory only (server restart clears it).
Clearing Analysis Results
- Click "🧹 Clear Analysis" button in the analysis header to reset all results
- Confirmation dialog prevents accidental clearing
- Clears: folder analysis, file selection, import options, and resumable state
- Useful for starting fresh or switching to a different folder
Select how you want to import:
- Import all files - Bulk import entire folder
- Import first X files - Test with limited sample
- Select specific files - Advanced filtering and selection
- Optionally enable "🤖 Auto-categorize using AI"
- Click "📄 Add Document"
- Monitor progress in the upload progress modal
- Public S3 bucket with list permissions
- Files must be readable without authentication
# Standard format
https://my-bucket.s3.amazonaws.com/documents/
# Regional format
https://my-bucket.s3.us-east-1.amazonaws.com/data/
# S3 protocol (converted automatically)
s3://my-bucket/folder/-
Enter S3 URL:
https://my-public-bucket.s3.amazonaws.com/research-papers/ -
Analyze Results:
Total Files: 127 Total Size: 45.2 MB File Types: - .pdf: 98 files - .txt: 24 files - .docx: 5 files -
Select Import Option:
- Option A: Import all 127 files
- Option B: Import first 10 files (testing)
- Option C: Select specific PDFs only
-
Import Progress:
Processing: research-paper-001.pdf ✅ research-paper-001.pdf (12.3 MB) ✅ research-paper-002.pdf (8.7 MB) ⏳ research-paper-003.pdf (15.1 MB) ⏱️ research-paper-004.pdf (pending)
-
Get API Key:
- Go to Google Cloud Console
- Create a new project (or use existing)
- Enable Google Drive API:
- Navigate to "APIs & Services" > "Library"
- Search for "Google Drive API"
- Click "Enable"
- Create credentials:
- Go to "APIs & Services" > "Credentials"
- Click "Create Credentials" > "API Key"
- Copy the API key
- Optional: Restrict the key to Google Drive API for security
-
Share Your Folder:
- Open Google Drive
- Right-click folder → "Share"
- Change to "Anyone with the link" can view
- Copy the folder link
-
Configure Environment:
GOOGLE_DRIVE_API_KEY=AIzaSyC_your_api_key_here
-
Restart Server:
npm run server
- Enter your shared Google Drive folder URL
- Click "🔍 Analyze Folder"
- Review files (binary files only - skips Docs/Sheets/Slides)
- Choose import option and start
https://drive.google.com/drive/folders/1a2B3c4D5e6F7g8H9i0J
✅ Will Import:
- PDFs, Word docs (.docx), text files
- Images (if vision enabled)
- Any binary file format
❌ Will Skip:
- Google Docs (must export manually)
- Google Sheets (must export manually)
- Google Slides (must export manually)
- Google Forms, Drawings, etc.
When to use:
- Complete folder migration
- Trusted content with consistent quality
Example:
Import all 127 files from bucket
When to use:
- Testing cloud import with sample data
- Validating file formats before full import
- Rate-limited processing
Example:
Import first 10 files
When to use:
- Cherry-picking relevant documents
- Filtering by file type, size, or date
- Complex selection criteria
Features:
- Search by filename
- Filter by file type (.pdf, .txt, .docx, etc.)
- Filter by size range (< 100KB, 100KB-1MB, 1MB-10MB, etc.)
- Bulk select/deselect
- Folder navigation (breadcrumbs) for subfolders
- Scales to very large folders (50K+) via paged fetching + virtual rendering
- Choose "Select specific files" radio option
- Click "Choose Files..." button
- File selector modal opens with full folder listing
For providers that return full paths (like S3 keys), the file selector supports folder navigation:
- Breadcrumbs show the current folder path
- Folder rows let you drill down without losing selections
- Selections persist when you move between folders
Filters + folder navigation
- When no filters are active, search is scoped to the current folder (breadcrumb navigation mode).
- When a filter is active (search/type/size), the file list switches to a global filter mode across all discovered files.
- Folder rows are hidden (to avoid mixing "global search" with folder browsing).
- Each file row shows its folder path so you can still see where it lives.
This avoids the confusion of paging through a single flat, recursive list when your dataset has many nested folders.
Search Filter:
- Type: Text input
- Matches: Filename contains query (case-insensitive)
- Example: Search "2024" to find all files from 2024
File Type Filter:
- Type: Dropdown
- Options: All detected file extensions
- Example: Select ".pdf" to show only PDF files
Size Range Filter:
- Type: Dropdown
- Options:
- < 100 KB
- 100 KB - 1 MB
- 1 MB - 10 MB
- 10 MB - 100 MB
- > 100 MB
Bulk Actions:
- Select All - Select all filtered files
- Clear Selection - Deselect everything
Selection Summary:
- Available files count (after filters)
- Selected files count
- Total size of selection
Visual Indicators:
- ☐ Unchecked - Not selected
- ☑ Checked - Selected
- Blue highlight - Selected row
Each file shows:
- 📄 Filename (truncated if long)
- 📁 Folder path (shown when using global filter mode)
- 🏷️ Extension (uppercase badge)
- 📊 Size (human-readable format)
- 📅 Last Modified (if available)
-
Filter PDFs over 1MB:
- Set Type:
.pdf - Set Size:
1 MB - 10 MB - Result: 43 files match
- Set Type:
-
Search for specific topic:
- Search:
machine-learning - Result: 12 files match
- Search:
-
Review and select:
- Browse paginated results
- Click checkboxes or rows to select
- Selected: 8 files, 67.3 MB
-
Confirm selection:
- Click "Import 8 Files" button
- Modal closes, import starts
Cloud imports use the same progress tracking system as file uploads.
Real-time Updates:
- Current file being processed
- Current stage/action (download, text extraction, PII scan, categorization, embedding, saving)
- Files completed vs. total
- Success/error counts
- Animated progress bar
Cloud Import Details:
- For S3 imports, the UI displays the bucket name under the current file (helps distinguish similar filenames across buckets)
File Status Icons:
- ⏱️ Pending (not started yet)
- ⏳ Processing (currently downloading/processing)
- ✅ Success (completed successfully)
- ❌ Error (failed with error message)
Actions:
- Stop - Stops after current file completes
- Back - Available after stopping; returns to the upload modal without losing the previously analyzed cloud folder state
- Resume Upload - Available after stopping (cloud imports only); continues importing the remaining queued files
- Close - Closes modal (import continues in background)
- Progress tracked in-memory on server
- Job ID saved in localStorage on client
- Survives page refresh
- Note: Server restart clears active jobs
If you refresh the page during an import:
- Header shows "Upload in progress..." button (only while job is
processing) - Click to reopen progress modal
- Status resumes from last update
Note: If you stop an upload and close the progress modal, the header button returns to "Add Document". Stopped jobs are not automatically restored on refresh.
Every cloud-imported file goes through the exact same pipeline as regular uploads:
- Stream file from cloud provider
- Convert to buffer in memory
- Create file-like object
- PDFs: Extract text with table detection
- DOCX: Convert to markdown
- Images (if vision enabled): Extract content via vision model
- Plain text: Direct use
If CATEGORIZATION_MODEL set and "🤖 Auto-categorize" enabled:
- Extract category, location, tags
- Detect prices, ratings, coordinates
- Generate structured metadata
If PII_DETECTION_ENABLED=true:
- Scan for sensitive information
- Detect credit cards, SSNs, emails, etc.
- Risk level assessment
- Store findings in payload
If DESCRIPTION_MODEL set:
- Generate document summary
- Detect language
- Create searchable overview
- Generate 768D semantic vector via Ollama
- Generate sparse vector for keyword matching
- Hybrid search support
- Insert into Qdrant collection
- Update job status
- Increment success/error counts
Endpoint: GET /api/cloud-import/providers
Response:
{
"s3": {
"enabled": true,
"requiresAuth": false
},
"gdrive": {
"enabled": false,
"requiresAuth": true,
"reason": "GOOGLE_DRIVE_API_KEY not configured"
}
}Endpoint: POST /api/cloud-import/analyze
Request:
{
"provider": "s3",
"url": "https://bucket.s3.amazonaws.com/folder/"
}Response:
{
"jobId": "analysis_1704461234567_0",
"status": "analyzing",
"resumed": false
}Endpoint: GET /api/cloud-import/analysis-jobs/:jobId
Response (analyzing):
{
"jobId": "analysis_1704461234567_0",
"status": "analyzing",
"provider": "s3",
"url": "https://bucket.s3.amazonaws.com/folder/",
"filesDiscovered": 40900,
"totalSize": 2000000000,
"fileTypes": { ".jpg": 43000 },
"pagesProcessed": 43,
"startTime": 1704461234567,
"endTime": null
}Response (completed) returns files:
{
"jobId": "analysis_1704461234567_0",
"status": "completed",
"files": [
{ "key": "folder/document1.pdf", "name": "document1.pdf", "size": 12345, "extension": ".pdf" }
]
}Endpoint: POST /api/cloud-import/analysis-jobs/:jobId/pause
Response:
{ "jobId": "analysis_1704461234567_0", "status": "paused" }When paused, the job status endpoint omits files by default (to keep polling light). Request the partial file list explicitly:
Endpoint: GET /api/cloud-import/analysis-jobs/:jobId?includeFiles=1
Used by the UI to enable "
Endpoint: GET /api/cloud-import/analysis-jobs/by-url?provider=s3|gdrive&url=...
Response:
{
"found": true,
"jobId": "analysis_1704461234567_0",
"status": "paused",
"filesDiscovered": 40900,
"fileTypes": { ".jpg": 43000 }
}Endpoint: POST /api/cloud-import/import
Request (All Files):
{
"provider": "s3",
"url": "https://bucket.s3.amazonaws.com/folder/",
"files": "all",
"autoCategorize": true
}Request (First X Files):
{
"provider": "s3",
"url": "https://bucket.s3.amazonaws.com/folder/",
"files": [
{"key": "folder/doc1.pdf", "name": "doc1.pdf", "size": 12345},
{"key": "folder/doc2.pdf", "name": "doc2.pdf", "size": 23456}
],
"autoCategorize": false
}Response:
{
"message": "Cloud import started",
"jobId": "job_1704461234567_42",
"fileCount": 127
}Uses existing upload job endpoints:
Endpoint: GET /api/upload-jobs/:jobId
For large jobs, prefer paging the file list:
Endpoint: GET /api/upload-jobs/:jobId?filesLimit=0
- Lightweight polling (avoids returning the full
fileslist)
Endpoint: GET /api/upload-jobs/:jobId/files?offset=0&limit=200
- Returns a slice of file statuses:
{ filesTotal, offset, limit, files } - Use this for scroll-driven fetching / virtualized rendering
To resume a stopped cloud-import upload job:
Endpoint: POST /api/upload-jobs/:jobId/resume
- Only supported for cloud import jobs that were stopped (status
stopped) - Local file uploads cannot be resumed (the server does not retain the original file payload)
For S3:
- Bucket is not public
- Incorrect URL format
- Bucket doesn't exist
- No files in folder
Solutions:
-
Verify bucket is publicly accessible:
aws s3 ls s3://your-bucket/folder/ --no-sign-request
-
Check URL format - should end with
/:✅ https://bucket.s3.amazonaws.com/folder/ ❌ https://bucket.s3.amazonaws.com/folder -
Test with AWS CLI:
curl https://bucket.s3.amazonaws.com/folder/
For Google Drive:
- Folder not shared publicly
- Invalid API key
- Incorrect folder URL
Solutions:
-
Verify folder sharing:
- Right-click folder in Drive → "Share"
- Must be set to "Anyone with the link" can view
-
Test API key:
curl "https://www.googleapis.com/drive/v3/files?key=YOUR_API_KEY" -
Check folder ID in URL:
✅ https://drive.google.com/drive/folders/1a2B3c4D5e6F7g8H9i0J ❌ https://drive.google.com/drive/u/0/folders/... (remove u/0)
## Troubleshooting
### "Failed to analyze folder"
**Common Causes**:
- Bucket is not public
- Incorrect URL format
- Bucket doesn't exist
- No files in folder
**Solutions**:
1. Verify bucket is publicly accessible:
```bash
aws s3 ls s3://your-bucket/folder/ --no-sign-request
-
Check URL format - should end with
/:✅ https://bucket.s3.amazonaws.com/folder/ ❌ https://bucket.s3.amazonaws.com/folder -
Test with AWS CLI:
curl https://bucket.s3.amazonaws.com/folder/
Issue: S3 bucket requires authentication
Solution: Use public buckets only, or implement AWS credentials support (not currently supported)
Possible Causes:
- Large file taking time to process
- Ollama service not responding
- Network issues downloading file required-for-google-drive) steps
Issue: API key is incorrect or not properly configured
Solutions:
- Verify API key in
.envfile - Check that Google Drive API is enabled in Google Cloud Console
- Try creating a new API key
- Restart the server after updating
.env
Issue: All files are Google Workspace files (Docs/Sheets/Slides)
Solution: Google Workspace files must be exported manually as binary formats:
- Export Docs as .docx or .pdf
- Export Sheets as .xlsx or .csv
- Upload exported files to a regular folder Debugging:
- Check server logs for errors
- Verify Ollama is running:
curl http://localhost:11434/api/tags - Check network connectivity to S3
- Consider stopping import and retrying with smaller batch
Issue: Files may have failed text extraction
Solutions:
- Check upload job errors for failed files
- Verify file formats are supported
- Check server logs for extraction errors
- Try re-uploading specific failed files
Issue: GOOGLE_DRIVE_API_KEY not configured
Solution: Follow Google Drive Setup steps
Test with "Import first 10 files" before importing entire folders
For large folders, use file selector to:
- Exclude unwanted file types
- Skip oversized files
- Import priority documents first
If your dataset has many subfolders, navigate by folder (breadcrumbs) instead of relying on a flat list.
Get better metadata extraction:
CATEGORIZATION_MODEL=gemma3:4bDon't close browser during large imports - keep progress modal visible
After import completes:
- Search for imported documents
- Check document counts in collections
- Verify metadata extraction worked
Review error list in progress modal:
- Check file formats
- Verify file sizes within limits
- Re-upload failed files individually if needed
URL: https://research-bucket.s3.amazonaws.com/papers-2024/
Analysis: 85 PDFs, 234 MB total
Option: Import all files
Auto-categorize: ✅ Enabled
Result: 85 documents added, fully searchable
URL: https://docs-bucket.s3.amazonaws.com/legal/
Analysis: 450 mixed files, 1.2 GB total
Option: Select specific files
Filters:
- Type: .pdf
- Size: < 10 MB
- Search: "contract"
Selected: 23 files
Result: 23 contracts imported
URL: s3://training-data/samples/
Analysis: 1000+ files
Option: Import first 50 files
Auto-categorize: ❌ Disabled (faster)
Result: Quick import for testing
- Web UI Guide - Complete UI documentation
- File Upload - Regular file upload system
- PII Detection - Sensitive data scanning
- Testing - Testing strategies
Ready to import? Open the web UI and try it out! 🚀