- Overview
- Features
- How to Use
- Technical Implementation
- Configuration
- Troubleshooting
- Best Practices
- Future Enhancements
- Related Documentation
- Technical References
The Document Cluster Visualization feature provides an interactive 2D scatter plot that visualizes your document collection using dimensionality reduction. Documents with similar semantic content appear closer together, making it easy to discover patterns, clusters, and outliers in your data.
- 2D Scatter Plot: View your entire document collection in a 2D space using UMAP dimensionality reduction
- Interactive Navigation: Click, zoom, pan, and select documents
- Color Coding: Visualize by category, PII risk level, or upload date
- Multi-Selection: Use box or lasso selection to select multiple documents at once
- Quick Navigation: Click any document to view similar documents
- Smart Caching: Fast initial load with automatic cache invalidation
- Real-Time Stats: See cache age, document count, and selection info
- Click the 📊 Clusters button in the header
- The URL will change to
/clusters - Wait for the visualization to generate (5-10 seconds for first load)
- Hover: View document title, category, location, and PII risk
- Click: Select a document and view similar documents
- Zoom: Scroll wheel or pinch gesture
- Pan: Click and drag (when not in selection mode)
- Box Select (default): Click and drag to select documents in a rectangular area
- Lasso Select: Use the toolbar to switch to freeform selection
- Clear Selection: Click "Clear" button or click outside the selection
Use the "Color by" dropdown to change visualization:
- Category: Documents colored by their category (Restaurant, Hotel, Technology, etc.)
- PII Risk: Documents colored by PII risk level (none, low, medium, high, critical)
- Upload Date: Documents colored by upload date (blue = recent, red = older)
When you select multiple documents:
- View up to 10 selected documents in the panel below
- Click any document to navigate to its details
- See document metadata (category, location, PII risk)
- Clear selection with the "Clear" button
- Automatic: Cache automatically refreshes after 1 hour (default)
- Manual: Click the 🔄 Refresh button to force regeneration
- On Upload: Cache clears automatically when new documents are added
┌─────────────────────────────────────────────────────────┐
│ Frontend (Vue 3) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ DocumentClusterView.vue │ │
│ │ - View management │ │
│ │ - Data loading │ │
│ │ - Selection handling │ │
│ └────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────────────────────┐ │
│ │ ScatterPlot.vue │ │
│ │ - Plotly.js integration │ │
│ │ - Event handling │ │
│ │ - Color schemes │ │
│ └─────────────────────────────────────────────────┘ │
└───────────────────┬─────────────────────────────────────┘
│ HTTP
│
┌───────────────────▼─────────────────────────────────────┐
│ Backend (Express.js) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ API Endpoints │ │
│ │ - GET /api/visualize/scatter │ │
│ │ - POST /api/visualize/refresh │ │
│ │ - GET /api/visualize/stats │ │
│ └────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────────────────────┐ │
│ │ VisualizationService │ │
│ │ - UMAP dimensionality reduction │ │
│ │ - Cache management (Memory or Redis) │ │
│ │ - Data transformation │ │
│ └────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────────────────────┐ │
│ │ Cache Strategy (Interface) │ │
│ │ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ InMemoryCache │ │ RedisCache │ │ │
│ │ │ - Fast access │ │ - Distributed │ │ │
│ │ │ - TTL based │ │ - Persistent │ │ │
│ │ └────────────────┘ └────────────────┘ │ │
│ └────────────────┬────────────────────────────────┘ │
└───────────────────┼─────────────────────────────────────┘
│
┌───────────────────▼─────────────────────────────────────┐
│ Qdrant Vector DB │
│ - 768D dense vectors (embeddings) │
│ - Document metadata │
│ - Scroll API for bulk retrieval │
└─────────────────────────────────────────────────────────┘
The visualization uses UMAP (Uniform Manifold Approximation and Projection) to reduce 768-dimensional embedding vectors to 2D coordinates:
Why UMAP?
- Preserves both local and global structure
- Faster than t-SNE for large datasets
- Better at maintaining cluster relationships
- Handles non-linear patterns well
Parameters:
{
nComponents: 2, // Output dimensions
nNeighbors: 15, // Local neighborhood size
minDist: 0.1, // Minimum distance between points
spread: 1.0 // Effective scale of embedded points
}The service supports two caching strategies via VIZ_CACHE_STRATEGY environment variable:
VIZ_CACHE_STRATEGY=memory
VIZ_CACHE_TTL=3600000 # 1 hourPros:
- Ultra-fast access (~1ms)
- No external dependencies
- Simple setup
Cons:
- Lost on server restart
- Not shared across instances
- Limited by RAM
Memory Usage:
- 100 documents ≈ 50 KB
- 1,000 documents ≈ 520 KB
- 10,000 documents ≈ 5.2 MB
VIZ_CACHE_STRATEGY=redis
REDIS_URL=redis://localhost:6379
VIZ_CACHE_TTL=3600000 # 1 hourPros:
- Persistent across restarts
- Shared across instances
- Distributed architecture
Cons:
- Requires Redis server
- Slightly slower (~5-10ms)
- Additional infrastructure
When to Use Redis:
- Multiple server instances
- High-availability requirements
- Large document collections (>100K)
- Frequent server restarts
Returns cached or generates new 2D scatter plot data.
Query Parameters:
refresh(boolean): Force cache refreshlimit(integer): Max documents to visualize (default: 5000, range: 100-50000)
Response:
{
"success": true,
"data": {
"points": [
{
"id": "doc_123",
"x": 1.234,
"y": -2.456,
"title": "Document Title",
"category": "Technology",
"location": "San Francisco",
"tags": ["AI", "ML"],
"piiRisk": "low",
"date": "2025-12-30T10:00:00.000Z",
"snippet": "First 150 characters..."
}
],
"metadata": {
"totalDocuments": 100,
"visualizedDocuments": 100,
"generatedAt": 1735560000000,
"method": "umap",
"parameters": {
"nNeighbors": 15,
"minDist": 0.1
},
"processingTime": {
"umap": 4523
}
},
"fromCache": true,
"cacheAge": 120000
}
}Force clear cache and regenerate visualization.
Request Body:
{
"limit": 1000
}Response: Same as /api/visualize/scatter
Get cache statistics.
Response:
{
"success": true,
"stats": {
"strategy": "memory",
"entries": 1,
"memoryUsage": 52480
}
}| Operation | In-Memory Cache | Redis Cache | No Cache |
|---|---|---|---|
| First Load | ~5-10s | ~5-10s | ~5-10s |
| Cache Hit | ~1ms | ~5-10ms | N/A |
| Cache Miss | ~5-10s | ~5-10s | ~5-10s |
| Memory (100 docs) | 50 KB | 0 KB | 0 KB |
| Memory (10K docs) | 5.2 MB | 0 KB | 0 KB |
UMAP Processing Time:
- 100 documents: ~2-3 seconds
- 1,000 documents: ~5-8 seconds
- 10,000 documents: ~30-45 seconds
Cache is automatically cleared when:
- TTL Expires: Default 1 hour (
VIZ_CACHE_TTL) - Manual Refresh: User clicks refresh button
- Document Count Changes: New uploads detected
- Server Restart: In-memory cache only
Add to your .env file:
# Visualization Cache Strategy
VIZ_CACHE_STRATEGY=memory # 'memory' (default) or 'redis'
VIZ_CACHE_TTL=3600000 # Cache TTL in milliseconds (1 hour)
REDIS_URL=redis://localhost:6379 # Required if using RedisNo additional configuration needed. The visualization automatically:
- Detects API URL from
VITE_API_URLor defaults tohttp://localhost:3001 - Adapts to screen size responsively
- Handles errors gracefully with retry options
Symptom: Spinner indefinitely or error message
Solutions:
- Check backend server is running (
http://localhost:3001) - Check collection has documents (need at least 1)
- Check browser console for errors
- Try manual refresh with the 🔄 button
Symptom: Taking >30 seconds to generate
Solutions:
- Reduce document count with
limitparameter - Enable Redis cache for better performance
- Check server CPU/memory usage
- Consider sampling large collections
Symptom: New documents not appearing
Solutions:
- Click manual refresh button
- Wait for TTL to expire (default 1 hour)
- Restart server (clears in-memory cache)
- Check
VIZ_CACHE_TTLsetting
Symptom: "Redis connection failed" errors
Solutions:
- Verify Redis is running:
redis-cli ping - Check
REDIS_URLenvironment variable - Test connection:
redis-cli -u redis://localhost:6379 - Fall back to memory cache:
VIZ_CACHE_STRATEGY=memory
Symptom: Many points stacked on top of each other
Solutions:
- Adjust UMAP parameters in
services/visualization-service.js:- Increase
minDistfor more spread (0.1 → 0.3) - Decrease
nNeighborsfor looser clustering (15 → 10)
- Increase
- Try different color schemes to distinguish overlapping categories
- Zoom in on dense clusters
- Use in-memory cache
- Keep default UMAP parameters
- Visualize all documents
- Use in-memory or Redis cache
- Default settings work well
- Consider filtering by category first
- Use Redis cache for production
- Set visualization limit to 1,000 or sample strategically
- Increase cache TTL to reduce regeneration frequency
- Consider LRU eviction for very large collections
- Use Redis cache for multi-instance setups
- Monitor cache memory usage
- Set reasonable TTL (1-24 hours)
- Enable error tracking
- Consider CDN for static assets
Potential improvements for future versions:
- 3D Visualization: Option for 3D scatter plots with WebGL
- Hierarchical Clustering: Automatic cluster detection and labeling
- Incremental Updates: Add new documents without full regeneration
- Export Options: Download plot as PNG/SVG
- Custom Filters: Filter documents before visualization
- Similarity Lines: Draw connections between similar documents
- Animation: Smooth transitions when changing color schemes
- Cluster Statistics: Show cluster sizes and characteristics
- Search in Plot: Highlight documents matching search query
- Time-Series View: Animate document additions over time
- Advanced Queries - Learn about recommendation and similarity search
- Mixed Dataset - Understanding the test dataset
- Quick Reference - API quick reference
- Summary - Project overview
- UMAP Algorithm - Dimensionality reduction theory
- Plotly.js - Interactive plotting library
- Redis - Caching documentation
- Qdrant - Vector database docs
