All notable changes to LocalLab will be documented in this file.
- Perfect ASCII art formatting - Fixed broken CHAT banner characters with clean, well-formed block display
- Enhanced loading indicator - Added "generating..." text after triple dots (◦ ◦ ◦) with dynamic removal when AI responds
- AI response padding fix - Implemented guaranteed 6-space left indentation under model name headers, eliminating flush-left margin issues
- Aesthetic ASCII art colors - Beautiful purplish color scheme: blue for LOCALLAB (bluish-purplish) and magenta for CHAT (purplish)
- Visual hierarchy perfection - CHAT banner as clearly dominant element with LOCALLAB as subtle secondary presence
- Loading behavior enhancement - Smooth indicator appearance/removal with proper alignment matching AI response indentation
- AI response alignment - Resolved persistent issue where responses appeared flush against left margin instead of properly indented
- ASCII art malformation - Repaired broken characters in CHAT banner ensuring perfect formatting across all terminals
- Loading indicator positioning - Fixed alignment to match 6-space AI response padding for consistent visual structure
- Markdown rendering padding - Enhanced dual approach: Rich Padding for complex content, direct padding for plain text
- Enhanced
_show_generating_indicator()- Added "generating..." text with aesthetic triple dots and proper 6-space alignment - Improved AI response rendering - Dual approach handling both markdown and plain text with guaranteed indentation
- Updated
_add_text_padding()- Consistent 6-space padding for all content types with multi-line support - ASCII art color optimization - Implemented subtle, pleasant colors perfect for dark terminal backgrounds
- Dynamic indicator clearing - ANSI escape codes for seamless loading indicator removal
- Professional alignment - All AI responses properly indented with consistent 6-space left padding
- Clear generation feedback - Enhanced loading indicator with explicit "generating..." text for better user understanding
- Aesthetic appeal - Beautiful purplish color scheme creating warm, inviting, and professional appearance
- Smooth interactions - Dynamic loading behavior with seamless transitions from indicator to response
- Perfect formatting - Flawless ASCII art display with proper visual hierarchy and clean character rendering
- CLI chat interface redesign - Modern minimal aesthetic with ASCII banner, professional color scheme, and enhanced visual hierarchy
- Response generation reliability - 3-retry logic with auto-reconnection, comprehensive format support, and special token cleaning
- Chat interface improvements - Removed verbose startup info, optimized spacing, minimal shutdown messages, and horizontal padding
- ASCII art banner - Complete "LOCALLAB" display with clean bottom-line design and balanced color brightness
- Visual distinction - Enhanced user/AI message contrast with subdued AI response colors for better readability
- Response parsing failures - Robust handling of 120+ second generation times, Chinese text, and conversation end markers
- Connection recovery - Automatic reconnection between retry attempts with exponential backoff delays
- Import error - Added missing
Groupimport inlocallab/cli/ui.pyfor proper Rich component rendering
- Updated
locallab/cli/ui.py,locallab/cli/chat.py,locallab/cli/connection.py - Enhanced error handling with comprehensive debug logging and response validation
- HuggingFace Hub progress bar error - Fixed
'enable_progress_bars'attribute error with version-agnostic fallback methods - BetterTransformer compatibility - Updated for transformers>=4.49.0 with intelligent version detection and native PyTorch fallbacks
- CUDA warnings - Changed alarming "CUDA not available" to informative "GPU not detected - running in CPU mode"
- Flash Attention messages - Improved warnings with installation guidance
- Optimization fallbacks - Added comprehensive error handling with result tracking and summary logging
- Download process reliability - Continues smoothly even when optimizations fail with clear user feedback
- Updated
locallab/utils/progress.py,locallab/utils/early_config.py,locallab/model_manager.py - Improved logging levels from warnings to informative messages
- New
locallab modelscommand group - Complete model management system with HuggingFace Hub integration - Model discovery - Search LocalLab registry and HuggingFace Hub with filtering, sorting, and popularity metrics
- Local model management - Download, list, remove, and get info on cached models with progress indicators
- Cache management - Intelligent cache with metadata tracking and cleanup functionality
- Rich UI components - Beautiful tables and progress bars with multiple output formats (table/JSON)
- Offline capability - Registry models available without internet, downloaded models work offline
- System compatibility checks - Hardware requirements validation before downloads
locallab models list # List locally cached models
locallab models download <id> # Download a model locally
locallab models remove <id> # Remove a cached model
locallab models discover # Discover available models
locallab models info <id> # Show detailed model information
locallab models clean # Clean up orphaned cache files- Model loading integration - Better integration with existing model loading system
- Documentation - Comprehensive model management guide with CLI reference updates
- Error handling - Robust network resilience and graceful fallbacks
- Performance - Faster server startup with pre-downloaded models and better resource utilization
- Added
locallab/utils/model_cache.pyfor centralized cache management - Added
locallab/utils/huggingface_search.pyfor Hub integration - Enhanced architecture with modular design and proper API integration
- Dynamic generation mode switching - Inline syntax (
--stream,--chat,--simple,--batch) to override modes per message - Enhanced chat interface - Visual feedback for mode changes, better error messages, seamless mode switching
- Documentation overhaul - New CLI documentation hub, restructured README with chat interface prominence
- Improved getting started - 3-step quick start process with better visual hierarchy and formatting
- Chat interface - Better command documentation, clear benefits for each mode, comprehensive examples
- Documentation quality - Consistent formatting, logical organization, proper cross-references, fixed markdown issues
- User experience - Chat interface as primary interaction method, simplified onboarding, better feature discovery
- Regex-based parsing - Robust pattern matching for inline mode detection with comprehensive error handling
- Documentation infrastructure - Markdown validation, link verification, unified formatting guidelines
- Fresh installation configuration bug - Fixed confusing "enable_quantization: True" display with intelligent detection
- Configuration interface issues - Fixed quantization descriptions, model selection limitations, spacing, and boolean handling
- User experience problems - Proper welcome screen instead of raw config dump for new users
- Fresh installation welcome - Attractive welcome screen with step-by-step guidance and visual indicators
- Enhanced configuration flow - Improved model selection (registry + custom HuggingFace IDs), better optimization settings
- Visual design improvements - Modern emoji-rich interface with consistent hierarchy and better formatting
- Organized configuration summary - Logical grouping (Model, Optimization, Access) with enhanced completion screens
- Configuration UX - Redesigned welcome flow, enhanced reconfigure experience, better error handling
- Interface organization - Reorganized settings into logical groups with improved information hierarchy
- Updated
locallab/server.pyandlocallab/cli/interactive.pywith fresh install detection and visual improvements
- Critical disk offloading error - Fixed "You are trying to offload the whole model to the disk" preventing LLM loading
- Qwen2.5-VL model loading - Added comprehensive fallback logic and proper model class detection
- Server stability - Fixed repeated startup callbacks spamming logs every 30 seconds
- Device mapping strategy - Intelligent GPU memory detection with safe device assignments (
cuda:0/cpu) instead ofdevice_map: "auto" - Error recovery - CPU retry logic when GPU loading fails, comprehensive error detection with appropriate fallbacks
- Smart device management - New
_get_safe_device_map()method with GPU memory inspection and adaptive configuration - Enhanced model support - Universal text generation and vision-language model support with cross-platform compatibility
- Multi-level fallbacks - Multiple fallback strategies ensure successful model loading
- Model loading process - Updated quantization configuration, enhanced model class detection, optimized memory usage
- Dependencies - Updated transformers requirement to
>=4.49.0for Qwen2.5-VL support
- Updated
locallab/model_manager.py,locallab/server.py,requirements.txt,setup.py
- Fixed Hugging Face model download issues with ModelManager and loading processes
- Improved model downloading and loading reliability and performance
- Critical
max_timeparameter error - Fixed "unexpected keyword argument 'max_time'" in ModelManager.generate() - Added
max_timeparameter to all generation endpoints and request models - Set default max_time to 180 seconds with improved timeout error handling
max_timeparameter - Added to both async and sync clients for server-side generation time limits- Enhanced error handling for timeout-related issues with optional parameter (default: 180s)
- Fixed
max_timeparameter handling between client and server with proper timeout support
- Stream generation quality - Enhanced token generation parameters, stop sequence detection, repetition prevention, optimized buffering
- Non-streaming generation - Improved all endpoints with repetition detection, special token handling, conversation markers
- Memory management - Reduced memory check frequency, smarter cache clearing thresholds, better OOM recovery
- Increased default max_length (2048→4096), token batch size (4→8), max_time (180s)
- Adjusted parameters: top_k (80), top_p (0.92), repetition_penalty (1.15) for better quality
- HuggingFace progress bars - Fixed critical display error with version-agnostic approach and multiple fallback methods
- Enhanced error handling during downloads with improved early configuration system
- Fixed critical HuggingFace progress bars display error
- Improved early configuration system for proper logging setup
- Model downloading redesign - Native HuggingFace progress bars with early configuration system
- Visual improvements - StdoutRedirector, clear separation between LocalLab logs and HF progress bars
- Download experience - Informative messages, consistent display across model types
- Native progress bars - Used HuggingFace's original progress bars instead of custom logger
- Fixed download log interception with proper visual formatting
- Native HuggingFace progress bars with fixed interleaved display and clear success messages
- CLI configuration - Fixed optimization settings not being saved properly, updated defaults to enabled
- Fixed CLI config environment variable issue
do_sampleparameter - Added to all generation endpoints with API documentation and examples- Clear messages before/after model downloads for better UX
- Model download logs - Fixed interleaving with custom progress bar system and suppressed regular logs
- Client
do_sampleerror - Added parameter to all client methods (client v1.0.9)
- Client parameter errors - Fixed repetition_penalty, missing top_k, sync/async mismatch (client v1.0.8)
- Consistent parameter handling with accurate docstrings
- UI banner redesign - Removed side borders, cleaner layout, better spacing and readability
- Enhanced visual consistency across INITIALIZING, RUNNING, and ngrok banners
- Response Quality Settings - Optional CLI section with detailed parameter descriptions
- Enhanced parameters - max_length (4096→8192), top_k (50→80), max_time (120s), repetition_penalty (1.1→1.15)
- Improved streaming - Larger token batches (4 tokens), better stop sequence and repetition detection
- Better error recovery - Enhanced OOM handling and memory management
- Client improvements - Timeouts (180→300s), max_length (1024→8192), added top_k and repetition_penalty
- Client Package (v1.0.7) - Increased timeouts, better streaming buffering, improved error handling
- Redesigned all UI banners with modern, aesthetic styling
- Enhanced INITIALIZING and RUNNING banners with box-style borders and improved spacing
- Redesigned ngrok tunnel banner with a modern box layout and better visual hierarchy
- Added informative notes to the ngrok banner for better user guidance
- Improved overall visual consistency and readability across all UI elements
- Enhanced color scheme for better visual appeal and readability
- Fixed model download progress bars to display sequentially instead of interleaved
- Implemented custom progress bar handler for HuggingFace Hub downloads
- Added proper synchronization for multiple concurrent download progress bars
- Enhanced logging during model downloads for better readability
- Improved visual clarity of download progress information
- Fixed extra spacing in the boundary of status banners
- Improved alignment of INITIALIZING and RUNNING status boxes
- Enhanced visual consistency across all UI elements
- Enhanced log coloring with lighter shades for better readability
- Redesigned ngrok tunnel banner with dynamic width to accommodate long URLs
- Improved visual aesthetics of the ngrok tunnel banner with modern styling
- Added automatic width adjustment for banners based on content length
- Fine-tuned color scheme to ensure all logs remain visible while not competing with important banners
- Implemented intelligent log coloring that uses subdued colors for routine logs
- Added smart detection of important log messages to highlight critical information
- Enhanced visual focus on banners and important messages by de-emphasizing routine logs
- Added special handling for ngrok and uvicorn logs to make them even more subdued
- Created a comprehensive pattern matching system to identify and highlight important logs
- Improved overall readability by reducing visual noise from routine log messages
- Completely redesigned the LocalLab ASCII art logo with a modern, aesthetically pleasing look
- Created beautiful boxed status indicators for both INITIALIZING and RUNNING states
- Enhanced visual hierarchy with prominent logo and clear status indicators
- Added detailed bullet points in status boxes for better user guidance
- Standardized the formatting of server details and ngrok tunnel information boxes
- Improved overall visual consistency across all UI elements
- Made server status much easier to distinguish at a glance
- Fixed duplicate logging issue where the same log message appeared multiple times
- Improved color detection for terminal output - now only uses colors when supported
- Prevented multiple handlers from being added to the same logger
- Disabled uvicorn's default logging configuration to prevent duplication
- Enhanced logger initialization to ensure consistent formatting
- Added proper cleanup of existing handlers before adding new ones
- Improved compatibility with different terminal environments
- Enhanced error handling and reliability in both clients
- Added timeout handling to sync client streaming methods
- Improved event loop cleanup and resource management
- Added connection state validation
- Added retry mechanism for streaming operations
- Added comprehensive logging throughout both clients
- Added proper cleanup of resources on client closure
- Fixed potential memory leaks in event loop handling
- Fixed thread cleanup in synchronous client
- Improved error propagation between async and sync clients
- Added proper timeout handling in streaming operations
- Enhanced connection state management
- Fixed package structure to avoid duplicate exports
- Updated version numbers to be consistent across all files
- Fixed imports in sync_client.py to use correct package name
- Improved package import reliability
- Ensured both LocalLabClient and SyncLocalLabClient are properly exported
- Fixed SyncLocalLabClient not being exported from locallab_client package
- Added proper exports for both LocalLabClient and SyncLocalLabClient in package init.py
- Ensured both sync and async clients are available through the main package import
- Renamed Python client package from
locallab-clienttolocallab_clientfor better import compatibility - Updated client package version to 0.3.0
- Changed client package structure to use direct imports instead of nested packages
- Improved client package documentation with correct import examples
- Fixed server shutdown issues when pressing Ctrl+C
- Improved error handling during server shutdown process
- Enhanced handling of asyncio.CancelledError during shutdown
- Added proper handling for asyncio.Server objects during shutdown
- Reduced duplicate log messages during shutdown
- Added clean shutdown banner for better user experience
- Improved task cancellation with proper timeout handling
- Enhanced force exit mechanism to ensure clean termination
- Added a dedicated synchronous client (
SyncLocalLabClient) that doesn't require async/await - Added automatic session closing to prevent resource leaks
- Added proper resource management with context managers
- Simplified client API with separate async and sync clients
- Updated documentation to clearly explain both client options
- Fixed issue with unclosed client sessions causing warnings
- Improved error handling in streaming responses
- Added unified client API that works both with and without async/await
- Implemented automatic session closing to the Python client
- Added proper resource management with atexit handlers and finalizers
- Improved error handling in the Python client
- Added synchronous context manager support (
withstatement)
- Simplified client API - same methods work in both sync and async contexts
- Updated Python client to track activity and close inactive sessions
- Enhanced client session management to prevent resource leaks
- Improved client package version to 0.2.0
- Fixed issue with unclosed client sessions causing warnings
- Improved error propagation in streaming responses
- Removed all response formatting from streaming generation
- Simplified token streaming to provide raw, unformatted tokens
- Removed text cleaning and formatting from all generation endpoints
- Improved error handling in streaming responses
- Optimized streaming generation for low-resource computers
- Implemented token-level streaming with proper error handling
- Added memory monitoring and adaptive token generation
- Enhanced error recovery mechanisms for streaming generation
- Improved client-side error handling for streaming responses
- Fixed issue with streaming generation stopping unexpectedly
- Improved error reporting in streaming responses
- Added timeout handling to prevent hanging during streaming
- Enhanced memory management to prevent OOM errors
- Optimized token generation for better performance on low-resource computers
- Reduced default max_length for streaming to conserve memory
- Improved token buffering for smoother streaming experience
- Enhanced Python client with better error handling for streaming
- Added proper error message propagation from server to client
- Added context awareness to streaming generation
- Enhanced streaming response quality with context tracking
- Improved streaming response coherence by maintaining conversation history
- Updated documentation with streaming context examples
- Fixed streaming response formatting issues
- Improved error handling in streaming generation
- Enhanced token cleanup for better readability
- Fixed Python client initialization error "'str' object has no attribute 'headers'"
- Updated client package to handle string URLs in constructor
- Bumped client package version to 1.0.2
- Updated documentation with correct client initialization examples
- Fixed HuggingFace token handling and validation in model loading
- Fixed ngrok token environment variable usage to use official
NGROK_AUTHTOKENname - Fixed token storage and retrieval in config and environment variables
- Improved CLI UX for token input and management
- Removed token masking for better visibility
- Show current token values when available
- Added proper token validation
- Enhanced token handling across the package
- Standardized environment variable names
- Better string handling for token values
- Consistent token validation
- Better error messages for token-related issues
- Improved networking setup with proper token handling
- Updated environment variable names to use official standards
NGROK_AUTHTOKENfor ngrok tokenHUGGINGFACE_TOKENfor HuggingFace token
- Standardized token management functions in config.py
- Fixed critical error with ngrok URL handling in Google Colab
- Fixed NgrokTunnel type error during server initialization
- Improved error messages for ngrok connection issues
- Updated footer design for better visibility
- Clarified URL usage in documentation (localhost vs ngrok)
- Simplified footer design in server output
- Enhanced ngrok tunnel setup process with better error handling
- Updated documentation to clearly distinguish between local and ngrok URLs
- Added support for HuggingFace token through CLI and environment variables
- Interactive prompt for HuggingFace token when required
- Secure token handling in configuration
- Improved error messages for model loading issues
- Made HuggingFace token optional but with interactive prompt when needed
- Enhanced model loading process with better token handling
- Updated documentation with HuggingFace token configuration details
- Fixed critical issue with BERT model loading by removing device_map for BERT models
- Added proper BERT model configuration for text generation
- Improved model loading process with better architecture detection
- Enhanced error handling for different model architectures
- Fixed memory management for CPU-only environments
- Added automatic model type detection and configuration
- Improved compatibility with various model architectures
- Enhanced error messages for better debugging
- Added support for BERT models in text generation mode
- Implemented automatic model architecture detection
- Added proper model-specific configurations
- Enhanced memory optimization for different model types
- Fixed critical issue with server not terminating properly when Ctrl+C is pressed
- Improved process termination by using os._exit() instead of sys.exit() for clean shutdown
- Added CPU compatibility by disabling quantization when CUDA is not available
- Fixed bitsandbytes error for CPU-only systems with clear warning messages
- Enhanced user experience with better error handling for non-GPU environments
- Added beautiful footer section with author information and social media links
- Included GitHub, Twitter, and Instagram links in the footer
- Added project repository link with star request
- Enhanced server startup display with comprehensive information
- Fixed critical issue with server not shutting down properly when Ctrl+C is pressed
- Improved signal handling in ServerWithCallback class to ensure clean shutdown
- Enhanced main_loop method to respond faster to shutdown signals
- Implemented more robust server shutdown process with proper resource cleanup
- Added additional logging during shutdown to help diagnose issues
- Increased shutdown timeout to allow proper cleanup of all resources
- Fixed multiple shutdown attempts when Ctrl+C is pressed repeatedly
- Ensured all server components are properly closed during shutdown
- Enhanced server compatibility with different versions of uvicorn
- Improved lifespan initialization with comprehensive fallback mechanisms
- Fixed server startup issues with newer versions of uvicorn (0.34.0+)
- Added robust error handling for lifespan initialization
- Implemented multiple initialization strategies for different uvicorn versions
- Improved logging during server startup to better diagnose initialization issues
- Enhanced server stability with proper error recovery during startup
- Fixed "Using NoopLifespan" warning by properly initializing lifespan components
- Ensured compatibility with both older and newer versions of uvicorn
- Improved server reliability in various Python environments
- Fixed critical issue with SimpleTCPServer not properly handling API requests
- Implemented proper ASGI server in SimpleTCPServer for handling API requests
- Added support for uvicorn's H11Protocol for better request handling
- Improved fallback server implementation with proper HTTP request parsing
- Fixed API documentation to show correct URLs based on environment
- Fixed API examples to show local URL or ngrok URL based on configuration
- Ensured server works correctly in both local and Google Colab environments
- Fixed import error: "cannot import name 'get_system_info' from 'locallab.utils.system'"
- Added backward compatibility function for system information retrieval
- Ensured proper display of system resources during server startup
- Enhanced compatibility between UI components and system utilities
- Improved error handling during server startup display
- Added graceful error recovery for UI component failures
- Ensured server continues to run even if display components fail
- Enhanced robustness of startup process with comprehensive error handling
- Added fallback mechanisms for all UI components to handle import errors
- Improved system resource display with multiple fallback options
- Enhanced model information display with graceful degradation
- Ensured server can start even with missing or incompatible dependencies
- Added minimal mode fallback server for critical initialization failures
- Implemented comprehensive error handling for configuration loading
- Created fallback endpoints for basic server functionality
- Added detailed error reporting in minimal mode
- Enhanced server resilience with multi-level fallback mechanisms
- Fixed critical error: "'Server' object has no attribute 'start'"
- Implemented robust SimpleTCPServer as a fallback when TCPServer import fails
- Added direct socket handling for maximum compatibility across environments
- Enhanced server startup process to handle different server implementations
- Improved error handling in server shutdown process
- Added graceful fallback for servers without start/shutdown methods
- Enhanced compatibility with different versions of uvicorn
- Improved server stability with better error recovery mechanisms
- Added comprehensive error handling for socket operations
- Implemented non-blocking socket I/O for better performance
- Added direct fallback to SimpleTCPServer when server.start() fails
- Improved Google Colab integration with better error handling
- Enhanced event loop handling for different Python environments
- Fixed critical error: "'Config' object has no attribute 'server_class'"
- Implemented custom startup method that doesn't rely on config.server_class
- Fixed import issues in Google Colab by properly exposing start_server in init.py
- Enhanced compatibility with different versions of uvicorn
- Improved server initialization for more reliable startup
- Added direct TCPServer initialization for better compatibility
- Implemented fallback mechanisms for TCPServer import to handle different uvicorn versions
- Added multiple import paths for TCPServer to ensure compatibility across all environments
- Enhanced error handling during server initialization
- Improved Google Colab integration with better import structure
- Added custom main_loop implementation with robust error handling
- Implemented graceful shutdown mechanism for all server components
- Enhanced server stability with improved error recovery
- Fixed critical error: "'NoneType' object has no attribute 'startup'"
- Implemented NoopLifespan class as a fallback when all lifespan initialization attempts fail
- Ensured server can start even when lifespan initialization fails
- Added proper error handling for startup and shutdown events
- Enhanced server stability across different environments and uvicorn versions
- Added robust error recovery during server startup process
- Overrode uvicorn's startup and shutdown methods to add additional error handling
- Improved logging for lifespan-related errors to aid in troubleshooting
- Added graceful fallback mechanisms for all critical server operations
- Ensured clean server shutdown even when lifespan shutdown fails
- Fixed critical error: "LifespanOn.init() takes 2 positional arguments but 3 were given"
- Enhanced lifespan initialization to handle different uvicorn versions with varying parameter requirements
- Implemented comprehensive parameter testing for all lifespan classes to ensure compatibility
- Added detailed logging for lifespan initialization to aid in troubleshooting
- Improved error handling for all lifespan-related operations
- Fixed critical error with LifespanOn initialization: "LifespanOn.init() got an unexpected keyword argument 'logger'"
- Improved compatibility with different versions of uvicorn by properly handling lifespan initialization
- Enhanced error handling for different lifespan implementations
- Added graceful fallbacks when lifespan initialization fails
- Fixed critical server startup error related to uvicorn lifespan initialization
- Fixed 'Config' object has no attribute 'logger' error during server startup
- Fixed 'Config' object has no attribute 'loaded_app' error
- Improved compatibility with different versions of uvicorn
- Enhanced error handling during server startup
- Fixed banner display functions to work with the latest server implementation
- Fixed critical issue with
locallab startfailing due to uvicorn lifespan module errors - Fixed
locallab configcommand not properly prompting for new settings when reconfiguring - Significantly improved CLI startup speed with optimized imports and conditional loading
- Enhanced configuration system to include all available options (cache, logging, etc.)
- Improved compatibility with different Python versions and environments
- Added better error handling for ngrok authentication token
- Fixed event loop handling for both local and Google Colab environments
- Removed "What's New" sections from documentation in favor of directing users to the changelog
- Restored option to skip advanced configuration settings for better user experience
- Fixed critical issue with
locallab startfailing due to uvicorn lifespan module errors - Fixed
locallab configcommand not properly prompting for new settings when reconfiguring - Significantly improved CLI startup speed with optimized imports and conditional loading
- Enhanced configuration system to include all available options (cache, logging, etc.)
- Improved compatibility with different Python versions and environments
- Added better error handling for ngrok authentication token
- Fixed event loop handling for both local and Google Colab environments
- Removed "What's New" sections from documentation in favor of directing users to the changelog
- Fixed critical issue with
locallab configcommand not being respected when runninglocallab start - Enhanced configuration system to properly load and apply saved settings
- Improved user experience by showing current configuration before prompting for changes
- Added clear feedback when configuration is saved and how to use it
- Fixed critical server startup error related to missing 'lifespan' attribute in ServerWithCallback class
- Fixed KeyError in 'locallab info' command by properly handling RAM information
- Significantly improved CLI startup speed through lazy loading of imports
- Enhanced error handling in system information display
- Fixed environment variable conflicts between CLI configuration and OS environment variables
- Improved configuration system to properly handle both CLI and environment variable settings
- Optimized server startup process for faster response time
- Reduced unnecessary operations during CLI startup for better performance
- Improved memory usage reporting with proper unit conversion (GB instead of MB)
- Enhanced ServerWithCallback class with proper lifespan initialization
- Updated configuration system to use a unified approach for all settings
- Enhanced CLI with interactive configuration wizard
- Added persistent configuration storage
- Implemented environment detection for smart defaults
- Added command groups: start, config, info
- Added support for configuring optimizations through CLI
- Improved Google Colab integration with context-aware prompts
- Added system information command
- Improved streaming generation quality to match non-streaming responses
- Added proper stopping conditions for streaming to prevent endless generation
- Implemented repetition detection to stop low-quality streaming responses
- Reduced token chunk size for better quality control in streaming mode
- Ensured consistent generation parameters between streaming and non-streaming modes
- Added memory monitoring to prevent CUDA out of memory errors
- Implemented adaptive token generation for streaming responses
- Added CUDA memory configuration with expandable segments
- Fixed torch.compile() errors by adding proper error handling and fallback to eager mode
- Fixed early stopping warning by correctly setting num_beams parameter
- Improved streaming generation with smaller token chunks for more responsive output
- Added memory-aware generation that adapts to available GPU resources
- Implemented error recovery for out-of-memory situations during generation
- Fixed issue with banners (running banner, system instructions, model configuration, API documentation) repeating in the console at regular intervals
- Added flag to ensure startup information is only displayed once during server initialization
- Improved server callback handling to prevent duplicate banner displays
- Fixed Env Configuration by removing the duplicated Env Configuration.
- Added comprehensive API documentation display on server startup with curl examples
- Added model configuration section that displays current model and optimization settings
- Added system instructions section showing the current prompt template
- Improved environment variable handling for model configuration
- Enhanced server startup logging with detailed optimization settings
- Added support for reading HUGGINGFACE_MODEL environment variable to specify model
- Redesigned modern ASCII art banners for a more aesthetic interface
- Improved UI with cleaner banner separations and better readability
- Fixed parameter mismatch in text generation endpoints by properly handling
max_new_tokensparameter - Resolved coroutine awaiting issues in streaming generation endpoints
- Fixed async generator handling in
stream_chatandgenerate_streamfunctions - Enhanced error handling in streaming responses to provide better error messages
- Improved compatibility between route parameters and model manager methods
- Added missing dependencies in
setup.py: huggingface_hub, pynvml, and typing_extensions - Improved dependency management with dev extras for testing packages
- Enhanced error handling for GPU memory detection
- Fixed circular import issues between modules
- Improved error handling in system utilities
- Enhanced compatibility with Google Colab environments
- New model loading endpoint that accepts model_id in the request body at
/models/load format_chat_messagesfunction to properly format chat messages for the model- CLI function to support command-line usage with click interface
- Properly awaiting async
generate_textin chat completion endpoint - Fixed async generator handling in
generate_streamfunction - Fixed streaming in the
stream_chatfunction to correctly send server-sent events - Properly escaped newline characters in the streaming response
- Added missing dependencies in
setup.py: colorama, python-multipart, websockets, psutil, and nest-asyncio
get_network_interfacesfunction to retrieve information about available network interfacesget_public_ipasync function to retrieve the public IP address of the machine- Adapter methods in
ModelManager(generate_textandgenerate_stream) to maintain API compatibility with route handlers
- Import error for
get_public_ipandget_network_interfacesfunctions - Naming mismatch between route handlers and
ModelManagermethods - New dependencies in
setup.py:netifacesandhttpx
- Fixed API endpoint errors for
/models/availableand other model endpoints - Resolved parameter error in
get_model_generation_params()function - Improved error handling for model optimization settings through environment variables
- Fixed circular import issues between routes and core modules
- Enhanced Flash Attention warning message to be more informative
- Added new
get_gpu_info()function for detailed GPU monitoring - Added improved system resource endpoint with detailed GPU metrics
- Added robust environment variable handling for optimization settings
- Made optimization flags more robust by checking for empty string values
- Improved fallback handling for missing torch packages
- Enhanced server startup logs with better optimization information
- Fixed critical server startup error in Google Colab environment with uvicorn callback configuration
- Resolved "'list' object is not callable" error by properly implementing the callback_notify as an async function
- Enhanced server startup sequence for better compatibility with both local and Colab environments
- Improved custom server implementation to handle callbacks more robustly
- Fixed circular import issue between core/app.py and routes/system.py by updating system.py to use get_request_count from logger module directly
- Made Flash Attention warning less alarming by changing it from a warning to an info message with better explanation
- Enhanced get_system_info endpoint with cleaner code and better organization
- Fixed potential issues with GPU info retrieval through better error handling
- Comprehensive environment check system that validates:
- Python version compatibility
- CUDA/GPU availability and configuration
- Ngrok token presence when running in Google Colab
- Improved error handling with detailed error messages and suggestions
- Clear instructions for setting up ngrok authentication token
- Complete removal of the deprecated monolithic
main.pyfile - Enhanced ngrok setup process with better authentication handling:
- Automatic detection of auth token from environment variables
- Clear error messages when auth token is missing
- Improved token validation and connection process
- Parameter renamed from
ngroktouse_ngrokfor clarity - More readable ASCII art for initializing banner
- Improved documentation about ngrok requirements for Google Colab
- Fixed circular import issues between core/app.py and routes modules
- Fixed ngrok authentication flow to properly use auth token from environment variables
- Fixed error with missing torch import in the server.py file
- Added graceful handling of missing torch module to prevent startup failures
- Improved error messages when server fails to start
- Better exception handling throughout the codebase
- Clear ASCII art status indicators ("INITIALIZING" and "RUNNING") showing server state
- Warning messages that prevent users from making API requests before the server is ready
- Callback mechanism to display the "RUNNING" banner only when the server is fully operational
- New dedicated logger module with comprehensive features:
- Colorized console output for different log levels
- Server status tracking (initializing, running, error, shutting_down)
- Request tracking with detailed metrics
- Model loading/unloading metrics
- Performance monitoring for slow requests
- API documentation for logger module with usage examples
- Completely refactored the codebase into a more modular structure:
- Split main.py into smaller, focused modules
- Created separate directories for routes, UI components, utilities, and core functionality
- Improved import structure to prevent circular dependencies
- Better organization of server startup and API functionality
- Enhanced model loading process with proper timing and status updates
- Improved error handling throughout the application
- Better request metrics in response headers
- Removed old logger.py in favor of the new dedicated logger module
- Complete removal of health checks and validation when setting up ngrok tunnels
- Fixed issue where logs did not appear correctly due to server starting in a separate process
- Simplified ngrok setup process to run without validation to prevent connection errors during startup
- Improved server startup flow to be more direct without background health checks or API validation
- Reorganized startup sequence to work properly with ngrok, enhancing compatibility with Colab
- Removed the background process workflow for server startup. The server now runs directly in the main process, ensuring that all logs (banner, model details, system resources, etc.) are displayed properly.
- Simplified the startup process by directly calling uvicorn.run(), with optional ngrok setup if the server is run in Google Colab.
- Added utility function is_port_in_use(port: int) → bool to check if a port is already in use.
- Added async utility function load_model_in_background(model_id: str) to load the model asynchronously in the background while managing the global loading flag.
- Updated server startup functions to incorporate these utilities, ensuring proper port management and asynchronous model loading.
- Extended the initial wait time in start_server from 5 to 15 seconds to allow the server ample time to initialize, especially in Google Colab environments.
- Increased health check timeout to 120 seconds for ngrok mode and 60 seconds for local mode to accommodate slower startups.
- Added detailed logging during health checks to aid in debugging startup issues.
- Improved logging across startup: the banner, model details, configuration, system resources, API documentation, quick start guide, and footer are now fully logged and printed.
- Updated the start_server function to extend the health check timeout to 60 seconds in Google Colab (when using ngrok) and to set an environment variable to trigger the Colab branch in run_server_proc.
- Modified startup_event to load the model in the background, ensuring that the server's /health endpoint becomes available in time and that logging output is complete.
- Updated GitHub Actions workflow to install the Locallab package along with its runtime dependencies in CI, ensuring that all required packages are available for proper testing.
- Refactored
run_server_procin the spawned process to initialize a dedicated logger ("locallab.spawn") to avoid inheriting SemLock objects from a fork context. - Ensured that the log queue is created using the multiprocessing spawn context, preventing runtime errors in Google Colab.
- Updated Mermaid diagrams in
README.mdanddocs/colab/README.mdto enclose node labels in double quotes, resolving parse errors in GitHub rendering. - Removed duplicate architecture diagrams from the root
README.mdfile.