feat: add model selection, replace tinyllama with llama3.2 and qwen2.5

Man2Dev · Man2Dev · commit f0ba8f027b64 · 2026-02-09T08:29:59.000+01:00
- Add multi-model support: users choose model via /newsession &lt;number&gt;
- Available models: Llama 3.2 (1B), Qwen 2.5 (1.5B instruct)
- Warn users on deprecated models and prompt to create new session
- Fix API Gateway 29s timeout: reduce Ollama timeout to 22s, smart retry
- Fix Lambda deployment: include requests dependency in zip
- Update README with model management docs and available models table
- Update variables.tf default ollama_model to llama3.2:1b
diff --git a/.gitignore b/.gitignore
@@ -29,6 +29,8 @@ package/
 
 # lambda
 lambda_function.zip
+lambda.zip
+*.zip
 
 # SSH keys and certificates
 *.pem
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,12 +15,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
   - `modules/monitoring/` - CloudWatch metric filter and error alarm
   - `modules/ec2/` - EC2 instance running Ollama for AI inference
 - **Ollama AI Integration**: Self-hosted LLM on EC2 (external API)
-  - EC2 instance (t3.large) running Ollama with tinyllama model
+  - EC2 instance (t3.large) running Ollama with multiple models
+  - Available models: Llama 3.2 (1B), Qwen 2.5 (1.5B instruct)
+  - Model selection via `/newsession <number>` command
+  - Unavailable model detection: warns users on deprecated models
   - Lambda calls `POST /api/chat` for AI-powered chat responses
   - Elastic IP for stable endpoint across stop/start cycles
   - Model persistence: S3 sync on shutdown, restore on boot
   - Lifecycle management script: `scripts/manage-ollama.sh`
-  - Error handling with timeouts, structured logging, graceful fallback
+  - Error handling with timeouts (22s), smart retry on fast failures, graceful fallback
+  - "Thinking..." indicator with message editing for response delivery
+  - Rate limiting: prevents duplicate requests while AI is processing
+  - Update deduplication via update_id tracking in DynamoDB
 - **Observability**: Added structured logging and monitoring
   - Structured JSON logging in `handler.py` (level, timestamp, action, outcome, request_id)
   - CloudWatch metric filter for ERROR-level logs (`{ $.level = "ERROR" }`)
@@ -36,6 +42,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 - **Documentation**: Added observability and remote state sections to README
 
 ### Changed
+- **Models**: Replaced tinyllama with Llama 3.2 (1B) and added Qwen 2.5 (1.5B instruct)
 - **handler.py**: Replaced placeholder with actual Ollama AI integration, updated `/status` and `/help`
 - **main.tf**: Migrated from inline resources to module calls, added monitoring and EC2 modules
 - **outputs.tf**: Updated outputs to reference module outputs, added monitoring and EC2 outputs
diff --git a/README.md b/README.md
@@ -79,7 +79,8 @@ This project creates a serverless Telegram bot running on AWS. When users send m
 |---------|---------|--------|
 | `/start` or `/hello` | Initialize and greet user | ✅ Working |
 | `/help` | Show available commands | ✅ Working |
-| `/newsession` | Create a new chat session | ✅ Working |
+| `/newsession` | Show available models | ✅ Working |
+| `/newsession <number>` | Create session with chosen model | ✅ Working |
 | `/listsessions` | List all user sessions | ✅ Working |
 | `/switch <number>` | Switch to a different session | ✅ Working |
 | `/history` | Show recent messages in session | ✅ Working |
@@ -428,7 +429,7 @@ Creates an EC2 instance running Ollama for AI inference.
 |----------|-------------|---------|
 | `instance_name` | Name tag for the instance | Required |
 | `instance_type` | EC2 instance type | `t3.large` |
-| `ollama_model` | Model to pull on first boot | `tinyllama` |
+| `ollama_model` | Model to pull on first boot | `llama3.2:1b` |
 | `models_s3_bucket` | S3 bucket for model persistence | Required |
 | `ssh_allowed_cidr` | CIDR for SSH access | `0.0.0.0/0` |
 
@@ -448,11 +449,21 @@ The bot integrates with [Ollama](https://ollama.com), a self-hosted large langua
 | Endpoint | `POST http://<EC2_EIP>:11434/api/chat` |
 | Protocol | HTTP (REST) |
 | Authentication | API key via `X-API-Key` header (nginx reverse proxy) |
-| Request format | JSON: `{"model": "tinyllama", "messages": [...], "stream": false}` |
+| Request format | JSON: `{"model": "llama3.2:1b", "messages": [...], "stream": false}` |
 | Response format | JSON: `{"message": {"content": "..."}}` |
 
+**Available Models:**
+
+| # | Model | Description | Size |
+|---|-------|-------------|------|
+| 1 | `llama3.2:1b` | Meta Llama 3.2, fast general-purpose | 1.3 GB |
+| 2 | `qwen2.5:1.5b-instruct-q4_K_M` | Alibaba Qwen 2.5, instruction-tuned | 986 MB |
+
+Users select a model when creating a session via `/newsession <number>`. Sessions using removed models show a warning and prompt the user to create a new session.
+
 **Error Handling:**
-- Connection timeouts (45s) with structured JSON error logging
+- Connection timeouts (22s) with structured JSON error logging
+- Smart retry: retries once only on fast connection errors (< 5s), not on timeouts
 - HTTP status code validation (non-200 responses return user-friendly error)
 - Exception handling with stack traces logged to CloudWatch
 - Graceful fallback: bot remains functional even if Ollama is unreachable
@@ -476,8 +487,38 @@ The bot integrates with [Ollama](https://ollama.com), a self-hosted large langua
 ./scripts/manage-ollama.sh start    # Start instance, wait for Ollama API
 ./scripts/manage-ollama.sh stop     # Stop instance (syncs models to S3)
 ./scripts/manage-ollama.sh status   # Check instance and API health
+./scripts/manage-ollama.sh ssh      # SSH into the instance
+```
+
+**Managing Models:**
+
+To add a new model, SSH into the EC2 instance and pull it:
+
+```bash
+# SSH into the instance
+./scripts/manage-ollama.sh ssh
+
+# Pull a model (must set OLLAMA_HOST since Ollama binds to port 11435)
+OLLAMA_HOST=http://127.0.0.1:11435 ollama pull <model_name>
+
+# List installed models
+OLLAMA_HOST=http://127.0.0.1:11435 ollama list
+
+# Remove a model
+OLLAMA_HOST=http://127.0.0.1:11435 ollama rm <model_name>
+```
+
+After pulling a new model, add it to the `AVAILABLE_MODELS` list in `handler.py` and redeploy the Lambda:
+
+```bash
+# Rebuild and deploy
+cp handler.py /tmp/lambda-build/handler.py
+cd /tmp/lambda-build && zip -r /path/to/lambda.zip . -x '__pycache__/*' '*.pyc'
+aws lambda update-function-code --function-name telegram-bot --zip-file fileb://lambda.zip
 ```
 
+> **Note:** Ollama binds to `127.0.0.1:11435` (not the default 11434) because nginx reverse proxy occupies port 11434 for API key authentication. Always set `OLLAMA_HOST=http://127.0.0.1:11435` when using the `ollama` CLI on the instance.
+
 ---
 
 ## Data Storage
diff --git a/handler.py b/handler.py
@@ -213,7 +213,7 @@ def get_active_session(user_id: int) -> Optional[Dict[str, Any]]:
     return None
 
 
-def create_session(user_id: int, model_name: str = "tinyllama") -> Dict[str, Any]:
+def create_session(user_id: int, model_name: str = "llama3.2:1b") -> Dict[str, Any]:
     """Create a new session, deactivate old active ones."""
     session_id = str(uuid.uuid4())
     sk = f"MODEL#{model_name}#SESSION#{session_id}"
@@ -263,7 +263,7 @@ def append_to_conversation(session: Dict[str, Any], message_dict: Dict[str, Any]
 
 
 def call_ollama(model: str, messages: List[Dict[str, Any]]) -> str:
-    """Call Ollama API for chat completion."""
+    """Call Ollama API for chat completion. Retries once only on fast connection errors."""
     if not OLLAMA_URL:
         logger.warning("call_ollama", message="OLLAMA_URL not configured")
         return "AI service is not configured. Please contact the administrator."
@@ -274,19 +274,32 @@ def call_ollama(model: str, messages: List[Dict[str, Any]]) -> str:
         "stream": False
     }
     headers = {"X-API-Key": OLLAMA_API_KEY} if OLLAMA_API_KEY else {}
-    try:
-        resp = requests.post(f"{OLLAMA_URL}/api/chat", json=payload, headers=headers, timeout=45)
-        if resp.status_code == 200:
-            data = resp.json()
-            response_content = data['message']['content']
-            logger.info("call_ollama", message=f"Response length {len(response_content)} chars")
-            return response_content
-        else:
-            logger.error("call_ollama", message=f"Ollama API error: HTTP {resp.status_code}")
+    for attempt in range(2):
+        start = time.time()
+        try:
+            resp = requests.post(f"{OLLAMA_URL}/api/chat", json=payload, headers=headers, timeout=22)
+            if resp.status_code == 200:
+                data = resp.json()
+                response_content = data['message']['content']
+                logger.info("call_ollama", message=f"Response length {len(response_content)} chars")
+                return response_content
+            else:
+                elapsed = time.time() - start
+                logger.error("call_ollama", message=f"Ollama API error: HTTP {resp.status_code}, attempt {attempt+1}")
+                # Only retry if failure was fast (connection issue, not slow inference)
+                if attempt == 0 and elapsed < 5:
+                    time.sleep(1)
+                    continue
+                return "Sorry, AI response unavailable. Use /status to check connection."
+        except Exception as e:
+            elapsed = time.time() - start
+            logger.error("call_ollama", message=f"Ollama connection error, attempt {attempt+1}", error=e)
+            # Only retry if failure was fast (connection refused, not timeout)
+            if attempt == 0 and elapsed < 5:
+                time.sleep(1)
+                continue
             return "Sorry, AI response unavailable. Use /status to check connection."
-    except Exception as e:
-        logger.error("call_ollama", message="Ollama connection error", error=e)
-        return "Sorry, AI response unavailable. Use /status to check connection."
+    return "Sorry, AI response unavailable. Use /status to check connection."
 
 
 # ==================== ARCHIVE FUNCTIONS ====================
@@ -426,20 +439,33 @@ def import_archive_to_s3(user_id: int, archive_data: Dict[str, Any]) -> Optional
 
 # ==================== COMMAND HANDLERS ====================
 
+AVAILABLE_MODELS = [
+    {"id": "llama3.2:1b", "name": "Llama 3.2", "desc": "Meta's latest, fast (1B)"},
+    {"id": "qwen2.5:1.5b-instruct-q4_K_M", "name": "Qwen 2.5", "desc": "Instruction-tuned (1.5B)"},
+]
+
+def format_model_list() -> str:
+    """Format available models as a numbered list."""
+    lines = []
+    for i, m in enumerate(AVAILABLE_MODELS):
+        lines.append(f"{i+1}. {m['name']} - {m['desc']}")
+    return "\n".join(lines)
+
 def handle_command(cmd: str, payload: str, chat_id: int, user_id: int, update_id: int) -> str:
     """Handle bot commands."""
     logger.info("handle_command", message=f"Command: {cmd}", command=cmd)
 
     if cmd == "/start" or cmd == "/hello":
         session = get_current_session(user_id)
-        resp = f"Hello! Your current model is {session['model_name']}. Chat away or use /help."
+        resp = f"Hello! Your current model is {session['model_name']}.\n\nAvailable models:\n{format_model_list()}\n\nUse /newsession <number> to start a session with a specific model.\nChat away or use /help."
         send_message(chat_id, resp)
         return "start_or_hello"
 
     if cmd == "/help":
-        resp = """Commands:
-/start or /hello - Greeting and session init
-/newsession - Start a new chat session
+        resp = f"""Commands:
+/start or /hello - Greeting and session info
+/newsession - Show available models
+/newsession <number> - Start session with chosen model
 /listsessions - List your sessions
 /switch <number> - Switch to a session (e.g., /switch 1)
 /history - Show recent messages in current session
@@ -453,6 +479,9 @@ def handle_command(cmd: str, payload: str, chat_id: int, user_id: int, update_id
 /export <number> - Export an archive as a file
 (Send a JSON file to import an archive)
 
+Available models:
+{format_model_list()}
+
 Send any text message to chat with the AI model."""
         send_message(chat_id, resp)
         return "help"
@@ -475,10 +504,27 @@ def handle_command(cmd: str, payload: str, chat_id: int, user_id: int, update_id
         return "status"
 
     if cmd == "/newsession":
-        new_session = create_session(user_id)
-        resp = f"New session created with model '{new_session['model_name']}' (ID: {new_session['session_id'][:8]})."
-        send_message(chat_id, resp)
-        return "newsession"
+        if not payload.strip():
+            resp = f"Choose a model for your new session:\n{format_model_list()}\n\nUsage: /newsession <number> (e.g., /newsession 1)"
+            send_message(chat_id, resp)
+            return "newsession_list"
+        try:
+            idx = int(payload.strip()) - 1
+            if 0 <= idx < len(AVAILABLE_MODELS):
+                model_id = AVAILABLE_MODELS[idx]["id"]
+                model_name_display = AVAILABLE_MODELS[idx]["name"]
+                new_session = create_session(user_id, model_name=model_id)
+                resp = f"New session created with {model_name_display} ({model_id}).\nSession ID: {new_session['session_id'][:8]}"
+                send_message(chat_id, resp)
+                return "newsession"
+            else:
+                resp = f"Invalid choice. Available models:\n{format_model_list()}\n\nUsage: /newsession <number>"
+                send_message(chat_id, resp)
+                return "invalid_newsession"
+        except ValueError:
+            resp = f"Invalid choice. Available models:\n{format_model_list()}\n\nUsage: /newsession <number>"
+            send_message(chat_id, resp)
+            return "invalid_newsession"
 
     if cmd == "/listsessions":
         items = get_user_items(user_id)
@@ -711,7 +757,7 @@ def handle_document(document: Dict[str, Any], chat_id: int, user_id: int) -> str
     return "imported"
 
 
-def handle_message(text: str, chat_id: int, user_id: int, update_id: int, document: Optional[Dict[str, Any]] = None) -> str:
+def handle_message(text: str, chat_id: int, user_id: int, update_id: int, document: Optional[Dict[str, Any]] = None, thinking_msg_id: Optional[int] = None) -> str:
     """Handle incoming messages: commands, chat, or documents."""
 
     if document:
@@ -742,10 +788,25 @@ def handle_message(text: str, chat_id: int, user_id: int, update_id: int, docume
         session = get_current_session(user_id)
         now = int(time.time())
 
+        # Check if session's model is still available
+        available_model_ids = {m["id"] for m in AVAILABLE_MODELS}
+        session_model = session.get("model_name", "")
+        if session_model not in available_model_ids:
+            available_list = format_model_list()
+            warn_msg = f"The model '{session_model}' is no longer available.\n\nPlease create a new session with an available model:\n{available_list}\n\nUsage: /newsession <number>"
+            if thinking_msg_id:
+                edit_message(chat_id, thinking_msg_id, warn_msg)
+            else:
+                send_message(chat_id, warn_msg)
+            return "model_unavailable"
+
         # Check if a request is already being processed (within last 55 seconds)
         pending_since = session.get("pending_request_ts", 0)
         if pending_since and (now - pending_since) < 55:
-            send_message(chat_id, "Please wait, still generating a response to your previous message...")
+            if thinking_msg_id:
+                edit_message(chat_id, thinking_msg_id, "Please wait, still generating a response to your previous message...")
+            else:
+                send_message(chat_id, "Please wait, still generating a response to your previous message...")
             return "rate_limited"
 
         user_msg = {"role": "user", "content": text, "ts": now}
@@ -755,35 +816,32 @@ def handle_message(text: str, chat_id: int, user_id: int, update_id: int, docume
         session['pending_request_ts'] = now
         table.put_item(Item=session)
 
-        # Send "Thinking..." message so user knows the bot is working
-        thinking_resp = send_message(chat_id, "Thinking...")
-        thinking_msg_id = None
-        if thinking_resp and thinking_resp.get("ok"):
-            thinking_msg_id = thinking_resp["result"]["message_id"]
-
-        # Build conversation context for Ollama (strip ts field, limit to last 10 messages)
-        all_msgs = [
-            {"role": m["role"], "content": m["content"]}
-            for m in session.get("conversation", [])
-            if "role" in m and "content" in m
-        ]
-        messages = all_msgs[-10:]
-
-        # Call Ollama for AI response
-        model_name = session.get("model_name", "tinyllama")
-        ai_response = call_ollama(model_name, messages)
-
-        # Clear pending flag
-        session['pending_request_ts'] = 0
-        ass_msg = {"role": "assistant", "content": ai_response, "ts": int(time.time())}
-        append_to_conversation(session, ass_msg)
-
-        # Replace "Thinking..." with the actual response
-        if thinking_msg_id:
-            edit_message(chat_id, thinking_msg_id, ai_response)
-        else:
-            send_message(chat_id, ai_response)
-        return "ai_response"
+        try:
+            # Build conversation context for Ollama (strip ts field, limit to last 10 messages)
+            all_msgs = [
+                {"role": m["role"], "content": m["content"]}
+                for m in session.get("conversation", [])
+                if "role" in m and "content" in m
+            ]
+            messages = all_msgs[-10:]
+
+            # Call Ollama for AI response
+            model_name = session.get("model_name", "tinyllama")
+            ai_response = call_ollama(model_name, messages)
+
+            ass_msg = {"role": "assistant", "content": ai_response, "ts": int(time.time())}
+            append_to_conversation(session, ass_msg)
+
+            # Replace "Thinking..." with the actual response
+            if thinking_msg_id:
+                edit_message(chat_id, thinking_msg_id, ai_response)
+            else:
+                send_message(chat_id, ai_response)
+            return "ai_response"
+        finally:
+            # Always clear pending flag, even on unexpected errors
+            session['pending_request_ts'] = 0
+            table.put_item(Item=session)
 
 
 def process_telegram_update(update: Dict[str, Any]) -> Dict[str, Any]:
@@ -808,18 +866,32 @@ def process_telegram_update(update: Dict[str, Any]) -> Dict[str, Any]:
         logger.warning("process_update", outcome="skipped", message="No chat_id in update")
         return {"processed": False, "reason": "no_chat_id"}
 
+    # For chat messages (not commands, not documents), send "Thinking..." immediately
+    thinking_msg_id = None
+    if text and not text.strip().startswith('/') and not document:
+        thinking_resp = send_message(chat_id, "Thinking...")
+        if thinking_resp and thinking_resp.get("ok"):
+            thinking_msg_id = thinking_resp["result"]["message_id"]
+
     # Deduplicate: check if this update_id was already processed
     try:
         dedup_resp = table.get_item(Key={'pk': OFFSET_PK, 'sk': f'update#{update_id}'})
         if 'Item' in dedup_resp:
             logger.info("process_update", outcome="skipped", message=f"Duplicate update_id={update_id}")
+            # Delete the "Thinking..." message if we sent one
+            if thinking_msg_id:
+                try:
+                    requests.post(f"{TELEGRAM_API}/deleteMessage",
+                                  json={"chat_id": chat_id, "message_id": thinking_msg_id}, timeout=5)
+                except Exception:
+                    pass
             return {"processed": False, "reason": "duplicate"}
         # Mark this update_id as processed
         table.put_item(Item={'pk': OFFSET_PK, 'sk': f'update#{update_id}', 'ts': int(time.time())})
     except Exception:
         pass  # If dedup check fails, process anyway
 
-    handle_result = handle_message(text, chat_id, user_id, update_id, document)
+    handle_result = handle_message(text, chat_id, user_id, update_id, document, thinking_msg_id)
 
     return {
         "processed": True,
diff --git a/variables.tf b/variables.tf