PlanExeOrg
diff --git a/‎README.md‎
Lines changed: 7 additions & 3 deletions b/‎README.md‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎database_api/model_taskitem.py‎
Lines changed: 32 additions & 0 deletions b/‎database_api/model_taskitem.py‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎database_api/tests/test_taskitem_model.py‎
Lines changed: 33 additions & 0 deletions b/‎database_api/tests/test_taskitem_model.py‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎docker-compose.yml‎
Lines changed: 2 additions & 0 deletions b/‎docker-compose.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/mcp/antigravity.md‎
Lines changed: 6 additions & 8 deletions b/‎docs/mcp/antigravity.md‎
Lines changed: 6 additions & 8 deletions
diff --git a/‎docs/mcp/cursor.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/mcp/cursor.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/mcp/inspector.md‎
Lines changed: 7 additions & 2 deletions b/‎docs/mcp/inspector.md‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎docs/mcp/mcp_details.md‎
Lines changed: 157 additions & 7 deletions b/‎docs/mcp/mcp_details.md‎
Lines changed: 157 additions & 7 deletions
@@ -51,9 +51,13 @@ Assuming you have an MCP-compatible client (OpenClaw, Cursor, Codex, LM Studio,
 The Tool workflow (tools-only, not MCP tasks protocol)
 
 1. `prompt_examples`
-2. `task_create`
-3. `task_status` (poll every 5 minutes until done)
-4. download the result via `task_download` or via `task_file_info`
+2. `model_profiles` (optional, helps choose `model_profile`)
+3. non-tool step: draft/approve prompt
+4. `task_create`
+5. `task_status` (poll every 5 minutes until done)
+6. download the result via `task_download` or via `task_file_info`
+
+Concurrency note: each `task_create` call returns a new `task_id`; server-side global per-client concurrency is not capped, so clients should track their own parallel tasks.
 
 ### Option A: Remote MCP (fastest path)
 
 
@@ -5,6 +5,31 @@
 from sqlalchemy_utils import UUIDType
 from sqlalchemy import JSON
 from sqlalchemy.orm import column_property
+from sqlalchemy import event
+
+
+def _sanitize_utf8_text(value):
+    """Normalize values into valid UTF-8-safe text for persistence."""
+    if value is None:
+        return None
+
+    if isinstance(value, str):
+        text = value
+    elif isinstance(value, (bytes, bytearray, memoryview)):
+        text = bytes(value).decode("utf-8", errors="replace")
+    else:
+        text = str(value)
+
+    # Postgres text does not support embedded NULL bytes.
+    if "\x00" in text:
+        text = text.replace("\x00", "")
+
+    # Replace unpaired surrogates or other non-encodable code points.
+    try:
+        text.encode("utf-8", errors="strict")
+    except UnicodeEncodeError:
+        text = text.encode("utf-8", errors="replace").decode("utf-8")
+    return text
 
 class TaskState(enum.Enum):
     pending = 1
@@ -113,3 +138,10 @@ def demo_items(cls) -> list['TaskItem']:
             }
         )
         return [task1, task2, task3]
+
+
+@event.listens_for(TaskItem, "before_insert")
+@event.listens_for(TaskItem, "before_update")
+def _sanitize_taskitem_fields(_mapper, _connection, target):
+    # Enforce valid UTF-8-safe prompt text regardless of writer path.
+    target.prompt = _sanitize_utf8_text(target.prompt)
@@ -39,3 +39,36 @@ def test_stop_request_fields_default(self):
             self.assertTrue(hasattr(fetched, "run_activity_overview_json"))
             self.assertTrue(hasattr(fetched, "run_artifact_layout_version"))
             self.assertFalse(bool(fetched.stop_requested))
+
+    def test_prompt_invalid_bytes_are_sanitized(self):
+        with self.app.app_context():
+            bad_bytes = b"Hello \xe2\x80 world"
+            task = TaskItem(
+                state=TaskState.pending,
+                prompt=bad_bytes,
+                user_id="test_user",
+            )
+            db.session.add(task)
+            db.session.commit()
+
+            fetched = db.session.get(TaskItem, task.id)
+            self.assertIsInstance(fetched.prompt, str)
+            # Must be encodable after sanitization.
+            fetched.prompt.encode("utf-8")
+            self.assertIn("Hello", fetched.prompt)
+            self.assertIn("world", fetched.prompt)
+
+    def test_prompt_surrogates_are_sanitized(self):
+        with self.app.app_context():
+            task = TaskItem(
+                state=TaskState.pending,
+                prompt="prefix \ud800 suffix",
+                user_id="test_user",
+            )
+            db.session.add(task)
+            db.session.commit()
+
+            fetched = db.session.get(TaskItem, task.id)
+            self.assertIsInstance(fetched.prompt, str)
+            fetched.prompt.encode("utf-8")
+            self.assertFalse(any(0xD800 <= ord(ch) <= 0xDFFF for ch in fetched.prompt))
@@ -242,6 +242,8 @@ services:
       PLANEXE_WORKER_PLAN_URL: ${PLANEXE_WORKER_PLAN_URL:-http://worker_plan:8000}
     ports:
       - "${PLANEXE_MCP_HTTP_PORT:-8001}:8001"
+    volumes:
+      - ./llm_config:/app/llm_config:ro
     restart: unless-stopped
     healthcheck:
       test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8001/healthcheck').read()"]
 
@@ -18,15 +18,13 @@ My interaction history:
 4. I didn't meant outbreak, I meant vulcanic
 5. your prompt is a bit shorter than the example prompts
 6. go ahead create the plan
-7. stop that plan you are creating.
-8. now create the plan again, this time with ALL details. Last time you had FAST selected that would leave out most details.
-9. check status
+7. check status
+8. status
+9. status
 10. status
-11. status
-12. status
-13. download the report
-14. summarize the report
-15. does it correspond to your expectations?
+11. download the report
+12. summarize the report
+13. does it correspond to your expectations?
 
 I had to manually ask about `check status` to get details how the plan creation was going. It's not something that Antigravity can do.
 
 
@@ -51,7 +51,7 @@ My interaction with Cursor for creating a plan is like this:
 2. I want you to come up with a good prompt
 3. I want something ala winter olympics in Italy 2026
 4. Slightly different idea. I want Denmark to switch from DKK to EUR. Use the persona of a person representing Denmark's ministers.
-5. go ahead create plan with all details
+5. go ahead create the plan
 6. *wait for 18 minutes until the plan has been created*
 7. download the plan
 
 
@@ -68,18 +68,23 @@ When connected follow these steps:
 Now there should be a list with tool names and descriptions:
 ```
 prompt_examples
+model_profiles
 task_create
 task_status
 task_stop
 task_file_info
 ```
 
+When you inspect `task_create`, the visible input schema includes `prompt` and optional `model_profile`.
+The `speed_vs_detail` parameter is intentionally hidden and only set via tool-specific metadata, since it confuses AI agents.
+
 Follow these steps:
 ![screenshot of mcp inspector invoke tool](inspector_step5_mcp_planexe_org.webp)
 
 1. In the `Tools` panel; Click on the `prompt_examples` tool.
-2. In the `prompt_examples` right sidepanel; Click on `Run Tool`. 
-3. The MCP server should respond with a list of list of example prompts.
+2. In the `prompt_examples` right sidepanel; Click on `Run Tool`.
+3. The MCP server should respond with a list of example prompts.
+4. Optionally run `model_profiles` to inspect available `model_profile` choices before `task_create`.
 
 ## Approach 2. MCP server inside docker
 
 
@@ -10,12 +10,13 @@ This document lists the MCP tools exposed by PlanExe and example prompts for age
 - The primary MCP server runs in the cloud (see `mcp_cloud`).
 - The local MCP proxy (`mcp_local`) forwards calls to the server and adds a local download helper.
 - Tool responses return JSON in both `content.text` and `structuredContent`.
+- Workflow note: drafting and user approval of the prompt is a non-tool step between setup tools and `task_create`.
 
 ## Tool Catalog, `mcp_cloud`
 
 ### prompt_examples
 
-Returns around five example prompts that show what good prompts look like. Each sample is typically 300–800 words: detailed context, requirements, and success criteria. Usually the AI does the heavy lifting: the user has a vague idea, the agent calls `prompt_examples`, then expands that idea into a high-quality prompt (300–800 words). The prompt is shown to the user, who can ask for further changes or confirm it’s good to go. When the user confirms, the agent then calls `task_create`. Shorter or vaguer prompts produce lower-quality plans.
+Returns around five example prompts that show what good prompts look like. Each sample is typically 300-800 words. Usually the AI does the heavy lifting: the user has a vague idea, the agent calls `prompt_examples`, then expands that idea into a high-quality prompt (300-800 words). A compact prompt shape works best: objective, scope, constraints, timeline, stakeholders, budget/resources, and success criteria. The prompt is shown to the user, who can ask for further changes or confirm it’s good to go. When the user confirms, the agent then calls `task_create`. Shorter or vaguer prompts produce lower-quality plans.
 
 Example prompt:
 ```
@@ -27,7 +28,33 @@ Example call:
 {}
 ```
 
-Response includes `samples` (array of prompt strings, each 300–800 words) and `message`.
+Response includes `samples` (array of prompt strings, each ~300-800 words) and `message`.
+
+### model_profiles
+
+Returns profile guidance and model availability for `task_create.model_profile`.
+This helps agents pick a profile without knowing internal `llm_config/*.json` details.
+Profiles with zero models are omitted from the `profiles` list.
+If no models are available in any profile, `model_profiles` returns `isError=true` with `error.code = MODEL_PROFILES_UNAVAILABLE`.
+
+Example prompt:
+```
+List available model profiles and models.
+```
+
+Example call:
+```json
+{}
+```
+
+Response includes:
+- `default_profile`
+- `profiles[]` with:
+  - `profile`
+  - `title`
+  - `summary`
+  - `model_count`
+  - `models[]` (`key`, `provider_class`, `model`, `priority`)
 
 ### task_create
 
@@ -41,11 +68,71 @@ Example call:
 {"prompt": "Weekly meetup for humans where participants are randomly paired every 5 minutes..."}
 ```
 
-Optional argument:
+Optional visible argument:
+```text
+model_profile: "baseline" | "premium" | "frontier" | "custom"
 ```
+
+Developer-only hidden metadata (not part of visible tool schema shown to agents):
+```text
 speed_vs_detail: "ping" | "fast" | "all"
 ```
 
+Example with visible `model_profile`:
+```json
+{"prompt": "Weekly meetup for humans where participants are randomly paired every 5 minutes...", "model_profile": "premium"}
+```
+
+Example with hidden metadata override. The `ping` only checks if the LLMs are connected and doesn't trigger a full plan to be created:
+```json
+{
+  "prompt": "Weekly meetup for humans where participants are randomly paired every 5 minutes...",
+  "metadata": {
+    "task_create": {
+      "speed_vs_detail": "ping"
+    }
+  }
+}
+```
+
+Example with hidden metadata override. The `fast` triggers a plan to be created, where the entire Luigi pipeline gets exercised, while skipping as much detail as possible:
+```json
+{
+  "prompt": "Weekly meetup for humans where participants are randomly paired every 5 minutes...",
+  "metadata": {
+    "task_create": {
+      "speed_vs_detail": "fast"
+    }
+  }
+}
+```
+
+Example with hidden metadata override. The `all` is the default setting. Creates a plan with **ALL** details:
+```json
+{
+  "prompt": "Weekly meetup for humans where participants are randomly paired every 5 minutes...",
+  "metadata": {
+    "task_create": {
+      "speed_vs_detail": "all"
+    }
+  }
+}
+```
+
+Counterexamples (do NOT use PlanExe for these):
+
+- "Give me a 5-point checklist for X."
+- "Summarize this paragraph in 6 bullets."
+- "Rewrite this email."
+- "Identify the risks of this project."
+- "Make a SWOT for this document."
+
+What to do instead:
+
+- For one-shot outputs, use a normal LLM response directly.
+- For PlanExe, send a substantial multi-phase project prompt with scope, constraints, timeline, budget, stakeholders, and success criteria.
+- PlanExe always runs a fixed end-to-end pipeline; it does not support selecting only internal pipeline subsets.
+
 ### task_status
 
 Fetch status/progress and recent files for a task.
@@ -60,6 +147,13 @@ Example call:
 {"task_id": "2d57a448-1b09-45aa-ad37-e69891ff6ec7"}
 ```
 
+State contract:
+
+- `pending`: queued and waiting for a worker, keep polling.
+- `processing`: picked up by a worker, keep polling.
+- `completed`: terminal success, proceed to download.
+- `failed`: terminal error.
+
 ### task_stop
 
 Request an active task to stop.
@@ -135,11 +229,51 @@ Example call:
 {"task_id": "2d57a448-1b09-45aa-ad37-e69891ff6ec7", "artifact": "report"}
 ```
 
+`PLANEXE_PATH` behavior for `task_download`:
+- Save directory is `PLANEXE_PATH`, or current working directory if unset.
+- Non-existing directories are created automatically.
+- If `PLANEXE_PATH` points to a file, download fails.
+- Filename is prefixed with task id (for example `<task_id>-030-report.html`).
+- Response includes `saved_path` with the exact local file location.
+
+## Minimal error-handling contract
+
+Error payload shape:
+```json
+{"error": {"code": "SOME_CODE", "message": "Human readable message", "details": {}}}
+```
+
+Common cloud/core error codes:
+- `TASK_NOT_FOUND`
+- `INVALID_USER_API_KEY`
+- `USER_API_KEY_REQUIRED`
+- `INSUFFICIENT_CREDITS`
+- `INTERNAL_ERROR`
+- `MODEL_PROFILES_UNAVAILABLE`
+- `generation_failed`
+- `content_unavailable`
+
+Common local proxy error codes:
+- `REMOTE_ERROR`
+- `DOWNLOAD_FAILED`
+
+Special case:
+- `task_file_info` may return `{}` while the artifact is not ready yet (not an error).
+
+## Concurrency semantics (practical)
+
+- Each `task_create` call creates a new task with a new `task_id`.
+- The server does not enforce a global “one active task per client” cap.
+- Parallelism is a client orchestration concern:
+  - start with 1 task
+  - scale to 2 in parallel if needed
+  - avoid more than 4 unless you have strong task-tracking UX
+
 ## Typical Flow
 
 ### 1. Get example prompts
 
-The user often starts with a vague idea. The AI calls `prompt_examples` first to see what good prompts look like (around five samples, 300–800 words each), then expands the user’s idea into a high-quality prompt and shows it to the user.
+The user often starts with a vague idea. The AI calls `prompt_examples` first to see what good prompts look like (around five samples, typically 300-800 words each), then expands the user’s idea into a high-quality prompt using this compact shape: objective, scope, constraints, timeline, stakeholders, budget/resources, and success criteria.
 
 Prompt:
 ```
@@ -151,7 +285,23 @@ Tool call:
 {}
 ```
 
-### 2. Create a plan
+### 2. Inspect model profiles (optional but recommended)
+
+Prompt:
+```
+Show model profile options and available models.
+```
+
+Tool call:
+```json
+{}
+```
+
+### 3. Draft and approve the prompt (non-tool step)
+
+At this step, the agent writes a high-quality prompt draft (typically 300-800 words, with objective, scope, constraints, timeline, stakeholders, budget/resources, and success criteria), shows it to the user, and waits for approval.
+
+### 4. Create a plan
 
 The user reviews the prompt and either asks for further changes or confirms it’s good to go. When the user confirms, the agent calls `task_create` with that prompt.
 
@@ -160,7 +310,7 @@ Tool call:
 {"prompt": "..."}
 ```
 
-### 3. Get status
+### 5. Get status
 
 Prompt:
 ```
@@ -172,7 +322,7 @@ Tool call:
 {"task_id": "<task_id_from_task_create>"}
 ```
 
-### 4. Download the report
+### 6. Download the report
 
 Prompt:
 ```